Machine Learning

Multiple Disparate Data Sources

  • All the text data sources have to be identified
  • Text data source maybe:
    • web pages scrapped or crawled over the internet
    • databases for various systems and interrelationships
    • data warehouses
    • text files or logs
    • live incoming text information e.g. live chat
    • combination of above


  • Data from all these sources is aggregated for further processing
  • This usually also involves converting the data into a uniform format  homogenization
  • The exclusive microservices-based horizontally scalable system may be dedicated for storage and serving data e.g. Kafka

Feature Engineering

  • Based on the goal, relevant and interesting data is extracted at this stage
  • Usually, for text, you may end up collecting words or group of words in a Bag Of Words
  • Some NLP may be used for selection of interesting words and weights may have to be decided e.g. sentiment scores
  • Words are grouped and marked with labels for a machine to be able to learn from

Machine Learning

  • Based on the problem statement, appropriate machine learning algorithms are applied to build a model
  • Learning may be achieved by processing the features having weights or maybe as simple as word count.
  • At this stage, libraries like scikit, NumPy, pandas, TensorFlow, theano, etc. come in handy
  • The learning process can be actively monitored via graphs and intuition built to tackle problems like – scaling, overfitting, underfitting, improper learning rate, etc.
  • Validation is also carried out to determine the accuracy of the model


  • After learning, the model is ready for prediction
  • Model itself can be stored as binary (TensorFlow) that can be re-loaded and used for predictions in the future

Small Sample Project

Predict the author from the given some text snippet. As a sample, we are using Enron’s email text corpus and learning patterns to predict the author.