Multiple Disparate Data Sources

  • All the text data sources have to be identified
  • Text data source maybe:
    • web pages scrapped or crawled over internet
    • databases for various systems and inter relationships
    • data warehouses
    • text files or logs
    • live incoming text information e.g. live chat
    • combination of above


  • Data from all these sources is aggregated for further processing
  • This usually also involves converting the data into a uniform format  – homogenization
  • Exclusive micro services based horizontally scalable system maybe dedicated for storage and serving data e.g. Kafka

Feature Engineering

  • Based on goal, relevant and interesting data is extracted at this stage
  • Usually for text, you may end up collecting words or group of words in a Bag Of Words
  • Some NLP maybe used for selection of interesting words and weightages may have to be decided e.g. sentiment scores
  • Words are grouped and marked with labels for machine to be able to learn from

Machine Learning

  • Based on problem statement, appropriate machine learning algorithms are applied to build a model
  • Learning may be achieved by processing the features having weights or maybe as simple as word count
  • At this stage, libraries like scikit, numpy, pandas, tensorflow, theano etc. come in handy
  • Learning process can be actively monitored via graphs and intuition built to tackle problems like – scaling, overfitting, underfitting, improper learning rate etc.
  • Validation is also carried out to determine accuracy of the model


  • After learning, the model is ready for prediction
  • Model itself can be stored as binary (tensorflow) that can be re-loaded and used for predictions in the future

Small Sample Project

Predict author from the given some text snippet. As a sample, we are using Enron’s email text corpus and learning patterns to predict the author.


4701 Patrick Henry Drive,
Bldg. 16, Suite 106, Santa Clara, California 95054

Development Center

P3-603, Pentagon Tower,
Magarpatta City, Hadapsar, Pune, Maharashtra 411028