Multiple Disparate Data Sources
- All the text data sources have to be identified
- Text data source maybe:
- web pages scrapped or crawled over internet
- databases for various systems and inter relationships
- data warehouses
- text files or logs
- live incoming text information e.g. live chat
- combination of above
Aggregation
- Data from all these sources is aggregated for further processing
- This usually also involves converting the data into a uniform format homogenization
- Exclusive micro services based horizontally scalable system maybe dedicated for storage and serving data e.g. Kafka
Feature Engineering
- Based on goal, relevant and interesting data is extracted at this stage
- Usually for text, you may end up collecting words or group of words in a Bag Of Words
- Some NLP maybe used for selection of interesting words and weightages may have to be decided e.g. sentiment scores
- Words are grouped and marked with labels for machine to be able to learn from
Machine Learning
- Based on problem statement, appropriate machine learning algorithms are applied to build a model
- Learning may be achieved by processing the features having weights or maybe as simple as word count
- At this stage, libraries like scikit, numpy, pandas, tensorflow, theano etc. come in handy
- Learning process can be actively monitored via graphs and intuition built to tackle problems like – scaling, overfitting, underfitting, improper learning rate etc.
- Validation is also carried out to determine accuracy of the model
Prediction
- After learning, the model is ready for prediction
- Model itself can be stored as binary (tensorflow) that can be re-loaded and used for predictions in the future
Small Sample Project
Predict author from the given some text snippet. As a sample, we are using Enron’s email text corpus and learning patterns to predict the author.
https://github.com/neovasolutions/predict-author