Multiple Disparate Data Sources
- All the text data sources have to be identified
- Text data source maybe:
- web pages scrapped or crawled over the internet
- databases for various systems and interrelationships
- data warehouses
- text files or logs
- live incoming text information e.g. live chat
- combination of above
Aggregation
- Data from all these sources is aggregated for further processing
- This usually also involves converting the data into a uniform format homogenization
- The exclusive microservices-based horizontally scalable system may be dedicated for storage and serving data e.g. Kafka
Feature Engineering
- Based on the goal, relevant and interesting data is extracted at this stage
- Usually, for text, you may end up collecting words or group of words in a Bag Of Words
- Some NLP may be used for selection of interesting words and weights may have to be decided e.g. sentiment scores
- Words are grouped and marked with labels for a machine to be able to learn from
Machine Learning
- Based on the problem statement, appropriate machine learning algorithms are applied to build a model
- Learning may be achieved by processing the features having weights or maybe as simple as word count.
- At this stage, libraries like scikit, NumPy, pandas, TensorFlow, theano, etc. come in handy
- The learning process can be actively monitored via graphs and intuition built to tackle problems like – scaling, overfitting, underfitting, improper learning rate, etc.
- Validation is also carried out to determine the accuracy of the model
Prediction
- After learning, the model is ready for prediction
- Model itself can be stored as binary (TensorFlow) that can be re-loaded and used for predictions in the future
Small Sample Project
Predict the author from the given some text snippet. As a sample, we are using Enron’s email text corpus and learning patterns to predict the author.