Have you noticed the trained ML model performance degrades over time? Why will the model performance degrade? Let’s say we have a model which takes the person’s data as an input and detects the face. Now with the Covid situation, almost 90% of people wear masks and the model will not be able to detect faces which results in low performance and low accuracy of the model. What is the phenomenon called? This is called Model Drift and it is categorized into Concept Drift and Data Drift.
Concept Drift is where the properties of the dependent variable change i.e., output/prediction of the model.
Data Drift is where the properties of the independent variable change i.e., input to the model.
y = a + bx
The change in the dependent variable y leads to Concept Drift and the change in the independent variable x leads to Data Drift.
Change is inevitable
The world and the parameters with which we train the model are going to change over time. Let’s discuss another example of a travel agency with a model considering a person’s average salary, season, weather as the input to predict the number of people traveling to some X country. With the covid regulations of border closures, flying restrictions, job losses, inflation, and with the change in people’s mindset the model would go for a toss.
How to detect the model drift?
Monitoring the model in production is the only way to detect the model drift. With the alert triggering by keeping the threshold of precision, recall, and f1score metrics through the monitoring tools. Evidently AI is one such monitoring tool.
How to avoid the model drift?
Either we train the model with the streaming data coming in for the continuous learning of the model or retrain the model with the interval of a week, month, etc., in a scheduled manner with the updated data. Retraining the model with the latest data is not an efficient way to handle the models which are already deployed in production. Online/Incremental model training is the efficient way.
Model Retraining Approach
Let us explore the python tool creme to train the incremental ML model with the streaming data one record at a time.
With creme, we encourage a different approach, which is to continuously learn a stream of data. This means that the model process one observation at a time, and can therefore be updated on the fly. This allows learning from massive datasets that don’t fit in the main memory. Online machine learning also integrates nicely in cases where new data is constantly arriving. It shines in many use cases, such as time series forecasting, spam filtering, recommender systems, CTR prediction, and IoT applications. If you’re bored with retraining models and want to instead build dynamic models, then online machine learning (and therefore creme!) might be what you’re looking for.
Install from PyPI
pip install creme
Dataset to train the model
docs = [
(“Cricket news: England James Anderson determined to revive international career despite West Indies axing”, “Cricket”),
(“Well Have Just One Head Coach For All Cricket Formats: CA Chairman”, “Cricket”),
(“Rod Marsh: Australian cricket legend in critical condition after suffering heart attack”, “Cricket”),
(“Facebook, Twitter highlight security steps for users in Ukraine”, “Technology”),
(“Apple launching new series of iphone”, “Technology”),
(“Galaxy S22 preorder sales indicate the phone is already a huge success”, “Technology”),
]
Setting up the model pipeline
from creme import compose
from creme import feature_extractionmodel = compose.Pipeline(
(‘tokenize’, feature_extraction.TFIDF(lowercase=True)),
(‘nb’, naive_bayes.MultinomialNB(alpha=1))
)
Here, we are using TFIDF as the feature extraction method and Naive Bayes as the ML algorithm
These are the other feature extraction methods we can try
Fitting the data to the model, one record at a time
%%time
for sentence, label in docs:
model = model.fit_one(sentence, label)Wall time: 998 µs
Predictions – Testing the model
model.predict_one(“Traffic arrangements for Australian cricket team’s visit, Pakistan Day events reviewed”)
Out: ‘Cricket’model.predict_one(“Launching Facebook Reels Globally and New Ways for Creators to Make Money”)
Out: ‘Technology’test = “Footballer Koo Ja-cheol to Return to K-League After More Than Decade Abroad”
model.predict_one(test)
Out: ‘Cricket’
As we can see in the above model testing, the last record of football news is predicted as cricket. Both are related to the sports category, we can anyway train our model with the new category as football.
Training on a new data and new category
newDocs = [“Footballer took out insurance policy on BMW minutes after smashing into parked cars”, “Russian footballer Fedor Smolov, a 32-year-old striker currently playing for his country, became one of the first Russian sportsmen to express his heartbreak at the invasion of Ukraine by his country.”, “Ukraine’s international footballer Roman Yaremchuk scored the equalizer for Benfica in a Champions League match”]
for doc_ in newDocs:
model.fit_one(doc_, “Football”)
Retesting the model
test = “Footballer Koo Ja-cheol to Return to K-League After More Than Decade Abroad”
model.predict_one(test)
Out: ‘Football’
We can update the model with the new data for the existing category or the new data with the new category.
Some benefits of using creme (and online machine learning in general):
- Incremental: models can update themselves in real-time.
- Adaptive: models can adapt to concept drift.
- Production-ready: working with data streams makes it simple to replicate production scenarios during model development.
- Efficient: models don’t have to be retrained and require little compute power, which lowers their carbon footprint
- Fast: when the goal is to learn and predict with a single instance at a time, then creme is an order of magnitude faster than PyTorch, Tensorflow, and scikit-learn.