A machine learning model, simplified
Let’s start by taking a look at a simple representation of a machine learning model:
X → y
Given an input X, the model produces a prediction y. The → symbol represents the relationship that the model learned between the input and the output it produces.
Now that we trained and deployed our model, what would happen if the distribution of the input X changes?
Remember your cellphone camera from 10 years ago?
Imagine that a few smart people created a face recognition model back then. They had to use the data that was available, which means a bunch of 1.2-megapixel pictures 🤮.
Not great, but good enough to get a model working.
What do you think happens over time with better cameras coming out every year?
The pictures that we take and use with the model have a different resolution than those used to train it. Our faces are still the same, but the data is different.
This shift in the distribution of the data is called data drift. It can happen slowly over time—like in the above example—or overnight, like when our faces turned into eyes and a mask 😷.
Covid was a great reminder to all of us, and not only because the data used by many models changed, but the patterns that models learned also varied.
I’m sure that Netflix, Disney, and other streaming services have models predicting the watching patterns of different segments of their user base.
Think about what happened to those predictions when hordes of people started watching more shows and movies than ever before when Covid forced us to stay inside. Pretty bad, huh?
Notice that the input data to those models didn’t change at all. What changed was the relationship of that input to the output predictions. This is called concept drift.
This can also happen slowly over time, like with a model predicting the buying patterns of certain products that suddenly start facing competition. Or words changing their meaning over time, or our definition and tolerance for what’s wrong and what isn’t.
Together, data and concept drift are a significant threat to the quality of our models.
A way out of it
Every machine learning model needs continuous monitoring.
This is a necessary step together with a process to update the model to keep the appropriate performance.
Updates might be as simple as retraining a new version of the model using new data or as complex as a completely new implementation that solves the problem.
If you’d like to dig deeper into data and concept drift, check out Elena’s article
for a more comprehensive analysis on the topic.