View profile

You can't do much without labels

You can't do much without labels
By Santiago • Issue #9 • View online
Hey there!
(In case you don’t know, most of the machine learning content I publish is in my Twitter account. If you need more of the same, you won’t be disappointed by following me there.)
This week, I’ve been thinking mostly about the gap between machine learning education and the real job. I put together some thoughts that I’m linking below and inspired this specific article: a technique that you can use to overcome one of the most common problems you’ll find out there.
Without further ado, here you have my rant and today’s story right after that.

Santiago
Machine learning education is broken.

If you are preparing for a research position, you are good. If you are looking to get out there and start solving problems, not even close.

Here are some thoughts so you can get ahead.

An introduction to Active Learning
Do you know what scares me? Having to go through a mountain of data to come up with labels.
Data labeling is hard, expensive, and sometimes outright prohibitive. Data labeling can kill your machine learning project even before it starts.
Let’s kick this off with a hypothetical problem: we’d like to build a model capable of visually inspecting photos of circuit boards and classify them based on their specific configuration.
Imagine a factory producing thousands of these boards per minute. Going through each circuit board manually would be a nightmare and slow down production significantly. That’s where our model would come in!
One of the hypothetical circuit boards that our factory produced.
One of the hypothetical circuit boards that our factory produced.
A big hurdle
How hard would it be to use deep learning to build a classifier that categorizes a picture of a circuit board appropriately?
As with any good ol’ classification model, we need to build a training set containing samples for each specific category we want to classify.
And here lies the problem.
Somebody would have to do this manually. Depending on how many images we need, the number of different classes, how long it takes to inspect each picture visually, and how much we need to pay people to do this job, creating the training dataset may be too expensive to consider.
We know how to solve the problem, but we need training data. And putting this together is not a trivial task.
Unfortunately, this is a common situation out there.
There’s always a way
Framing our problem in a traditional Supervised Learning approach where we need a ton of labeled data to train a classifier is not an option. We have to think differently.
Here’s an idea. We can start by labeling a few examples of each circuit board category to create a small training set that can kick things off. We can build a model off of it.
After training this model, we can use it to make predictions across the entire unlabelled dataset. Since we didn’t use enough data to train the model, the results won’t be good, but we can still use them to our advantage.
The prediction results will give us the confidence the model has on each image belonging to each category. We can select the top few images that our model is the least confident in, manually label them, add them to the training dataset, and retrain a new version of the model.
Can you see where I’m going with this?
From this point, we can repeat the whole process: use the model to create new predictions, select the worst results, label them, and create a new version of the model using the expanded training set.
One step at a time, the model will get better and better. We’ll only label those examples that maximize the information we need for the model to learn in every round. We’ll end up minimizing the amount of work we need to do manually.
Active Learning
This approach is called “Active Learning,” a semi-supervised machine learning technique. Here is an excerpt from Wikipedia:
Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs.
There are different variations of Active Learning, specifically how you can select which images you want to label during each iteration. The example above shows an approach known as “Least Confidence,” but you can find several more explained in this fantastic tutorial.
Active Learning is instrumental in making many use cases work that otherwise would be impractical to tackle with a Supervised Learning approach.
Look into it and keep it close. It will prove much more useful than what you might imagine.
YouTube is happening
I’m convinced.
I’ll be posting content on YouTube. Here is my channel with a whopping 67 subscribers so far. 🙃 No videos there yet, so I can’t complain.
I already know the type of content I’ll be creating. I also know the way I want to deliver that message. I’m close to figuring out everything I need to make this happen.
And I’m extremely excited! I’ll be creating the type of content I feel is missing out there in a way that I’d like to watch it.
I still don’t know when I’ll publish the first video, but I’ll keep you posted about my progress.
Thanks to everyone who replied to the last issue encouraging me!
We’ll talk again next week.
Did you enjoy this issue?
Santiago

Every week, I’ll teach you something new about machine learning.

Underfitted is for people looking for a bit less theory and a bit more practicality. There's enough mathematical complexity out there already, and you won't find any here.

Come for a journey as I navigate a space that's becoming more popular than Apollo 13, Area 51, and the lousy sequel of Star Wars combined.

If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue