View profile

Don't look

Don't look
By Santiago • Issue #31 • View online
You can’t unsee things.
Contrary to popular belief, knowing more is not always better.
Here, I want to make my case using one example: the test data that we use to evaluate our models.
Let me convince you that your biases are a constant threat to your work, and sometimes, the only solution is to remove ourselves from the picture.
Have a great weekend, everyone!

Don't look at your test set
Why would you?
Test data should mimic production data. You don’t have access to the latter, so why would you treat the former differently?
When building a machine learning model, I want my test set to represent real world data as closely as possible.

The best strategy I've found is to split the test set aside *before* I even look at the data.

Here is why this helps.
A process to improve your data
It turns out that good data is hard to come by.
Even datasets reviewed and used for years are riddled with mistakes that conspire against your work.
Here are some tips to improve your data.
A team led by MIT examined 10 of the most-cited datasets used to test machine learning systems.

They found that around 3.4% of the data was inaccurate or mislabeled.

Those are very popular datasets. How about yours?

Did you enjoy this issue?

I'll send you an email whenever I publish something important. I promise you don't want to miss this.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue