View profile

More splits

More splits
By Santiago • Issue #24 • View online
Today it’s all about splitting data. Again.
Such a simple step, yet misunderstood by many.
Every problem starts here—at least, right after collecting your data. The more I write about this; the more people reach out with questions about the process.
It’s hard to get anywhere if you don’t get the basics right.
Let’s talk about it.

Splitting in half is usually not a good idea. Or maybe it is. It depends...
Splitting in half is usually not a good idea. Or maybe it is. It depends...
More about leaking data
Last week, I talked about data leaks.
Some replied with many different questions.
Here is a follow-up thread that expands on the one I started before. You can start here if you want to understand how easy it is to leak data and what to do about it.
Always split your dataset before transforming the data.

I posted a thread earlier this week. A few people replied with a valid concern:

"How do you know the true range of a column without looking at all of your data?"

Good question. Let's talk about this: ↓
The most important thing about splitting
Do you know why you should split your data?
It turns out this is not an obvious question, even for people working with machine learning models for a long time.
Let’s try to fix that with this thread.
Surprisingly, many people don't understand why they split the data into different sets to build a machine learning model.

They know what to do but don't know why, when, or how.

Thread: On the most important thing you should know about splitting your data.
Did you enjoy this issue?

I'll send you an email whenever I publish something important. I promise you don't want to miss this.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue