First, a little bit of context
We can’t talk shop without focusing for a quick second on how the training process works. Here is a rough summary that should be enough for our purposes:
- We take a batch of examples from the training dataset.
- We run that batch through the model to compute a result.
- We find how far away that result is from where it needs to be.
- We adjust the model’s parameters by a specific amount.
- We repeat the process for as many iterations as required.
The number of examples we are using to create that batch is the first decision we are making. It’s a critical choice that will impact how the process works.
We have three possible options to pick from:
- We can use the entire training dataset to create one single, long-ass batch 🙈.
- We can go to the other extreme and use a single example at a time.
- We can fall somewhere in the middle and use a few examples of data in every batch.
Let’s think through each one of these options.
Using the whole dataset at once
Machine learning practitioners love to come up with names for everything; hence they decided to call this process “Batch Gradient Descent.” Gradient Descent because that’s the optimization algorithm’s name, and Batch because we’ll be using the entire dataset 🤷♂️. Yeah, I know it doesn’t make sense, but let’s roll with it.
If we create a batch with every example from our dataset, run it through the process, and only update the model once at the end, we’ll save a lot of processing time. That’s great, but on the other hand, it may be hard to fit a lot of examples in memory at the same time, so this won’t work for large datasets.
The most interesting aspect of using the entire dataset to compute the updates to the model is that we are smoothing out all of the noise in the data and creating small and stable adjustments. This sounds boring but predictable. Some problems will benefit from this, but the lack of noise may prevent the algorithm from getting out of a suboptimal solution.
Using a single example
This one is called “Stochastic Gradient Descent” (usually referred to as SGD because it takes some time to write the whole thing and acronyms always make us look smarter.)
In this case, we adjust the model’s parameters for every single example in our dataset. Here, we don’t have to process the whole thing at once, so we won’t have memory constraints, and we’ll get immediate feedback about how training is going.
Updating the parameters for every example will cause the adjustments to have a lot of noise. This is good for specific problems because it keeps them from getting stuck—values will jump around getting out of any trap—, but it also produces an ugly, noisy signal while the model is training.
Somewhere in the middle
Splitting the difference is usually a good strategy, and that’s what “Mini-Batch Gradient Descent” does: It takes a few examples from the dataset to compute the updates to the model.
This is a great compromise that gives us the advantages of both previous methods and avoids their problems. In practice, Mini-Batch Gradient Descent is the one commonly used. Still, we usually refer to it as “Stochastic Gradient Descent” because we really want to make sure to make it as confusing as possible. When you hear somebody say “SGD,” keep in mind that they probably use a batch with more than one example 🤦🏻♂️.
The final question we need to answer is around how many examples you should include in a batch. There’s been a lot of research to answer this, and empirical evidence suggests that smaller batches perform better.
To make it even more concrete and quoting a good paper
exploring this idea, “(…) 32 is a good default value.
Let’s wrap this up
Alright, let’s summarize this really quick with some practical advice.
The number of examples that you use during every iteration of your training process is essential. A good practice is to always start with 32 unless you have a good reason to go with a different size.
After you get a model that works, feel free to experiment with different batch sizes. Usually, I don’t deviate too much from the default value and rarely go with anything other than 16, 32, 64, or 128.
And, of course, a toast 🍸 to those who work hard to make us feel welcome with their use of names and acronyms!