I didn’t pay too much attention when I heard about CLIP
, a new neural network
that learns visual concepts
from natural language
supervision. At some point, however, my social media feed was all about it. I mostly have to blame this guy
. I had to look into it 👀 .
It immediately felt like Christmas in March (CLIP was released in January, but I was late to the party.) Here you had this new technique that kicked everyone’s butts with a zero-shot approach!
Let’s try to unpack this a little bit.
What in the world is “zero-shot”?
If your model can predict classes that you didn’t see during training, you have a zero-shot-capable model.
For example, you might have heard of ImageNet, a 14+ million image dataset organized in more than 22,000 categories. CLIP can correctly classify those images with 76.2% accuracy without training on that dataset and its classes.
(…), given a set of images of animals to be classified, along with auxiliary textual descriptions of what animals look like, an AI which has been trained to recognize horses, but has never seen a zebra, can still recognize a zebra if it also knows that zebras look like striped horses.
Mic-drop, mind-blown moment. Take a minute and try to appreciate this.
Think about this: zero-shot capabilities indicate that the model is learning to relate visual concepts to categories at a much deeper level than what we have seen.
This changes the game, and here is why
There are three key advantages of CLIP over existing supervised techniques:
- Putting together a good dataset and labeling it is a pain in the rear end. We don’t need this with CLIP, and I can’t describe my happiness because of it.
- Even if we collect and label a good dataset, existing models don’t generalize very well outside of that. That’s not the case with CLIP, which we can use for all sorts of tasks unrelated to a specific dataset.
- And the cherry 🍒 on top is that CLIP’s real-world performance is consistent with its performance in vision benchmarks. Just in case you didn’t know, most of the current deep learning models do much better with toy problems than out in the wild. This sucks, but CLIP takes care of it.
This is a big deal! Not in the “oh-wow-we-just-discovered-something-that-will-be-useful-someday” sense, but more in the “holy-crap-we-can-use-this-now-and-it’s-awesome” way.
Well, this all dandy. Now what?
Following their steps with GPT-3, OpenAI didn’t publish the full model, but just a smaller version. They also warned about using this in production, citing the need for more specific tests and potential bias in the model.