Let’s try to come up with a way to build this thing.
Alright, we know we can’t simply compare pixels from two different images to determine whether they show the same person. If you aren’t convinced about this, take some time and think it through. Pretty complex stuff, right?
Since pixels don’t work, we could convert images into a different format to easily compare. If we do that, we could solve this problem in two steps:
- Turn the suspect’s photo into a new representation that follows the new format that we came up with.
- Compare that with a database of mugshots stored in the same format.
Simple, right? 😎
Making up a good representation.
Let’s focus for a minute on a format that allows us to compare two images. How can we do this?
Here is a simple way to think about it: we could create a list of every person’s features. For example, we could have features like these:
- Eye color
- Hair color
- Hair length
- Beard color
- Nose size
We can then assign a numeric value to each one of these features. For example, if the person in the picture has black eyes, we will assign the value 0 to the first feature, while blue eyes will correspond to the value 1, and so on.
If we take every photo in our database and do this, finding any person becomes a matter of comparing lists of features and returning the most similar one. That’s a problem we can easily solve
. As long as our list of features is long and descriptive enough, we should have a shot at identifying pictures that belong to similar individuals!
There are only two small problems left.
Assuming we can turn pixels into a list of features, we can open the champagne 🍾 and move to a more interesting problem, but we still have to answer a couple of questions:
- What is a good list of features?
- How in the world can we do the conversion?
Remember that our features need to be descriptive enough that we can quickly identify a match. Simultaneously, the more features we add, the harder it will be to do the conversion, store the data, and compare the lists.
Which brings me to the second question: how can we do the conversion in the first place? Even if we put the entire police department 👮♀️ to work day and night, it will take many, many movies-worth to convert their database of pictures into the list of features.
Machine learning to the rescue, of course.
You probably saw this coming, didn’t you? We can train a neural network
, so it learns to turn images into a list of features—called a “feature vector.” Of course, we can’t predict which features it will learn, and we may not even be able to interpret those features, but they will do the job for us.
Remember our discussion about autoencoders
? Some of the same intuition applies here (although the solution for this problem is not an autoencoder.) Using enough pictures of faces, we can have a neural network learn what’s similar and different about them and produce the features we need.
The rest is simple: using that network, we can pre-compute the feature vector for every mugshot in the database. When we are ready to look up a photo, we get its feature vector and compare it with every stored vector, returning the most similar at the end. Mr. Detective will then take a look, say his lines, and start making phone calls.
Please, don’t answer that.