SecurLinx: Hyping Facial Recognition

Monday, March 26, 2012

Hyping Facial Recognition

Sometimes I let these go. It can be a time consuming process to untangle assumptions that lead to articles like the one below. But the article in question enjoyed a brief run on the Drudge Report over the weekend, so enough people saw it so as to make a clarification of the issues involved worth the effort.

Big Brother just got scarier: Japanese CCTV camera can scan 36 million faces per second - and recognise anyone who has walked into its gaze (UK Daily Mail)

One commenter asks "...why on earth would a camera need to scan 36 million people per second??? Would you have 36 million people somehow bunch up and face this magical camera?" ...which pretty thoroughly debunks the headline. The camera isn't really scanning anything, it's just a camera. It's software that is handling the video and images cropped from it. The real news here is (stop the presses): Applying a "face finder" to CCTV footage in real time will make later facial recognition faster.

A face finder is a program that identifies a certain group of digital pixels as a human face.

In a typical facial recognition transaction, the face finder then passes that cropped image to a facial recognition algorithm where it is turned into a template. The template is then compared to all the templates in the database. The database images are then ranked in order of the likelihood of a match and the facial recognition application presents the user with some number of results. Organizational policy takes over from there.

So, technically two challenges must be met: You have to find faces; You have to match faces. Two hard numbers in the article point to these two challenges: 40x40 pixels; and 36 million faces per second. 40x40 refers to the minimum size the face finder can handle; 36 million/second refers to the speed of the matching algorithm. The 40x40 number is pretty firm. The 36 million number is a little different. 36 million records per second could mean 3.6 million records in 0.1 seconds. It could mean 360,000 records in 0.01 seconds, etc.

Assuming the facial recognition matching algorithm can handle 36 million records, a claim which is not explicitly made, what about the performance of a facial recognition system with a database of 36 million 40x40 pixel photos? Is 1600 pixels enough to distinguish among 36 million people? [Update: With only 1600 pixels to work with and 36 million people to identify, there's a high likelihood that you you will run out of pixel combinations before you can account for all the people.]

Photo PRPhotos.com & SecurLinx

Here' a 40x40 gray-scale* photo of Penelope Cruz.

Why 40x40? The installed base of CCTV cameras is poorly suited to facial recognition. Facial recognition is what it says: the recognition of faces. It's not top-of-the-head recognition; it's not profile recognition; it's not back-of-the-head recognition. In general, CCTV cameras have been installed to observe and/or record what people are doing, not who they are. They have been deployed to answer the question, "what's going on?"

That's why the 40x40 pixel specification is important. CCTV cameras typically use such a wide angle that a person's face may only occupy a very small slice of the camera's field of vision, and when you zoom in on the face, you get a picture similar to the one above, in the best of cases. In more ordinary circumstances, few faces suitable for face recognition are captured at all.

Unfortunately, this is the way these things go. A company makes technical claims based upon laboratory findings. Those claims are exported via the media into the real world with the assumption that they will work tomorrow (or next tax year) in the chaotic real world, at least as well as (or better than) they worked in the lab.

This isn't really good for the industry, customers or the public at large, though. If believed, it results in customers with too-high expectations, companies trying to reset expectations and a public with with unrealistic hopes and fears of what these systems are capable of.

It is good to get people thinking about the kind of future they would like to build for themselves but for people to do that effectively, they must be presented with an accurate picture of the world in which they currently live. They aren't getting that from articles like this.

*Post for another day: Why use gray-scale images for facial recognition?