This is how we lost control of our faces | MIT Technology Review
The largest ever study of facial-recognition data shows how much the rise of deep learning has fueled a loss of privacy.
February 5, 2021
In 1964, mathematician and computer scientist Woodrow Bledsoe first attempted the task of matching suspects’ faces to mugshots. He measured out the distances between different facial features in printed photographs and fed them into a computer program. His rudimentary successes would set off decades of research into teaching machines to recognize human faces.
Now a new study shows just how much this enterprise has eroded our privacy. It hasn’t just fueled an increasingly powerful tool of surveillance. The latest generation of deep-learning-based facial recognition has completely disrupted our norms of consent.
People were extremely cautious about collecting, documenting, and verifying face data in the early days, says Raji. “Now we don’t care anymore. All of that has been abandoned,” she says. “You just can’t keep track of a million faces. After a certain point, you can’t even pretend that you have control.”
A history of facial-recognition data
The researchers identified four major eras of facial recognition, each driven by an increasing desire to improve the technology. The first phase, which ran until the 1990s, was largely characterized by manually intensive and computationally slow methods.
But then, spurred by the realization that facial recognition could track and identify individuals more effectively than fingerprints, the US Department of Defense pumped $6.5 million into creating the first large-scale face data set. Over 15 photography sessions in three years, the project captured 14,126 images of 1,199 individuals. The Face Recognition Technology (FERET) database was released in 1996.
The following decade saw an uptick in academic and commercial facial-recognition research, and many more data sets were created. The vast majority were sourced through photo shoots like FERET’s and had full participant consent. Many also included meticulous metadata, Raji says, such as the age and ethnicity of subjects, or illumination information. But these early systems struggled in real-world settings, which drove researchers to seek larger and more diverse data sets.
In 2007, the release of the Labeled Faces in the Wild (LFW) data set opened the floodgates to data collection through web search. Researchers began downloading images directly from Google, Flickr, and Yahoo without concern for consent. LFW also relaxed standards around the inclusion of minors, using photos found with search terms like “baby,” “juvenile,” and “teen” to increase diversity. This process made it possible to create significantly larger data sets in a short time, but facial recognition still faced many of the same challenges as before. This pushed researchers to seek yet more methods and data to overcome the technology’s poor performance.
Then, in 2014, Facebook used its user photos to train a deep-learning model called DeepFace. While the company never released the data set, the system’s superhuman performance elevated deep learning to the de facto method for analyzing faces. This is when manual verification and labeling became nearly impossible as data sets grew to tens of millions of photos, says Raji. It’s also when really strange phenomena start appearing, like auto-generated labels that include offensive terminology.
Image-generation algorithms are regurgitating the same sexist, racist ideas that exist on the internet.
The way the data sets were used began to change around this time, too. Instead of trying to match individuals, new models began focusing more on classification. “Instead of saying, ‘Is this a photo of Karen? Yes or no,’ it turned into ‘Let’s predict Karen’s internal personality, or her ethnicity,’ and boxing people into these categories,” Raji says.
Amba Kak, the global policy director at AI Now, who did not participate in the research, says the paper offers a stark picture of how the biometrics industry has evolved. Deep learning may have rescued the technology from some of its struggles, but “that technological advance also has come at a cost,” she says. “It’s thrown up all these issues that we now are quite familiar with: consent, extraction, IP issues, privacy.”
Raji says her investigation into the data has made her gravely concerned about deep-learning-based facial recognition.
“It’s so much more dangerous,” she says. “The data requirement forces you to collect incredibly sensitive information about, at minimum, tens of thousands of people. It forces you to violate their privacy. That in itself is a basis of harm. And then we’re hoarding all this information that you can’t control to build something that likely will function in ways you can’t even predict. That’s really the nature of where we’re at.”