MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs • The Register
The dataset holds more than 79,300,000 images, scraped from Google Images, arranged in 75,000-odd categories. A smaller version, with 2.2 million images, could be searched and perused online from the website of MIT’s Computer Science and Artificial Intelligence Lab (CSAIL). This visualization, along with the full downloadable database, were removed on Monday from the CSAIL website after El Reg alerted the dataset’s creators to the work done by Prabhu and Birhane.
The key problem is that the dataset includes, for example, pictures of Black people and monkeys labeled with the N-word; women in bikinis, or holding their children, labeled whores; parts of the anatomy labeled with crude terms; and so on – needlessly linking everyday imagery to slurs and offensive language, and baking prejudice and bias into future AI models.
Screenshot from the MIT AI training dataset
A screenshot of the 2.2m dataset visualization before it was taken offline this week. It shows some of the dataset’s examples for the label ’whore’, which we’ve pixelated for legal and decency reasons. The images ranged from a headshot photo of woman and a mother holding her baby with Santa to porn actresses and a woman in a bikini ... Click to enlarge
Antonio Torralba, a professor of electrical engineering and computer science at CSAIL, said the lab wasn’t aware these offensive images and labels were present within the dataset at all. “It is clear that we should have manually screened them,” he told The Register. “For this, we sincerely apologize. Indeed, we have taken the dataset offline so that the offending images and categories can be removed.”
In a statement on its website, however, CSAIL said the dataset will be permanently pulled offline because the images were too small for manual inspection and filtering by hand. The lab also admitted it automatically obtained the images from the internet without checking whether any offensive pics or language were ingested into the library, and it urged people to delete their copies of the data:
“The dataset contains 53,464 different nouns, directly copied over from WordNet," Prof Torralba said referring to Princeton University’s database of English words grouped into related sets. “These were then used to automatically download images of the corresponding noun from internet search engines at the time, using the available filters at the time, to collect the 80 million images.”
WordNet was built in the mid-1980s at Princeton’s Cognitive Science Laboratory under George Armitage Miller, one of the founders of cognitive psychology. “Miller was obsessed with the relationships between words,” Prabhu told us. “The database essentially maps how words are associated with one another.”
For example, the words cat and dog are more closely related than cat and umbrella. Unfortunately, some of the nouns in WordNet are racist slang and insults. Now, decades later, with academics and developers using the database as a convenient silo of English words, those terms haunt modern machine learning.
“When you are building huge datasets, you need some sort of structure,” Birhane told El Reg. “That’s why WordNet is effective. It provides a way for computer-vision researchers to categorize and label their images. Why do that yourself when you could just use WordNet?”
WordNet may not be so harmful on its own, as a list of words, though when combined with images and AI algorithms, it can have upsetting consequences. “The very aim of that [WordNet] project was to map words that are close to each other,” said Birhane. "But when you begin associating images with those words, you are putting a photograph of a real actual person and associating them with harmful words that perpetuate stereotypes.”
The fraction of problematic images and labels in these giant datasets is small, and it’s easy to brush them off as anomalies. Yet this material can lead to real harm if they’re used to train machine-learning models that are used in the real world, Prabhu and Birhane argued.
“The absence of critical engagement with canonical datasets disproportionately negatively impacts women, racial and ethnic minorities, and vulnerable individuals and communities at the margins of society,” they wrote in their paper.