First he’d need data. While his dissertation work continued to run on the side, he set up 12 fake OkCupid accounts and wrote a Python script to manage them. The script would search his target demographic (heterosexual and bisexual women between the ages of 25 and 45), visit their pages, and scrape their profiles for every scrap of available information: ethnicity, height, smoker or nonsmoker, astrological sign—“all that crap,” he says.
To find the survey answers, he had to do a bit of extra sleuthing. OkCupid lets users see the responses of others, but only to questions they’ve answered themselves. McKinlay set up his bots to simply answer each question randomly—he wasn’t using the dummy profiles to attract any of the women, so the answers didn’t matter—then scooped the women’s answers into a database.
McKinlay watched with satisfaction as his bots purred along. Then, after about a thousand profiles were collected, he hit his first roadblock. OkCupid has a system in place to prevent exactly this kind of data harvesting: It can spot rapid-fire use easily. One by one, his bots started getting banned.
He would have to train them to act human.
He turned to his friend Sam Torrisi, a neuroscientist who’d recently taught McKinlay music theory in exchange for advanced math lessons. Torrisi was also on OkCupid, and he agreed to install spyware on his computer to monitor his use of the site. With the data in hand, McKinlay programmed his bots to simulate Torrisi’s click-rates and typing speed. He brought in a second computer from home and plugged it into the math department’s broadband line so it could run uninterrupted 24 hours a day.
After three weeks he’d harvested 6 million questions and answers from 20,000 women all over the country. McKinlay’s dissertation was relegated to a side project as he dove into the data. He was already sleeping in his cubicle most nights. Now he gave up his apartment entirely and moved into the dingy beige cell, laying a thin mattress across his desk when it was time to sleep.
For McKinlay’s plan to work, he’d have to find a pattern in the survey data—a way to roughly group the women according to their similarities. The breakthrough came when he coded up a modified Bell Labs algorithm called K-Modes. First used in 1998 to analyze diseased soybean crops, it takes categorical data and clumps it like the colored wax swimming in a Lava Lamp. With some fine-tuning he could adjust the viscosity of the results, thinning it into a slick or coagulating it into a single, solid glob.
He played with the dial and found a natural resting point where the 20,000 women clumped into seven statistically distinct clusters based on their questions and answers. “I was ecstatic,” he says. “That was the high point of June.”
He retasked his bots to gather another sample: 5,000 women in Los Angeles and San Francisco who’d logged on to OkCupid in the past month. Another pass through K-Modes confirmed that they clustered in a similar way. His statistical sampling had worked.