Part III · Chapter 15

Bigotry

Jacky Alciné's tweet, Joy Buolamwini's audits, Timnit Gebru, and the birth of fairness research. → An introduction to AI's bias reckoning — a beginning, not the last word.

“Google Photos, y’all fucked up. My friend’s not a gorilla.” — Jacky Alciné, on Twitter, June 29, 2015

The picture that broke it open was unremarkable. Jacky Alciné, a young software engineer in Brooklyn, had been at a concert with a friend, and afterward the photos uploaded themselves the way photos did by the summer of 2015, sucked silently into Google Photos and sorted into albums by a system that needed no instruction. Google had launched the app weeks earlier, in late May, and its headline trick was exactly this: it looked at your pictures and figured out what was in them. Beaches. Birthdays. Dogs. Skylines. You could type “snow” into the search box and the app would surface every snowy day you had ever photographed, even ones you had forgotten. It was the kind of feature that, a decade before, would have been science fiction, and it ran on the same convolutional networks that had won ImageNet and triggered the auctions and the hiring wars. This was the technology arriving in a normal person’s pocket.

On June 29, Alciné opened the app and saw that it had created an album. The album was labeled “Gorillas.” Inside were dozens of photographs of him and his friend, both of them Black. The system had looked at two human faces and filed them under a category usually reserved, in the popular imagination, for the cruelest racist insult in the language. He took a screenshot and posted it. “Google Photos, y’all fucked up,” he wrote. “My friend’s not a gorilla.”

The tweet moved fast. By that evening it had been shared thousands of times, and it had reached the right person inside Google. Yonatan Zunger, the company’s chief social architect, replied to Alciné directly, within about an hour and a half of the original post. He did not deflect. “This is 100% Not OK,” Zunger wrote, and asked for access to the data so engineers could find the bug. Within a day, Google had apologized publicly and promised a fix. The speed of the response was, by the standards of a large technology company facing a public-relations problem, genuinely impressive.

The fix was not. When Google’s engineers went looking for a way to keep the system from ever again labeling a Black person a gorilla, the cleanest path they could find was to stop the system from labeling anything a gorilla. They removed the category. They also blocked “chimp,” “chimpanzee,” and “monkey.” For years afterward, Google Photos, a product built to recognize the visual world in extraordinary detail, was constitutionally incapable of finding an actual gorilla in your photographs of a zoo. Wired would confirm, in 2018, that the blunt patch was still in place. The most sophisticated image-recognition system available to consumers had a hole in it shaped exactly like the company’s embarrassment.

What the episode exposed was not that someone at Google was a racist. No engineer had sat down and taught the network that Black faces were apes. The failure was upstream of any intention, baked into the data the system learned from. A neural network trained to recognize faces learns whatever the photographs in front of it contain. If those photographs are overwhelmingly of light-skinned people, the network gets very good at light-skinned faces and remains clumsy with dark-skinned ones, the way a student who has only ever studied one kind of problem fumbles the first time the test changes. The model was not lying about what it saw. It simply had not been shown enough of the world to see it correctly, and no one inside Google had checked before shipping. The people most likely to be misrecognized were the people least represented in the training set, and they were also, not coincidentally, the people least represented in the rooms where the decisions were made.

That was the quiet scandal underneath the loud one. The systems being sold as superhuman were not failing at random. They were failing along the oldest fault lines, and they were failing the people who had the least power to complain. Proving it, rigorously, with numbers a company could not wave away, would take an outsider, and the outsider who did it was at that moment a graduate student a few hundred miles up the coast, sitting in front of a webcam that could not see her face.

Joy Buolamwini had come to the MIT Media Lab by way of Georgia, Oxford, and a Rhodes Scholarship, a computer scientist with a poet’s ear and an unusual willingness to make herself the subject of her own experiments. In 2015, working on a project she called the Aspire Mirror, she wanted to build something playful: a mirror that would project an inspiring face over your reflection. To do it she needed face-tracking software, the off-the-shelf kind that detects where a face is in a frame, the same commodity capability that underpinned everything from phone cameras to airport security. She pointed the camera at herself, and the software did not find her. She moved closer, adjusted the light, sat directly in front of the lens. Nothing. The little box that was supposed to snap onto a detected face stayed empty.

Then she picked up a plain white Halloween mask that happened to be on her desk and held it in front of her face. The software found the mask instantly. It drew its confident little box around a blank white plastic oval, registering it as a human face, while the actual human being holding it, a dark-skinned woman, remained invisible to the machine. Buolamwini was a Black woman literally putting on a white face to be seen by a computer. She has said she did not have to add the symbolism; the machine supplied it.

She might have shrugged it off as one cheap library’s bug, the way most people shrugged off such things. Instead she treated it as a research question, and the question she asked was the one Google had not bothered to ask itself. The right question was for whom these systems worked, and how badly they failed for everyone else. The honest way to answer it was to test the systems the way a scientist tests a drug, with a defined population and a measured outcome, rather than with anecdotes. So Buolamwini built a benchmark.

She assembled a dataset of just over a thousand faces, drawn from parliamentarians in three African and three Nordic countries to get a wide spread of skin tones, and she labeled each face along two axes: gender and skin type, the latter using the Fitzpatrick scale that dermatologists use to classify skin. Then she ran three of the most prominent commercial facial-analysis systems on the market against it. The systems came from IBM, from Microsoft, and from Face++, a Chinese company whose technology was widely deployed across Asia. All three sold a feature that guessed a person’s gender from a photograph. All three were marketed as accurate. Buolamwini, working with a young Ethiopian-born computer scientist named Timnit Gebru, who had done her PhD at Stanford and was as comfortable with the statistics as with the politics, set out to measure exactly how accurate, and for whom.

The paper they published in early 2018, presented at the inaugural Conference on Fairness, Accountability, and Transparency, was titled Gender Shades, and its central finding was the kind of result that does not need rhetoric because the table does the arguing. On lighter-skinned men, all three systems were nearly flawless. The best of them got the gender right better than 99 percent of the time; the worst still erred on fewer than one in a hundred. On darker-skinned women, the same systems collapsed. The error rates climbed to as high as 34.7 percent, better than one wrong guess in three. The gap between the group the systems served best and the group they served worst ran to more than thirty percentage points, on the same task, inside the same product. A technology being sold as a uniform capability was in fact two technologies: one that worked, for some people, and one that did not, for others, and the line between them was drawn by sex and skin.

Part of what made the result so hard to argue with was how carefully the two researchers had anticipated the objection that the systems were not really worse at dark skin, only worse at women, or worse at some confounded combination of lighting and pose. By scoring each system separately on four groups, lighter men, lighter women, darker men, and darker women, they could show that the failures stacked. The systems were somewhat worse on women than men and somewhat worse on darker skin than lighter, and at the intersection of the two the errors compounded into the headline gap. A person who was both a woman and dark-skinned was failed twice over, by a product that on a lighter-skinned man would have been essentially perfect. The math made an old idea legible in a new place: that harm concentrates where disadvantages overlap.

The companies’ first instinct was the institutional one. IBM and Microsoft were given the results before publication, as scientific courtesy required, and the early responses had the defensive texture of any large organization told its flagship product is broken. But Buolamwini and Gebru had done the work properly, with a public benchmark and a reproducible method, and there was no comfortable way to dismiss a number you could check yourself. Within months IBM had retrained its system and published its own improved figures; Microsoft did the same. The audit had done what audits are supposed to do and almost never get to: it had forced a correction by being undeniable.

A year later Buolamwini and a colleague, Deborah Raji, ran the test again, this time adding Amazon’s Rekognition to the lineup, the facial system Amazon was actively selling to police departments. Amazon’s product showed the same pattern the others had, the same cliff between light-skinned men and dark-skinned women. Amazon’s response was different from IBM’s and Microsoft’s. Rather than fix the result, the company attacked the study, disputing the methodology in public statements and a blog post, insisting the test did not reflect how the technology was used in the field. The fight was uglier than the science, which is often how it goes when the science is correct and the implications are expensive.

What Buolamwini and Gebru had built was bigger than three error rates. They had demonstrated a method, and a movement formed around it. Buolamwini had already founded the Algorithmic Justice League while at the Media Lab, an organization devoted to auditing deployed systems and dragging their failures into public view, and Gender Shades became its proof of concept. She became one of the most visible voices arguing that the burden of proof had been backward: that companies deploying these systems on the public should have to show they worked across the public, rather than leaving it to the people harmed to discover otherwise. She testified before Congress, narrated her findings in spoken-word pieces she called “AI, Ain’t I a Woman?” after Sojourner Truth, and turned the dry vocabulary of false-positive rates into something a senator or a city council could grasp. The argument was not that facial recognition should be made fairer and then deployed everywhere. For some uses, police surveillance among them, she argued that fairer was not the same as safe, and that a system which identified everyone equally well might simply mean everyone was equally surveilled. The conference where Gender Shades appeared, known first as FAT* and later as FAccT, became the institutional home of a new subfield, where computer scientists, lawyers, sociologists, and ethicists argued about what fairness in an algorithm even meant, and whether a single mathematical definition of it could exist. Inside the big labs, “AI ethics” and “responsible AI” teams appeared on org charts where two years earlier there had been nothing, hiring exactly the kind of researcher who had previously had to do this work from the outside.

The deeper point, the one the field would spend years circling, was that “bias” in these systems was not a bug in the ordinary sense. A bug is a place where the code does something other than what was intended. These systems did precisely what they were built to do: they learned the statistical contours of the data they were fed and reproduced them faithfully. The problem was the data, and the data was the world, captured unevenly. Photographs on the internet skewed white because the institutions that had been photographing people for a century skewed white. Image datasets scraped from the web inherited that skew, and the networks inherited it from the datasets, and the products inherited it from the networks, until a Black man in Brooklyn opened an app and found himself filed under “Gorillas.” Each layer had behaved exactly as designed. The discrimination was not injected anywhere; it was laundered through a pipeline that everyone could see and no one had thought to question, because questioning it required imagining that the system might treat you, specifically, as less than fully visible.

This was also a story about who got to ask the questions. It took an outsider to surface the Google Photos failure, and graduate students rather than the companies to quantify the facial-recognition gap. The labs employed plenty of careful people. But the people who experienced the failures most acutely were not the ones building the systems or signing off on them, and the people who were building and signing off rarely experienced the failures at all. A team of mostly light-skinned engineers testing a face detector on themselves would conclude, accurately, that it worked great. The blind spot in the data was mirrored by a blind spot in the room, and the second was harder to fix than the first. You could retrain a model in a few weeks. Changing who held the pen took longer, and the field was about to learn how much longer.

Timnit Gebru, who had co-authored the audit that forced two of the largest technology companies on earth to admit their products were broken, went on to co-lead an ethical-AI team at Google, recruited precisely because she was the kind of researcher whose work had teeth. She would last there a little over two years. Late in 2020, at the very edge of the period this book has so far described, a dispute over a research paper she had co-authored, one that questioned the risks of the enormous language models Google was beginning to build, would end with her departure under circumstances both she and the company described in flatly contradictory terms, and it would become one of the defining controversies of the field. Whether she resigned or was fired, whether the science or the politics was the real issue, would be argued over for years. But that fight belongs to a later chapter and a later, larger machine. What mattered in the summer of 2015 and the winter of 2018 was simpler and more durable: that the work had been done at all, by people the field had not been built to listen to, and that once the numbers were on the table they could not be taken back off it.

The reckoning that began with a tweet and a white mask was an opening, not a verdict. The audits proved the systems could be measured; they did not settle what anyone owed when the measurement came back bad. The fairness researchers had shown the labs a kind of harm that emerged from products doing their jobs too literally, and they had forced a few corrections through sheer rigor. The harder question was the one that came when a system worked exactly as intended and the intent itself was the problem. The labs had been slow to police the harms their products caused by accident. They were about to face a harder test, over a product that would cause harm entirely on purpose, and that fight would not be carried by outsiders at all. It would come from inside.