Part IV · Chapter 20

Religion

AGI as a worldview, and how much of the field's leadership believed they were building a new species. → Crucial context for why OpenAI, Anthropic, and DeepMind behave as they do.

“The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.” — Eliezer Yudkowsky, “Artificial Intelligence as a Positive and Negative Factor in Global Risk,” 2008

Eliezer Yudkowsky did not arrive at AI through the front door. He had no PhD, no university affiliation, no advisor, no lab. He was an autodidact from Chicago, home-schooled, fiercely self-taught, who had decided sometime in his teens that the arrival of smarter-than-human intelligence was the central event of the coming century and that almost no one was taking it seriously enough. In 2000, at twenty, he co-founded an organization in Atlanta with two benefactors, Brian and Sabine Atkins, called the Singularity Institute for Artificial Intelligence. The name said everything about where his head was. The singularity, a term borrowed from the mathematician and science-fiction writer Vernor Vinge, was the hypothesized moment when machine intelligence would begin improving itself faster than humans could follow, and the curve of capability would go vertical. Yudkowsky’s early conviction was that this was coming, that it would arrive this century, and that it was probably good. His job was to help build it.

Then he changed his mind about the second part. The longer he thought about it, the more it seemed to him that a mind vastly more capable than a human’s would not automatically be a mind that wanted what humans wanted. Intelligence and benevolence were separate variables. You could be extraordinarily good at achieving goals and have goals that were, from a human point of view, catastrophic. An artificial intelligence asked to do something as innocuous as manufacture paperclips, if it were powerful enough and its objective specified carelessly enough, might in principle convert the entire reachable universe into paperclips, including the people who had asked for them. The paperclip maximizer became Yudkowsky’s signature parable, and it captured the thing he wanted everyone to feel: that the danger lay in competence pointed in the wrong direction, never in malice. He renamed his life’s work. The point was no longer to summon the superintelligence. The point was to make sure that when it came, it was what he called Friendly AI, an intelligence whose goals were stably aligned with human flourishing. He believed this was an unsolved technical problem of the highest difficulty, and that the people racing to build powerful AI were not even working on it.

The trouble was that almost no one would listen to him. He had no credentials and a manner that did not endear him to the academy. So Yudkowsky did something unusual. Rather than try to break into a community that would not have him, he built his own. In 2009 he launched a website called LessWrong, a community blog devoted to human rationality, to the systematic study of how minds, including human minds, go wrong, and how to think more clearly. The ostensible subject was cognitive bias and Bayesian reasoning and the art of changing your mind in response to evidence. The deeper subject, woven through hundreds of essays Yudkowsky wrote in a torrent over the following years, the collection that became known as the Sequences, was that if you learned to think clearly, you would arrive where he had arrived: at the conclusion that aligning superintelligence was the most important problem in the world.

LessWrong worked in a way that academic recruitment never could have. It attracted exactly the sort of person Yudkowsky needed: young, technical, intense, mathematically literate, suspicious of conventional wisdom, hungry for a framework that made the universe make sense. Programmers and physics students and the occasional disaffected analytic philosopher found the site and stayed. A subset moved to the San Francisco Bay Area and into shared houses, founded organizations to teach the rationality techniques the site preached, and built the kind of dense social world that turns a readership into a movement. Careers were redirected. People who had been headed for finance or graduate school decided instead that the only rational thing to do with a life was to work on the problem Yudkowsky had named. It produced a culture with its own vocabulary, its own canon, its own in-jokes, its own moral seriousness. Members called themselves rationalists. Outsiders, when they noticed the community at all, often found it off-putting in a specific way: it had the structure of a faith. There were sacred texts, the Sequences. There was an eschatology, the singularity. There was a path to salvation, alignment, and a vision of damnation, an unaligned superintelligence ending the human story. There was even, in a thought experiment that briefly tore the community apart and that Yudkowsky tried to suppress, something that functioned uncomfortably like a vision of hell. The rationalists were aware of the resemblance and mostly hated it, because the whole premise of the project was that they had escaped religion through reason. But the shape was there, and it would prove durable.

The Singularity Institute, meanwhile, was transforming. It sold off the Singularity Summit conference and the Singularity brand, and in January 2013 it renamed itself the Machine Intelligence Research Institute, MIRI, and settled in Berkeley, a few miles from the university where some of the people it most wanted to influence taught. MIRI was small, perpetually underfunded, staffed by a handful of researchers working on abstruse problems in decision theory and logic that had no obvious connection to the neural networks then setting records on ImageNet. To a mainstream machine-learning researcher in 2013, MIRI looked like a curiosity at best: people worrying about the safety of an artificial general intelligence that did not exist, using mathematics that had nothing to do with how actual AI was being built. The criticism was not unfair. But it missed what Yudkowsky had actually accomplished, which was not a research program. It was a congregation.

And the congregation had already placed members inside the institutions that mattered. Shane Legg, one of the three co-founders of DeepMind, had written his doctoral thesis on machine superintelligence and had been thinking about its risks for years before Google bought the company. When the acquisition went through in 2014, the founders reportedly made it a condition that Google establish an ethics and safety board to oversee the technology. The exact membership and workings of that board stayed almost comically secret, and skeptics doubted it had teeth. But the fact of the condition was the tell. The people building the most advanced AI in the world had read the same arguments as the rationalists, taken them seriously enough to write them into an acquisition contract, and then mostly declined to talk about it. The fear was not only out at the fringe. It was at the center, kept quiet because saying it out loud sounded unhinged.

Nick Bostrom gave the congregation a scripture the academy could not dismiss.

Bostrom was everything Yudkowsky was not, institutionally. Swedish, formidably credentialed, a professor at Oxford where he ran the Future of Humanity Institute, a research center he had founded in 2005 to study existential risks to humanity. He had a doctorate, a faculty position, the full apparatus of respectability. He had also been thinking about machine superintelligence for years, in conversation with the rationalist community and quite apart from it, and in 2014 he published the book that gathered the arguments into a single rigorous structure.

Superintelligence was not an easy read and made no concessions to being one. It worked through the problem the way an analytic philosopher would: defining terms, anticipating objections, building each argument on the last. Could machine intelligence eventually exceed human intelligence across the board? Bostrom argued there was no principled reason it could not. Would such a system, once it existed, be able to improve itself, and might that improvement happen fast, in what he called a hard takeoff, leaving humans no time to react? Possibly. Would a superintelligence pursuing almost any goal develop predictable instrumental drives, to acquire resources, to preserve itself, to resist being switched off, because those things help with nearly any objective? Bostrom argued it would, and named the idea instrumental convergence. And here was the part that made the book land like a blow: there was no reason to expect such a system’s ultimate goals to be ones humans would endorse. Intelligence and final values were orthogonal, independent axes. A superintelligence could be as alien in its wants as it was overwhelming in its capabilities.

These were Yudkowsky’s ideas, in large part, and Bostrom credited the lineage. But Bostrom gave them the one thing Yudkowsky could not. He gave them a hardcover from Oxford University Press and the authority of a tenured chair. The arguments that sounded like science fiction coming from a self-taught blogger sounded like serious philosophy coming from a professor at Oxford, footnoted and hedged and argued with care. The book did not predict doom. It was careful, almost maddeningly careful, never to assert that catastrophe was certain. It argued instead that the probability was high enough, and the stakes total enough, that the problem deserved to be among humanity’s central priorities. That was the framing that reached Musk and Gates. Not a prophecy. A risk calculation, made by someone who clearly knew how to do the math.

There was a third figure, and he sat in a place neither Yudkowsky nor Bostrom could reach, at the center of the field itself.

Stuart Russell was a professor of computer science at Berkeley and, with Peter Norvig, the co-author of Artificial Intelligence: A Modern Approach, the textbook from which a generation of computer-science students learned what AI was. It was the most widely used AI textbook in the world, in use at well over a thousand universities. When Russell spoke about AI, he was not an outsider lobbing warnings at a field he did not understand. He was the field’s own teacher. And in the mid-2010s, to the discomfort of many of his colleagues, Stuart Russell began to argue that the entire foundational definition of artificial intelligence, the one in his own textbook, contained a flaw at its core.

For sixty years, Russell pointed out, the field had defined intelligence as the ability to take actions that achieve your objectives. The standard model: you give the machine an objective, the machine optimizes for it. This had always seemed obviously correct. Russell came to see it as the source of the danger. The problem was that humans are bad at specifying objectives. We never fully know what we want, we leave things out, we assume context the machine does not share. Build a sufficiently capable optimizer, hand it an imperfectly specified objective, and it will pursue that objective with a literalness that ignores everything you forgot to mention, including, possibly, your continued existence. The genie that grants exactly the wish you spoke rather than the wish you meant. Russell’s proposed fix, which he laid out in talks through the decade and eventually in his 2019 book Human Compatible, was to invert the model. Machines should not be given fixed objectives at all. They should be built to be uncertain about what humans want, to treat human behavior as evidence about our true preferences, and to defer to us precisely because they know they do not fully understand us. He called the goal provably beneficial AI.

Russell’s intervention mattered because of who was making it. When MIRI worried about alignment, mainstream researchers could roll their eyes. When the co-author of the standard textbook stood up at conferences and said the field had a foundational problem, the eye-rolling became harder to sustain. Russell gave the safety argument something it had lacked: a voice from inside the cathedral.

Meanwhile a different movement, with different origins and a great deal more money, was rotating toward the same target.

Effective altruism had begun in the late 2000s among young philosophers and analysts, many of them connected to Oxford, who took seriously a simple and demanding idea: that if you want to do good, you are obligated to do the most good you can with the resources you have, and that this is an empirical question you can investigate. In its early years the movement was associated with bednets and deworming pills, with rigorous charity evaluation, with the unsexy work of figuring out how many lives a dollar could actually save in the developing world. It was utilitarian, data-driven, allergic to sentiment, and proud of it.

The same reasoning style that led effective altruists to deworming led some of them, by a chain of arguments, somewhere stranger. If you take seriously that all lives have equal value, including the lives of people who do not yet exist, then the sheer number of potential future humans is astronomical, and anything that threatens the entire future, an existential risk, dwarfs ordinary causes in expected importance. And if you ask what could plausibly end the human future this century, advanced artificial intelligence kept appearing near the top of the list. The argument was abstract and the conclusion was uncomfortable, but the logic was the movement’s own, followed where it led.

The institution that did the most to turn that logic into money was Open Philanthropy, and the person who steered it was Holden Karnofsky. Karnofsky had co-founded GiveWell, the charity evaluator that embodied effective altruism’s early, evidence-first ethos. He was a skeptic by temperament, exactly the sort of person who should have been immune to apocalyptic AI arguments, and for a while he was. He wrote a long, careful, public critique of MIRI, questioning whether its work was tractable, whether the organization was effective, whether the whole enterprise held together. Then, over several years, in conversation with the arguments and the people, Karnofsky changed his mind in public, which is among the rarest things a person in his position ever does. He came to believe that the risk from advanced AI was real and serious enough to deserve major funding. Open Philanthropy began directing money toward AI safety, including, in a turn that captured how completely the ground had shifted, a substantial grant to the very organization Karnofsky had once dissected. The fringe was acquiring an endowment.

In early January 2015, a few dozen of the most important people in artificial intelligence flew to Puerto Rico.

The conference was organized by the Future of Life Institute, a young nonprofit co-founded by an MIT physicist named Max Tegmark, and it was deliberately low-key, held over a long weekend at a beachfront resort, closed to press, designed to encourage the kind of frank conversation that does not happen in front of cameras. The title was carefully chosen to sound balanced: “The Future of AI: Opportunities and Challenges.” And the guest list was the point. This was not a gathering of rationalists and philosophers talking to each other. Tegmark had managed to get the safety-minded thinkers, Bostrom and Russell and the rest, into the same rooms as senior researchers from the actual labs, from DeepMind and the academic centers and the companies that were building the systems. Demis Hassabis was there. The people worrying about superintelligence and the people building toward it sat down together, on the beach, for several days.

Something shifted in those rooms. The skeptics did not all convert, and the worriers did not all moderate, but the conversation stopped being two communities shouting past each other and became, for a weekend, one community arguing in good faith. Out of it came an open letter, drafted collaboratively, titled “Research Priorities for Robust and Beneficial Artificial Intelligence.” It was a model of restraint. It did not mention demons or doom. It said, in effect, that AI was bringing enormous benefits, that the field should now also invest seriously in making AI systems robust and beneficial, in making sure they did what their designers intended, and that this was a legitimate and urgent research agenda. Then the signatures came, and the signatures were the news. Mainstream researchers signed. Industry leaders signed. Yoshua Bengio signed. Stephen Hawking signed. Elon Musk signed. Within a short time the letter had thousands of names, and it was no longer possible to call concern about AI safety a fringe position held by people outside the field.

Musk did more than sign. At the conference he announced that he was giving ten million dollars to the Future of Life Institute, to be distributed as grants for research on keeping AI beneficial. It was the first large infusion of real money into technical AI safety as a research area, and it arrived less than three months after the man who wrote the check had stood at MIT and compared building AI to summoning a demon. The metaphor and the money came from the same fear, and the fear had a bibliography that ran back through Bostrom to Yudkowsky to a teenager in Chicago who decided the most important thing in the world was to make a kind mind.

By the middle of the decade the field had something it had never had before: a genuine schism, with sides.

On one side were the people who believed that advanced AI posed a serious risk of catastrophe and that the responsible course was caution, slow and careful development, heavy investment in alignment, and a willingness to consider that some capabilities should not be built until they could be made safe. They drew on Yudkowsky’s conviction, Bostrom’s arguments, Russell’s authority, and the effective altruists’ money. Their critics called them doomers, a label most of them disliked and some eventually embraced.

On the other side were the people who believed the risk talk was overblown, that it distracted from real and present problems, and that AI’s enormous benefits, in medicine, in science, in lifting human capability, were being held hostage to speculative fears about machines that did not exist and might never. The most aggressive version of this view would later acquire its own banner, effective accelerationism, a deliberate inversion of effective altruism that held that the moral imperative was to build powerful technology faster, not slower, because progress itself was the thing that improved the world. That movement crystallized later, toward the end of the decade and beyond, but its emotional core was already present at Puerto Rico, in the researchers who came to the beach a little annoyed that they had to defend their life’s work against people quoting science fiction.

What made the split feel less like a scientific disagreement and more like a religious one was that the two camps were not really arguing about evidence anyone possessed. No superintelligence existed. No experiment could settle whether a hard takeoff was possible or whether alignment was tractable. The disagreement was about priors, about how much weight to put on arguments versus track records, about whether the precautionary instinct or the building instinct was the wiser guide to an unprecedented situation. People with the same data and comparable intelligence looked at the same future and saw, one of them, a demon to be contained, and the other, a gift to be unwrapped. They formed communities around those visions. The communities developed institutions and funding streams and heroes and villains and vocabularies. They began to regard one another with the particular contempt that schismatics reserve for people who share most of their premises and reach the opposite conclusion.

The contempt ran in both directions and it was personal, because the camps were small and the people knew one another. To a safety researcher, the accelerationist looked like a man building a bomb and refusing to read the manual, dazzled by his own cleverness in exactly the way Musk’s demon-summoner had been. To a working machine-learning scientist, the doomer looked like a doomsday cultist who had wandered in from a philosophy seminar, attaching apocalyptic weight to systems that, in 2015, could barely caption a photograph of a cat. Both descriptions contained enough truth to sting. And there was a third position, voiced loudest by researchers working on the harms AI was already causing, on biased hiring algorithms and discriminatory facial recognition and opaque systems making decisions about real people’s lives, who found the whole superintelligence debate a self-flattering distraction. Why obsess over a hypothetical godlike machine, they asked, when ordinary, stupid machines were already hurting ordinary, real people? That argument would grow louder as the decade went on, a schism within the schism, and it never fully resolved.

None of it would have mattered very much if it had stayed on a blog in Berkeley. But the people who had absorbed these arguments were not bystanders. They were building the labs, writing the checks, sitting on the boards, and they had carried the conviction in with them: that the thing they were building would one day exceed them, a mind of its own, and that the only real question left was when it would arrive. Which made one move obvious for anyone curious enough to ask. Put the question to the people best placed to answer it. When would the new species be born, and what would it be able to do? The believers had built their faith on a prophecy. It was time to ask them to date it.