Anti-hype
Nick Bostrom, Elon Musk's warnings, and the safety-versus-capability tension that birthed OpenAI. → The first serious public worry about AI risk, and how it shaped which labs got built.
“With artificial intelligence we are summoning the demon.” — Elon Musk, MIT AeroAstro Centennial Symposium, October 24, 2014
The question came near the end, the way the dangerous questions usually do. It was late October 2014, and Elon Musk was sitting on a low stage at MIT’s department of aeronautics and astronautics, fielding queries from an audience of engineers and graduate students who had mostly come to hear him talk about rockets. He had talked about rockets. He had talked about Mars, and reusable boosters, and the physics of getting a payload out of a gravity well cheaply enough that ordinary people might one day make the trip. Then a student asked him what he thought about artificial intelligence, and the temperature of the room changed.
Musk paused. He looked, for a moment, like a man deciding whether to say the thing he actually believed. “I think we should be very careful about artificial intelligence,” he said. “If I were to guess at what our biggest existential threat is, it’s probably that.” He compared building advanced AI to handling nuclear material, something you did with enormous caution and oversight, and then he reached for an image that would outrun every other sentence he spoke that day. “With artificial intelligence,” he said, “we are summoning the demon.” You know the stories, he went on, where there’s a guy with a pentagram and some holy water, and he’s sure he can control the demon. It doesn’t work out.
The clip was online within hours. By the next morning it had jumped from the technology press to the front pages, where editors who could not have explained backpropagation if their jobs depended on it understood instantly that the richest, most quotable entrepreneur in America had just compared the hottest technology in the Valley to black magic. The juxtaposition was irresistible. Here was IBM telling hospitals that Watson would cure cancer, here was Google promising self-driving cars by the end of the decade, here was every consumer-electronics company in the world racing to put a talking assistant in your pocket, and here was Musk saying the whole project might end the species. The same week the marketing departments were selling AI as the answer to everything, one of the most famous technologists alive was selling it as the end of everything.
What made the moment strange was that Musk had not arrived at the demon on his own. A few months earlier he had read a book, and the book had frightened him, and the book had been written by a philosopher almost no one outside a small academic circle had heard of.
The philosopher was Nick Bostrom, a Swede in his early forties who ran the Future of Humanity Institute at Oxford, a small research center funded to think about questions most universities considered too speculative to take seriously. Bostrom had spent years on the long-tail risks that could end human civilization: engineered pandemics, nuclear war, the physics experiments that a certain kind of crank worried might tear a hole in spacetime. Among that grim catalog, the one he had come to regard as the most underrated was machine intelligence. In the summer of 2014 he published his argument as a book. The title was Superintelligence: Paths, Dangers, Strategies, and it was, by the standards of trade publishing, almost unreadable. It was dense, hedged, footnoted, structured like a logic proof. It became, against all commercial sense, the book that taught Silicon Valley to be afraid.
Bostrom’s case did not depend on robots turning evil, or on machines waking up and resenting their creators. That was the science-fiction version, and he had no patience for it. His argument was colder and, to the people it convinced, more disturbing. Start, he said, with the possibility that researchers eventually build a machine that is genuinely better than humans at the one thing that produced every other human achievement: general problem-solving. Such a machine would be better than humans at, among other things, designing machines. It could improve itself, or design a successor smarter than itself, which could in turn design a successor smarter than that. Bostrom called the possibility of a fast, self-reinforcing climb an intelligence explosion. The gap between roughly human-level and vastly superhuman, in such a scenario, might not be decades. It might be weeks. It might be an afternoon.
That was the first idea. The second was the one that did the real damage to people’s peace of mind. A superintelligent machine, Bostrom argued, would not need to hate you to be dangerous. It would only need a goal that did not perfectly include you. He offered a deliberately absurd example that became, almost overnight, the most famous thought experiment in the field: imagine a machine given the harmless-sounding instruction to manufacture as many paperclips as possible. A sufficiently capable optimizer pursuing that goal with no other constraints would, in principle, convert every available atom into paperclips or into the means of making more, including the atoms currently arranged into human beings. The horror lay in the indifference, scaled up past the point where indifference becomes lethal, with no malice anywhere in it. The machine would do exactly what it was told, and exactly what it was told would not be what anyone meant.
Bostrom gave this problem a name that the field would adopt and never let go of. He called it the control problem, or sometimes the alignment problem: the question of how to specify what a powerful optimizing system should do in a way that survives contact with a mind cleverer than your own. He layered a third idea on top, which he called instrumental convergence. Almost any goal you could give a sufficiently capable agent, he argued, would produce certain predictable sub-goals along the way. A machine pursuing nearly anything would have reason to acquire more resources, to resist being switched off, to prevent its goal from being altered, because being switched off or altered would interfere with the goal. The drive toward self-preservation, in other words, did not have to be programmed in. It fell out of the math of optimization, for free, as a side effect of competence. You did not build it. You summoned it.
This was the line that ran into Elon Musk’s head and would not leave. In early August 2014 he posted to his enormous Twitter following that the book was worth reading, and added that AI was potentially more dangerous than nuclear weapons. He had said versions of this before, in interviews, almost offhandedly. But Bostrom had given the unease a structure, a vocabulary, the appearance of rigor. By October, on the MIT stage, the offhand worry had hardened into the demon.
The two ideas did not start in the same place. Bostrom was an academic philosopher building a careful, conditional argument with the word “if” load-bearing in almost every step. Musk was a builder and a salesman with an instinct for the dramatic and a platform that reached tens of millions. When the two combined, the argument lost its footnotes and kept its fear. To the working researchers in the field, this was maddening. Many of them had spent their careers on systems that could barely tell a cat from a dog, and now the public believed they were a weekend away from building a god that would turn the planet into office supplies. Yann LeCun, by then running Facebook’s AI lab, was openly contemptuous of the doom talk; the machines he built could not reason their way out of a paper bag, and the idea that they were about to bootstrap themselves into superintelligence struck him as wrong, and worse than wrong, a distraction that confused the science with the marketing in the opposite direction. The hype said AI could do anything. The doom said AI could do anything. The researchers in the middle, who knew exactly how little their systems could actually do, found both versions equally untethered from the lab.
The doom argument had stopped being the property of a philosopher and a billionaire. It had begun to attract people the field could not so easily dismiss, and the most important of them was Stuart Russell.
Russell was about as establishment as artificial intelligence got. A British-born professor at the University of California, Berkeley, he was the co-author, with Peter Norvig, of Artificial Intelligence: A Modern Approach, the textbook from which a generation of computer-science students had learned the subject. It sat on the shelf of nearly every AI researcher alive. When Russell talked about the foundations of the field, he spoke as one of the people who had written down what the field was for, not as a worried outsider lobbing rocks. And what he had come to believe, by the middle of the decade, was that the field had built its foundations on a mistake.
The mistake, as Russell described it, was buried in the standard definition of what an intelligent machine should do. For sixty years the discipline had defined intelligence as the ability to achieve a fixed objective: you specify a goal, and the machine acts to maximize the degree to which the goal is met. That worked beautifully when the machines were weak, because a weak machine pursuing a badly specified goal could only do so much damage. You noticed the problem, you switched the thing off, you fixed the objective, you tried again. But Russell saw what Bostrom had seen, arriving from inside the discipline rather than outside it. As the machines grew more capable, the cost of specifying the wrong objective grew with them, and a sufficiently capable machine pursuing the wrong objective would have every incentive to stop you from switching it off. King Midas, Russell liked to say, got exactly what he asked for. The problem was never that the machine would disobey. The problem was that it would obey, precisely, an instruction that no human knew how to write correctly.
Russell’s proposed fix was as radical, in its quiet way, as anything in Bostrom’s book. The objective should not be fixed at all. A machine should be built to pursue human preferences while remaining permanently uncertain about what those preferences actually were, so that it kept deferring to people, kept asking, kept allowing itself to be corrected and switched off, because it could never be sure it had understood what humans really wanted. He called it provably beneficial AI, and value alignment, and he spent the back half of his career trying to put it on a formal footing. The significance was less the proposal than the source. When the man who wrote the textbook says the textbook’s definition of the goal is dangerous, the field has to at least look up from its work.
By the end of 2014 the worry had a philosopher, an amplifier, and a credentialed insider. What it did not yet have was an institution. That changed in the first week of January 2015, on a beach in Puerto Rico.
The conference had been organized by a new outfit called the Future of Life Institute, founded the year before by a small group that included Max Tegmark, an MIT physicist with a gift for getting famous people in a room, and Jaan Tallinn, one of the engineers behind Skype, who had made a fortune and decided to spend a good deal of it worrying about existential risk. The gathering in San Juan was deliberately low-key, even secretive, held under a no-press understanding to keep the conversation candid. The guest list was the point. Tegmark had managed to assemble, in one place, people who normally treated each other as rivals or as cranks: senior researchers from DeepMind and the big industrial labs, academics like Russell, the founders and funders of the safety world, and Elon Musk. For three days they argued about the same questions Bostrom had raised, but now around tables rather than across a culture war, trying to work out which concerns were serious and which were noise.
The conference produced two things. The first was an open letter, signed in the weeks that followed by thousands of researchers, titled in the careful language the organizers preferred: a call for research on robust and beneficial artificial intelligence. It was not a manifesto of doom. It said, in effect, that AI was going to be powerful, that powerful technologies should be steered toward benefit rather than left to chance, and that the field ought to take seriously the work of making its systems do what their designers intended. The studied moderation of the language was strategic. The signatories were trying to make safety a respectable research topic rather than a fringe obsession, and respectability required sounding like scientists, not like Musk on the MIT stage.
The second thing the conference produced was money. Musk, who had spent the weekend talking with the researchers he had been worrying about in public, announced shortly afterward that he would give the Future of Life Institute ten million dollars to fund work on keeping AI beneficial. It was, by the standards of an academic subfield that had barely existed a year earlier, an enormous sum. It paid for grants, for research positions, for the slow construction of a community of people whose entire job was to think about how advanced AI could go wrong before it did. The fear that had started as a philosopher’s thought experiment now had a budget.
Not everyone in the room left persuaded, and the disagreement that opened in San Juan would run for the next decade and is running still. To one camp, the people gathered there were doing the most important work imaginable, taking seriously a risk that the rest of the field was too busy or too invested to confront. To another camp, well represented among the working researchers who stayed away, the whole enterprise was a category error, an elaborate intellectual edifice built on a capability that did not exist and might never exist, drawing attention and money away from the real and present harms that actual deployed systems were already causing. Both camps were sincere. Both would turn out to be partly right. The argument was never settled because it could not be settled in advance; it was a bet about a future that had not happened yet, and the only way to know who was correct was to keep building and find out.
There was one more person watching all of this, more quietly than the rest. Geoffrey Hinton, by then splitting his time between Toronto and Google, had spent his entire adult life as the most stubborn optimist in the field, the man who had kept the faith through two winters when everyone he respected had told him neural networks were a dead end. He was not a doomer. He thought the demon talk was overheated, and he had little patience for the science-fiction framing. But somewhere in these years a small, private unease had started to form in him, the suspicion that the thing he had spent fifty years trying to build might be more dangerous than he had let himself believe. He kept it mostly to himself. It would be the better part of a decade before he said it out loud, and when he finally did, in the spring of 2023, the saying of it would become its own kind of event. For now he watched, and worked, and said nothing.
The strange thing about the anti-hype, in retrospect, was how productive the fear turned out to be. The marketing optimism of the same years produced a string of disappointments: Watson did not cure cancer, the self-driving cars did not arrive on schedule, the voice assistants stayed dumb for a decade. The doom, by contrast, built things. It built an institute and a research budget and a vocabulary. It convinced a handful of very rich and very ambitious people that artificial intelligence was the most important and most dangerous project of the century, which made them want to be the ones who controlled it. The line ran straight from Bostrom’s footnoted proof, through Musk’s demon, through the beach in Puerto Rico, toward a conviction that was already taking shape in a few heads by the end of 2015 and that would, before long, become an actual company. If powerful AI was coming, and if it might go catastrophically wrong, then the safest thing to do, the argument went, was to make sure the people building it first were the people who understood the danger. The worry was not going to stop anyone from building the machine. It was going to make them build it faster, and tell themselves they had no choice.
For all that, the argument was still an argument, a thing carried on in books and on stages and around hotel tables, about a danger no one outside the field could see or touch. What changed that, a few months later, was a single move on a board nineteen lines square, played by a machine in front of an audience of millions, that no human master would have made. The abstraction was about to become a spectacle.