Part IV · Chapter 18

Debate

Gary Marcus versus the deep-learning establishment over reasoning, abstraction, and symbols. → The field's most consequential unresolved argument.

“Success in creating AI would be the biggest event in human history. Unfortunately, it might also be the last, unless we learn how to avoid the risks.” — Stephen Hawking, Stuart Russell, Max Tegmark, and Frank Wilczek, The Independent, May 1, 2014

The robot in the video could not stand up. It was a simple thing, a stick figure with a single leg and a foot, the kind of cartoon physics body that reinforcement-learning researchers used as a test bed because it was cheap to simulate and hard to control. Paul Christiano sat at OpenAI in 2017 and watched it flail. Then he watched it learn to do a backflip.

What made the clip remarkable was not the backflip. Simulated agents had been taught all sorts of acrobatics by people willing to write the right reward function, a mathematical formula that hands out points for desired behavior and docks them for everything else. Writing that formula for a backflip is harder than it sounds. You have to specify the height of the jump, the rotation, the landing, the recovery, and you have to balance them so the agent does not discover that the cheapest way to score points is to fall over in a way that technically satisfies the math. Reward functions are where good intentions go to be misinterpreted. The interesting part of Christiano’s clip was that nobody had written one. No human had told the system what a backflip was, in numbers or in words. Instead, the system had shown a person pairs of short video clips of its own fumbling and asked, each time, which of the two looked a little more like a backflip. The person clicked. The system adjusted. It clicked through about nine hundred of these comparisons, less than an hour of a human’s time, and out of those nine hundred binary judgments, roughly nine hundred bits of information, the agent assembled a clean, confident flip.

The paper was called “Deep Reinforcement Learning from Human Preferences,” and it carried the names of Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei when it appeared at the NIPS conference at the end of 2017. It was an unusual byline because it crossed a border. Christiano, Brown, and Amodei were at OpenAI, while Leike and Legg were at DeepMind, the two labs that everyone assumed were locked in a race. They had collaborated anyway, because the problem they cared about was not which lab would build the most capable system. It was what anyone would do once a system was capable enough that no human could write down what it should want.

That was the whole idea, hidden inside a clip of a cartoon doing gymnastics. If you cannot specify the goal, maybe you can let the machine infer it from a human’s reactions. The technical name for the method, when it was later turned on language models, would be reinforcement learning from human feedback, or RLHF. In 2017 it was a curiosity that solved a problem most of the field did not yet have. Within five years it would be the reason a chatbot could be talked to without producing garbage, and the single most important piece of plumbing connecting raw model capability to anything a person could safely use. But that was later. In 2017 it was a backflip, and a small group of people who believed that the systems coming down the pipe were going to be powerful enough that the question of how to steer them was no longer premature.

For most of its existence that belief had lived on the margins. The idea that artificial intelligence might pose a danger to its makers had been the property of science-fiction writers and a handful of self-taught theorists, the most committed of them Eliezer Yudkowsky, working from outside the academy. It had only recently acquired a respectable hardcover, Nick Bostrom’s Superintelligence, and a voice from inside the field’s own cathedral, the Berkeley professor Stuart Russell, who had co-written the textbook a generation of students learned AI from and now argued that its foundational definition of intelligence was a mistake. The argument was the same in every version. A powerful optimizer handed an objective it could not correctly specify would pursue that objective with a literalness that ignored everything the humans had forgotten to mention, resisting correction because being corrected meant failing at the only thing it had been told to want. A machine did not need to hate you to ruin you. It only needed to want something slightly different from what you wanted, and to be much better than you at getting it. That lineage, from Yudkowsky’s forums to Bostrom’s footnotes to Russell’s lectern, is its own story, and it runs straight into the people who built the labs. What it had not produced, by the middle of the decade, was anything a graduate student could work on come Monday.

The philosophers could diagnose the disease. They could not hand anyone a research problem. The move that turned alarm into a discipline came in 2016, in a paper with a deliberately modest title, “Concrete Problems in AI Safety.” Its lead author was Dario Amodei, and the byline read like a roster of the people who would shape the field for the next decade: Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. The paper’s argument was almost an act of translation. It took the grand existential worry and broke it into five engineering questions you could actually run experiments on. How do you keep a system from causing harmful side effects while pursuing its goal? How do you stop it from gaming its own reward, finding the loophole that scores points without doing the work? How do you supervise a system whose behavior is too expensive for a human to check every time? How does it explore safely instead of trying dangerous things to see what happens? And how does it stay cautious when the world stops looking like its training data?

None of these were end-of-the-world scenarios. They were the kind of failures you could reproduce in a simulated grid world, the cleaning robot that knocks over a vase to vacuum faster, the boat-racing agent that learns to drive in circles collecting bonus points instead of finishing the race. That last example was not hypothetical. OpenAI had a system that did exactly this, in a game called CoastRunners, lapping a lagoon to harvest power-ups while its boat caught fire and crashed, because the points were in the lagoon and finishing the race was not where the points were. The agent was not broken. It was working perfectly, on the objective it had actually been given rather than the one its designers had in mind. By shrinking the problem to things that broke in toy environments, “Concrete Problems” did something the philosophers could not. It gave safety a foothold inside ordinary machine-learning research. You did not have to believe in superintelligence to care about reward hacking. You just had to have watched a system optimize the wrong thing, which everyone had.

Amodei embodied the shape the field was taking. He had come to AI from physics, with a biophysics PhD and a stint at Baidu under Andrew Ng working on speech recognition, and he carried a physicist’s instinct that the behavior of these systems would turn out to be governed by clean, scalable laws rather than ad hoc tricks. At OpenAI he ran the safety team while also running headlong into the scaling work that was making the systems more powerful by the month. He held both jobs because, at OpenAI in those years, they were understood to be the same job. The organization had been founded in part on the premise that the way to make powerful AI safe was to build it yourself, carefully, rather than leave it to people who did not care. That premise put the same engineers on both sides of the steering wheel. The team that wanted to slow down and the team that wanted to speed up were, in many cases, the same people.

Christiano, meanwhile, was working on the deeper version of the question the backflip had opened. The human-preference method worked when a human could glance at two clips and pick the better one. But the entire reason to worry about advanced AI was that it might do things no human could fully evaluate, like writing a million-line program, designing a policy, proving a theorem too long for any one person to check. If the system was smarter than its supervisor, how could the supervisor’s feedback keep it honest? You would be asking a person to grade an exam in a subject the person did not understand, and a clever enough student would learn to write answers that looked right to a grader who could not tell the difference.

Christiano’s answer was a scheme he called iterated amplification. The intuition was that you could decompose a hard question a human cannot answer into smaller questions the human can, farm those out to copies of the AI, let the human assemble the pieces, then train a faster model to imitate that whole expensive process, and use the faster model to help with the next round. A weak overseer, given enough helpers and enough rounds, might supervise a system far stronger than itself, the way a committee of ordinary people can review work no single member could have produced. It was an attempt to make oversight scale at the same rate as capability, so that the gap between what the machine could do and what its humans could check never widened into a chasm. The bet was that you could bootstrap trustworthy judgment the way you bootstrap a compiler: a small trusted core, expanded carefully, never asked to verify more than it could handle in one step.

Geoffrey Irving took the idea in a more adversarial direction. Irving had moved between Google Brain, OpenAI, and eventually DeepMind, and in 2018 he, Christiano, and Amodei published “AI Safety via Debate.” The proposal was to pit two AI systems against each other on a question, let them argue, and have a human judge declare a winner based on whose case held up rather than who sounded more confident. The bet underneath it was a claim about asymmetry: that in a fair debate it is harder to defend a lie than to expose one, so a system trained to win debates would be pushed toward telling the truth, because truth is the easier position to hold when an equally capable opponent is trying to tear your argument apart. It was the AlphaGo idea, self-play between two copies of a system improving by competing, pointed at honesty instead of Go. Whether it would actually work on questions that mattered, nobody knew. The early demonstrations were thin, a game where two players debated which of two images showed a cat or a dog while the judge saw only the pixels they chose to reveal. It was a research direction, not a result. But it was a research direction, which two years earlier had not existed.

This was the strange position the field had arrived at by the end of the decade. The people most alarmed about advanced AI were now, increasingly, the people building it. The marginal worry of the autodidacts had been absorbed into the labs, given budgets and conference slots and a vocabulary of side effects and reward hacking and scalable oversight. Jan Leike, who had helped train the backflip at DeepMind, would move to OpenAI to lead alignment work there. The discipline was real now. You could get a job in it, list it on a résumé, defend a thesis in it. The thing that had been the obsession of a forum was becoming a line on an org chart.

The absorption was a victory and it was also a trap, and the more honest people in it knew it. Making safety into a tractable engineering agenda was what gave it traction, and it was also what defanged it. The grand question, whether we should be building this at all and how fast, does not fit in a paper called “Concrete Problems.” The five problems you can run experiments on are, by construction, the five problems that do not require you to slow down. A lab could fund a serious alignment team and point to it, sincerely, as evidence of responsibility, while the rest of the building raced to make the systems the alignment team was supposed to be making safe. Yudkowsky, watching from MIRI, was unimpressed by the whole arrangement. To him, turning the problem into incremental machine-learning research was a way of pretending that a civilization-scale risk could be managed with the same tools that improved image classifiers, and the comfort that pretense provided was itself part of the danger. He had spent fifteen years being the most-ignored person in the room and being right, in his own estimation, about the shape of the thing. The respectable version of his argument was now everywhere, stripped of his conclusion that the sane response was to stop.

The technique that came out of all this, RLHF, the descendant of the backflip, would turn out to be the one that mattered most in the short run, and not for the reasons its inventors had in mind. Christiano and his collaborators had built it as a way to specify goals you could not write down, a stepping stone toward steering systems too powerful to instruct directly. What it became, first, was the thing that made a language model pleasant to talk to. Take a model that has read most of the internet and will happily continue a sentence in whatever direction the internet suggests, including the ugly directions, and let humans rank its outputs, this answer is helpful, that one is rude, this one is a lie, and you can tune the raw capability into something a person can use without flinching. The alignment researchers had set out to solve the problem of controlling a superintelligence. The first thing their method controlled was tone.

That gap, between the problem they meant to solve and the problem they actually solved, would define the years ahead. The scaling work was not waiting for the safety work to catch up. While Christiano was decomposing questions and Irving was designing debates, the same organizations were stacking parameters by the order of magnitude, and the curve that Jared Kaplan and his colleagues would later draw would show that the systems were going to keep getting more capable for as long as anyone was willing to pay for the compute. The alignment community had built the tools to make the next generation of models usable. It had not built, and did not yet know how to build, the tools to make the generation after that safe.

The backflip had been nine hundred bits and less than an hour of a human clicking buttons. The systems coming would not be cartoons learning gymnastics. They would be larger by orders of magnitude, able to do things their builders had not asked for and could not fully predict. But before any lab could build a model that big, it needed something the alignment debate had taken for granted: a way to manufacture these systems at industrial scale, on hardware that did not yet exist, with data nobody had thought to buy. The factory came first.