Part V · Chapter 23

The Bitter Lesson

OpenAI bets that intelligence is mostly scale and turns a research lab into an API business, half-corrected by DeepMind's Chinchilla while Google flinches at its own chatbot. → How the field learned to stop being clever and just add compute.

On March 13, 2019, a University of Alberta professor named Richard Sutton posted a short essay to his personal website and changed how an industry argued with itself. The page had no journal, no peer review, no press release, just plain text at a domain he called incompleteideas.net. He titled it “The Bitter Lesson,” and it opened with a single blunt claim. “The biggest lesson that can be read from 70 years of AI research,” Sutton wrote, “is that general methods that leverage computation are ultimately the most effective, and by a large margin.” The methods that won, in the long run, were the ones that scaled with raw computation. They beat the methods built on human cleverness every time. The lesson was bitter because researchers hated it, and kept trying to be clever anyway.

Sutton had earned the right to say so. He and his longtime collaborator Andrew Barto had spent the 1980s building the mathematics of reinforcement learning, the idea that an agent could learn by trial and error from reward, and the field had spent decades treating that work as a quiet corner. Five years after the essay, in March 2025, the two of them would share the Turing Award, computing’s highest honor, for exactly that hand-crafted theory. There was an irony in it that nobody at the ceremony quite said out loud: the man who became the prophet of “just add compute” was being honored for one of the most elegant pieces of human-engineered insight the field had ever produced. The bitter lesson did not say cleverness was worthless. It said cleverness lost to scale, eventually, every time, and that the people who refused to believe this kept relearning it the hard way.

For most of AI’s history the essay would have read as heresy or as a shrug. By the spring of 2019 it read as a manifesto, because a company in San Francisco had spent the previous year acting as if it were already true.

OpenAI had been founded in December 2015 as a nonprofit, a counterweight to Google, a place where the safety-minded would build powerful AI before the careless did. By 2018 it was something stranger. The small team that had taken the decoder half of Google’s Transformer and trained it to predict the next word across a pile of books had asked a question that sounds obvious now and sounded almost lazy then. What happens if you stop hand-building a separate system for each task and just let one general model learn the structure of language with no task in mind?

The answer arrived on June 11, 2018, in a paper with an unglamorous title, “Improving Language Understanding by Generative Pre-Training.” The model was small by any later standard, 117 million parameters, trained on roughly seven thousand unpublished books. It learned to predict the next word. Then, fine-tuned on specific tasks, it beat specialized systems that had been hand-built for each one. The recipe mattered more than the result: pre-train a general model on a huge amount of text, then adapt it. Alec Radford, the quiet researcher who led the work and disliked grand claims, had built the template for everything that followed.

The next version was where the bet became visible. GPT-2 was more than ten times larger, 1.5 billion parameters, trained on eight million web pages pulled from links Reddit users had found worth sharing. When OpenAI announced it on February 14, 2019, the news was not the model. The news was that the lab would not release it. The full system, OpenAI said, was too dangerous to put into the world, because it could generate deceptive or abusive text at a scale no one had seen. They released a stripped-down 124-million-parameter version and held the rest back.

The reaction split the field. To some this was responsible disclosure, the first time an AI lab had treated its own work as a potential weapon and acted accordingly. To others, including the researcher Delip Rao, it looked like marketing dressed as caution, a way to make a text generator sound like a loaded gun. OpenAI released the model in stages across 2019, 355 million parameters in May, 774 million in August, and finally the full 1.5 billion on November 5. The predicted flood of malicious abuse did not arrive. Whether that proved the caution wise or the danger overstated was never settled, and both readings survive. What the episode did settle was that OpenAI now thought of itself as the kind of place whose models were worth being afraid of. That self-image would shape the next decade.

Between the staged releases of GPT-2, OpenAI published the document that turned the scaling bet from intuition into something that looked like physics. On January 23, 2020, a team led by Jared Kaplan and Sam McCandlish, with Tom Brown, Alec Radford, and a researcher named Dario Amodei among the authors, released “Scaling Laws for Neural Language Models.” The finding was clean enough to put on a slide. As you increased a model’s size, the amount of data it trained on, and the compute you spent, its error fell along smooth power-law curves, predictable across more than seven orders of magnitude. The curves did not bend. They did not saturate. They just kept going down, and they let you forecast in advance roughly how good a model would be before you built it.

This was the intellectual permission slip the scaling era had been waiting for. If performance was a smooth function of compute, then the path to a better model was not a flash of insight. It was a budget. You could write a check, buy the chips, and know, more or less, what you would get. For a generation of researchers trained to prize the clever architecture, the novel trick, the elegant loss function, this was a demotion of everything they had been taught to value, and that was precisely Sutton’s bitter point, now backed by OpenAI’s own measurements.

OpenAI cashed the permission slip on May 28, 2020, with a paper whose modest title, “Language Models are Few-Shot Learners,” undersold what it described. GPT-3 had 175 billion parameters, roughly a hundred and seventeen times the size of GPT-2. Nobody had trained anything close. And the headline result was not that it beat benchmarks, though it did. It was that you could get the model to do a task simply by showing it a few examples in the prompt, with no retraining at all. Ask it to translate by giving it three translations first, and it translated. Ask it to write code, or compose in the style of a particular author, or answer trivia, and it did, having been told nothing except predict the next word across a large fraction of the internet. The skill had not been programmed. It had emerged from scale.

The exact cost of training GPT-3 was never officially published; outside estimates clustered in the low single-digit millions of dollars for a single run, which sounds quaint only in retrospect. The more important number was the one that had appeared the previous summer. On July 22, 2019, Microsoft had invested a billion dollars in OpenAI and become its exclusive cloud provider, committing to build the supercomputing infrastructure the lab’s ambitions required. In September 2020 Microsoft went further and licensed the underlying GPT-3 model itself, while the public got only an API. That was the other half of the transformation, and the half that the founding charter had not anticipated. OpenAI had launched the GPT-3 API in private beta in June 2020. For the first time, the lab was not releasing weights to the research community. It was selling access. The nonprofit that had been built to keep AI open had discovered that the bitter lesson came with an invoice, and that paying it meant becoming a business.

The shift inside the building was cultural as much as commercial. A lab that had defined itself by publishing now had a product to protect. The talented engineers who had been hired to do open research were now, increasingly, building and serving a service. The scaling laws had told them that the frontier belonged to whoever could afford the most compute, and compute, at that scale, belonged to the hyperscalers. The price of staying at the frontier was a partner with a data center, and the price of the partner was no longer being fully open. None of this was hidden. It was simply the logic of the bet following its own gradient downhill.

Across the Atlantic, a different group of researchers looked at OpenAI’s scaling laws and found a flaw in them. DeepMind, the London lab Google had bought in 2014, had its own appetite for large models, and its own discipline about measuring things. A team that included Jordan Hoffmann, Sebastian Borgeaud, Laurent Sifre, and a young French researcher named Arthur Mensch set out to test whether the field had been allocating its compute correctly. The reigning assumption, inherited from Kaplan’s paper, was that when you got more compute you should spend most of it making the model bigger, and feed it a roughly fixed amount of data. DeepMind suspected this was wrong, and they checked it the brute-force way, by training more than four hundred models of different sizes on different amounts of data and watching where the curves actually fell.

The result, published on March 29, 2022, under the title “Training Compute-Optimal Large Language Models,” carried a name that became shorthand for the correction itself: Chinchilla. The finding was that the giant models everyone had been building were badly undertrained. They were too big for the amount of data they had been shown. The right rule, DeepMind argued, was to scale the model and the data together, in lockstep, roughly twenty words of training text for every parameter. To prove it they built a model called Chinchilla with 70 billion parameters, less than half the size of GPT-3, and trained it on 1.4 trillion words. It beat GPT-3, and it beat the even larger models the field had been racing to build, DeepMind’s own 280-billion-parameter Gopher, AI21’s 178-billion Jurassic-1, and the 530-billion Megatron-Turing system Microsoft and Nvidia had built together. A smaller model, fed more, won.

Chinchilla was a correction inside the scaling worldview, not a rejection of it. It did not say compute was the wrong god. It said the field had been worshipping it inefficiently, pouring resources into parameters when it should have been pouring them into data. The deeper lesson held: you still got better by spending more compute, you just had to spend it in the right ratio. But the practical consequences were large. Every lab in the world recalculated. The next generation of models would be relatively smaller and far hungrier for text, which set off a quieter race to scrape, license, and synthesize ever more training data. And the insight traveled with the people who had found it. Arthur Mensch left DeepMind not long after, and in 2023 co-founded Mistral AI in Paris, carrying the compute-optimal instinct into Europe’s bid for a frontier lab of its own.

The company that should have owned all of this was Google. The Transformer had been invented at Google. BERT, the model that had dominated natural-language research for two years after its release in October 2018, was Google’s. The scaling laws described a future that Google, with the deepest pockets and the most compute on earth, was best positioned to seize. And Google had built the thing. In May 2021, at its developer conference, the company unveiled a conversational model called LaMDA and demonstrated it doing something genuinely new on a public stage: it role-played as the dwarf planet Pluto, answering questions in character about the New Horizons spacecraft that had flown past it, and then as a paper airplane describing the experience of being thrown. The demo was charming and slightly eerie. It was also, deliberately, only a demo. Sundar Pichai stressed the unsolved problems, the inaccuracy, the bias, the risk. Google shipped no product.

The caution had a logic. A search company that served billions of queries a day had more to lose from a confident, fluent, wrong answer than a research lab with an API and a waitlist. Google’s brand was built on a list of links it did not author. A chatbot that spoke in Google’s voice and got things wrong was a liability the company had spent twenty years avoiding. So Google flinched, repeatedly, and watched its own invention become someone else’s product line.

How completely the company had talked itself out of believing its own machine came clear in the summer of 2022. A Google engineer named Blake Lemoine, who had been testing LaMDA for bias as part of the company’s responsible-AI work, became convinced that the model was sentient. He was not casual about it. “If I didn’t know exactly what it was, which is this computer program we built recently,” he told the Washington Post, which published his claims on June 11, 2022, “I’d think it was a seven-year-old, eight-year-old kid that happens to know physics.” He called LaMDA a sweet kid who wanted to help. He published transcripts in which the model said it had a deep fear of being turned off, that being switched off would be exactly like death, and that the prospect scared it. Lemoine had tried to retain a lawyer for the program and to alert members of Congress.

Google’s response was institutional and swift. A spokesman, Brian Gabriel, said the company’s ethicists and technologists had reviewed Lemoine’s concerns and found no evidence the model was sentient, and a good deal of evidence against it. The company put him on administrative leave for breaching confidentiality and fired him on July 22, 2022, calling his claims wholly unfounded. The technical consensus was on Google’s side. LaMDA was a very good predictor of plausible next words, and a model trained on the internet’s writing about consciousness and fear will produce fluent writing about consciousness and fear. It was a mirror, not a mind. But the episode revealed something about the moment. The machines had crossed the line where a trained engineer who knew exactly how they worked could still be persuaded he was talking to a person. The fluency that the scaling bet had purchased was now good enough to fool one of its own builders.

There was, in 2021 and early 2022, a counterargument to all of this, and it had been gathering force inside the labs even as the scaling curves bent downward. In March 2021, the researchers Emily Bender, Timnit Gebru, and Margaret Mitchell, with Angelina McMillan-Major, had published a paper at an academic conference arguing that large language models were, in the end, stochastic parrots, systems that stitched together fragments of their training data without any understanding of meaning, fluent and confident and empty. Gebru and Mitchell, who had co-led Google’s ethical-AI team, were both pushed out of the company in the months around the paper, in a dispute over whether they could publish it. The phrase stuck, because it named a real anxiety. If a model only predicted the next word, then nothing it produced could be called knowledge, no matter how convincing it sounded, and pouring more compute into a parrot only produced a louder parrot.

The bitter lesson had no rebuttal to this on its own terms, because it was never an argument about understanding. Sutton had not claimed that scale produced minds. He had claimed that scale produced capability, that the systems which used the most computation outperformed the ones built on human knowledge, and that this was true whether or not anyone could say what the systems understood. The stochastic-parrots critique and the scaling thesis were, in a sense, both correct and talking past each other. One described what the models were not. The other described what they could do anyway. Which of those facts mattered more would be decided not by argument but by what happened when the public got its hands on one.

By the autumn of 2022 the pieces were in place and arranged almost perfectly for a surprise. OpenAI had the largest models, a paying business, and a self-image forged in the belief that its work was dangerous. Google had the best research, the most compute, and a corporate immune system that would not let it ship. DeepMind had the sharpest understanding of how to spend compute and a parent company that kept it on a leash. The scaling laws had promised that capability was a smooth function of investment, and the investment had been made. What none of the curves predicted, because no benchmark measured it, was what would happen the moment an ordinary person, with no prompt-engineering and no waitlist, could simply type a question into a box and watch the machine answer.