Neuron Makers
Part VIII · Chapter 33

Teaching Machines to Think

A new paradigm of test-time compute lets models reason step by step; OpenAI's o1 and o3 and DeepSeek's R1 send the old benchmarks tumbling. → The second scaling law, and the return of reasoning to AI.

“o3 is a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs.” — François Chollet, on OpenAI’s o3, December 20, 2024

Noam Brown had spent the better part of a decade teaching computers to do something most people assumed they could already do: stop and think before acting. He had come up through games that punish haste. As a doctoral student at Carnegie Mellon he had helped build Libratus, the poker bot that beat four of the best heads-up players in the world over twenty days in a Pittsburgh casino in 2017, and what made Libratus win was simple: when the stakes were high, it took longer. It had not memorized more hands than its opponents. It searched the game tree further. It reasoned at the table, in the moment, rather than relying only on what it had learned in advance. Brown had measured the effect and found it staggering. Letting the program think harder during a single hand was worth roughly the same as making the program a hundred thousand times larger. A little deliberation, applied at the right instant, was worth an ocean of preparation.

When Brown joined OpenAI in 2023, he carried that finding with him like a key to a door no one in language modeling had thought to try. For five years the entire field had been organized around a single lever. You made the model bigger, you fed it more text, you spent more on the training run, and the model got better in ways you could chart on a clean line. That was the bitter lesson the field had swallowed after years of resisting it, and it had produced GPT-4 and Gemini and Claude. But the lever had a cost curve that bent the wrong way. Each new tier of capability demanded an order of magnitude more compute than the last, and the supply of high-quality text on the internet was not infinite. Some researchers had begun to whisper that the curve was flattening, that the next GPT might cost ten times as much and feel only a little smarter. The party that had begun with AlexNet might be running out of fuel.

Brown’s poker work pointed at a second lever, one nobody had pulled. What if you left the trained model alone and instead let it think longer at the moment you asked it a question?

The intuition was almost embarrassingly human. Ask a person a hard question and a good answer rarely arrives the instant their mouth opens. They pause. They work it out, sometimes on paper, sometimes in their head, trying a path, hitting a wall, backing up, trying another. The large language models of 2023 did none of this. They produced text the way a fluent liar does, one word after the next at a constant clip, committing to each token before they could possibly know where the sentence was going. They were extraordinary at tasks that rewarded fluent pattern completion and brittle at tasks that required following a chain of steps, because they were structurally incapable of stepping. Ask GPT-4o, the conversational model OpenAI shipped in May 2024, to solve a competition math problem from the American Invitational Mathematics Examination, and it scored about thirteen percent. A strong high-school competitor scored far higher. The teenager did not know more. The teenager could sit with the problem.

The fix, when OpenAI revealed it on September 12, 2024, looked deceptively small. The new model was called o1, and it had been built under the internal codename Strawberry, a name that carried freight. In late 2023, during the five days Sam Altman spent fired and rehired, reporters had picked up on an internal project called Q-star that some at the company believed represented a meaningful leap, and the rumor had attached itself to the idea that OpenAI had quietly cracked machine reasoning. Strawberry was the descendant of that work. What it did, mechanically, was generate a long internal monologue before answering, a private chain of thought in which it could try an approach, notice an error, and correct itself, all before a single word reached the user. OpenAI hid the raw monologue and showed only a sanitized summary, partly to protect a trade secret and partly because the unedited thinking could be strange to read. But the effect on the scoreboard was not hidden at all. The thirteen percent on the math exam became roughly eighty-three. On a benchmark of competitive programming problems, o1 reached the eighty-ninth percentile of human contestants. On a set of graduate-level science questions designed to stump anyone without a PhD in the relevant field, it crossed the threshold of expert human performance.

The number that mattered most was not on any leaderboard. It was the shape of a curve OpenAI published alongside the release. The company plotted o1’s accuracy against the amount of compute it was allowed to spend thinking at the moment of the question, and the line went up and to the right, smoothly, the same disciplined slope the field had spent five years chasing through training. Here was the same payoff from a different faucet. Brown and his colleagues had a name for it, and they used it deliberately, because naming a thing is how you make people believe it is real. They called it a new scaling law. There was the old one, where you spent money before deployment to make the model larger. And now there was a second, where you spent money at deployment to let the model deliberate, and it bought you intelligence on the same kind of predictable schedule. The implication was that the ceiling everyone feared was not a ceiling at all. It was a door.

What was actually happening inside the model was less mystical than it sounded and more interesting. Earlier researchers had noticed that if you simply asked a language model to “think step by step,” it often did better, a trick called chain-of-thought prompting that had been floating around since 2022. The model, prompted to show its work, would sometimes reason its way to a right answer it would have botched if forced to blurt. But that was a prompt, a polite request, and the model complied unevenly. What OpenAI had done was train the behavior in, using reinforcement learning, so the model learned to produce a good chain of thought rather than merely any chain of thought, to develop the habits of a careful thinker, to recognize when it had gone wrong and reverse course. The model was being taught a process, not new facts, and the process was the kind of thing humans call thinking.

OpenAI was not the only lab to find the door, and the most vivid demonstration that the behavior could emerge rather than be hand-built came, a few months later, from an unlikely place. On January 20, 2025, a Chinese firm called DeepSeek, a research outfit spun out of a quantitative hedge fund, released a reasoning model called R1, and released it the way OpenAI no longer released anything: in the open, weights downloadable by anyone, under a permissive MIT license, with a full technical paper describing how it was built. The market and geopolitical aftershocks of that release belong to a later part of this story. What belongs here is what was inside the paper.

DeepSeek’s researchers had trained a version of the model, R1-Zero, using almost nothing but reinforcement learning. They had not shown it thousands of human-written examples of careful reasoning to imitate. They had given it problems with checkable answers, rewarded it when it got them right, and let it discover, on its own, how to get them right more often. And the model discovered the same thing the human race had discovered: that hard problems reward patience. Over the course of training it began, unprompted, to spend more time on each problem, to write out longer chains of reasoning, to pause mid-solution and reconsider. In one passage the DeepSeek authors highlighted a moment in the model’s output where it stops, in effect catches itself, and re-examines its own work before continuing. They annotated it, in the paper, with a single phrase: an aha moment. No one had programmed the reasoning. It had been incentivized into existence, the way a child left alone with a hard puzzle eventually learns to slow down.

For a brief stretch in late 2024 and early 2025, the reasoning models seemed to be erasing the field’s oldest measuring sticks faster than anyone could invent new ones. The benchmarks that had defined progress for years fell one after another. On December 20, 2024, on the final day of a holiday publicity campaign OpenAI called the twelve days of OpenAI, the company previewed a successor to o1 called o3. The headline result was a score on a test that had been engineered, specifically and stubbornly, to be unbeatable by memorization.

The test was the Abstraction and Reasoning Corpus, and it was the work of François Chollet, a French engineer at Google who had spent years arguing, against the prevailing enthusiasm of his own employer, that the large language models were not as smart as their scores suggested. Chollet’s complaint was that a model trained on the entire internet could ace almost any benchmark by having effectively seen the answers, or close cousins of them, somewhere in its training. He built his corpus to defeat exactly that. Each puzzle was a small grid of colored squares that transformed into another grid according to a hidden rule, and the rules were novel, the kind a bright child could infer from a couple of examples but that could not be looked up because they had never existed before. For years the best AI systems scored in the single digits where humans scored around eighty-five percent. Chollet had begun to treat the gap as a kind of proof that something essential was still missing.

When o3’s preview scored 75.7 percent on the easiest setting, and 87.5 percent when allowed to spend lavishly on thinking, Chollet did not hedge. He called it a genuine breakthrough. The man who had built the test to be hard, and who had made a small career of skepticism, conceded that the machine had done something he had not expected to see for years. He was careful to add that it was not artificial general intelligence, that the puzzles a human still found trivial could still trip the system up. But the concession itself was the news.

And then, in the way of this period, the news got more complicated. The 87.5 percent had been bought at a price almost no one noticed in the excitement. The high-compute run had let the model generate over a thousand candidate solutions per puzzle and search among them, at a cost that ran into thousands of dollars for a single task. It was a demonstration of what test-time compute could do if you spent without limit, not a description of a product anyone could use. When OpenAI finally released o3 to the public in April 2025, it was a different model, tuned and cost-optimized for the real world, and the ARC Prize team that maintained the benchmark went back and measured it. The shipping o3 scored somewhere in the low-to-mid forties on the same puzzles, not the mid-eighties, at a tiny fraction of the cost. On a harder second version of the test, released to replace the one the models were beginning to crack, it scored in the low single digits.

This pattern, of a launch-day number that shrank when the dust settled, was not unique to o3. It was becoming the defining tension of the era. The same months produced a faked Google demo, a coding agent whose viral debut was later shown to be edited, and a string of benchmark claims that did not survive contact with reproduction. The capability was real, genuinely and undeniably real, and the theater around the capability was also real, and learning to tell them apart was becoming the central skill of anyone trying to follow the field honestly. The reasoning models could do things in early 2025 that no model could do in early 2024. They could also be made to look superhuman in a controlled demo in ways that did not generalize to a phone in someone’s pocket. Both were true at once, and the labs had every incentive to advertise the first and stay quiet about the second.

What did not shrink was the trajectory. Through 2025 the second scaling law held, stubbornly, run after run. OpenAI shipped o3 and a smaller, faster o4-mini in April, the first models that could think with images, manipulating a photograph or a diagram inside their own chain of reasoning rather than merely describing it, and the first trained to reach for tools, a web search, a Python interpreter, a file reader, in the middle of working a problem the way a person reaches for a calculator. The smaller model, given a Python interpreter to check its arithmetic, answered essentially every question on a recent edition of the same math olympiad that had stumped GPT-4o a year earlier. Anthropic took a different tack and shipped, in February 2025, what it called the first hybrid reasoning model, a single model with a dial. For an easy question it answered quickly, in the old conversational way; for a hard one the user could tell it to engage extended thinking and spend a budget of additional reasoning before replying. The distinction between a chatbot and a reasoner was collapsing into a setting.

The deeper change was philosophical, and it ran against the grain of everything the field had believed since AlexNet. For thirteen years the story of progress had been a story of scale applied in advance. You did the expensive work once, during training, and then inference was cheap; you ran the finished model billions of times for fractions of a cent. The reasoning models inverted the economics. They made the model’s answers expensive again, deliberately, because the expense was where the intelligence now lived. A single hard query to a reasoning model could consume a hundred times the compute of a simple one, and the labs began charging accordingly, OpenAI introducing a two-hundred-dollar-a-month tier built around models that could be told to think for a long time. The compute that the industry had spent years pushing into ever-larger training clusters was now also being pushed into inference, into the moment of use, and that shift would soon reorder the economics of the entire data-center buildout. A model that thinks longer is a model that runs hotter, and a planet’s worth of chips would have to answer for it.

There was a humbler reading of all this, and the more careful researchers held onto it even as the scoreboards lit up. The reasoning models were not thinking in any sense a philosopher would sign off on. They were generating text, still, one token at a time, and the chain of thought was itself just more generated text, a performance of reasoning that happened to correlate, often, with correct answers. No one could fully explain why training a model to ramble toward an answer made the answer better, only that it reliably did. The aha moment in the DeepSeek paper was a real pattern in the output, and it was also a phrase a human had chosen to put on a graph. The single-digit scores on the harder abstraction test were a reminder that whatever the models were doing, it was not yet the fluid, sample-efficient generalization that lets a child learn a new rule from one example. Gary Marcus, who had spent years insisting that deep learning could not reason, found in the o3 walk-back fresh ammunition; the boosters found in the o3 preview fresh proof he was wrong. The honest answer was that both had a point, and that the question of whether these systems reasoned, or only convincingly imitated reasoning, was no longer obviously answerable, because the imitation had gotten good enough that the distinction had begun to lose its grip.

What was not in doubt was the commercial consequence, and it arrived fastest in the one domain where reasoning could be checked instantly and at scale: software. Code either runs or it does not. A model that could plan, try, fail, and correct, that could hold a multi-step problem in view long enough to work it through, turned out to be exactly the kind of model that could write a program, notice the program was broken, and fix it. The benchmark the labs began to obsess over was SWE-bench Verified, a set of real bug reports drawn from open-source software, where the model had to produce a patch that made the actual tests pass. In the autumn of 2024 the best systems solved roughly a third of the problems. Within a year that figure would climb past three-quarters, and then past eighty percent, and the reasoning that had been demonstrated on contrived puzzles about colored grids would be quietly absorbed into the daily work of millions of programmers. The machines had learned to stop and think, and the first thing the world asked them to think about was how to build more machines.