Part VIII · The New Machines

A Second Axis of Scaling

For years a model improved only by getting bigger. In September 2024 OpenAI's o1 learned to think longer before answering, and benchmarks conventional models had stalled on began to fall. By January 2025 China's open-weight DeepSeek-R1 had reached the same tier. These are the few benchmark numbers the sources actually state — left to right, the conventional model gives way to the reasoning models.

GPT-4o · conventional (May 2024)

o1 · reasoning (Sept 2024)

o3 · reasoning (Dec 2024 preview / Apr 2025 release)

DeepSeek-R1 · open reasoning (Jan 2025)

The catch the excitement missed. o3's headline 87.5% on ARC-AGI came from a high-compute run that searched over a thousand candidate answers per puzzle, at thousands of dollars per task. The model OpenAI actually shipped in April 2025 scored in the low-to-mid forties on the same test, and only low single digits on a harder second version. The capability was real; so was the demo theater. On SWE-bench Verified the chapter gives a trajectory, not per-model points: roughly a third of real bugs solved in autumn 2024, past three-quarters and then past 80% within a year.

SOURCE · NEURON MAKERS, Ch. 33 (“Teaching Machines to Think”) + research dossier 08 (Reasoning / Agents / Multimodal). Underlying: OpenAI o1 & o3 announcements; DeepSeek-R1 (open weights, Jan 20 2025); ARC Prize / François Chollet. Only figures stated in the sources are plotted; released-o3 ARC bars use the midpoint of the sourced 41–53% and 3–4% ranges.