Part V · The Transformer

A straight line, drawn in log scale

Once the Transformer existed, capability climbed with scale — more parameters, more data, more compute — in a way that looked predictable on a log chart. GPT-1 to GPT-3 spanned 1,500× in size in two years. Then DeepMind’s Chinchilla showed the field had been reading the line wrong: a smaller model fed far more data beat a bigger one.

OpenAI GPT line (params) DeepMind (Gopher, Chinchilla) The era’s thesis: “bigger is better” (illustrative)

The thesis (2018–2020)

GPT-1 (117M) → GPT-2 (1.5B) → GPT-3 (175B). Loss fell smoothly as compute grew. The straight line said: keep scaling, and capability keeps coming. Scale, not cleverness, was the lever.

The correction (2022)

Chinchilla (70B on ~1.4T tokens) beat Gopher (280B). The compute-optimal recipe is roughly 20 tokens per parameter — most giant models had been starved of data. The slope changed; the straight line did not bend.

SOURCE · NEURON MAKERS, CH. 23 “THE BITTER LESSON” & dossier 01 · params: GPT-1 117M, GPT-2 1.5B, GPT-3 175B, Gopher 280B, Chinchilla 70B / ~1.4T tokens · dashed thesis line illustrative