Part V · The Transformer
Once the Transformer existed, capability climbed with scale — more parameters, more data, more compute — in a way that looked predictable on a log chart. GPT-1 to GPT-3 spanned 1,500× in size in two years. Then DeepMind’s Chinchilla showed the field had been reading the line wrong: a smaller model fed far more data beat a bigger one.
GPT-1 (117M) → GPT-2 (1.5B) → GPT-3 (175B). Loss fell smoothly as compute grew. The straight line said: keep scaling, and capability keeps coming. Scale, not cleverness, was the lever.
Chinchilla (70B on ~1.4T tokens) beat Gopher (280B). The compute-optimal recipe is roughly 20 tokens per parameter — most giant models had been starved of data. The slope changed; the straight line did not bend.
SOURCE · NEURON MAKERS, CH. 23 “THE BITTER LESSON” & dossier 01 · params: GPT-1 117M, GPT-2 1.5B, GPT-3 175B, Gopher 280B, Chinchilla 70B / ~1.4T tokens · dashed thesis line illustrative