Neuron Makers
Part V · Chapter 22

Attention Is All You Need

Eight Google researchers invent the Transformer in 2017, name it after a Beatles song, and scatter to found the companies that will challenge Google itself. → The one architectural decision the entire modern era rests on.

“Attention Is All You Need.” — title of Vaswani et al. (2017), a play on the Beatles’ “All You Need Is Love”

Jakob Uszkoreit had a problem with how machines read.

By 2016 he was a researcher at Google, working on the systems that answered questions and translated text, and the dominant tool for that work was the recurrent neural network. A recurrent network read a sentence the way a person reads a ticker tape: one word, then the next, then the next, each word’s meaning shaped by a running memory of everything that came before. It was elegant and it was slow. Because each step depended on the step before it, the math could not be spread across many processors at once. You waited for word one before you could compute word two. For a company that owned warehouses of parallel hardware and ran a translation product used by hundreds of millions of people, the bottleneck was maddening.

Uszkoreit was the son of Hans Uszkoreit, a computational linguist of some renown, and he had grown up around the idea that language could be modeled. His own conviction, which he repeated to anyone who would listen and to several who would rather he stopped, was that the sequential reading was unnecessary. A model did not need to march through a sentence in order. It could look at all the words at once and learn, for each word, which of the others mattered to it. The pronoun “it” in a sentence could reach directly back across a dozen words to the noun it referred to, without passing through every word in between. The mechanism for that reaching already had a name in the research literature. It was called attention, and it had been bolted onto recurrent networks a few years earlier, in 2014, by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, as a way to help translation systems keep track of long sentences. Attention was the helper. Uszkoreit’s heresy was that attention could be the whole thing. Take away the recurrence, take away the slow sequential spine, and let attention carry the model on its own.

What attention did, stripped of the mathematics, was let every word in a sentence interrogate every other word and weigh the answers. For each word the model produced a query, and every word offered a key; where a query matched a key, the model paid that word more heed and pulled in more of its meaning. The word “it” could broadcast a query that asked, in effect, which earlier noun do I stand for, and the right noun’s key would answer loudest, no matter how far back it sat. All of those comparisons happened in a single sweep rather than in the slow relay of a recurrent network. The team ran several of these attention systems side by side, each one free to learn a different kind of relationship, so that one set of comparisons might track grammar while another tracked meaning, and then stitched the results together. That was the whole bet. If the model could learn, from data alone, which words should attend to which, it would not need to be walked through the sentence in order at all. It would only need one small concession: because it no longer read in sequence, it had to be told where each word sat in the line, a stamp of position added to every word so the model would not mistake a sentence for a bag of unordered tokens.

Most of his colleagues thought this was wrong, or at least unpromising. Attention was a useful patch. Nobody believed it could replace the architecture it patched. Uszkoreit kept arguing anyway, and over 2016 and into 2017 he gathered a small, shifting group of collaborators at Google’s Mountain View campus willing to try to prove him right. Among the earliest was Illia Polosukhin, a Ukrainian engineer who worked on the question-answering system behind Google Search. The two of them, with others drifting in and out, started building toward a model with no recurrence in it at all.

What happened next has become one of the more contested origin stories in the history of computing, partly because so many of the people involved have an interest in how it is told, and partly because the truth is genuinely diffuse. The work that became the Transformer did not spring from one mind. It was assembled by a group, in fits, with several people contributing the piece without which it would not have worked, which means several people can each say, accurately, that it would not have worked without them.

The clearest turning point came from a man who was not originally on the project at all. Noam Shazeer was one of Google’s most formidable engineers, a veteran who had been at the company since 2000 and had a hand in the early spelling correction in Google Search and in some of the infrastructure that made the place run. He was the kind of programmer about whom other very good programmers spoke with a certain awe. By 2017 he had grown restless, was thinking about leaving, and had told colleagues he might quit to do something else. Then, by the account he and others have given to journalists, he overheard the group talking about their attention-only idea near the office snacks. He found it interesting. He asked if he could join.

Shazeer took the group’s working code and rewrote large parts of it. He was the rare researcher who was also a world-class systems engineer, and he restructured the model so that it actually used the hardware the way Uszkoreit had dreamed it would. The results jumped. The thing that had been a promising research direction became, in a matter of weeks, a model that was beating the best translation systems Google had. The team kept pulling pieces out to see what they could do without. They removed convolutions. They removed the recurrence entirely, the feature that every competitive sequence model had relied on for years. Each time they expected the quality to collapse, and each time it held or improved. The architecture got simpler and better at the same time, which almost never happens, and which is why the people in the room remember the spring of 2017 the way some people remember a particular summer.

The eight names that ended up on the paper were Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin. They listed themselves with a footnote declaring equal contribution and ordered their names at random, an unusual gesture meant to head off exactly the argument about credit that broke out anyway. Vaswani’s name came first, which is why his is the one most people who cite the paper learn. Parmar was the only woman among them. Gomez was the youngest by a wide margin, a twenty-year-old undergraduate from the University of Toronto doing an internship, who found himself running experiments through the night on what turned out to be one of the most influential pieces of engineering of the century. Kaiser, who had co-built a Google library called Tensor2Tensor, helped turn the model into reusable code that others could pick up.

The name came from Llion Jones, a Welsh engineer with a fondness for the Beatles. The team had been calling the architecture the Transformer, a label Uszkoreit had pushed because the model transformed one representation of language into another and because, by his own admission, it sounded good. When it came time to title the paper, Jones suggested “Attention Is All You Need,” a play on the Beatles’ “All You Need Is Love.” Some of the authors thought it was too cute for a serious machine-learning paper. They used it anyway. It would become, in time, one of the most recognized titles in the field, and the joke would curdle slightly with success, because the title overstated the case. Attention was not, strictly, all you need. The model also needed that stamp of position on every word, and several other components the title elides. But the phrase stuck, and the field’s habit of naming later papers “X Is All You Need” is a small monument to how thoroughly it stuck.

They were writing against the deadline for NeurIPS, the year’s largest machine-learning conference, and the work was not finished when the writing started. Experiments were still running as the paper was assembled, results dropping into the tables in the last days before submission. They got it in. On June 12, 2017, the paper went up on the arXiv preprint server as arXiv:1706.03762.

What the paper reported was, on its face, narrow. It was a translation result. On the standard 2014 benchmark for translating English to German, the Transformer scored 28.4 on the BLEU metric that grades translation quality, a new record. On English to French it scored 41.8. Those numbers, in the small world of machine translation, were genuinely good. But the line in the paper that mattered most was not about quality. It was about cost. The model reached those scores after training for around three and a half days on eight GPUs. The systems it beat had taken far longer and far more hardware. The Transformer was better and it was cheaper, and it was cheaper precisely because Uszkoreit had been right: without the sequential spine, the computation spread across parallel hardware, and the hardware did what it was built to do. The architecture’s defining feature was that it scaled with compute almost effortlessly. Throw more chips at it and it went faster. Throw a bigger version of it at more data and it kept getting better, with none of the bottlenecks that had strangled the older designs.

Nobody in the room fully understood, in June 2017, what they had. They had built a better translation model. The paper presents it as a better translation model. The abstract talks about translation, the experiments are about translation, and the motivation, all the way back to Uszkoreit’s original frustration, was about making Google Translate faster. The universality of the thing, the fact that it would turn out to be the engine behind essentially every important AI system of the next decade, nobody had designed. Other people discovered it, later.

When the paper was presented at NeurIPS that December, in Long Beach, it drew interest but not pandemonium. Translation researchers paid close attention. The broader field took longer to register what had arrived. Recognition came not from any single demonstration but from the slow realization, over 2018 and 2019, that the architecture worked for almost anything you pointed it at. A research group could take the Transformer, designed for converting English into German, and train it instead to predict the next word in a stretch of ordinary text, and the model would absorb a startling amount of how language works. The same skeleton that translated could summarize, answer questions, write code, and carry on a conversation. The narrow tool was a general one. The translation result had been a keyhole, and the building behind it was enormous.

Two labs walked through that keyhole first. At OpenAI, a small team led by Alec Radford took the decoder half of the Transformer and trained it to do nothing but predict the next word across a large pile of books, releasing the result in June 2018 as the first of a series they called GPT, for Generative Pre-trained Transformer. The middle word of that name, and the entire family of systems that followed it, traces directly to the eight authors’ paper; the T in GPT is theirs. Four months later, in October 2018, Google’s own researchers released BERT, which used a different slice of the Transformer to read both directions across a sentence at once, swept the field’s benchmarks, and quietly went into Google Search to help it understand what people were actually asking. The architecture invented to translate was, within eighteen months, remaking search, language understanding, and the early shape of the systems that would become chatbots. What the years after would add to it was not a new idea but a willingness to make it enormous, to feed it more data and more compute than anyone had dared, and to find that it simply kept getting smarter. By 2020 OpenAI had built a version with a hundred and seventy-five billion parameters, more than a thousand times the size of Radford’s first one, and the larger it grew the more it could do that no one had trained it to do. The Transformer was the substrate. Scale was what the field would do with it.

That part of the story belongs to others. The part that belongs to the eight authors is what they did next, which was leave.

They did not leave all at once, and they did not leave for one reason, but they left almost completely. Polosukhin went first, in 2017, before the paper had even been presented, to start a company that began as an effort to teach machines to write code and pivoted into a blockchain platform called NEAR. Gomez, the intern, co-founded a company called Cohere in 2019, aimed at selling language models to businesses rather than consumers, and by 2025 it had raised money at a valuation around seven billion dollars. Shazeer, the engineer whose rewrite had made the thing work, grew frustrated that Google would not ship the conversational AI he believed it could build. The company worried about the reputational hazard of a chatbot that might say something embarrassing or false, a caution that was perfectly rational for a firm whose search box was trusted by billions and whose mistakes made headlines, and that would later look, in hindsight, like the most expensive timidity in corporate history. In 2021 Shazeer left with a colleague named Daniel De Freitas to found Character.AI and build exactly the kind of chatbot Google would not. Uszkoreit left the same year to start Inceptive, applying the architecture not to language but to biology, designing RNA molecules for new medicines. Vaswani and Parmar co-founded a company called Adept in 2022, then left it to found another, Essential AI, in 2023. Kaiser went to OpenAI, where the scaling bet was being made most aggressively. Jones, who had named the paper, was the last of the eight out the door, founding Sakana AI in Tokyo in 2023.

The arithmetic of that exodus is the central irony of the modern era. Google employed all eight of the people who invented the Transformer. Google ran the data centers that could train enormous versions of it. Google had, in BERT, already proven to itself that the architecture generalized beyond translation. The company had the idea, the talent, and the compute, the full set, in one building, years before anyone else. And it watched the people holding all three pieces leave to found the companies that would spend the next decade trying to beat it. The lab that built the foundation of modern AI became the lab known for having let its builders go.

Google noticed, eventually, and tried to buy back what it had lost. In August 2024 it struck a deal worth around 2.7 billion dollars that licensed Character.AI’s technology and brought Shazeer and roughly thirty of his people back inside the company, with Shazeer installed as a technical co-lead of Gemini, Google’s answer to the systems his old colleagues had helped others build. The cost of getting Noam Shazeer back, a man who had been on the payroll in 2017 and had thought about quitting out of restlessness, ran into the billions and took the better part of a decade. The architecture, by then, had been cited well over a hundred thousand times, among the most cited papers in the history of computer science.

In March 2024, seven of the eight authors sat together on a stage at a conference hosted by Nvidia, the chipmaker whose hardware had made their architecture run, reunited in public for the first time as something close to celebrities. They were asked about what they had built and what should come after it. The striking thing about the conversation was their impatience. Several said openly that the Transformer was good but not the end, that the field had become stuck on a single architecture mostly because it worked and the money had piled onto it, and that the world would be better off finding something better. They had built the foundation of the era, and they were already itching to tear it up. They had spent the years since 2017 watching an idea they assembled in a few feverish weeks, to make a translation model run faster on Google’s chips, become the most consequential architecture in the history of artificial intelligence, and they were the ones least inclined to treat it as finished.