Neuron Makers
Part I · Chapter 5

Testament

Google Brain, the YouTube cat detector, and Jeff Dean's industrial-scale compute. → Why the deep-learning era is also a story about a few engineers who knew how to wield data centers.

“Compilers don’t warn Jeff Dean. Jeff Dean warns compilers.” — a “Jeff Dean fact,” Google in-house folklore

Sometime in 2012, on a cluster of a thousand machines humming inside a Google data center, a number began to climb. The number belonged to a single artificial neuron buried deep in a large network, one unit among more than a billion connections, and what it was tracking was simple to state and strange to witness. It had taught itself to notice cats.

No one had told it what a cat was. The engineers had not labeled a single image. They had pointed the network at ten million still frames pulled at random from YouTube videos, each one a 200-by-200 thumbnail of whatever the internet happened to contain, and let it run for three days. The instruction, if it could be called that, was almost contentless: look at these pictures, and learn to describe them to yourself in a way that lets you reconstruct them. Find the patterns that recur. Compress the chaos. The network was free to decide for itself what was worth noticing.

When the researchers went looking afterward, hunting through the layers for whatever the machine had decided to care about, they found that one unit lit up reliably for cat faces. Another responded to human faces. A third had learned the shape of a human body. The team could even run the process in reverse, asking the network to draw the input that would most excite the cat-detecting neuron, and out came a smudged, dreamlike composite, the platonic cat the machine had assembled on its own from a million accidental frames of pet videos. It was not a useful product. No one was going to sell it. But it was a small piece of evidence for a large claim, which was that if you gave a neural network enough data and enough computers, it would start to organize the world without being asked.

The man who could supply the computers was named Jeff Dean, and inside Google he was less an engineer than a legend that happened to have a desk.

The legend took the form of a meme. Around 2007, in the tradition of the Chuck Norris jokes then circulating, a couple of Google engineers built a webpage of mock-heroic “Jeff Dean facts” as an inside gift, and the genre never stopped growing. Jeff Dean’s PIN is the last four digits of pi. Jeff Dean once failed a Turing test when he correctly identified the 203rd Fibonacci number in less than a second. When Jeff Dean designs software, he first codes the binary and then writes the source as documentation. The speed of light in a vacuum used to be about 35 miles per hour, until Jeff Dean spent a weekend optimizing physics. The jokes were absurd on purpose, and like all good in-jokes they encoded a real belief held by people who would never say it plainly: that Dean operated at a level the rest of them could only approximate.

The belief was not unearned. Dean had joined Google in 1999, when the company was small enough that a single brilliant systems programmer could shape what it became, and he had spent the next decade building the machinery that let Google grow into the largest computer the world had ever assembled. In 2004, with his colleague Sanjay Ghemawat, he co-authored MapReduce, a programming model that let an ordinary engineer write a simple piece of code and have it run across thousands of unreliable machines at once, with the system handling the failures and the bookkeeping invisibly. Two years later the same pair helped build Bigtable, a storage system that held the petabytes Google’s services were generating. These were not products anyone outside the company would ever name, and that was the point. They were the floor everything else stood on. Search, Gmail, Maps, the ads that paid for all of it: under each ran infrastructure that Dean and a small number of people like him had designed so that the rest of the company could treat planetary-scale computing as a thing you simply called from a function.

What made Dean rare was not raw cleverness, though he had that. It was a particular instinct for the boundary where an algorithm meets a machine, the place where a beautiful idea either runs fast on real hardware or quietly dies of friction. He thought in terms of cache lines and network latency and the cost of moving a byte, and he could look at a piece of mathematics and see, almost physically, where it would bottleneck once it had to run ten thousand times across a fleet of imperfect computers. In a field that prized theorists, Dean was proof that the engineer who knew how to wield the data center was every bit as load-bearing as the scientist who designed the experiment. The breakthroughs in ideas were real. But an idea that no organization could feed at industrial scale stayed an idea.

The neural network that learned about cats came to Dean by way of a Stanford professor who wandered into the Google cafeteria with a problem.

Andrew Ng was already one of the most visible figures in machine learning. He had built Stanford’s reputation in the area, advised a generation of students, and would soon reach a vastly larger audience through the online courses that helped launch the company Coursera. Around 2010 he began spending time at Google as a consultant, and he carried with him a conviction that had been unfashionable for most of his career and was only beginning to look defensible: that the way forward in artificial intelligence was not more cleverness but more scale. Bigger networks. More data. More computation thrown at the same old idea, the layered neural network that Geoffrey Hinton and a handful of others had refused to abandon through two decades of professional winter. Ng suspected that the reason neural networks had underdelivered was not that the idea was wrong but that no one had ever run it large enough to find out.

Google was, by 2011, possibly the only place on earth where you could test that suspicion without first raising a fortune to buy computers. The computers were already there, idling in data centers, available to anyone inside the company who could justify the cycles. Ng pitched the project to Larry Page and to Sebastian Thrun, the roboticist who ran the secretive lab called Google X, and got the room. The effort was christened the Google Brain, and it began as one of Google X’s “moonshots,” housed alongside the self-driving cars and the internet-beaming balloons, a small bet that thinking machines might be closer than anyone respectable was willing to say.

Ng knew the science. What he needed was someone who could make a network with a billion connections actually run, and one day in a Google microkitchen he fell into conversation with Dean. The pitch was simple: take a neural network, make it enormous, train it on more data than anyone had tried, and see what happens. Dean was intrigued. He had touched neural networks years earlier, as an undergraduate in the late 1980s, during the brief flare of interest that backpropagation had set off, and had concluded they were promising and impractical because the machines of the era were far too slow. The machines were no longer too slow. The question was whether anyone could write the software to harness them.

Dean started spending one day a week on the project, then more. He and a small group, including the researchers Quoc Le, Greg Corrado, and others drawn from Google’s ranks and from Ng’s Stanford orbit, set out to build a system that could train a single neural network across thousands of machines simultaneously. This was harder than it sounds, and the reason it was hard is the reason Dean was the right person for it. A neural network learns by passing data forward through its layers, measuring its error, and then adjusting its millions of internal weights to do a little better next time. When the whole thing fits inside one computer, the bookkeeping is a nuisance. When you split the network across a thousand computers that must constantly tell one another how to update, the bookkeeping becomes the entire problem. Machines fail. Messages arrive late or out of order. A naive design spends all its time waiting and none of its time learning. The system they built, later known as DistBelief, was an answer to that problem: a way to spread one giant model across a fleet and keep it learning despite the chaos, the kind of infrastructure that turned a research idea into something an organization could run.

With the machinery in place, they ran the experiment that would become the project’s calling card. Quoc Le led it. They assembled ten million unlabeled images, each a 200-by-200 frame sampled from YouTube, and fed them to a network of roughly a billion connections spread across a cluster of about a thousand machines, on the order of sixteen thousand processor cores working in concert. There were no labels because the point was to see what the network would learn with no guidance at all. This was unsupervised learning, the hardest and most tantalizing version of the problem. A supervised network is shown a picture of a cat and told “cat,” a million times, until it can tell cats from dogs; it is a very patient student with a very strict tutor. An unsupervised network is shown the pictures and told nothing. It has to invent its own categories, to discover that certain shapes and textures recur and are worth having a name for, even if it never learns the name. If a machine could do that, it would be learning the way a child does in its first year, sorting the visual world into objects before anyone teaches it the words.

After three days of training, the cat neuron was there, and the face neuron, and the body neuron, each having emerged on its own from nothing but the statistics of what people upload to the internet. By the narrow standards of accuracy the result was modest. By the standard of what it implied, it was hard to look away from. The team published it in 2012, and the press, predictably, fixed on the cat. A computer had watched YouTube and learned what a cat looked like. It made a good headline, and it buried the deeper point, which was that scale had done the work. The architecture was not new. The learning rule was not new. The only thing that was new was the size, the number of images and the number of computers, and that alone had carried the network across a threshold no one had reached before.

That was the same lesson AlexNet would land, more violently, a few months later in Florence. The two results came from opposite corners of the field, one from a graduate student’s improvised rig of gaming cards and one from a thousand-machine cluster inside the world’s most sophisticated data center, and they pointed in exactly the same direction. The old idea had not needed a new theory. It had needed to be fed.

Google understood the implication faster than anyone, because Google was the one player that already owned both halves of the equation. It had the data, generated by a billion users a day. And it had the compute, the warehouses of machines that Dean and his colleagues had spent a decade learning to drive. What it lacked was the cluster of human beings who understood the old idea most deeply, the small priesthood that had kept neural networks alive through the years when admitting you worked on them could end a career. So Google bought them.

The acquisition that mattered most arrived in early 2013. After the bidding war that Geoffrey Hinton had run, by email, from a hotel room at a casino in Lake Tahoe, his tiny company of three was folded into Google, and Hinton himself came with it. He did not move to California or abandon the University of Toronto; the terms let him keep one foot in academia and spend the rest of his time inside the company. What he walked into was the infrastructure that academia could never give him. For thirty years Hinton had been the believer who could not get enough computers. Now he had Jeff Dean’s machines.

The arrangement made the two halves of the field’s history visible in one building. Hinton represented the idea, carried through exile by stubbornness; Dean represented the means, built up over a decade with no particular thought of artificial intelligence at all. MapReduce and Bigtable had been designed to index the web and serve ads, not to train neural networks. But the same fleets of machines, the same hard-won skill at making thousands of computers behave as one, turned out to be precisely what the resurrected idea required. The infrastructure had been waiting, in a sense, for a use no one had foreseen when it was built.

What that fusion produced, almost immediately, was an appetite. Once you have seen that a network gets better as you make it bigger, the obvious next move is to make it bigger again, and then again, and the obvious obstacle is the hardware. The processors and graphics cards Google had on hand were general-purpose machines, built to do many things adequately rather than one thing perfectly. Dean and a handful of others began to wonder what would happen if Google designed a chip from scratch to do nothing but run neural networks, a piece of silicon shaped to the exact contours of the math. The thought would harden, over the following years, into a project that gave Google its own custom processors and changed the economics of the entire enterprise. That is a story for later. The seed of it was planted here, in the recognition that the bottleneck on intelligence was no longer the idea, and no longer even the data, but the chips and the data centers that fed the idea its data.

The center of gravity had shifted, and by more than the cat experiment alone suggested. For most of its history, artificial intelligence had been a science, which is to say an argument among researchers about which approach was correct, settled in papers and at conferences. With the Google Brain it became something else as well: an industrial process, settled in part by who controlled the most computers. A graduate student with a brilliant idea could still change the field, as the Toronto students had. But to push the idea to its limit, the student now needed to be standing next to a data center, which meant standing inside one of a small number of companies, which meant the future of the technology was bound up with the future of those companies in a way it had never been before. The professors had not lost. Many of them, like Hinton, simply moved in.

Dean stayed at the center of it. He would go on to lead Google’s artificial intelligence research as the Brain absorbed more of the company and the company organized more of itself around the Brain, and the “Jeff Dean facts” kept multiplying, now with a faint edge of truth to the absurdity, because the man really had built much of the substrate that the new era would run on. The joke had always been that Dean could bend physics. The reality was narrower and more interesting. He could bend a thousand machines to a single purpose, reliably, at a scale almost no one else could manage, and in an age when intelligence turned out to be a function of how much computation you could marshal, that was nearly the same thing.

What none of it answered was the question of what all this compute was for. Google had the data, the chips, the talent, and a clear commercial reason to want better image recognition and better translation and better ads. The technology served the products. But a few miles of cultural distance away, in a small office in London, a former chess prodigy had founded a company around a far stranger premise. He did not want to build better products. He said he wanted to build a mind, and then to use that mind to solve everything else, and he had managed to say it with enough seriousness that the people with the data centers were starting to listen.