Part I · Chapter 4

Breakthrough

AlexNet, two GPUs in a bedroom, and the 2012 ImageNet result that cut the error rate in half. → The proof that scale and GPUs change everything — the hinge of the modern story.

George Dahl had a result he did not entirely believe, and a conference badge that gave him almost no standing to present it.

It was December 2009, in Whistler, British Columbia, where the Neural Information Processing Systems conference held its annual run of post-conference workshops up in the ski-resort sprawl above Vancouver. Dahl was a graduate student in Geoffrey Hinton’s lab at the University of Toronto. So was the man sitting beside him, Abdel-rahman Mohamed, an Egyptian engineer who had come to Canada to study machine learning and had been handed, more or less by accident, a problem nobody else in the lab wanted. The problem was speech.

That a speech problem ended up in Hinton’s lab at all was something of an accident, and the accident ran through Mohamed. He had not come to Toronto to work on sound. He had come to work on machine learning in the abstract, the way Hinton’s students tended to, and the speech question had been lying around the lab the way a hard unclaimed problem does, half-interesting and a little radioactive, the sort of thing a senior student knows better than to pick up. Mohamed picked it up. He had an engineer’s tolerance for problems that did not want to be solved, and an outsider’s freedom from the field’s accumulated certainties about what could not be done. He did not know enough about speech recognition to know that what he was attempting was supposed to be hopeless. That turned out to be an advantage.

Speech recognition in 2009 was a mature field with a settled orthodoxy and a graveyard of people who had tried to overturn it. The standard system had two halves. A Gaussian mixture model described what the little slices of sound looked like, statistically; a hidden Markov model stitched those slices into the most probable sequence of words. The acronym was GMM-HMM, and everyone who worked on speech for a living had spent a career tuning it. The pairing had been refined for two decades, in industrial labs with budgets the size of small countries, against benchmarks that everyone agreed on. The big speech houses, IBM and AT&T and Microsoft and the academic groups feeding them students, had turned the tuning into a discipline of its own, a craft of features and adaptations and smoothing tricks passed down like a trade. Newcomers who arrived announcing that they would replace it with something cleverer tended to leave, a year or two later, having learned why it was hard.

Neural networks had been tried on speech before. In the late 1980s and early 1990s, researchers had bolted small networks onto the front of HMM systems and gotten modest gains, then watched the approach stall when the networks refused to go deep. A network with one hidden layer could be coaxed into learning. A network with many layers, trained the only way anyone knew how, with errors propagated backward from the output, simply did not improve. The gradients that were supposed to carry the correction signal back through the layers grew faint and then vanished. By the time the signal reached the early layers there was almost nothing left to learn from. People in the field had a name for the wall, and a quiet consensus that it could not be climbed.

What Dahl and Mohamed had on their laptops in Whistler was a network with many layers that worked.

They had built it on top of an idea Hinton had published three years earlier, in 2006, and which had quietly reopened a door the field assumed was nailed shut. The trick was not to train all the layers at once. It was to train them one at a time, from the bottom up, each layer learning to reconstruct the patterns in the layer below it without anyone telling it what the right answer was. Hinton called these stacked layers a deep belief network, and the layer-by-layer procedure pretraining. After the stack had taught itself the rough structure of the data, you could connect it to a final output layer and fine-tune the whole thing with ordinary backpropagation. The pretraining gave the network a sane starting point. From there the gradients had something to grab. The wall, it turned out, had a staircase behind it, and Hinton’s lab had found the bottom step.

The 2006 papers had been demonstrated mostly on images of handwritten digits, the discipline’s tidy sandbox. The question hanging over the lab was whether the staircase led anywhere that mattered. Speech was a brutal place to find out. It had real benchmarks, a real industry, and a real establishment that would not be charmed by a clever toy.

The benchmark Mohamed and Dahl chose was called TIMIT, a corpus of recorded American English recorded in the 1980s by Texas Instruments and MIT, which is where the name came from. TIMIT was small and old and exhaustively studied. Every serious speech group on earth had run their system against it. The numbers it produced were known to three decimal places and trusted like a tide table. If you posted a better TIMIT result, nobody could wave it away by saying you had tuned your test to flatter your method. The test had been fixed in place for two decades.

Mohamed’s network read the sound, frame by frame, and tried to name the phoneme, the smallest unit of spoken sound, that each frame belonged to. It was a narrow slice of the full speech problem, but it was the slice everyone measured. He fed the network not raw waveform but the standard acoustic features the field already used, the same numerical description of each sliver of sound that the Gaussian systems consumed. The network’s only job was to do better at the same task with the same inputs. When the phone error rate came back, it landed in the low twenties, percentage-wise, better than the best published Gaussian mixture systems on the same test. Not by a rounding error. By enough that Dahl, checking it, assumed he had made a mistake, found no mistake, and then sat with the uncomfortable feeling of having done something that the literature said should not be possible.

The number was suspicious precisely because it was clean. There is a particular dread that comes over a careful researcher when an experiment works too well, because the most likely explanation is always that you have fooled yourself, that some answer has leaked from the test set into the training, that you are measuring your own mistake. Dahl and Mohamed went back through the pipeline looking for the leak. There was no leak. The result was real, and it was theirs, and it contradicted twenty years of consensus, which is an exhilarating and slightly frightening thing for a graduate student to be holding.

The man who understood what they had before almost anyone else was sitting in the audience because he had arranged to be there.

Li Deng was a principal researcher at Microsoft Research in Redmond, Washington, and he had spent his career inside the speech orthodoxy, building the kind of systems Mohamed’s network had just embarrassed. Deng was not a neural-network man. He was a speech man, a Chinese-born engineer with a doctorate from Wisconsin and a deep technical command of the GMM-HMM machinery that had defined his field. But Deng had grown restless with it. The gains had been getting smaller every year, a fraction of a percent wrung out of ever more elaborate engineering, and he had begun to suspect the architecture itself was the ceiling.

In 2009 Deng had invited Hinton to Microsoft to talk, and the conversation had turned to speech. The two of them, the believer in neural nets and the insider who knew where the bodies were buried in speech, organized the Whistler workshop together. It was titled, plainly, around deep learning for speech recognition, and its purpose was to drag a small heretical result in front of the people who could either ignore it or act on it. Most of the room was speech researchers who had come, in part, to see whether the Toronto students were serious or whether this was another neural-network revival that would die on contact with a real benchmark.

The TIMIT number made them sit up. A small benchmark was still only a small benchmark, and the veterans in the room knew the difference between beating TIMIT and running a product that transcribed millions of phone calls without falling over. But the result was clean and the test was honest, and that combination was rare enough to be worth a flight to Whistler. Deng did not treat it as a curiosity. He treated it as a lead. He offered Dahl and Mohamed internships at Microsoft Research, where they would have access to something Toronto did not have: data at the scale of an actual business, and a system in production to measure themselves against.

That was the inflection, though nobody in the room would have called it that yet. The speech establishment did not surrender at Whistler. Some of the veterans were skeptical in the specific, informed way of people who had watched neural networks promise and underdeliver before, and they were not wrong to be: a single phone-recognition benchmark, however honest, was a long way from a deployed product. But the room agreed, for the first time in years, that the thing it had buried might be worth digging up. Deng’s credibility was part of why. He was not a convert arriving from outside to tell speech people their life’s work was obsolete. He was one of them, fluent in their methods and trusted in their hallways, and when an insider of his standing took a heresy seriously, it became permissible for others to take it seriously too.

The internships at Redmond mattered because TIMIT was a laboratory and Microsoft was the world. TIMIT asked the network to name phonemes in carefully recorded sentences read by cooperative speakers. A shipping speech product had to handle a stranger mumbling a restaurant name into a phone on a windy street. The gap between the two was enormous, and it was exactly the gap where neural-network revivals had died before, looking strong on the benchmark and collapsing in the field.

Dahl, working at Microsoft with Deng and a researcher named Dong Yu, took the approach up to large-vocabulary recognition, the real thing, on data drawn from Microsoft’s voice search for businesses. They built what came to be written as CD-DNN-HMM, a context-dependent deep neural network feeding a hidden Markov model, which kept the parts of the old system that worked and replaced the part that did not, the Gaussian mixture, with a deep network. On Bing’s voice search task the deep network cut the word error rate by roughly a sixth to a quarter relative to the carefully tuned Gaussian system it replaced. In a field that celebrated single-percent gains, a relative cut of that size was not an improvement. It was a different machine.

This is the part of the story that gets told least often, and it is the part that matters most for understanding what came after. The deep-learning revolution is remembered as a vision story, a story about a network that recognized photographs of dogs and mushrooms in 2012 and stunned a roomful of computer-vision researchers. That moment was real and it is coming. But the proof of concept, the first time a deep network walked into a mature field with a settled orthodoxy and an honest benchmark and won, happened in speech, and it happened first. Speech was the canary. Vision was the explosion everyone remembered.

There was a reason the canary sang when it did, and it had less to do with cleverness than with two boring forces that had been gathering quietly underneath the science: money to keep the people alive, and hardware to make the math go fast.

The money came from an unlikely source, a Canadian research charity, and it had been keeping Hinton’s small tribe together for years before anyone outside it noticed. In 2004 the Canadian Institute for Advanced Research, CIFAR, had launched a program with the unwieldy name Neural Computation and Adaptive Perception and made Hinton its director. CIFAR did not fund laboratories or buy equipment. It funded a network of people, a few dozen researchers scattered across institutions, and it paid for them to come together, repeatedly, in the same rooms.

This sounds like a small thing. It was not. By the mid-2000s neural networks were close to academic poison. Reviewers rejected the papers, hiring committees passed over the people, and a graduate student who announced an intention to work on deep networks was advised, kindly, to think about their career. The field had survived two winters and was deep in a third. What CIFAR bought was permission. Twice a year the program gathered its members, Hinton in Toronto, Yann LeCun in New York, Yoshua Bengio in Montreal, and a rotating cast of younger researchers, and for a few days they could speak to people who took the work seriously instead of defending it to people who did not. They shared half-finished results that no conference would have accepted. Pretraining, the staircase behind the wall, was hashed out in these rooms before it was published. So was much of what the students would carry into industry.

It was a hothouse, deliberately kept warm against the climate outside. The total budget over the lean years was a rounding error against what a single company would later spend in a quarter chasing these same ideas. But it kept the community from dispersing, and a community is what a field is. The students mattered most. A program that pays for senior professors to meet twice a year is pleasant; a program that lets those professors keep producing students who believe in an unfashionable idea is how the idea survives a generation. Dahl and Mohamed were Hinton’s students. The students who would soon take deep networks into Google and Microsoft and a dozen startups had been raised inside the warmth CIFAR paid for. When the results finally came, there was a group ready to produce them, trained by one another, who already knew and trusted one another’s work. The institute, which would later rename the program Learning in Machines and Brains, had funded the survival of an idea through the years when survival was the only available victory.

The second force was hardware, and the people who saw it first were not in Toronto.

In 2009, the same year as the Whistler workshop, three researchers at Stanford published a paper with a title that read like an engineering memo: large-scale deep unsupervised learning using graphics processors. The lead author was Rajat Raina, the senior author was Andrew Ng, and the middle author, Anand Madhavan, had done much of the work of bending a piece of consumer entertainment hardware to a purpose its makers had never imagined.

The graphics processing unit existed to render video games. Its job was to compute, in parallel, the color of millions of pixels many times a second, which meant it was built to do enormous numbers of simple arithmetic operations all at once. A central processor, the brain of an ordinary computer, was a brilliant generalist that did one complicated thing at a time, very fast. A GPU was an army of dim arithmeticians doing the same dumb thing in lockstep, by the thousand.

It happened that the dumb thing the GPU did best, multiplying big grids of numbers, was precisely the operation at the heart of a neural network. Training a deep network is, underneath the biology metaphors, a vast pile of matrix multiplication repeated until the numbers settle. For years the field had been doing that math on processors that were never designed for it, and training a serious network could take weeks. The Stanford group showed that the same training, moved onto graphics cards, ran faster by more than an order of magnitude. The paper reported speedups around seventy times for some of their models, and showed deep networks with on the order of a hundred million connections, far larger than anyone had been training, made tractable on hardware a graduate student could buy.

That number, an order of magnitude and then some, was the difference between an experiment you could run and an experiment you could not. A network that took a month to train was a network you could try a few times a year. A network that took a day was one you could iterate on, fail with, fix, and try again before the week was out. Research moves at the speed of its feedback loop, and the GPU collapsed the loop. The gaming industry, chasing ever more lifelike explosions, had spent two decades and billions of dollars building, without meaning to, the cheapest supercomputer the machine-learning world had ever seen. It was sitting in the chassis of teenagers’ computers, waiting for someone to point it at the right problem.

Programming the thing was its own ordeal. The graphics cards had not been built for general computation, and bending them to it meant writing code in the cards’ own awkward idiom, thinking about memory and parallelism in ways a normal programmer never had to. It was finicky, low-level work, closer to hardware engineering than to the clean mathematics of a machine-learning paper, and it rewarded a particular kind of person: someone willing to spend weeks coaxing a few hundred lines of code into running ten times faster, who found that kind of grinding optimization satisfying rather than beneath them. The field did not have many of those people yet. The ones it had were about to become very valuable.

Put the three together and the breakthrough stops looking like magic. The science, the staircase that let networks go deep, had been worked out in Hinton’s lab and rehearsed in CIFAR’s rooms. The hardware, the cheap parallel math machine, had arrived from the games industry and been pointed at deep networks by Ng’s group at Stanford. And the proving ground, the honest benchmark in a mature field, had been offered up by speech, where Deng knew the orthodoxy was ripe and Mohamed and Dahl had a method that beat it. Each piece had a long, separate history. They converged inside about three years.

Once the proof existed, industry moved with a speed that startled the academics who had spent decades being ignored.

At Google, the man who carried the work into production was Vincent Vanhoucke, a French-born engineer who had been working on the company’s speech systems. Google had a speech problem of unusual urgency, because Google had Voice Search, and Voice Search meant millions of ordinary people speaking queries into phones, expecting the right answer, every day. The economic value of a one-percent improvement in that error rate was enormous, and Vanhoucke’s team, around 2011 and into 2012, swapped the Gaussian mixture acoustic models for deep neural networks. The word error rate dropped by something in the range of twenty to twenty-five percent relative, the kind of jump the field had not seen from a single change in living memory. It went into production, serving real users at a scale that no academic benchmark could simulate.

Microsoft, where the large-vocabulary work had been done, moved its own systems. IBM, the third great speech house, moved too. Within roughly two years of the Whistler workshop, the three largest speech-recognition operations in the world had abandoned the architecture that had defined their field for two decades and adopted the one a handful of graduate students had demonstrated on an old benchmark in a ski town. The orthodoxy did not so much fall as evaporate. The veterans who had spent careers tuning Gaussian mixtures found themselves, almost overnight, tuning deep networks instead.

The speed of it was the strange part. Mature engineering fields do not turn on a dime. They have installed bases, trained staff, papers in flight, reputations staked on the existing way. A speech researcher in 2009 had often spent fifteen years acquiring an expertise that the new approach made partly obsolete. There is a version of this story in which the establishment defends its turf for a decade, the way fields often do, and the deep networks win slowly and bitterly. That is not what happened. What happened instead was that the gains were too large to argue with and the people best positioned to argue, the insiders like Deng and Yu and Vanhoucke, were the ones leading the change rather than resisting it. The orthodoxy switched sides. The most credible defenders of the old system became the builders of the new one, and a transition that might have taken a generation took about two years.

The cultural effect went well beyond speech. For thirty years the people who believed in neural networks had been able to say only that the idea was elegant and ought to work. They had faith and theory and a handful of demonstrations on toy problems. What they had never had was the thing that ends an argument in engineering: a result, in production, in a field that had every reason to resist it, that no one could explain away. Speech gave them that. It was a public, measurable, commercially deployed proof that when you stacked depth and data and compute in the right proportions, the networks won outright.

And the computer-vision people were watching. They had their own settled orthodoxy, their own honest benchmarks, their own decades of hand-engineered methods that everyone agreed were the state of the art. They had, lately, something the speech people had lacked at the start: a dataset so large it bordered on the absurd, assembled by a researcher most of the field had thought was wasting her time. The recipe that had just toppled speech, depth plus data plus cheap parallel hardware, was about to be aimed at a competition called ImageNet, and the explosion this time would be one the whole world heard.

The dataset was the work of Fei-Fei Li, a Stanford computer scientist who had spent the late 2000s on a project her colleagues considered eccentric to the point of folly. The reigning wisdom in machine learning held that better algorithms made better systems. Li suspected the opposite, that the algorithms were starved, that what they lacked was not cleverness but examples, in the way a child learns to recognize a dog only after seeing a great many of them. So she set out to build a training set on a scale no one had attempted: not thousands of labeled images but millions, scraped from the internet and sorted into thousands of categories by an army of anonymous workers hired through Amazon’s Mechanical Turk to do the labeling one picture at a time. The result, released around 2009, was ImageNet, a labeled photographic catalog of the visual world. To make it a contest, Li and her collaborators launched the ImageNet Large Scale Visual Recognition Challenge, an annual test in which programs competed to name the dominant object in photographs they had never seen. For its first two years the winners were elaborate hand-engineered systems, and they improved on one another by the usual fractions.

Alex Krizhevsky entered the 2012 challenge from his bedroom. He was the quiet, intense student from Hinton’s Toronto lab, the kind of programmer who would rather wrestle code into running faster than explain to anyone what he was doing, and he had a particular talent for the finicky, low-level GPU work that the moment demanded. Working with Ilya Sutskever, who supplied the conviction that the approach would scale, and with Hinton advising, he built a deep convolutional network, a descendant of the kind LeCun had pioneered for check-reading, but far larger, with some sixty million parameters, and trained it not on a cluster but on two consumer NVIDIA GTX 580 graphics cards installed in a computer at his parents’ house. The cards had three gigabytes of memory each, which was not enough, so Krizhevsky split the network across the two of them and spent days hand-tuning the code that let them share the load. Training took the better part of a week, the cards running hot, the error rate falling and falling.

When the 2012 results were posted, the network, which the field would come to call AlexNet, had a top-five error rate of 15.3 percent. The second-place entry, a conventional system of the established kind, came in at 26.2 percent. In a competition where the state of the art moved by a percent or two a year, a graduate student working on gaming hardware had beaten the best in the world by nearly eleven points, cutting the error rate by close to half. It was the same shape of result that speech had produced three years earlier, the same too-large-to-argue-with margin, except that this time the field watching was computer vision, and computer vision had a much larger audience.

The reaction came at a workshop in Florence that October, where the result was presented to the European Conference on Computer Vision. The room understood at once. Yann LeCun, who had built the first convolutional networks and spent the wilderness years watching the field ignore them, was among those who grasped immediately what had happened: the methods he and Hinton had kept alive through two winters had just won the discipline’s hardest open contest, in public, by a margin no one could explain away. Those who were in the room remember a charged quality to the discussion, the sense of a settled field discovering in the space of a talk that it was no longer settled. The hand-engineered approaches that had defined computer vision were, as of that afternoon, the past.

Everything in the prologue followed from this. The crowd that overflowed the room at Lake Tahoe two months later, the four companies bidding by email for a three-person company with no product, the forty-four million dollars, the man who would not sit down running an auction over a hotel-room laptop: all of it was the market reacting to the number Krizhevsky had produced in his bedroom. The idea that had been buried in 1969 and laundered out of paper titles in the 1990s had done the one thing that ends an argument in this field. It had won a fair contest, twice, and the second time the whole world was positioned to see it.