Part II · Chapter 11

Expansion

Deep learning swallows speech, translation, and vision; George Dahl and the statistical old guard fall. → How the technique generalized faster than anyone expected.

The thing about the move that beat the speech-recognition community was that almost nobody in the room understood what they were looking at. It was around 2010, and a graduate student named George Dahl was presenting results that should have been impossible. For roughly two decades the science of teaching computers to understand spoken words had run on a single combination of methods, refined to a high polish by hundreds of careful researchers: hidden Markov models to track how sounds unfold in time, and Gaussian mixtures to decide which sound was being made at each instant. Every improvement for twenty years had come from tuning that machinery, a fraction of a percent at a time. The acoustic model was the part that turned raw sound into a guess about which phoneme had just been spoken, and it was the part everyone had given up expecting to leap. Dahl had replaced it with a deep neural network, and the error rate had not crept down. It had fallen off a cliff.

The people who built speech systems for a living were among the most sophisticated machine-learning practitioners on earth. They had not been lazy or unimaginative. They had simply been certain, on the basis of two decades of evidence, that the Gaussian-mixture approach was the right tool for the job and that neural networks were a charming dead end someone tried every few years before going back to real work. Now a student from Geoffrey Hinton’s lab in Toronto was showing numbers that made their life’s specialty look like the slow path. There is a line, attributed within that community, that captures the strangeness of the moment: George had wiped out the whole field without even knowing its name. Dahl was not a speech person. He did not carry the decades of accumulated intuition about formants and triphones and pronunciation dictionaries. He had a tool that worked, and he pointed it at a problem, and the problem fell.

What made the result so unsettling went beyond the fact that a neural network had beaten the old guard on one benchmark. It was the manner of the beating, and what it implied. Dahl’s network did not know anything about speech that the engineers had so painstakingly encoded. It had no built-in model of how the human vocal tract produces sound, no hand-tuned features designed by people who understood acoustics. It learned its own features from raw spectrograms, layer by layer, the way Hinton’s deep networks had been taught to learn anything. The implication, once it sank in, was vertiginous. If a generic learning machine with no domain knowledge could outperform twenty years of specialist craft in speech, then the specialist craft had never been the source of the performance. The data and the architecture were. And if that was true in speech, there was no obvious reason it would not be true everywhere.

The path that led Dahl to that conference room ran back through a small Toronto lab that had spent years being ignored. Hinton’s group had been refining ways to train networks with many layers, and around 2009 two of his students, Dahl and Abdel-rahman Mohamed, started applying those networks to a standard speech benchmark called TIMIT. The early results were good enough that Microsoft and Google both took notice and brought the Toronto students in to try the approach on real, industrial-scale data, the kind with thousands of hours of messy human speech rather than a tidy academic corpus. At Microsoft, a senior researcher named Li Deng had been independently convinced that deep learning might break the logjam, and he became one of the approach’s most important champions inside a large company. By 2012 the largest technology companies in the world were quietly swapping out the acoustic models that powered their voice products. The change was invisible to users. Voice search and dictation simply got better, fast, in a way they had not in years.

The reason this transfer was possible, and the reason it kept happening across one field after another, is worth stopping on, because it is the whole story of these years. Before deep learning, a researcher who wanted a machine to do something hard had to teach it what to look for. In speech you designed features that captured the acoustics of vowels and consonants. In vision you designed features that captured edges and corners and textures. In machine translation you built elaborate statistical tables of which phrases tended to translate to which other phrases. Each field had its own bespoke toolkit, refined over decades by people who had devoted their careers to understanding that specific domain. The features were the expertise. They were what a graduate student spent years learning to build.

A deep neural network dissolved that arrangement. Instead of being told what features to look for, it discovered its own, from the bottom up, given enough examples. The early layers learned simple patterns, the later layers combined them into more abstract ones, and the whole stack was shaped by nothing but the gap between its guesses and the right answers. This meant that the same basic recipe, a deep network trained on a large labeled dataset, could be aimed at almost any perceptual problem. The thing that had made each field a separate priesthood, its hand-crafted features, was exactly the thing the network made unnecessary. A method that learns its own features does not care whether the features are acoustic or visual or linguistic. It is, in a sense that the older specialists found hard to accept, indifferent to the domain.

Vision had been the first proof, the year the speech transfer was completing. The image-recognition result on the ImageNet benchmark in 2012 had cut the field’s error rate roughly in half, and it had done so with the same family of techniques, a deep network learning its own visual features from a large pile of labeled photographs. What had looked at first like a single spectacular result in one corner of computer science began, over the next few years, to look like the leading edge of something general. Google folded the technology into its products at a pace that surprised even the people building it. Photos became searchable by their contents, so that a user could type “dog” or “beach” and find pictures they had never tagged. The same machinery began creeping into medical imaging, where networks trained on labeled scans started matching specialists at spotting particular abnormalities. The image classifier that had won a contest was becoming infrastructure.

Then came translation, and translation was the field that made the displacement impossible to ignore, because it had been the rival approach’s most celebrated success for two decades. Statistical machine translation had ruled since the 1990s. It worked by chopping sentences into phrases, looking up how those phrases had been translated in enormous collections of human-translated text, and stitching the pieces back together with a model of which word orders sounded plausible in the target language. It was a triumph of statistics over the older dream of teaching a computer grammar by hand, and it powered the translation services that hundreds of millions of people used every day. It was also, by the mid-2010s, stuck. The phrase-based systems produced translations that were useful but clumsy, full of the telltale word-salad that everyone recognized as machine output.

The neural alternative had been incubating in plain sight. In 2014 a cluster of papers introduced a deceptively simple idea: train one network to read an entire sentence in one language and compress its meaning into a vector of numbers, then train a second network to expand that vector into a sentence in another language. Sequence to sequence, the approach was called. The foundational paper came from Ilya Sutskever, Oriol Vinyals, and Quoc Le at Google, with a parallel line of work from Kyunghyun Cho in Yoshua Bengio’s Montreal group. A refinement followed almost immediately, from Bengio’s lab, that let the decoder look back at the original sentence as it worked rather than relying on a single fixed summary, a mechanism its authors called attention. It would matter enormously later, in ways nobody in 2014 fully grasped, but for the moment its job was modest: it made neural translation good enough to be more than a research curiosity.

In September 2016, Google made the switch official. It announced that Google Translate would move from its phrase-based statistical engine to a neural system, the Google Neural Machine Translation system, starting with Chinese-to-English and expanding from there. The company’s own paper claimed the new system reduced translation errors by large margins over the old one on several language pairs, in some cases narrowing the gap to human translation by more than half on Google’s internal measures. The figures were Google’s own, and like all vendor benchmarks they flattered the product; the system still made plenty of mistakes, and professional translators were quick to point out where it failed. But the direction was unmistakable. Two decades of phrase-based statistical machine translation, the field’s proudest achievement, were being retired in favor of the same kind of network that had already taken speech and vision. The third domino had fallen, and it had fallen the same way as the first two.

For the people whose expertise was suddenly worth less than it had been the year before, the experience was disorienting in a way the triumphant press releases never captured. This was not a new feeling in the field. Years earlier a Microsoft researcher named Chris Brockett had watched statistical methods make his seven years of hand-built linguistic rules obsolete nearly overnight, and the memory of that vertigo was instructive: the people who had once been the disruptors, the statisticians who displaced the rule-writers, were now themselves the old guard being displaced. The wheel had turned again, and faster. A computational linguist who had spent a career learning the grammar of a dozen languages, or a speech engineer who could hear the difference between two acoustic models the way a sommelier tastes wine, found that the most valuable skill in the room had become something else entirely: knowing how to assemble a large dataset, design a network, and run it on enough hardware. The expertise that had taken a career to build could not be transferred to the new regime, because the new regime did not use it.

There was a hard question buried in all of this, and the honest researchers asked it of themselves. Had the decades of specialist work been wasted? The generous answer, and probably the true one, was that the old methods had kept these problems alive and defined during the years when neural networks could not be trained well enough to touch them. The statistical translators had built the parallel-text corpora that the neural systems then trained on. The speech engineers had defined the benchmarks and assembled the labeled audio that made the deep networks’ victory measurable. The features they hand-crafted had encoded real understanding of the problems, and that understanding did not vanish; it had simply been absorbed, made implicit, learned rather than specified. But the harder truth underneath the generous one was that the particular skills, the specific craft of building a good Gaussian-mixture acoustic model or a good phrase table, were now museum pieces. The people were not worthless. Their decades of specialized technique mostly were.

What unsettled even the winners was how few ideas were doing all the work. A skeptic might have expected that conquering speech, vision, and translation would require three different breakthroughs, three insights tailored to three very different kinds of data. Instead it was substantially one idea, deep networks that learn their own features, applied three times with variations. The architecture differed in the details. Vision used convolutional networks that slid feature detectors across an image. Speech and translation leaned on networks that processed sequences in order. But the core bet was identical, and so was the recipe: more data, more layers, more compute, fewer hand-built assumptions. The field had spent fifty years believing that intelligence would have to be engineered piece by piece, each capability requiring its own clever design. The years from 2010 to 2016 suggested something the older researchers found almost offensive in its crudeness, that a single general method, fed enough examples and enough computation, could swallow problem after problem that had each been thought to need its own science.

George Dahl, who had started it in that conference room, was an early and almost accidental agent of a shift much larger than speech. He went on to a long career at Google, one figure among many whose work helped retire the methods they had grown up with. The acoustic-modeling community he had upended did not disappear; many of its best people simply learned the new tools and carried their hard-won intuitions about sound into the deep-learning era, where the intuitions still helped even when the old machinery did not. The same happened in translation and in vision. The fields were not destroyed so much as converted, sometimes against the will of their elders, to a common method that did not respect the borders between them.

By 2016 the pattern was clear enough to name. Deep learning had turned out to be more than a better speech recognizer or image classifier or translator. It was a general solvent, and one field after another was dissolving in it. It had taken decades to get neural networks to work at all, and only a handful of years, once they did, for them to conquer three of the hardest problems in computing, problems that had each supported entire research communities for a generation. The implication that hung in the air, unspoken at most conferences but impossible to miss, was that the same thing might keep happening. If one method could take speech and vision and translation in five years, the open question had stopped being whether it would take the next field. It was which field, and how soon. With the technique proven and general, and the money and talent already pouring in, the only thing left unsettled was who would own it.