Part I · Chapter 2

Promise

Geoff Hinton, backpropagation, and the connectionist underground that refused to let the idea die. → How a few true believers kept a discredited idea alive against the consensus of their field.

“Learning representations by back-propagating errors.” — David Rumelhart, Geoffrey Hinton, and Ronald Williams, Nature, 1986

Before he was the man who would auction his company from a hotel-room floor, before the field he believed in turned out to be right, Geoffrey Hinton was a carpenter. He had a degree from Cambridge in experimental psychology, taken after he had abandoned, in turn, natural sciences, the history of art, and philosophy, and he had concluded that the academic study of the mind was a disappointment. Psychology, as it was practiced in the late 1960s, seemed to him to have given up on the hardest and most interesting question, which was how a brain actually worked. So he stopped. He spent a while doing manual work, building things with his hands, on the theory that it was at least honest labor and that the mystery of the mind would still be there if he ever found a way in.

He came from a family that did not take ordinary careers for granted. His great-great-grandfather was George Boole, the self-taught mathematician whose algebra of logic became, a century after his death, the language in which every digital computer thinks. Another forebear, Charles Howard Hinton, wrote about the fourth dimension and coined the word “tesseract.” His father was an entomologist of some renown. To be a Hinton was to carry a quiet expectation that you would do something with your mind, which made dropping out to do carpentry both a small rebellion and a kind of confession that he had not yet found the thing worth doing.

What pulled him back was the same question that had driven Rosenblatt: not how to program a computer to be clever, but how a physical network of cells, learning from experience, could come to understand anything at all. By the early 1970s that question had a name and a tiny, embattled community of people asking it. The trouble was that the name was poison. Marvin Minsky and Seymour Papert’s 1969 book had done its work; neural networks were the thing serious people had moved past. Hinton arrived at the University of Edinburgh in 1972 to do a PhD in artificial intelligence at almost the worst possible moment, just as the field’s first long winter was setting in.

The winter had a date and a document. In 1973 the British government published a report by the applied mathematician James Lighthill, commissioned to assess whether artificial intelligence research deserved continued funding. Lighthill’s verdict was withering. The field had promised general intelligence and delivered toy programs that worked on toy problems and collapsed the moment the problems got real. He saw no evidence that any of it would scale. British funding for AI was gutted in the aftermath, and the chill spread. To work on neural networks in this climate was to work on the discredited corner of a discredited field, and Hinton’s advisors told him as much. He was, more or less, advised that he was wasting his life. He kept at it anyway. This was the trait that mattered more than any other in the decades to come, and it is worth naming plainly: Hinton was stubborn in a way that did not depend on encouragement. He thought the brain was a neural network because the brain was, observably, a network of neurons, and he could not see why anyone would try to build intelligence on any other principle. The fashion of the field struck him as beside the point. The field could be wrong.

His doctoral supervisor at Edinburgh, Christopher Longuet-Higgins, was a brilliant man who had come to artificial intelligence from theoretical chemistry and who believed, as most serious people then believed, that the way to build a mind was to build it out of symbols and logical rules, the way you build a proof. Hinton wanted to build it out of neurons. The two of them argued for years, with real respect and real disagreement, and the disagreement was the whole argument of the field in miniature: was intelligence something you programmed, a system of explicit rules a clever designer wrote down, or was it something that grew, an arrangement of connections that tuned itself in response to the world? Longuet-Higgins thought the first. Hinton thought the second, and could not be talked out of it, and finished his thesis on a topic his own department regarded as unpromising.

For years that stubbornness mostly meant isolation. Hinton finished his doctorate in 1978 and bounced between institutions, a believer in a thing almost no one funded. Then, in the early 1980s, he found the warmest place the connectionist underground would ever have, and it was in Southern California.

At the University of California, San Diego, a cognitive scientist named David Rumelhart had assembled a group of researchers who shared the conviction that the mind should be understood as a system of many simple units working in parallel, each one doing almost nothing, the intelligence emerging from the pattern of their connections. They called the idea Parallel Distributed Processing, PDP for short, and the name was partly a careful dodge: it let them describe neural networks without using the radioactive word. Hinton came to San Diego, first as a visitor in the late 1970s and then in earnest in the early 1980s, and joined them. For the first time he was not the lone heretic in the building. He was among people who took the premise as obviously true and spent their days on the real problem, which was getting the networks to learn anything useful.

Rumelhart was the quiet center of it. He was a mathematical psychologist, soft-spoken, more interested in being right than in being heard, and he had the rare gift of taking an intuition and finding the equation underneath it. Where Hinton was restless and full of analogies, Rumelhart was precise. The pairing worked. The group around them eventually poured years of argument and experiment into two thick volumes, published in 1986 under the title Parallel Distributed Processing, that became the unlikely bible of a movement most of computer science had declared dead. The books sold in numbers no one expected for a dense academic work on cognition. A generation of researchers who would later build the industry read them as students, and the word “connectionism” became the banner under which the believers gathered.

That problem was the one Minsky and Papert had turned into a weapon. A single layer of neurons could draw only a straight divide between categories, and the fix everyone understood was to stack layers, to put a middle tier between the input and the output so the network could build complicated decisions out of simple ones. The catch was the training. With a single layer, Rosenblatt’s old rule could tell each connection whether to grow stronger or weaker, because you could see directly how each input had contributed to the mistake. Add a hidden middle layer and that clarity vanished. When the network got an answer wrong, which of the hidden neurons was at fault, and by how much? The error appeared at the output, but the blame had to be distributed backward through a tangle of connections that had no obvious owner. This was the credit-assignment problem, and it had stalled the field for a generation.

The answer turned out to be a piece of calculus that, in hindsight, looks almost too simple to have been a barrier. It is called backpropagation, and what it does is run the network’s error backward through the layers, the way you might trace a flood back upstream to find which tributaries fed it. The network makes a guess. You measure how wrong the guess was. Then, working from the output back toward the input, you compute for every single connection how much it contributed to that error, and you nudge each one a little in the direction that would have made the error smaller. The mathematics that makes this possible is the chain rule, the rule for how a change deep inside a chain of dependencies ripples out to the end. Backpropagation is the chain rule applied, layer by layer, to assign each connection its precise share of the blame. Do it once and the network gets infinitesimally better. Do it millions of times, across millions of examples, and a network with hidden layers can learn to recognize things that no single-layer machine ever could.

The technique was not, strictly, Hinton’s invention. Versions of it had been worked out and forgotten more than once, by researchers in different fields who did not know about one another and whose papers sank without trace. A young American named Paul Werbos had written down essentially the same idea in his Harvard doctoral thesis in 1974, in the depths of the first winter, and almost no one had noticed. What Hinton, Rumelhart, and a young researcher named Ronald Williams did was different and, for the history, decisive. In a paper published in Nature in October 1986, titled “Learning representations by back-propagating errors,” they did the thing the earlier discoverers had not: they showed it working on problems people cared about, explained clearly why it mattered, and put it in front of a wide audience that was ready, finally, to listen. The paper demonstrated that a multi-layer network trained by backpropagation would discover useful internal representations on its own, that the hidden neurons would organize themselves into detectors for the features that mattered, without anyone telling them what those features were. The single-layer limitation that Minsky and Papert had used to bury the field was simply gone. You could train the deeper networks now.

That phrase, “learning representations,” carried the whole weight of the thing. The old hand-built systems needed a human to decide, in advance, what features of a problem mattered, to tell the machine to look for edges or corners or particular sounds. A network trained by backpropagation decided for itself. Shown enough examples, its hidden layers grew their own internal vocabulary for describing the world, a set of features no engineer had specified and that sometimes no engineer could even name. The machine was learning more than how to answer questions. It was learning what to pay attention to, which is closer to what a brain does than anything the rule-writers had built.

The 1986 paper is the moment the idea Rosenblatt had been right about, and wrong to oversell, became something you could actually build on. It did not end the exile; the field was still marginal, still underfunded, still treated by most of computer science as a curiosity at best. But it gave the believers a working tool, and a small ecosystem of results began to grow around it.

The believers were a small and identifiable group, and part of what kept the idea alive was simply that they knew one another. There were perhaps a few dozen people in the world in the early 1980s who took neural networks seriously as a route to intelligence, scattered across psychology departments and physics departments and the odd computer-science outpost, often working under labels chosen to avoid the suspicion the word “neural” would draw from a grant committee. They organized their own small meetings. They read one another’s unpublished work. When a young researcher turned up who shared the conviction, word traveled, and the network of believers pulled them in. It functioned less like a research program with institutional backing than like a faith kept alive by a congregation, passed hand to hand because the established church had declared it heresy. Hinton was becoming, without quite intending to, one of its central figures, the person other believers wrote to, the advisor students sought out when they wanted to work on the unfashionable thing.

One of the most striking results came from Hinton’s collaborator Terry Sejnowski, who had trained with him on a related architecture called the Boltzmann machine, a network that learned by settling into low-energy states the way a physical system cools toward equilibrium. Sejnowski took backpropagation and pointed it at an almost theatrical demonstration. He built a network called NETtalk and trained it to read English aloud, feeding it written text and the correct pronunciations and letting it learn the maddening, exception-riddled mapping between English spelling and English sound. As it trained, you could listen to it. At first it produced a continuous babble, like an infant. Then it found the boundaries between words, and the rhythm of syllables, and slowly, over a night of training, it resolved into something that haltingly but recognizably read. People who heard the recording found it eerie, because the network had not been given a single rule of English pronunciation. It had listened to examples and built the rules itself, exactly as Rosenblatt had promised a machine someday would, except that this one actually did it.

The most consequential demonstration of all was not in a lab at all but on an interstate highway. At Carnegie Mellon, a graduate student named Dean Pomerleau built a system he called ALVINN, for Autonomous Land Vehicle in a Neural Network. It was a neural network that watched the road through a camera and learned to steer. Pomerleau trained it by recording what a human driver did, the image from the windshield paired with the angle of the wheel, and letting the network learn the association, the same trick as NETtalk applied to driving instead of speech. ALVINN was a small network with a single hidden layer by the standards of what came later, but by the early 1990s it could steer one of the lab’s research vehicles on real public roads, handling the wheel at highway speed over stretches as long as tens of miles while a human kept watch. This was years before self-driving cars became a fashionable idea, accomplished by a system whose entire conceptual lineage ran straight back to the discredited perceptron. The Carnegie Mellon group would go on, in 1995, to drive a minivan called Navlab 5 almost the whole way from Pittsburgh to San Diego with the steering automated for about ninety-eight percent of the distance, a trip they called No Hands Across America, though the vision system that drove that particular run was a later, non-neural one. The neural network’s claim on the future was ALVINN, the proof that a machine could learn to drive by watching a person do it.

So the idea worked. The puzzle of the era, the thing that makes the period genuinely strange, is that working was not enough. NETtalk read aloud and ALVINN drove across the country and the broader field of computer science remained largely unmoved, still convinced that the path to machine intelligence ran through hand-built rules and explicit logic rather than networks that learned from examples. The neural-network people had results, but they were a tiny minority, and their results were treated as clever stunts rather than as the early evidence of a coming revolution. Part of the reason was that the networks of the 1980s and early 1990s were small, the computers slow, the datasets thin. The technique was right but starved, and starved technologies look like failed ones. Hinton would say later that he had always assumed the principle was correct and the field would come around once the machines got big enough to prove it. In the meantime he had to live inside a discipline that mostly thought he was wrong.

He also had to decide where to live, and that decision was its own kind of statement. Hinton had landed at Carnegie Mellon, in Pittsburgh, and the work was going well. But this was Reagan’s America in the 1980s, and a great deal of American artificial-intelligence research was paid for by the Department of Defense, threaded into the Pentagon’s interest in autonomous weapons and battlefield computing. Hinton found that intolerable. He did not want the science he believed in to be, in its funding and its purposes, an arm of the military. In 1987 he left for the University of Toronto, where the money was cleaner and a Canadian research foundation was willing to back long-shot ideas that the American mainstream had written off. It was a move that looked, at the time, like a step away from the center of the field. He was leaving the well-funded heart of AI for a quieter outpost in another country, to keep working on an idea that the well-funded heart of AI did not believe in.

In Toronto he settled in, and he stayed. The Canadian Institute for Advanced Research, a foundation built specifically to back long-horizon ideas that ordinary grant committees would not touch, gave him a fellowship and the kind of patient money that does not demand a product next year. It was an arrangement perfectly suited to a man working on something the rest of the field thought was finished. He built a lab. He took students, and the students took students, and slowly an institution grew up around a man who had been told, decades earlier, that he was wasting his life on a dead idea. The idea was still marginal. But it was alive, and it had a home, and it had a stubborn keeper who had never once been persuaded that the consensus of his field was the same thing as the truth.

What none of them had yet was the one ingredient that would eventually settle the argument in their favor: scale. The networks were too small and the computers too slow to show what the principle could really do. That would come. And the person who would carry the next piece of the work, who would take backpropagation and convolution and turn them into a system that read a meaningful fraction of America’s checks, was a young French engineer who had read a single sentence about the perceptron and decided it would be his life. He found his way to Hinton’s lab in Toronto, and then to Bell Labs, and into the second winter that was waiting just ahead.