Part I · Chapter 3

Rejection

Yann LeCun, LeNet, the second AI winter, and the years researchers had to hide the word "neural." → The wilderness years, and why so many pioneers ended up Canadian or European.

The machine in the basement at Holmdel could read a number it had never seen. A nine that someone had scrawled in a hurry, looping and lopsided; a seven a European had crossed; a smudged four written by a child. Yann LeCun would feed it the scanned image of a handwritten amount off a bank check, and the network would think for a moment, in the way that networks think, and produce the digits. It did this thousands of times a day without complaint and almost without error. By any honest measure it was one of the most successful artificial neural networks ever built. It was running in production, on real money, for real customers, in the early 1990s, at a time when most of computer science had decided that artificial neural networks were a dead end.

LeCun had wanted to build something like it since he was a teenager, though he could not have said so at the time, because the field he wanted to enter had effectively ceased to exist. He had grown up outside Paris, the son of an aerospace engineer, the kind of boy who took apart radios and model airplanes and put them back together changed. He went to engineering school, the École Supérieure d’Ingénieurs en Électrotechnique et Électronique, and trained as an electrical engineer, which is a respectable thing to be and was nothing like what he wanted. What he wanted he found in a book.

The book documented a debate, held in 1975 at an abbey north of Paris, between the linguist Noam Chomsky and the psychologist Jean Piaget, with a roomful of philosophers and scientists arguing about whether the structure of the mind was something a child built up from experience or something written into the species in advance. LeCun, reading it as a student, was less interested in the two famous men than in a participant who appeared at the edges of the argument: someone defending the idea that a machine could learn the way Piaget thought a child did, by adjusting itself in response to the world. There was a reference, in passing, to a device called the Perceptron. LeCun went looking for everything he could find about it. Most of what he found was an explanation of why it had failed. He read Minsky and Papert’s book, the one that had buried the idea a decade earlier, and came away with the opposite of the lesson it intended. The limitations they described seemed to him like problems to be solved, not proofs that the road was closed. He decided that learning machines were what he would do with his life, in a field that no longer admitted to having a name.

He taught himself. There was no one in France working on this; there was barely anyone anywhere. He reinvented, more or less from scratch, the idea of training a network with many layers by propagating the error backward through them, not knowing that a few people on the other side of the Atlantic were converging on the same thing. In 1985 he went to a small workshop in the French Alps and gave a talk about his version of it. A man from Toronto was in the audience and came up afterward, interested. The man was Geoffrey Hinton, who had just made backpropagation work, and who knew immediately that the young Frenchman in front of him had been pulling at the same thread alone. Hinton invited him to come to Toronto. LeCun finished his doctorate first, then crossed over in 1987 to do a postdoc in Hinton’s lab, joining the small and somewhat clannish group of people who believed, against the field, that this was going to work.

In 1988 he went to Bell Labs.

It is hard now to convey what Bell Labs meant. It was the research arm of the American telephone monopoly, and for half a century it had been the most productive industrial laboratory in the history of science. The transistor was invented there. So was information theory, the laser, the Unix operating system, the C programming language, the discovery of the cosmic microwave background. Researchers were given real problems and real money and a great deal of latitude, and the campus in Holmdel, New Jersey, was a temple to the idea that if you hired brilliant people and left them mostly alone, useful things would come out the other end. LeCun arrived into a group that had been handed a concrete, valuable, unglamorous task: teach a computer to read.

Specifically, to read the handwritten amounts on bank checks and the digits in zip codes, which the U.S. Postal Service and the banking system processed by the billions, and which human beings had to key in by hand, slowly and expensively. It was the perfect proving ground for what LeCun believed. The digits were a small, closed world. There was an enormous amount of training data, eventually gathered into a standard set of handwritten numerals that researchers would test against for the next thirty years. And no one’s life depended on perfection, only on the system being faster and cheaper than people, which was not a high bar.

The thing LeCun built to do it became known, eventually, as a convolutional neural network, and the idea behind it is worth pausing on, because nearly everything that came later in computer vision is a descendant of it. The problem with reading an image is that the thing you are looking for can be anywhere in it. A seven written in the top-left corner is the same seven written in the bottom-right; a network that learned to recognize sevens only in one position would be useless. The earlier neural nets wired every pixel of the input to every unit in the next layer, which meant they had to learn the appearance of a seven separately for every possible location, an absurd waste of effort that the available computers and data could not afford.

LeCun’s network did something more economical, borrowed loosely from how the visual cortex was thought to work. Instead of one giant tangle of connections, it used small detectors, each looking at only a little patch of the image at a time, hunting for a simple local feature: an edge, a corner, the stub of a curve. And the same detector was slid across the whole image, applied at every position, so that whatever it had learned to find in one place it could find everywhere. A feature was a feature wherever it appeared. These local detectors fed into more detectors that combined their findings into larger shapes, and those into larger ones still, until the final layers were looking not at pixels but at assemblies of strokes that added up to a digit. Between the layers, the network also threw away precise position information on purpose, pooling each neighborhood down so that the exact placement of a stroke stopped mattering and only its rough presence carried forward, which made the whole thing tolerant of the wobble and slant of real handwriting.

The trick that made it tractable was that the detectors shared their settings, the same small pattern reused thousands of times rather than learned thousands of times over. It drastically cut the number of things the network had to learn, which meant it could actually be trained on the hardware of the late 1980s. The system that resulted was called LeNet, and it worked. The same three ideas at its core, local features, weight sharing, and the willingness to blur away exact position, would two decades later be the engine inside the network that recognized photographs, and the network after that, and nearly every machine that learned to see. LeCun had the architecture of modern computer vision essentially right in 1989. What he did not have, and could not have had, was a fast enough computer or a large enough world to point it at. The idea was waiting on the hardware, and the hardware was twenty years away.

It worked well enough to be sold. Through AT&T and then NCR, the check-reading technology was built into the machines that banks used to process payments, and by the late 1990s a substantial share of all the checks written in the United States, a figure usually given as well over a tenth and sometimes higher, passed through a descendant of LeCun’s network on their way to being cashed. There is no cleaner test of a technology than that businesses pay money to use it on something that matters. By that test, neural networks had passed. LeCun and his colleagues even designed a special chip, called ANNA, to run these networks faster than a general-purpose processor could, an early ancestor of the specialized AI silicon that would one day be fought over by nations. They had a working product, a profitable application, and a piece of custom hardware. And it made almost no difference to the standing of their field.

Because down the hall at Bell Labs was Vladimir Vapnik.

Vapnik was a Soviet mathematician, brilliant and combative, who had emigrated and brought with him a body of statistical learning theory of formidable elegance. Out of it came a method called the support vector machine, which solved many of the same classification problems neural networks did, and solved them with a property neural nets conspicuously lacked: you could prove things about it. A support vector machine came with mathematical guarantees about how well it would generalize from its training data to new examples. It was clean. It was provable. It did not require the dark art of choosing how many layers and how to initialize them and when to stop training, the fiddly empirical craft that made neural networks feel less like science than like cooking. To a field that prized rigor and was embarrassed by anything that smelled of alchemy, Vapnik’s machines were almost a moral relief.

The two camps coexisted in the same building, sometimes the same corridor, and the rivalry was real and good-humored and, in retrospect, historic. In 1995 Vapnik and Larry Jackel, who ran the group, made a pair of wagers about the future, with LeCun as the witness and a fancy dinner as the stake. Jackel bet that by the year 2000 someone would have a real theory of why big neural networks worked, a proof of the kind that already existed for Vapnik’s machines. Vapnik bet the reverse about practice: that by 2000 no one in their right mind would still be using neural nets of the sort they had in 1995, because everyone would have switched to support vector machines. The terms were written down. What gives the bet its flavor is that both men lost, and LeCun, who thought neural nets would survive and did not need a theory to keep using them, collected on both. But that resolution lay in the future. For most of the 1990s, sitting in that corridor, the smart money looked like Vapnik’s. The serious people, the ones who understood the mathematics best, were betting that neural networks were a passing inelegance, a clever hack the field would outgrow once it had proper tools.

What happened next was the second winter.

The field had been through one already, in the 1970s, after the Minsky and Papert book and a skeptical British government report had drained the money and the morale out of the first wave of AI. People who lived through that one had learned to recognize the season. By the mid-1990s the signs were back. Support vector machines and other kernel methods were ascendant. Funding agencies and journals and hiring committees had moved on. The hot results were elsewhere. And the word itself, neural, had acquired a faint odor of failure, the smell of a field that had overpromised in the 1960s and overpromised again in the 1980s and was not going to be given a third chance to embarrass everyone.

So researchers stopped using the word. This is the detail that captures the era better than any funding chart. People who were doing neural network research learned to not call it that. They wrote papers about convolutional networks, or about specific architectures, or they leaned on the more respectable umbrella of machine learning, anything to keep a program committee from rejecting the work on sight for the company it kept. A grant proposal with neural in the title was a proposal that did not get funded. A paper that foregrounded the connection to the discredited old idea was a paper that did not get in. The believers were still doing the work. They had simply learned to do it in a kind of internal exile, laundering the vocabulary, hiding the lineage, waiting.

There is something almost comic in it, viewed from the far side. The same researchers who would later have companies bid tens of millions of dollars for them spent the late 1990s engineering the titles of their own papers to conceal what the papers were about, the way a writer might slip a manuscript past a censor. And there was a quiet desperation underneath the comedy, because a scientist’s career is built of accepted papers and funded grants, and a field that has decided your subject is illegitimate is a field that can end your livelihood without ever proving you wrong. Being right is not a defense when the people who decide what counts have stopped reading. LeCun, who never stopped believing he had the better of the argument, would say years afterward that he had been sure he was right the whole time. Certainty of that kind is cheap to claim in hindsight. What is rare is to hold it during the years when it buys you nothing.

The hiding was a humiliation, but it was survivable. The deeper cost of these years was the human one. The people who kept faith with neural networks paid for it in careers that stalled, in students who took the safer path, in the slow grind of being right and unrewarded for it. And for some the cost was heavier still. In the middle of this stretch, while the field was at its lowest and the work was hardest to defend, Hinton’s wife, Rosalind, died of ovarian cancer, leaving him with two young children, Emma and Thomas, and a grief that nearly pulled him out of research entirely. He had given his professional life to an idea the world had decided was wrong, and now the personal ground had given way beneath him too. He kept going. It is one thing to persist in an unpopular scientific bet when your life is otherwise whole. It is another to keep at it through that.

The unfashionability had a geography to it, and the geography is the part that surprised people later. The great consolidation of AI talent into a handful of American technology companies would not happen for another decade, and when it did, the founders of the field would be conspicuously not American. Hinton was British and had taken his lab to Canada, partly out of distaste for how much American AI money came from the military, settling in Toronto where the government was content to fund basic research without asking it to point at a weapon. LeCun was French. And the most important keeper of the flame in this particular stretch was a German computer scientist working in Switzerland, far from the centers of the field, where being out of fashion mattered a little less because you were already out of the way.

His name was Jürgen Schmidhuber, and in 1997, with his student Sepp Hochreiter, he published a network design called Long Short-Term Memory, the LSTM, built to solve a problem that had quietly crippled an entire branch of neural networks. The networks that handled sequences, language, speech, anything that arrived one piece at a time, had to carry information forward from earlier in the sequence to make sense of what came later. But the error signal that taught them tended to fade as it traveled back across many steps, so they could not learn long-range dependencies; by the time the lesson reached the beginning of a sentence it had dissolved to nothing. Hochreiter had characterized this vanishing-gradient problem precisely. The LSTM was the fix: a unit with a kind of protected internal memory and a set of learned gates that decided what to remember, what to forget, and what to let through. It was an ingenious piece of engineering, and for years almost no one outside a small circle cared. It sat on the European margin, mostly ignored, until the world turned and it became, for a decade, the workhorse that read the speech into phones and translated the languages of the internet. The pattern was becoming familiar. The right answers were being found. They were simply being found by people the field was not listening to.

Meanwhile the field’s old guard was learning, in real time, how it feels to be displaced. The lesson did not arrive only for the neural network people; it arrived, more brutally, for the people whose methods the neural network people would eventually replace. At Microsoft, a computational linguist named Chris Brockett had spent the better part of seven years doing painstaking, expert work on machine translation: hand-building the linguistic rules that a computer would need to turn one human language into another, encoding grammar and syntax and the thousand exceptions that make translation hard, the accumulated craft of a discipline that took human language seriously as a structured thing. It was scholarship as much as engineering. And then the statistical approach arrived, machines that learned to translate not from rules a linguist wrote but from raw heaps of already-translated text, finding patterns no human had articulated, and it began, almost overnight, to do better than the rules. Brockett has described the vertigo of it, the sensation of watching years of careful, intelligent work be rendered beside the point by a method that knew nothing about language and did not need to. It was a preview, in miniature, of what was coming for everyone. The era that was about to begin would be full of people who had been the best in the world at something, watching a learning machine do it better without understanding how.

That, in the end, is the strange shape of these years. A neural network was reading the nation’s checks and a chip had been built to run it faster, and the field had decided neural networks were finished. The people who would turn out to be most right were scattered to Toronto and Switzerland and the basements of Bell Labs, publishing under other names, betting their careers against the consensus and the smart money both, burying the parents who would not live to see them vindicated. The idea was not dead. It had been deemed dead, which is a different thing, and the difference would matter enormously, because deeming requires only that everyone agree, and agreement can change in an afternoon.

It would take a discontinuity to change it. Not a better argument, which the believers already had, and not a better result, which they already had too, in the form of a machine reading real money. It would take a result so large that no argument could contain it, a number so far beyond what anyone expected that the consensus would simply break under the weight of it. That number was a few years off, and it would not arrive first in the field everyone remembers. Before vision, there was a quieter test in a place where the orthodoxy was just as entrenched and the believers had a student ready to break it.