Neuron Makers
Part IV · Chapter 19

Automation

Robotics, OpenAI's Rubik's-cube hand, and what lab founders really thought about jobs. → The "what does this mean for work?" question, asked through people rather than econometrics.

“Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we develop software.” — Andrej Karpathy, “Software 2.0,” 2017

On November 9, 2015, Google did something that, a few years earlier, would have struck its own engineers as financially insane. It gave away the machine.

The machine was called TensorFlow, and it was the second of its kind. The first, an internal system named DistBelief, had been built starting in 2011 by Jeff Dean and a small group inside Google Brain, and it had quietly become one of the most valuable pieces of software in the company. DistBelief trained the neural networks that read addresses on Street View images, that ranked search results, that powered the voice recognition in every Android phone. It was the engine room under the floorboards of products a billion people touched. It was also, by the accounts of the people who used it, an ungainly beast: tangled with Google’s internal infrastructure, hard to extend, built by researchers for researchers and bearing all the scars of a thing assembled in a hurry to win.

So Dean’s team rebuilt it. They spent two years writing a cleaner, faster, more general system, one that could express almost any computation as a graph of operations flowing tensors, multidimensional arrays of numbers, from one node to the next. That is where the name came from. And when it was done, instead of guarding it, Google published the source code under an open license and put it on the internet for anyone to download.

The decision baffled outsiders. Here was the company’s hard-won advantage in the most important technology of the decade, and they were handing it to competitors free of charge. Inside Google, the logic was different. Dean and his colleagues had watched what happened in the open-source world: the tools that everyone used became the tools that everyone improved, and the company that controlled the tool’s direction enjoyed a gravitational pull no marketing budget could buy. If every graduate student in the world learned to think in TensorFlow, then every graduate student in the world would arrive at their first job already fluent in Google’s dialect. And the dialect was tuned, not coincidentally, to run beautifully on Google’s hardware.

TensorFlow was the opening move in a contest that almost nobody outside the field noticed at the time, and that would matter as much to the next decade of AI as any single model. It was the industrialization of machine learning itself.

For most of the story so far, building a neural network had been a craft. The image of the field in 2012, the year AlexNet won ImageNet, was three exhausted people and two gaming GPUs in a Toronto apartment, hand-tuning kernels in CUDA because no off-the-shelf software could do what they needed. Alex Krizhevsky had written much of his convolution code himself, by hand, because there was no library to call. Every lab had its own private stack of scripts, half-documented, held together by the institutional memory of whoever wrote them. A new student joining a group might spend the first three months just learning to operate the local apparatus before training a single useful model. The knowledge of how to make these systems work lived in people’s heads and in their fingers, and it did not transfer.

That arrangement is fine for a guild and fatal for an industry. You cannot run a factory on craftsmen who each build their own lathes. And what was coming, though only a handful of people fully believed it yet, was a factory.

The first thing a factory needs is interchangeable parts and a common shop floor. That is what the frameworks provided. TensorFlow was the loudest, but it was not alone, and within a year it had a rival that would eventually eat its lunch among researchers. In the autumn of 2016, a team at Facebook AI Research led by Soumith Chintala released PyTorch.

The two systems embodied a genuine disagreement about how the work should feel. TensorFlow, in its first incarnation, made you describe your entire computation up front as a static graph, then hand that graph to the system to execute. It was the way a compiler thinks: define everything, optimize the whole, then run. This was wonderful for deploying a finished model at scale across thousands of machines, which was precisely Google’s problem. It was miserable for the actual experience of research, where you want to poke at a tensor in the middle of a computation, print its value, change one line, and see what happens. Debugging a static graph felt like trying to inspect a sealed engine while it ran.

PyTorch made the opposite bet. It built the graph on the fly, as your code executed, line by line, the way ordinary Python runs. You could drop a print statement anywhere. You could set a breakpoint. You could write a loop whose shape depended on the data flowing through it. The technique was called define-by-run, and to the researchers who tried it, it felt less like operating a machine and more like thinking out loud. Chintala had come from the old Torch library, written in the niche language Lua, and the move to Python, the language every scientist already knew, removed the last barrier. Within two years PyTorch had quietly conquered the research labs. Most of the papers that would define the next era were written in it. Google would spend years and an entire rewrite of TensorFlow trying to win that ground back, and would largely fail.

The framework wars looked, from a distance, like a tooling squabble. Up close they were much larger than that. The frameworks turned the construction of a neural network from a bespoke act into an assembly process. A researcher in Mumbai or São Paulo or Tübingen could now download the same shop floor that Google and Facebook used, import a standard layer, stack it on another standard layer, and have a working model in an afternoon. The barbed-wire fence around the field, the requirement that you be able to write your own GPU code, came down. The population of people who could build a neural network exploded, and the per-model cost of building one collapsed. Both of those things were prerequisites for what scaling would soon demand.

But interchangeable parts are only half of an industrial system. The other half is power. And the frameworks had a hunger for it that ordinary processors could not feed.

Inside Google, the same insight that produced TensorFlow had produced something quieter and stranger. As early as 2013, Jeff Dean and his colleagues had done a back-of-the-envelope calculation that frightened them. If voice search took off, if every Android user spoke to their phone for just three minutes a day and a neural network had to process each query, Google would need to roughly double the number of data centers it owned. The economics did not work. General-purpose chips, even the GPUs that had powered AlexNet, were spending most of their transistors on things a neural network did not need. What a neural network mostly did was multiply matrices, over and over, billions of times. So Google decided to build a chip that did almost nothing but that.

The project was led in part by Norman Jouppi, a veteran computer architect, and the result was the Tensor Processing Unit. At its heart sat a grid of 256 by 256 multiply-accumulate units, a so-called systolic array, through which data marched in lockstep like a bucket brigade. It was not a flexible chip. It was barely a computer in the conventional sense. It was a purpose-built organ for the one operation that mattered. Google had the TPU running in its data centers by 2015, before TensorFlow was even public, and revealed its existence at the I/O developer conference in May 2016. When the paper describing it appeared the following year, the numbers were startling: for the inference workloads it targeted, the TPU ran fifteen to thirty times faster than contemporary CPUs and GPUs, and delivered far more performance per watt.

The TPU was also, it turned out, watching the most famous AI event of the decade from inside the building. When AlphaGo sat across the board from Lee Sedol in Seoul in March 2016 and played the move that human professionals called impossible, the network choosing those moves was running on Google’s custom silicon. The chip and the algorithm had grown up in the same company, tuned to each other, and that co-design, software shaped for hardware and hardware shaped for software, was the second pillar of the new industrial order. Nvidia, watching all of this, would respond by reorienting its entire business around the same idea, stuffing its GPUs with dedicated matrix-multiply units called Tensor Cores and selling them by the rackful to anyone trying to build the next big model. The picks and shovels were becoming the most profitable part of the gold rush.

The power also stopped requiring ownership. For most of computing history, if you wanted to run a large job you had to buy the machines, house them, cool them, and hire the people to keep them alive. Around the middle of the decade that changed for machine learning. Google, Amazon, and Microsoft began renting their accelerators by the hour, and a graduate student with a credit card could now summon, for a single afternoon, more computing power than most universities owned outright. On paper this was the great equalizer. In practice it had a sliding floor. Anyone could rent a few hours on a few chips. Renting thousands of chips for weeks, the regime that the largest models would soon demand, cost millions of dollars, and the cloud bill arrived at the end of the month whether the run had worked or not. The cloud democratized small experiments and quietly walled off large ones. It widened the bottom of the funnel and narrowed the top.

So now there was a shop floor and there was power. The third thing a factory needs is raw material, and in machine learning the raw material is labeled data.

The field had always known this, in a way. ImageNet had been built on the backs of tens of thousands of anonymous Mechanical Turk workers who drew boxes and ticked categories for pennies a task. But that had been a one-time academic heroics, a dataset assembled and then frozen. As neural networks moved out of the lab and into products, into self-driving cars and content moderation and medical imaging, the appetite for labeled data became continuous and industrial. A self-driving car company did not need one labeled dataset. It needed a river of them, fresh every week, every pedestrian boxed, every lane line traced, every traffic cone tagged, forever.

In 2016 a nineteen-year-old named Alexandr Wang dropped out of MIT and, with a co-founder, started a company to sell exactly that river. He had grown up in Los Alamos, the son of two physicists at the national laboratory, and he had the instinct of someone raised around large institutions that turned raw inputs into finished science. Scale AI was, at its core, an unglamorous proposition: it organized human beings, many of them in the Philippines and Kenya and Venezuela, to look at images and audio and text and attach the labels that machine learning models needed to learn. A worker in Nairobi might spend an eight-hour shift drawing polygons around cars in dashcam footage shot on a freeway in California, so that a self-driving system would one day know, in some statistical sense, what a car was. The technology around the humans was real and clever: the routing, the quality control, the software that decided which examples a model was most confused by and should therefore be labeled next. But the business was, in the end, a managed workforce for the manufacture of ground truth. It was a labor company wearing the clothes of a software company, and it would make Wang one of the youngest billionaires in the world.

The emergence of a data-labeling industry was its own kind of confession. It meant that the bottleneck in building an AI system was no longer the algorithm or the compute. Those had become, if not commodities, at least things you could buy. The bottleneck was the patient, expensive, human work of telling the machine what was true, millions of times over. The intelligence everyone marveled at was, at its foundation, a vast act of remembering examples that people had been paid to curate. The factory ran on hidden hands.

There was a deeper trick coming for the data problem, and it pointed in the same direction as everything else. If you already had a large model trained on the open internet, you could use it to help label the next batch, or to generate raw text to train on, or to filter a noisy dataset down to the good parts. Data could be made to compound. But the seed of that loop, the first large model, was again something only a well-resourced lab could grow. The companies that had labeled the most could now label more cheaply, and the gap between them and everyone else widened with each turn.

There was a fourth piece, and it was the most vertiginous, because it pointed at the possibility that the factory might one day run itself.

If building a neural network had become an assembly process, stack this layer, choose that width, set this learning rate, then a natural question presented itself to anyone with an engineer’s temperament. Why was a human doing the stacking? The choice of architecture, the number of layers and their arrangement and the size of each, was still made by people, by intuition and trial and accumulated taste. It was the last bit of true craft left in the process. So a few researchers asked whether a machine could learn to design machines.

In November 2016, two Google Brain researchers, Barret Zoph and Quoc Le, posted a paper with a deceptively flat title: “Neural Architecture Search with Reinforcement Learning.” The idea inside it was almost recursive enough to make you dizzy. They built a neural network whose job was to write descriptions of other neural networks. This controller network would propose an architecture, that architecture would be built and trained on a real task, its accuracy would come back as a reward signal, and the controller would adjust itself to propose better architectures next time. It was the same reinforcement-learning machinery that had taught DeepMind’s systems to play Atari, pointed inward, at the design of intelligence itself.

It worked. The architectures the system discovered were competitive with the best human designs on standard benchmarks, and some of them had a strange, asymmetric quality, full of connections that no human designer would have drawn but that turned out to help. The result was real, and it was also a glimpse of an unsettling future: the human designer, the last craftsman on the floor, being automated like everyone before.

There was a catch, and the catch was the whole point. Neural Architecture Search worked by training thousands of candidate networks to evaluate them, and training thousands of networks costs an extraordinary amount of compute. The early experiments consumed hundreds of GPUs running for weeks. To automate the design of a network, you had to be able to throw away the cost of building hundreds of them. This was an option available to roughly three organizations on Earth. The dream of the self-designing AI turned out, on inspection, to be a dream you could only afford if you already owned a data center. Google had folded the technique into a product it called AutoML and pitched it as democratization, push a button and get a custom model, but the button was wired to a server farm only Google possessed.

The same pattern ran underneath all four pillars. Each step that made machine learning easier also made it more concentrated. The free framework lowered the barrier to entry, and tied the entrants to a particular hardware ecosystem. The custom silicon delivered enormous efficiency, and only a handful of companies could design and deploy it. The data-labeling industry turned ground truth into a purchasable commodity, and the largest buyers could afford rivers while everyone else could afford a trickle. The architecture search promised to automate the last human step, and priced that automation at a level only the giants could pay. The tools spread outward to everyone. The frontier moved inward, toward the few.

Andrej Karpathy, who had been one of Fei-Fei Li’s students at Stanford and was by 2017 running AI at Tesla, gave the whole shift a name that stuck. In an essay he called “Software 2.0,” published that November, he argued that something more profound than a new set of tools was underway. The classical way of writing software, Software 1.0 in his framing, meant a programmer typing explicit instructions, line by line, telling the computer exactly what to do. Software 2.0 abandoned that. Instead of writing the program, you specified what you wanted with a dataset and a goal, and then an optimization process searched the vast space of possible neural-network weights for a program that satisfied it. The programmer’s job became the curation of data and the design of objectives. The actual logic, the billions of weights, was found by gradient descent and was not really written by anyone at all. It was grown.

Karpathy’s point was that this was a new substrate, not a marginal technique for special problems. Whole categories of code that engineers used to write by hand, for vision and speech and translation, were being deleted and replaced by trained networks that did the job better and that no human had explicitly designed. The compiler for this new kind of software was the optimizer. The source code was the dataset. And the means of production were the framework, the chip, the labeled data, and the search: exactly the four things that had just been industrialized.

The inversion was complete. A generation of computer scientists had been trained to believe that the essence of their craft was the precise, legible specification of behavior: you understood a system because you had written every rule, and you could trace any output back to a line you had typed. Software 2.0 threw that away. Nobody could point to the line of code that made the network recognize a cat, because there was no such line. There was only a pattern of weights, distilled out of millions of examples, that worked for reasons no one could fully articulate. The field was trading comprehension for capability, and the trade was so favorable that almost no one hesitated. The machines were getting better at things, and the price was that we no longer understood, in the old way, how.

By 2019 the assembly line was complete, and it was running. You could rent the framework for nothing, rent the compute by the hour from any cloud, buy the labeled data from a managed workforce, and, if you were one of the few who could afford it, let a machine help design the model. The romance of the lone researcher with a clever idea had not died, but it had been surrounded. Increasingly the work that moved the frontier had become a logistics operation more than a single insight: assembling enough compute, enough data, and enough engineers to run a single training job that might cost as much as a house, or eventually as much as a small office building.

There is a tell in the historical record for exactly when a science becomes an industry, and it is mundane. It is the author list. The papers that had defined the early breakthroughs carried two or three names. AlexNet had three. The 1986 paper that put backpropagation on the map had three as well. As the era of automation gave way to the era it had made possible, the author lists began to swell, twenty names, thirty, more, each one a specialist responsible for one station on the line: the data pipeline, the distributed training, the evaluation harness, the infrastructure. The lone inventor had become a manufacturing team.

A factory this expensive needs a reason to exist that justifies the bill. You do not assemble compute by the data-center, buy rivers of labeled data, and pay a manufacturing team to run a single job that costs as much as a house unless you believe the thing coming off the line is worth it. The people building these machines did believe that, and not modestly. A great many of them thought the end of the line was a mind, an intelligence that would one day exceed their own, and that conviction, more than any framework or chip, was the engine that kept the factory running.