Part VIII · Chapter 34

The Agentic Turn

Models stop answering and start doing — using computers, writing code on their own, and reorganizing software work around agents. → How the chatbot became a worker.

In March 2023, a thirty-three-year-old game developer named Toran Bruce Richards uploaded a side project to GitHub and gave it a name that sounded like a parody of the moment: Auto-GPT. The idea was simple to the point of recklessness. Take GPT-4, the model OpenAI had released only days earlier, and let it talk to itself. Give it a goal in plain English. Let it break the goal into tasks, do the tasks, look at the results, and decide what to do next, without a human in the loop.

Within weeks, Auto-GPT was the most-starred repository on all of GitHub, ahead of every established tool and framework. People pointed it at their businesses, their research projects, their personal to-do lists. They gave it a credit card, figuratively, and in a few reckless cases literally, and told it to make money. Screenshots flooded Twitter: Auto-GPT spawning sub-agents, writing its own code, googling its own errors. It felt, for a few weeks in the spring of 2023, like the future had arrived early.

And then, mostly, it didn’t work.

Auto-GPT would start strong and then wander. It would get caught in loops, repeating the same failed action with the patience of a machine and the judgment of none. It would declare victory on tasks it hadn’t finished. It would spend real money on API calls chasing goals it had quietly misunderstood three steps earlier. The gap between the demo and the dependable tool turned out to be enormous, and closing it would consume the field for the next two years.

The pattern was familiar to anyone who had watched the field before. A capability would appear in a demo, look like magic, and then refuse to survive contact with the real world. What was new this time was the shape of the failure. Earlier AI systems failed by being wrong about a single thing, misclassifying an image, mistranslating a phrase. Auto-GPT failed by being wrong about a sequence of things, each error feeding the next, until a project that began as “research the market and draft a business plan” ended as a machine confidently emailing itself nonsense. The demos went viral precisely because the good runs were so good. The bad runs, which were most of them, did not make it onto Twitter.

That gap, between the run worth screenshotting and the run you would actually trust, is the subject of this chapter. It is the story of how AI went from a thing that answered questions to a thing that did things, and of all the ways that the second turned out to be harder than the first.

The models that powered the chatbot boom of 2022 and 2023 were, at their core, prediction engines. You gave them text, and they predicted the text that should come next. A remarkable amount of apparent intelligence fell out of that simple objective, but it was, fundamentally, a system that produced words. It did not act on the world. It had no hands.

This is worth sitting with, because it is the hinge the whole chapter turns on. A model that could write a flawless Python script to rename a thousand files could not rename a single one. A model that could explain, in fluent detail, exactly how to book a flight could not book a flight. It was, in the most literal sense, all talk. The intelligence was real and the agency was nil, and the entire agentic project was an attempt to bridge that gap, to take the thing that could describe an action and connect it to the world where actions actually happened.

The first hands were crude. In 2022, Shunyu Yao and colleagues at Princeton and Google described a technique they called ReAct, a contraction of “reason” and “act.” A language model, they showed, could be prompted to interleave reasoning steps with actions: generate a thought, take an action, observe the result, generate the next thought. It was a loop. Think, act, observe, repeat.

In practice it looked almost conversational, the model narrating its own work. Asked a question it could not answer from memory, it would write something like a thought, I need to find out when this happened, then an action, search for the date, then read back the result and decide what to do with it. The narration was not decoration. Forcing the model to state its reasoning before each move made the moves better, the same way a person who talks through a problem out loud catches mistakes a silent thinker misses. ReAct was the first hint that the path to an agent ran through reasoning, not around it.

The idea drew on something deep in the prehistory of AI, the old dream of the software agent, the autonomous program that could be dispatched to accomplish a goal. In the 1990s, researchers had built “softbots” that could navigate Unix file systems and the early web. The dream had never died; it had only waited for the software to get smart enough. ReAct, and the techniques that followed it, were the moment the dream met a brain capable of improvising.

What made the loop work, when it worked, was that the model could now use tools.

The breakthrough that made tools practical arrived in June 2023, when OpenAI added “function calling” to its API. Developers could describe a set of functions to the model, a weather lookup, a database query, a code executor, and the model could respond not with prose but with a structured request to call one of them. The model became a router, deciding when to reach outside itself and how. Function calling turned the language model from an oracle into an operator.

The change sounded technical and was, in fact, profound. Before function calling, a developer who wanted a model to take an action had to coax it into producing text in a precise format, then parse that text and hope the model had not improvised. It was brittle and embarrassing, the AI equivalent of slipping a note under a door and praying the right thing happened on the other side. Function calling made the contract explicit. The model knew which tools existed, what arguments they took, and how to ask for them. The hallucinated half-step between intention and action was, if not eliminated, at least cornered. Within months, every serious AI application was built on top of it.

Function calling did not come from nowhere. It was the productization of an idea researchers had been circling for two years: that a language model’s greatest limitation, its inability to do anything but generate text, could be turned into a strength if the text it generated could be made to mean something to another system.

Toolformer, a Meta paper from early 2023, had shown that a model could teach itself to use APIs, calculators, search engines, translation systems, by inserting special tokens into its training data. WebGPT, from OpenAI back in 2021, had taught a model to browse the web to answer questions, clicking links and scrolling pages like a person. These were proofs of concept. Function calling made the capability a product.

Each had been chasing the same insight from a different angle. WebGPT had attacked the model’s ignorance: a system trained on text that stops at a certain date cannot know what happened after it, but a system that can search can go and find out. Toolformer had attacked the model’s incompetence at the things it was worst at, like arithmetic, by teaching it to hand those tasks off to a tool built to do them perfectly. The thread connecting them was a kind of humility built into the architecture, the recognition that a language model did not have to be good at everything if it knew when to ask for help. Function calling made that humility routine.

The progression from there was rapid. By late 2023, the major labs were racing to turn tool use into a platform. OpenAI shipped plugins, then a model it called GPTs, customizable agents that ordinary users could assemble without code. The vision was a kind of app store for autonomous helpers. The reality, at first, was a lot of half-working toys: clever for a demonstration, frustrating for daily use, abandoned as often as not within a session or two. The platform was real. The reliability that would have made anyone depend on it was not there yet.

The agent that came closest to the original dream arrived from Anthropic, and it did something none of the others had dared. In October 2024, Anthropic gave its Claude model the ability to use a computer the way a person does, looking at a screen, moving a cursor, clicking buttons, typing into fields. “Computer use,” the company called it, plainly. Instead of asking developers to wire up a tidy menu of functions, Anthropic let the model operate the same graphical interfaces that humans operate, pixels and all.

It was the first time a frontier lab had shipped an agent that drove a desktop. The demonstration was striking: ask the model to fill out a form, or pull data from a spreadsheet into a web app, and it would move the pointer across the screen and do it. The implication was larger than the demo. If a model could use any software a human could use, then the universe of things it could be asked to do was no longer limited to whatever functions a programmer had bothered to expose. It was limited only by what could be done on a screen, which is to say, almost everything.

Anthropic was unusually candid about the limits. The feature was released in beta, slow and error-prone, and the company warned in plain language that it could still make mistakes, misreading a screen, clicking the wrong thing, getting confused by an unexpected pop-up. The honesty was the point. The lab that had been founded on safety concerns was not pretending the agent was finished. It was showing the work.

While the general-purpose desktop agent stayed an impressive beta, a narrower kind of agent was quietly becoming useful. The place agents worked first was code. Software has a property the rest of the world lacks: it tells you immediately whether you got it right. A program either compiles or it doesn’t; a test either passes or it fails. An agent writing code lives inside a tight loop of action and feedback, make a change, run the tests, read the error, try again, and that feedback is exactly the thing that keeps a long chain of actions from drifting into nonsense. By 2024, coding agents that could take a description of a bug, find the relevant files, edit them, and verify the fix were doing real work for real developers. They were not general. They did not drive desktops or browse the open web. They did one kind of thing in an environment that punished mistakes instantly, and in that environment, the agentic loop finally held together.

Underneath every agent demo lay the same unglamorous problem, and it was a problem of arithmetic.

An agent that takes a single action and gets it right ninety-five percent of the time looks impressive. But agents do not take single actions. They take chains of them: read the email, search the database, draft the reply, check the calendar, book the meeting. And reliability compounds the wrong way down a chain. Ninety-five percent accuracy on one step is ninety-five percent. On ten steps in a row, each depending on the last, it is closer to sixty. On twenty steps, it falls below forty. The longer the horizon, the more certain failure becomes.

This was the mathematics behind Auto-GPT’s wandering, and behind the gap between every dazzling agent demo and the dependable product that stubbornly refused to follow. A chatbot that gives a wrong answer is a nuisance; you read it, you notice, you ask again. An agent that takes a wrong action has already acted, sent the email, spent the money, deleted the file, and then built its next ten actions on top of the mistake. The errors cascaded.

Closing that gap became the central engineering problem of the agentic era. Labs attacked it from every direction: better planning, so the model could lay out its steps before committing to them; the reasoning models of the previous chapter, which could think harder about each move before making it; tighter loops with humans kept in or near the loop for the consequential steps; and narrower domains, coding, customer support, research, where the actions were constrained enough that the math could be made to work.

The other half of the answer was to let the agent check itself. A human who books a meeting glances at the calendar afterward to confirm it landed on the right day; an agent could be built to do the same, treating each action not as a fact but as a hypothesis to be verified before moving on. Verification turned out to be at least as important as the action itself. A model that could notice it had clicked the wrong button and undo it was worth far more than a model that clicked faster. Much of the progress of 2024 came not from agents that acted more boldly but from agents that doubted more usefully, that paused, looked at what they had done, and caught the error before it had children.

None of this dissolved the arithmetic. It bent it. An agent that recovered from most of its mistakes could run a longer chain before the odds caught up with it, and a longer chain meant a bigger job it could be trusted to finish. Reliability did not arrive as a breakthrough. It accrued, percentage point by percentage point, as the unglamorous work of the era.

The agentic turn reframed what an AI product could be. A chatbot was something you talked to; an agent was something you handed a job and walked away from. The difference sounds small and is not. Delegation requires trust, and trust requires reliability, and reliability over long horizons was exactly the thing the technology did not yet have. Toran Bruce Richards had stumbled onto the right idea eighteen months before the field had built the machinery to support it. Auto-GPT was not wrong about where things were going. It was just early, a sketch of the destination drawn before anyone knew how to get there.

But the direction was set. By the end of 2024, every major lab had concluded that the future of its products lay in the agent behind the chat box, a system that could be handed a goal and trusted, increasingly, to carry it out. The chatbot had been the demo. The agent was the business.

And a business that ambitious needed an engine to match. Building models that could act, not merely answer questions, meant training runs of a scale that had never been attempted, and a bill to match. The race to build agents was, underneath, a race to spend.