Part VII · Chapter 31

The Counterfeit and the Real

Image and video generation explode into deepfakes, the Taylor Swift fakes, and a wave of copyright suits led by the New York Times against OpenAI. → The first great copyright war of the AI age.

“What a bunch of malarkey. Voting this Tuesday only enables the Republicans in their quest to elect Donald Trump again.” — the AI-cloned voice of Joe Biden, robocall to New Hampshire voters, January 21, 2024

On the Sunday before the New Hampshire primary, the phones rang across the state and Joe Biden was on the line. Not really, of course. But the voice was his: the loose-jointed cadence, the verbal shrug, the catchphrase he had been using since the Senate. “What a bunch of malarkey,” it began, and any voter who had watched the man for forty years would have heard nothing wrong. The voice told Democrats that voting in Tuesday’s primary was a trap, that it would only help Donald Trump, that they should save their vote for November. It signed off with a callback number that belonged, when anyone bothered to dial it, to the treasurer of a New Hampshire Democratic group who had said no such thing.

The primary was January 23, 2024. The calls went out on the 21st, to thousands of households, in the voice of a sitting president telling members of his own party not to vote. By the time the New Hampshire attorney general’s office opened an investigation, the more interesting question had stopped being who Biden was. It was who had made him say it.

The answer, it turned out, was a street magician in New Orleans named Paul Carpenter, who built the audio in under twenty minutes for something close to a dollar. Carpenter normally bent forks and read minds for tips on Bourbon Street. He had been hired by a political consultant, paid a few hundred dollars, given a script, and pointed at a website. The website was ElevenLabs. Carpenter typed the words, the model returned a Biden who had never spoken them, and a presidential voice entered an American election as a forgery that almost no listener could catch.

This is the part of the story that did not fit the old shape of political dirty tricks. A forged letter takes a forger. A doctored photo, in the era before all this, took a darkroom and a steady hand and someone who knew what they were doing. The Biden robocall took a magician with no relevant skill, a credit card, and twenty minutes. The asymmetry was the whole point, and it would define everything that followed. The thing was almost free to make and enormously expensive to answer. The FCC, the New Hampshire courts, voice-forensics labs, and the national press would spend the next two years on something that had cost roughly the price of a coffee.

The consultant who hired Carpenter was Steve Kramer, and Kramer was working, at the time, for the campaign of Dean Phillips, a Minnesota congressman mounting a long-shot Democratic challenge to Biden. The Phillips campaign disavowed any knowledge of the calls, and there is no evidence it knew. Kramer’s own explanation, once he was caught, was that he had done it as a public service: a demonstration, he said, of how dangerous the technology had become, a fire alarm pulled on purpose so the country would finally smell smoke. Whether anyone believed him hardly mattered. He had built the demonstration, and the demonstration worked.

The institutions moved with unusual speed for institutions. On February 8, 2024, less than three weeks after the calls, the Federal Communications Commission issued a declaratory ruling that AI-generated voices in robocalls were illegal under the Telephone Consumer Protection Act, the same law written to stop telemarketers, now pressed into service against synthetic presidents. The agency went on to propose a $6 million fine against Kramer. Lingo Telecom, the carrier that had let the calls onto its network with falsified caller-ID authentication, faced a proposed $2 million penalty and settled in August 2024 for $1 million and a promise to follow the rules it had broken. New Hampshire brought criminal charges against Kramer: felony voter suppression, misdemeanor impersonation of a candidate. The voice that cost a dollar had generated millions of dollars in penalties and a federal rulemaking, which is a strange kind of efficiency.

What the New Hampshire robocall announced was a new physics rather than a new crime. For the entire history of recording, from Edison’s tinfoil cylinder to the smartphone in a witness’s pocket, a recording of a voice had carried a presumption. It was evidence. It meant the person had been there and had said the thing. That presumption had survived photography, film, and audiotape because faking any of them convincingly was hard, slow, and expensive enough to be rare. The presumption was now gone. No state intelligence service had taken it down. The thing that did was a company two friends from Poland had started in a moment of irritation at the way movies sounded.

Mati Staniszewski and Piotr Dąbkowski had known each other since school. Staniszewski went to Palantir; Dąbkowski to Google as a machine-learning engineer. The grievance they shared was specific and small, and it became the company’s founding myth because it was true. In Poland, foreign films were not dubbed the way they were in Germany or France, with a full cast of voice actors. They were read. A single male narrator, the lektor, recited every line of dialogue, every character, man and woman and child, in a flat unbroken monotone laid over the quieted original. Generations of Poles watched Hollywood through this one exhausted voice. Staniszewski and Dąbkowski had grown up inside it, and the question they kept returning to was why, in 2022, a machine could not simply do better, could not give every character a real voice, in any language, with the feeling intact.

They founded ElevenLabs in 2022. The technical lineage they were building on had been assembling for years. DeepMind’s WaveNet in 2016 had shown that a neural network could generate raw audio sample by sample, producing speech that no longer sounded like a robot reading a manual; it was too slow to be practical but it proved the thing was possible. Google’s Tacotron and Tacotron 2, in 2017 and 2018, turned text into spectrograms and spectrograms into sound with a fluency that closed much of the remaining gap. Then the question shifted from making a good voice to making a specific person’s voice from almost nothing. By early 2023 Microsoft’s VALL-E could clone a speaker from a three-second sample by treating audio as a sequence of discrete tokens, the same trick that had made language models work, applied to sound. The technology to counterfeit a human voice from a few seconds of it was, by 2023, a research result anyone could read.

What ElevenLabs did was make it a product. Their model was accurate, and beyond that it was expressive. It caught the breath, the hesitation, the rising heat of an argument, the things that make a voice sound like a person rather than a transcript being read aloud. They put it behind a clean website with a free tier. You pasted text, picked a voice or supplied a sample of one, and got back audio good enough to fool the person whose voice it was. The same capability that could finally give the lektor’s films a real cast could give a New Orleans magician a president.

The dual-use problem arrived almost immediately, and it arrived in the ugliest possible form. Within days of the January 2023 beta, users on 4chan had cloned the voices of celebrities and made them say things designed to wound. A synthetic Emma Watson reading passages of Mein Kampf was one of the clips that circulated, chosen precisely because it paired an admired voice with the most repellent text the trolls could find. The company had been live for less than a week. ElevenLabs responded the way the industry would learn to respond: it moved voice cloning behind paid tiers and identity friction, added provenance tooling, and began building detection for its own outputs. None of it was sufficient and the company knew it was not sufficient, because the underlying capability was now loose in the world and a paid tier is a speed bump, not a wall.

The market did not punish ElevenLabs for any of this. It rewarded the company spectacularly. A $19 million Series A in 2023 was followed, in January 2024, the same month as the Biden robocall, by an $80 million Series B led by Andreessen Horowitz that valued the company at roughly $1.1 billion. A year after that, in January 2025, a $180 million Series C pushed the valuation to $3.3 billion. The robocall and the funding round happened in the same January. Both were true at once: the technology was a billion-dollar business and a loaded weapon, and the same people built both, and there was no version of one without the other.

If voice fell first because it was easiest, video came next because it was hardest, and the hardest problems attract the most ambition. An image has to be coherent in two dimensions and one instant. A video has to be coherent across time. A face has to stay the same face from frame to frame, a coffee cup has to remain on the table when the camera looks away and comes back, a person who walks behind a pillar has to come out the other side. Object permanence, the thing a human infant masters before it can speak, is brutally hard for a generative model, and it requires orders of magnitude more compute than a still image. The early attempts looked like it. Meta’s Make-A-Video and Google’s Imagen Video, both released in the autumn of 2022, produced clips that were a few seconds long, low resolution, and visibly unstable, the way a dream is unstable, fine until you look at any detail twice.

The company that turned video generation from a research demo into a tool was Runway, and Runway had come to it sideways. Cristóbal Valenzuela, a Chilean who had studied economics before falling into code and art, met Anastasis Germanidis and Alejandro Matamala at NYU’s Interactive Telecommunications Program, the downtown-Manhattan workshop where artists go to learn just enough engineering to be dangerous. They founded Runway in 2018 with a creative-tools sensibility rather than a research-lab one. Their early users were filmmakers and editors who wanted machine learning inside their existing software, not researchers chasing benchmarks. When the diffusion wave broke in 2022, Runway was close to its center. The company contributed to the open release of Stable Diffusion in 2022. Then it made a bet that the others had not yet made, that the real prize was the moving image rather than the still one.

Runway’s Gen-1, in February 2023, restyled existing video, turning a clip of a man into a clip of a marble statue of the same man moving the same way. Gen-2, that summer, went the rest of the distance to text. Type a sentence, get a few seconds of video that no camera had recorded. The clips were short and strange and you could feel the seams, but they were generated from nothing but words, and filmmakers started using them in real projects, which was the line research demos rarely crossed. Then, on February 15, 2024, three weeks after the Biden robocall, while the FCC was still drafting its ruling, OpenAI published Sora, and the conversation changed register again.

Sora generated clips up to a minute long with a coherence that the field had not expected for years. A woman walking down a neon Tokyo street, her sunglasses reflecting the signs, her gait consistent step to step; a litter of golden-retriever puppies in snow; drone footage of a coastline that no drone had flown. OpenAI’s framing was deliberately grand. The technical report was titled “Video generation models as world simulators,” and the claim embedded in that title was that to predict the next frame convincingly, a model had to learn something about how the world works: that water flows downhill, that a bitten cookie shows teeth marks, that a body in motion stays in motion. The architecture was a diffusion model with a transformer’s backbone, operating on small patches of space and time rather than on whole frames, the spacetime equivalent of the tokens that language models chew through. Whether Sora “understood” physics was the kind of question that would occupy people for years, and it had its skeptics; the demos also showed a chair drifting impossibly, a person’s legs swapping places mid-stride. But the visceral fact was harder to argue with. The clips were good enough that you wanted to believe them.

OpenAI did not release Sora to the public on the day it announced it. For most of 2024 the model existed as a controlled preview, available to red-teamers probing it for harm and to a handful of artists, and the restraint was itself a kind of admission, that a tool which could manufacture convincing footage of anything was not something you simply switched on for the world the way you might a chatbot. The public version, branded Sora Turbo, arrived in December 2024, by which point the field was crowded: Runway shipping Gen-3 Alpha, Luma’s Dream Machine, Pika, Kuaishou’s Kling out of China, Google’s Veo climbing toward photorealism. Runway, for its part, was reportedly valued above $3 billion by 2025. The race that had started with three-second dream-fragments in 2022 had, in roughly two years, reached the point where the technology no longer limited what you could fake. The only thing left was finding a reason not to.

The reasons were everywhere, and most of them were not about elections. The word “deepfake” had entered the language in 2017, coined by a Reddit user who went by “deepfakes” and used early face-swapping code to paste celebrities’ faces onto pornography. That was the technology’s first and, by volume, overwhelming application. A 2019 study by the firm Deeptrace found that the great majority of deepfake videos online, its figure was 96 percent, were nonconsensual pornography, almost all of it targeting women. The political deepfakes that frightened legislators were the rare case. The common case was a private cruelty, a face stolen and grafted onto a body in an act it never performed, and the victims were overwhelmingly ordinary women with no recourse.

In January 2024, the same month as the robocall and the ElevenLabs funding round, the cruelty went to scale and hit a target large enough that the world finally looked. Sexually explicit AI-generated images of Taylor Swift spread across X over a span of hours; one of them was viewed tens of millions of times before it came down. The images had reportedly originated in a Telegram group and on 4chan, generated with consumer tools by people treating it as a game. X, unable to find and remove the images fast enough, fell back on a blunt instrument and simply blocked searches for “Taylor Swift” entirely. A platform with hundreds of engineers, defeated badly enough that its emergency response was to make the most famous woman in the world unsearchable on its own service. If it could happen to her, with her lawyers and her leverage and her millions of defenders, the question of what protected anyone else answered itself. Within weeks the incident had become a talking point in Congress, cited by senators of both parties as evidence that the law had fallen behind the tools, and it gave the long-stalled push for federal limits on nonconsensual intimate images a momentum it had never had.

So the counter-effort assembled, and it had two broad strategies, both of them losing ground from the start. The first was detection: train classifiers to spot the artifacts a generator leaves behind, the too-smooth skin, the wrong number of teeth, the blink that never comes, the spectral fingerprint of synthesized speech. Pindrop, the voice-fraud firm that traced the Biden robocall back to ElevenLabs, was a detection company, and the trace was a genuine win. The firm’s analysts had compared the synthetic audio against samples from the major voice engines and matched its fingerprints to ElevenLabs with confidence, which is how a magician in New Orleans came to be identified at all. But detection is structurally a step behind. Every classifier is trained on the outputs of yesterday’s generators, and tomorrow’s generator is built, often explicitly, to defeat exactly those classifiers. Hany Farid, the Berkeley forensics scientist who had spent a career on image authentication, said the part out loud that everyone in the field knew. The people making the fakes had vastly more money and momentum than the people catching them, and the gap was widening.

The second strategy ran the other direction. Instead of detecting fakes after the fact, mark the real or mark the synthetic at the moment of creation. Google DeepMind’s SynthID embedded an imperceptible watermark directly into generated images starting in 2023, then extended it to audio, video, and even text in 2024, weaving a signal into the very choices a model makes so that the output could later be identified as machine-made. The C2PA standard, the Coalition for Content Provenance and Authenticity, backed by Adobe, Microsoft, and a long roster of others, took the opposite tack: cryptographically sign content at the point of capture, so a photograph could carry a tamper-evident record of where it came from and what had been done to it, a chain of custody for pixels. Both ideas were sound. Both shared a fatal property. A watermark can be stripped by a screenshot or a re-encode; provenance is opt-in, and the people making forgeries are precisely the people who will not opt in. You can label every photo your camera takes as real, and it does nothing about the fake, which simply carries no label and looks no different.

This is where the chapter’s real subject comes into focus. The subject is what the mere possibility of fakes does to everything that is real. In 2018 the law professors Bobby Chesney and Danielle Citron had named it in a paper on deepfakes: the liar’s dividend. Their insight was perverse and exactly correct. As the public absorbs the fact that audio and video can be convincingly faked, the value of a fake goes up, and so does something else. The real recording loses its power to compel belief. A politician caught on tape can now say the tape is a deepfake, and enough people will entertain the doubt that the tape stops mattering. The dividend is paid to the liar, and the cruelest part of the equation is that it grows as public awareness grows. The more thoroughly we teach people that fakes exist, the easier we make it for the guilty to dismiss the truth.

Sam Gregory, who ran the human-rights organization Witness, had been warning about exactly this side of the ledger before the celebrities and the elections made it fashionable. His constituency was the activist filming a beating, the witness recording a massacre on a phone, the citizen-journalist whose footage was sometimes the only evidence a crime had occurred. For them the deepfake threat had less to do with someone faking an atrocity. The real danger was that every real atrocity could now be waved away as faked, that a government caught on video could shrug and say the video was generated, and that the burden of proving authenticity, which used to belong to the doubter, would shift onto the person who had risked their life to record the truth. The collapse ran in both directions. Anything could be faked, so nothing had to be believed. Gregory’s organization had spent years pushing for provenance standards precisely because it understood that the people least able to prove their footage was real were the people whose footage mattered most, and that an authentication regime built around expensive cameras and corporate sign-on would protect the powerful long before it protected the witness in a war zone with a five-year-old phone.

Governments did what governments do, which was legislate against the last incident. The FCC’s robocall ruling was the fast, narrow response. The European Union’s AI Act, which entered into force on August 1, 2024, took the broad approach, requiring that deepfakes and AI-generated content be disclosed and labeled, putting the force of law behind the provenance ideas the industry was still struggling to make stick. In the United States, where federal legislation moved slowly, the states filled the gap, with a wave of laws in 2024 aimed at deepfakes in elections and nonconsensual intimate images, and Congress took up the NO FAKES Act to give people a property right in their own voice and likeness. Denmark went furthest in spirit, proposing in 2025 to give every citizen what amounted to copyright over their own face and voice, the recognition, finally, that in a world where anyone could be synthesized, a person might need to own themselves as a matter of law. The platforms added labels. None of it touched the asymmetry. The forgery still cost a dollar and the response still cost millions, and no statute had repealed that arithmetic.

Underneath the funding rounds and the fines and the dueling acronyms, something quieter had shifted in what a recording meant. For a century and a half a recording had been a kind of testimony, imperfect, sometimes staged, but anchored to an event that had occurred. The chain from world to recording to belief had held, mostly, because breaking it was hard. Voice cloning and video generation broke it, and they broke it as their core function rather than as a side effect, because the function of these tools is to produce convincing media of things that never happened. The triumph and the catastrophe were the same achievement. A model good enough to give the lektor’s films a real cast is a model good enough to put words in a president’s mouth, and there is no model that can do the first without doing the second, because they are the identical capability pointed at different intentions.

The puppies in the snow and the woman on the Tokyo street were beautiful, and they were also a demonstration, exactly as Steve Kramer had claimed his robocall was a demonstration. Both showed the same thing. The line between the counterfeit and the real, which everyone had assumed was a property of the world, turned out to be a property of how expensive forgery happened to be. Make forgery cheap enough and the line does not move. It dissolves. By the time the institutions understood that, the cost had already fallen to a dollar, and it was still falling.

These were the easy media, in the end, voice and video, things that move through time and trade on resemblance. The harder generative problems were the ones where there was no obvious original to copy and no clear way even to say whether the result was good: a song that had never been sung, a building that had never been built, a world rendered from a single photograph. Those would test something more basic than the line between fake and real: what it means for a machine to make something at all, and how, when no human made it, anyone could possibly judge whether it was any good.