© Vincent Kilbride

What happens when AI trains itself?

Artificial intelligence will soon run out of human sentences to learn from. What are its options then?
September 6, 2023

The Boott Cotton Mills Museum in Lowell, Massachusetts shakes and rattles with the movement of water-powered looms, massive and complex machines more than a century old. Based in what is now a national park, a few dozen of the looms in a massive weave room have been put back into service. Visitors can have a taste of what factory life might have felt like from 1835—when the complex was built—into the early 20th century. A sign warns that the weave room is hot, loud, filled with cotton dust, and that visitors might find it overwhelming and need to leave.

Unfortunately, that option wasn’t available to thousands of workers who toiled in gruelling conditions to keep power looms running from Lowell in the US to Lancashire in the UK. From nimble-fingered children who reached into the works to re-tie broken threads, to grown men who loaded the massive bobbins of thread and unloaded the bolts of finished cloth, the automated marvel of the power loom was fed and cared for by armies of unautomated humans. The looms could produce cloth faster and cheaper than hundreds of professional weavers working in parallel, but they were powerless without hundreds of humans skilled enough to keep the machines in good order. 

We are starting to learn that modern AI systems have a lot in common with the power looms of the Industrial Revolution. These systems already generate impressively detailed images and texts, and their advocates promise they will transform, well, pretty much everything, from how we search for information, book a trip or shop for clothes to how we organise our workplaces and wider society. But much as the loom’s bulk obscures our sight of the children mending broken threads, the impressive achievements of these massive systems tend to blind us to the human labour that makes them possible.

When an image-generation program like Stable Diffusion produces an illustration from a written prompt—a blue bowl of flowers in the style of Van Gogh, say—it relies on massive sets of labelled data: images that show blue bowls, bowls of flowers and Van Gogh paintings, all carefully labelled by humans. Reporting in the Verge, Josh Dzieza interviewed some of the thousands—possibly millions—of workers who label these images from their computers in countries like Kenya and Nepal for as little as $1.20 an hour. Other annotators in the US give feedback to chatbots about which of their prompt responses are more conversational, receiving $14 an hour to provide the “human” in a process known as “reinforcement learning from human feedback”, or RLHF, the method that’s allowed ChatGPT to provide such lifelike results.

Even if AI models can find their way through the legal thickets, another barrier may yet hinder them: the limits of human creativity

While these annotators are on the frontlines of feeding the machine, other human contributors may not be aware they are part of the AI supply chain. ChatGPT learned to write sonnets by ingesting Shakespeare and Donne, but it learned how to answer thousands of other questions from content published on the web. A team at the Washington Post worked with the Allen Institute for AI to study the “Common Crawl”, a giant set of data that includes millions of publicly accessible websites, known to be a primary material for large language models such as Google’s T5 and Facebook’s LLaMA. (OpenAI, creators of ChatGPT, won’t release what data was used to train their model, but may well use Common Crawl.) The sites that provided the most data to these massive AIs aren’t hard to predict: Wikipedia ranks #2, and many of the top sites are respected newspapers. 

But there’s weird stuff in there, too. My personal blog, with 20 years of my opinions on various topics, ranks 42,458th out of the millions of websites contributing data to these AIs, responsible for 0.0003 per cent of the “tokens” (brief phrases) that these programs use to generate text in English. Reddit, the diverse set of conversations tended by teams of volunteer moderators, is far better represented at 540th. And a notorious repository of pirated books ranks 190th, with authors from JK Rowling to Hannah Arendt providing AIs with writing lessons.

Neither Google nor Facebook asked my permission before training AIs on my writing, and while I’m reacting to my new job as a tutor to AIs with bemusement, other authors are responding with litigation. Comedian Sarah Silverman has joined a set of authors in suing OpenAI and Meta, arguing they used her work without permission, credit or compensation. The New York Times has changed its terms of service to prohibit its content from being used to train AI systems, prompting speculation that it, too, may sue OpenAI for copyright infringement should negotiations between the companies over the licensing of the newspaper’s content break down.

Courts will need to decide whether ingesting a comedian’s jokes or a newspaper’s reporting constitutes copyright infringement, or fair use of copyrighted materials. If a court finds OpenAI or others to have infringed authors’ rights, the consequences could be significant—US law provides for fines of up to $150,000 for each act of infringement, which could (theoretically) add up to billions in the case of the New York Times, where hundreds of thousands of articles have likely been incorporated into AI models. Projects like ChatGPT and Stable Diffusion might need to start again, with “clean” licensed corpora and documented rights to use all the data they used to train their systems.

Even if AI entrepreneurs can find their way through these legal thickets, another more existential barrier may cap the growth of large language models: the limits of human creativity. A recent paper from Epoch, a team of researchers focused on the future of AI, predicts that AI companies will run out of “high-quality” language data like “books, news articles, scientific papers, Wikipedia, and filtered web content” as soon as 2026. While the number of books and scientific papers authored per year is massive, it’s also finite, and the appetites of large language models have grown exponentially. Companies like OpenAI may be able to train models on lower-quality data, such as comments on social media, but as regards filtered, edited content, “The high-quality language stock will almost surely be exhausted before 2027 if current trends continue.”

But isn’t the point of these generative AIs to create high-quality images and text? Why not simply have ChatGPT generate billions of training texts once we run out of human-generated ones? A paper from UK researchers titled “The Curse of Recursion” warns that this may not be an option. One of its authors, Ross Anderson, has suggested that training AIs on their own outputs may be analogous to engines “choking on their own exhaust”. Subtle biases become amplified over time through reinforcement, until the model—the statistical engine that generates original text, images or music—becomes useless. The authors of “The Curse of Recursion” warn that we may need to segregate genuinely human-authored texts from those written partially or entirely with generative AI, lest we trigger the process of model collapse. At the same time, they acknowledge that other scientists may find ways to mitigate such collapse, allowing AIs to learn from themselves. But for now, we still need humans not only to tend the looms, but to produce the threads—text and images—for AIs to weave together. 

I’ve started to think of the recent wave of progress in generative AI through another industrial-age metaphor. Fuelled by profits from the power loom and the other fruits of industrialisation, tycoons extended US railroads west, settling land that had never been farmed before. 

This land was made agriculturally productive by damming rivers, but also by tapping underwater aquifers that had accumulated water over thousands of years. Now some of those aquifers are nearly tapped out, and it will likely take thousands of years of rainfall to refill them. If contemporary AIs are built on thousands of years of human writing and imagery, what happens if we find our aquifers of information running dry?