A federal judge in California ruled Monday that Anthropic likely violated copyright law when it pirated authors’ books to create a giant dataset and “forever” library but that training its AI on those books without authors’ permission constitutes transformative fair use under copyright law. The complex decision is one of the first of its kind in a series of high-profile copyright lawsuits brought by authors and artists against AI companies, and it’s largely a very bad decision for authors, artists, writers, and web developers.
This case, in which authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson sued Anthropic, maker of the Claude family of large language models, is one of dozens of high-profile lawsuits brought against AI giants. The authors sued Anthropic because the company scraped full copies of their books for the purposes of training their AI models from a now-notorious dataset called Books3, as well as from the piracy websites LibGen and Pirate Library Mirror (PiLiMi). The suit also claims that Anthropic bought used physical copies of books and scanned them for the purposes of training AI.
“From the start, Anthropic ‘had many places from which’ it could have purchased books, but it preferred to steal them to avoid ‘legal/practice/business slog,’ as cofounder and chief executive officer Dario Amodei put it. So, in January or February 2021, another Anthropic cofounder, Ben Mann, downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated,” William Alsup, a federal judge for the Northern District of California, wrote in his decision Monday. “Anthropic’s next pirated acquisitions involved downloading distributed, reshared copies of other pirate libraries. In June 2021, Mann downloaded in this way at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And, in July 2022, Anthropic likewise downloaded at least two million copies of books from the Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated.”
Notably, Anthropic also created an internal, “general-purpose library” made up partially of pirated copyrighted works for “various uses for which the company might have of them,” in addition to scraping the books for the purposes of training AI. William Alsup, a federal judge for the Northern District of California, wrote in his decision Monday that the creation of this “pirated library … points against fair use” and must be considered at trial. At a hearing in May, Alsup signaled that he was leaning toward making this type of decision: “I’m inclined to say they did violate the Copyright Act but the subsequent uses were fair use,” Alsup said.
“The downloaded pirated copies used to build a central library were not justified by a fair use. Every factor points against fair use,” Alsup wrote. “Anthropic employees said copies of works (pirated ones, too) would be retained ‘forever’ for ‘general purpose’ even after Anthropic determined they would never be used for training LLMs. A separate justification was required for each use. None is even offered here except for Anthropic’s pocketbook and convenience.We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages, actual or statutory (including for willfulness). That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages.”
Still, this is not a good decision for authors, because the judge ruled that actually training AI on the works was not illegal, though it is too early to say exactly what this means in a larger context. At the moment, it suggests that training an AI on legally purchased works is sufficiently transformative, but that pirating those works in the first place is not. This case did not consider what it means for AI training of free-to-access content on the open web, on social media, from libraries, etc. It’s largely a win for AI companies, who, when faced with these sorts of lawsuits, have almost universally said that their data scraping and training is legal as a transformative fair use under copyright law, arguing they do not need to ask for permission or provide compensation when they scrape the internet to build AI tools.
This lawsuit does not allege that Anthropic or Claude directly recreated parts of the authors’ books to its users: “When each LLM was put into a public-facing version of Claude, it was complemented by other software that filtered user inputs to the LLM and filtered outputs from the LLM back to the user,” Alsup wrote in his order. “As a result, Authors do not allege that any infringing copy of their works was or would ever be provided to users by the Claude service. Yes, Claude could help less capable writers create works as well-written as Authors’ and competing in the same categories. But Claude created no exact copy, nor any substantial knock-off. Nothing traceable to Authors’ works. Such allegations are simply not part of plaintiffs’ amended complaint, nor in our record.”
Many other copyright lawsuits against AI companies argue that not only are AI companies training on pirated copyrighted data, but that the AI tools they create then regurgitate large passages of those copyrighted works either verbatim or in a substantially similar style. Researchers found, for example, that Meta’s AI has “memorized” huge portions of books and will regurgitate them. This case largely considered whether the actual training itself is a violation of copyright law.
“The use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act,” Alsup wrote in his order. “And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies.”
In February, Thompson Reuters won a case against a competitor in which it claimed its competitor illegally scraped its works to train AI. There are currently dozens of similar lawsuits winding their way through the legal system right now, so it’s likely to take a few more decisions before we get a full picture of what courts think about the legality of mass, unauthorized AI data training.