Millions of Used Books Cut and Scanned for Anthropic's Claude Training | GeekNews
Key Points
- 1A judge ruled that Anthropic's practice of purchasing, scanning, and digitalizing millions of used books for AI model training constitutes "exceedingly transformative" fair use.
- 2Conversely, the ruling explicitly stated that Anthropic's separate downloading and use of over 7 million pirated books was not fair use and amounted to copyright infringement.
- 3This landmark decision provides a crucial legal precedent distinguishing legitimate data acquisition from piracy in AI training, impacting ongoing copyright disputes in the AI industry.
A recent court ruling in the U.S. District Court for the Northern District of California by Judge William Alsup has provided a significant precedent regarding copyright and AI model training, specifically concerning Anthropic's AI chatbot, Claude. The case involved Anthropic's use of copyrighted materials, primarily books, for training its large language model (LLM).
The core of Anthropic's data acquisition methodology involved two distinct approaches:
- Legal Acquisition and Digitalization: Anthropic invested millions of dollars to purchase a massive quantity of second-hand physical books. These books underwent an intensive digitalization process: the bindings were removed, pages were cut, and they were then scanned using high-volume industrial scanners to convert them into digital files. These digital files were subsequently stored in an internal research library, and the original physical books were disposed of. This process implies sophisticated optical character recognition (OCR) technology was employed to extract the textual content from the scanned images, making it machine-readable for subsequent model training.
- Illegal Acquisition (Pirated Downloads): Separately, Anthropic admitted to downloading over 7 million pirated e-books. Specifically, co-founder Ben Mann acknowledged downloading at least 5 million books from Library Genesis in 2021 and an additional 2 million from Pirate Library Mirror in 2022. These sources typically provide pre-digitized files (e.g., PDFs, EPUBs), bypassing the need for physical scanning.
For the training of the Claude LLM, the textual data from both sources, along with other materials like social media posts and videos, was processed. This involved tokenization, where raw text is converted into numerical tokens (sub-word units). These tokens form the massive dataset, typically hundreds of billions to trillions of tokens, that is fed into a Transformer-based neural network architecture. The model learns intricate patterns, statistical relationships, grammar, and semantic information by predicting the next token in a sequence (or filling masked tokens) through an iterative optimization process (e.g., stochastic gradient descent) that minimizes a loss function (e.g., categorical cross-entropy) across its vast number of parameters. The output of the trained model is generated text, which is statistically derived from the learned patterns, rather than a direct reproduction of the input data.
The judge's ruling made a critical distinction between these two data acquisition methods:
- Fair Use for Legally Purchased and Scanned Books: Judge Alsup ruled that Anthropic's act of purchasing physical books, digitizing them for an internal library, and using this data to train its AI model constituted "exceedingly transformative" fair use. The reasoning behind this was that Anthropic's LLM does not merely replicate or substitute existing documents. Instead, it learns from the data to "create something entirely different." The LLM's internal representations are abstract statistical models, not copies of the original works, and its outputs are not intended to be replacements for the original copyrighted materials. This ruling emphasizes that the *purpose* and *character* of the use, specifically the highly transformative nature of AI learning, can justify what might otherwise be considered extensive copying. The fact that Anthropic created a private, internal digital library from purchased copies, without re-distributing the digital copies themselves, was also a key factor.
- Copyright Infringement for Pirated Books: Conversely, the judge unequivocally stated that using pirated e-books (from sources like Library Genesis and Pirate Library Mirror) for training purposes does *not* constitute fair use and is a clear act of copyright infringement. The ruling explicitly noted that Anthropic had no right to use these illegally obtained copies, and the justification of building a "permanent, universal library" internally does not legitimize the use of pirated materials. This indicates that while the *transformative nature* of AI training can support a fair use claim, the *legality of data acquisition* remains paramount.
This judgment is considered a significant turning point in the ongoing debate about copyright and AI, setting a precedent that differentiates between legitimate data sourcing and illegal acquisition for AI training. It highlights that the method of obtaining data and its legal standing profoundly impact fair use determinations, even if the ultimate use (AI training) is considered transformative. The ruling comes amidst a wave of similar lawsuits against AI companies, underscoring the legal challenges faced by the AI industry in balancing innovation with intellectual property rights.