Media & Culture

Book publishers sue Meta over AI’s ‘word-for-word’ copying

Five major publishers claim Meta stole copyrighted books to train Llama.

Deep Dive

Five major book publishers—Macmillan, McGraw Hill, Elsevier, Hachette, Cengage—along with author Scott Turow, have filed a class-action lawsuit against Meta, accusing the company of large-scale copyright infringement when training its Llama AI models. The suit alleges Meta knowingly downloaded pirated copies of books and journal articles from notorious sites like LibGen, Anna’s Archive, Sci-Hub, and Sci-Mag, as well as from the Common Crawl dataset, which is said to be 'full of unauthorized copies.' The publishers claim that Llama can reproduce entire sections verbatim—for example, when prompted with two sentences from Cengage's calculus textbook, the AI generated a word-for-word continuation. They are asking the court for damages and an injunction to stop Meta's alleged unlawful activities, plus a list of all copyrighted works used in training.

Meta has responded by asserting that training AI on copyrighted material constitutes fair use, a position that has seen mixed results in court. Last year, a federal judge ruled in Meta's favor in a similar lawsuit but noted the decision did not legally permit all such use. Meanwhile, Anthropic settled a similar case with authors for $1.5 billion. Meta spokesperson Dave Arnold stated, 'AI is powering transformative innovations, productivity and creativity for individuals and companies, and courts have rightly found that training AI on copyrighted material can qualify as fair use.' The company plans to fight the lawsuit aggressively.

Key Points
  • Five major publishers and author Scott Turow filed a class action suit against Meta over AI training copyright infringement.
  • Meta allegedly used pirated content from LibGen, Sci-Hub, and Common Crawl to train Llama models.
  • Example: Llama reproduced verbatim text from Cengage's calculus textbook when prompted with just two sentences.

Why It Matters

This case could reshape fair use boundaries for AI training, affecting how tech companies source data.