Media & Culture

Encyclopedia Britannica is suing OpenAI for allegedly ‘memorizing’ its content with ChatGPT

Lawsuit claims GPT-4 outputs near-verbatim Britannica passages, cannibalizing publisher's web traffic.

Deep Dive

Encyclopedia Britannica and dictionary publisher Merriam-Webster have filed a significant copyright lawsuit against OpenAI, accusing the AI giant of using their proprietary content to train models like GPT-4 without authorization. The core allegation is that OpenAI's systems have "memorized" copyrighted material, enabling them to output "near-verbatim copies" of significant portions of Britannica's text on demand. The lawsuit includes side-by-side examples showing entire passages from GPT-4 responses matching the publisher's content word-for-word, positioning these outputs as unauthorized copies derived from training data.

Beyond direct copying, the publishers argue OpenAI is "cannibalizing" their business by generating responses that "substitute, or directly compete" with their core encyclopedia and dictionary content. This, they claim, diverts web traffic that would traditionally go to their sites through search engines. The case is the latest in a growing wave of legal challenges from content creators against AI companies, joining The New York Times' ongoing lawsuit against OpenAI and following Anthropic's recent $1.5 billion settlement with authors over copyrighted book training data. The outcome could set crucial precedents for how AI models are trained and what constitutes fair use of copyrighted material in the age of generative AI.

Key Points
  • Lawsuit alleges GPT-4 outputs "near-verbatim" copies of Britannica's copyrighted encyclopedia entries.
  • Publishers claim the practice "cannibalizes" web traffic by substituting for their websites, harming their business model.
  • Case follows similar lawsuits by The New York Times and a recent $1.5B settlement by Anthropic over training data.

Why It Matters

This lawsuit could redefine fair use for AI training and force major changes in how companies source data for models like GPT-4.