Developer Tools

Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs

New research shows LLMs can memorize and reproduce entire books after finetuning...

Deep Dive

A new arXiv paper shows that finetuning large language models can cause them to reproduce copyrighted book content verbatim. The paper's repository includes code for finetuning models like GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1, and uses excerpts from Cormac McCarthy's 'The Road' as an example. The authors provide a preprocessing pipeline that splits books into 300-500 word chunks, generates plot summaries, and then finetunes models to complete those excerpts. The paper evaluates memorization using metrics like the longest contiguous regurgitated span. The repository notes that full book content and model generations are not included because they contain large portions of verbatim text.

Key Points
  • Finetuning GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 on copyrighted books triggers 300-500 word verbatim recall
  • Study used temperature 1.0 and 100 completions per excerpt to maximize memorization detection
  • Open-source pipeline available for testing memorization in other finetuned models

Why It Matters

This exposes a critical gap in LLM alignment, risking copyright infringement and data leakage in production systems.