Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews
New detection tool reveals LLM-modified text in 6.5-16.9% of reviews at top AI conferences.
A team of 12 researchers from Stanford University and other institutions has published a groundbreaking study titled 'Monitoring AI-Modified Content at Scale,' presenting a novel maximum likelihood model for detecting large language model (LLM) influence in large text corpora. The researchers applied their tool to analyze over 15,000 peer reviews submitted to four major AI conferences—ICLR 2024, NeurIPS 2023, CoRL 2023, and EMNLP 2023—following ChatGPT's release. Their findings reveal that between 6.5% and 16.9% of review text showed signs of substantial LLM modification, meaning AI assistance went beyond simple spell-checking to substantive content generation. This represents the first large-scale empirical measurement of how LLMs are infiltrating academic peer review, the very system meant to evaluate AI research itself.
The study's technical approach leverages both expert-written and AI-generated reference texts to estimate the probability that any given text segment was LLM-modified, enabling corpus-level analysis that individual detection tools miss. The data reveals clear behavioral patterns: reviewers who submitted close to deadlines, expressed lower confidence in their assessments, or were less likely to respond to author rebuttals showed significantly higher rates of AI-generated text. These corpus-level trends, which might be undetectable in individual cases, suggest LLMs are being used as a productivity crutch under time pressure or uncertainty. The researchers call for urgent interdisciplinary work to examine how LLM use is changing information and knowledge practices, as the peer review system—already strained—now faces the meta-problem of evaluating AI-generated critiques of AI research.
- Detection tool found 6.5-16.9% of peer review text at major AI conferences (ICLR, NeurIPS, CoRL, EMNLP) contained substantial LLM modifications
- AI-generated text was 3x more common in reviews submitted within 48 hours of deadlines and in reviews where reviewers reported lower confidence
- Stanford's maximum likelihood model uses expert and AI reference texts to estimate LLM influence at corpus scale, revealing patterns invisible in individual cases
Why It Matters
The very system evaluating AI research is increasingly AI-generated, raising fundamental questions about authenticity and quality in scientific discourse.