Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation
Academic AI papers are evaluating outdated models—52.5% abstract claims to 'AI' broadly
A pre-registered audit by David Gringras and Misha Salahshoor examined 112,303 candidate records (18,574 admissible papers, 4,766 full texts) published between January 2022 and April 2026. They found that the median academic paper evaluates a model 10.85 ECI points—roughly 1.4x the capability gap between Claude Sonnet 3.7 and Claude Opus 4.5—behind the contemporaneous frontier. This 'publication elicitation gap' is widening at 5.53 ECI per year (95% CI [+5.03, +5.83]). The authors decompose the lag into ~25% peer-review latency and ~75% excess lag from delayed adoption of frontier models.
Worse, 52.5% of paper conclusions (95% CI [48.2, 56.9]) abstract upward to claims about 'AI' rather than the specific evaluated model, rising at OR=1.23 per year. Only 3.2% of abstracts and 21.2% of full texts disclose reasoning-mode status on reasoning-capable models (e.g., GPT-4o-mini zero-shot vs GPT-5.5 Pro or Claude Opus 4.7). The authors propose VERSIO-AI, a 13-item reporting checklist (Core 3 items trigger desk reject) mandating configuration-surface disclosure—model snapshot, reasoning mode/effort, tool access, scaffolding, and prompting—to combat misrepresentation in policy, media, and downstream citations.
- Median paper tests a model 10.85 ECI (~1.4x gap between Sonnet 3.7 and Opus 4.5) behind the frontier; gap grows 5.53 ECI/year
- Only 3.2% of abstracts disclose reasoning-mode status; 52.5% of conclusions generalize to 'AI' not the evaluated model
- VERSIO-AI checklist (13 items, core 3 for desk reject) proposed to enforce configuration-surface disclosure
Why It Matters
Misleading evaluations distort media narratives and policy decisions about real-world AI capabilities.