RepoMirage reveals code agents fail 41% on repository context reasoning
When files get shuffled, AI coding assistants drop from 67% to 25% accuracy.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new paper from Tsinghua University researchers introduces RepoMirage, a diagnostic evaluation suite that exposes a critical blind spot in today's code agents. Built on SWE-Bench Verified, RepoMirage applies three types of semantics-preserving repository-level perturbations—renaming variables, restructuring imports, reordering functions—to test whether agents truly reason about code across multiple files or just pattern-match. The results are striking: on standard issue resolution tasks, agents score 66.8%, but when RepoMirage-Extend introduces structural bottlenecks that force cross-file reasoning, accuracy plummets to 25.3%.
The researchers identified 'exploration drift' as the root cause: agents browse more files but fail to synthesize the retrieved information into actionable structure. To fix this, they propose RepoAnchor, a prototype workflow that explicitly separates repository exploration from downstream problem solving. By first building a structural scaffold (dependency graphs, call hierarchies) and then using that scaffold to guide the agent, RepoAnchor recovers much of the lost performance. The findings suggest that current code agents—including those powering GitHub Copilot, Cursor, and Devin—lack robust repository context reasoning, and that structure-aware methods are the next frontier for making AI truly understand large codebases.
- RepoMirage applies 3 types of semantics-preserving code perturbations to test multi-file reasoning, revealing a 41 percentage point drop in accuracy (66.8% to 25.3%)
- Trajectory analysis shows 'exploration drift': agents access more files but fail to connect information into correct structural understanding
- Proposed RepoAnchor separates repository exploration from problem solving via a structure-first workflow, recovering significant performance
Why It Matters
Current code agents overfit to surface patterns; without real context reasoning, they'll fail on complex, real-world repositories.