RepoMirage applies 3 types of semantics-preserving code perturbations to test multi-file reasoning, revealing a 41 percentage point drop in accuracy (66.8% to 25.3%)?

RepoMirage applies 3 types of semantics-preserving code perturbations to test multi-file reasoning, revealing a 41 percentage point drop in accuracy (66.8% to 25.3%)

Trajectory analysis shows 'exploration drift'?

agents access more files but fail to connect information into correct structural understanding

Proposed RepoAnchor separates repository exploration from problem solving via a structure-first workflow, recovering significant performance?

Proposed RepoAnchor separates repository exploration from problem solving via a structure-first workflow, recovering significant performance

Developer Tools

RepoMirage reveals code agents fail 41% on repository context reasoning

arXiv cs.SE May 27, 2026

⚡When files get shuffled, AI coding assistants drop from 67% to 25% accuracy.

Deep Dive

A new paper from Tsinghua University researchers introduces RepoMirage, a diagnostic evaluation suite that exposes a critical blind spot in today's code agents. Built on SWE-Bench Verified, RepoMirage applies three types of semantics-preserving repository-level perturbations—renaming variables, restructuring imports, reordering functions—to test whether agents truly reason about code across multiple files or just pattern-match. The results are striking: on standard issue resolution tasks, agents score 66.8%, but when RepoMirage-Extend introduces structural bottlenecks that force cross-file reasoning, accuracy plummets to 25.3%.

The researchers identified 'exploration drift' as the root cause: agents browse more files but fail to synthesize the retrieved information into actionable structure. To fix this, they propose RepoAnchor, a prototype workflow that explicitly separates repository exploration from downstream problem solving. By first building a structural scaffold (dependency graphs, call hierarchies) and then using that scaffold to guide the agent, RepoAnchor recovers much of the lost performance. The findings suggest that current code agents—including those powering GitHub Copilot, Cursor, and Devin—lack robust repository context reasoning, and that structure-aware methods are the next frontier for making AI truly understand large codebases.

Key Points

RepoMirage applies 3 types of semantics-preserving code perturbations to test multi-file reasoning, revealing a 41 percentage point drop in accuracy (66.8% to 25.3%)
Trajectory analysis shows 'exploration drift': agents access more files but fail to connect information into correct structural understanding
Proposed RepoAnchor separates repository exploration from problem solving via a structure-first workflow, recovering significant performance

Why It Matters

Current code agents overfit to surface patterns; without real context reasoning, they'll fail on complex, real-world repositories.

Read Original Article

RepoMirage reveals code agents fail 41% on repository context reasoning

Why It Matters

Related Articles

🚀 Stay Ahead in AI