Code access boosts performance by +0.23 over closed-book across all tasks?

Code access boosts performance by +0.23 over closed-book across all tasks.

Documentation adds only +0.071 on doc-dependent tasks, showing limited impact?

Documentation adds only +0.071 on doc-dependent tasks, showing limited impact.

528 tasks generated from 10 Python repos in SWE-Bench; framework is open-source?

528 tasks generated from 10 Python repos in SWE-Bench; framework is open-source.

Developer Tools

Code-QA-Bench separates code reasoning from documentation memorization

arXiv cs.SE May 29, 2026

⚡New benchmark reveals code access matters far more than memorizing docs for AI.

Deep Dive

Code-QA-Bench is a new framework from researchers led by Jun Zhang that tackles a critical problem in evaluating AI code understanding: how to tell if a model truly reasons about code versus just memorizing documentation or training data. The framework generates repository-level QA tasks using an answer-first pipeline: a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring each task is grounded in real code structure. It then tests models under three conditions: closed-book (no repository access), code-only (documentation removed), and documented (full repository). The deltas between these conditions directly measure how much models rely on documentation recall versus actual code reasoning.

Applied to 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, experiments on four frontier models (e.g., GPT-4o, Claude 3.5) reveal clear patterns: code access is the dominant factor, yielding an average +0.23 gain over closed-book conditions. Documentation provides only modest additional benefit (+0.071) on doc-dependent tasks, and code-only performance essentially matches documented performance on code-derivable tasks. This validates that the benchmark successfully isolates reasoning from memorization. The framework is open-source and can be applied to any well-documented Python repository, offering a standardized way to assess how well AI models truly understand code structure versus relying on rote recall.

Key Points

Code access boosts performance by +0.23 over closed-book across all tasks.
Documentation adds only +0.071 on doc-dependent tasks, showing limited impact.
528 tasks generated from 10 Python repos in SWE-Bench; framework is open-source.

Why It Matters

Helps developers and researchers accurately evaluate if AI models truly understand code or just memorize documentation.

Read Original Article

Code-QA-Bench separates code reasoning from documentation memorization

Why It Matters

Related Articles

🚀 Stay Ahead in AI