Code-QA-Bench separates code reasoning from documentation memorization
New benchmark reveals code access matters far more than memorizing docs for AI.
Code-QA-Bench is a new framework from researchers led by Jun Zhang that tackles a critical problem in evaluating AI code understanding: how to tell if a model truly reasons about code versus just memorizing documentation or training data. The framework generates repository-level QA tasks using an answer-first pipeline: a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring each task is grounded in real code structure. It then tests models under three conditions: closed-book (no repository access), code-only (documentation removed), and documented (full repository). The deltas between these conditions directly measure how much models rely on documentation recall versus actual code reasoning.
Applied to 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, experiments on four frontier models (e.g., GPT-4o, Claude 3.5) reveal clear patterns: code access is the dominant factor, yielding an average +0.23 gain over closed-book conditions. Documentation provides only modest additional benefit (+0.071) on doc-dependent tasks, and code-only performance essentially matches documented performance on code-derivable tasks. This validates that the benchmark successfully isolates reasoning from memorization. The framework is open-source and can be applied to any well-documented Python repository, offering a standardized way to assess how well AI models truly understand code structure versus relying on rote recall.
- Code access boosts performance by +0.23 over closed-book across all tasks.
- Documentation adds only +0.071 on doc-dependent tasks, showing limited impact.
- 528 tasks generated from 10 Python repos in SWE-Bench; framework is open-source.
Why It Matters
Helps developers and researchers accurately evaluate if AI models truly understand code or just memorize documentation.