Developer Tools

SWE-QA: A Dataset and Benchmark for Complex Code Understanding

New dataset tests 15 models; best scores only 74%.

Deep Dive

Researchers from EPITA and LRE have released SWE-QA, a new dataset and benchmark designed to test multi-hop code comprehension in AI models. Unlike existing benchmarks that focus on isolated code snippets, SWE-QA requires models to connect information across multiple, dispersed code segments—a skill crucial for real-world software development. The dataset comprises 9,072 multiple-choice questions systematically generated from 12 Python repositories taken from SWE-bench. It evaluates reasoning patterns like Declaration-and-Call questions, which link entity definitions to their usage, and Interacting-Entity questions, which examine dynamic relationships among collaborating components. Questions are generated through parsing-based entity extraction and LLM-assisted construction with carefully validated distractors, ensuring genuine comprehension is tested over superficial pattern matching.

Evaluation of 15 language models, ranging from 360 million to 671 billion parameters, revealed significant challenges in multi-hop reasoning. The best performing model achieved only 74.41% accuracy. Dense architectures consistently outperformed mixture-of-experts (MoE) models by 10 to 14 percentage points, while reasoning-enhanced variants showed inconsistent benefits. These results underscore that current AI systems still struggle with the complex, interconnected reasoning required in professional software engineering. The benchmark provides a more realistic measure of code understanding, pushing the field toward better evaluation and development of models capable of handling real-world coding tasks.

Key Points
  • SWE-QA includes 9,072 multiple-choice questions from 12 Python repositories in SWE-bench.
  • Best model accuracy was 74.41%; dense architectures beat MoE by 10-14 percentage points.
  • Tests multi-hop reasoning like Declaration-and-Call and Interacting-Entity patterns.

Why It Matters

Highlights AI's gap in complex code reasoning, pushing better benchmarks for real-world software tasks.