TraceEval benchmark reveals LLMs still struggle to recover program runtime call structure
Claude-Opus-4.6 tops at 72.9% F1; fine-tuned Qwen2.5-Coder closes gap within 1.7 F1
A team of researchers (Li et al.) has launched TraceEval, a benchmark that shifts the evaluation of large language models (LLMs) from merely passing unit tests to understanding program execution semantics. Traditional coding benchmarks like HumanEval or SWE-Bench only check if the output matches expected results, offering little insight into whether a model grasps the internal call structure of code. TraceEval addresses this by requiring models to recover the runtime call graph from source code — an execution-verified ground truth where every positive edge is mechanically witnessed by actual program runs, removing annotator bias and label noise.
The benchmark comprises 10,583 real-world programs (2,129 test, 8,454 train) drawn from over 1,600 open-source repositories across Python, JavaScript, and Java. An LLM-assisted harness generation pipeline with tracer validation creates these instances. Evaluating 10 LLMs in zero-shot mode, the strongest performer was Claude-Opus-4.6 with an average F1 of 72.9% across all three languages. However, when researchers fine-tuned the Qwen2.5-Coder family on the training split, they observed massive gains of up to +55.6 F1, with Qwen2.5-Coder-32B reaching 71.2% F1 — within just 1.7 points of the top proprietary model. The authors also release a reproducible pipeline that converts any open-source repository into new verified benchmark instances, enabling the community to expand TraceEval continuously.
- TraceEval is the first execution-verified, multi-language benchmark for code semantic reasoning, covering 10,583 programs across Python, JavaScript, and Java.
- Top zero-shot model Claude-Opus-4.6 scores 72.9% F1; fine-tuned Qwen2.5-Coder-32B reaches 71.2% F1, closing the gap to just 1.7 points.
- Includes a reproducible pipeline for converting any open-source repository into new benchmark instances, facilitating community-driven expansion.
Why It Matters
Moves code evaluation beyond test-passing to measure genuine semantic understanding, critical for reliable AI coding assistants in production.