TraceEval is the first execution-verified, multi-language benchmark for code semantic reasoning, covering 10,583 programs across Python, JavaScript, and Java?

TraceEval is the first execution-verified, multi-language benchmark for code semantic reasoning, covering 10,583 programs across Python, JavaScript, and Java.

Top zero-shot model Claude-Opus-4.6 scores 72.9% F1; fine-tuned Qwen2.5-Coder-32B reaches 71.2% F1, closing the gap to just 1.7 points?

Top zero-shot model Claude-Opus-4.6 scores 72.9% F1; fine-tuned Qwen2.5-Coder-32B reaches 71.2% F1, closing the gap to just 1.7 points.

Includes a reproducible pipeline for converting any open-source repository into new benchmark instances, facilitating community-driven expansion?

Includes a reproducible pipeline for converting any open-source repository into new benchmark instances, facilitating community-driven expansion.

Developer Tools

TraceEval benchmark reveals LLMs still struggle to recover program runtime call structure

arXiv cs.SE May 13, 2026

⚡Claude-Opus-4.6 tops at 72.9% F1; fine-tuned Qwen2.5-Coder closes gap within 1.7 F1

Deep Dive

A team of researchers (Li et al.) has launched TraceEval, a benchmark that shifts the evaluation of large language models (LLMs) from merely passing unit tests to understanding program execution semantics. Traditional coding benchmarks like HumanEval or SWE-Bench only check if the output matches expected results, offering little insight into whether a model grasps the internal call structure of code. TraceEval addresses this by requiring models to recover the runtime call graph from source code — an execution-verified ground truth where every positive edge is mechanically witnessed by actual program runs, removing annotator bias and label noise.

The benchmark comprises 10,583 real-world programs (2,129 test, 8,454 train) drawn from over 1,600 open-source repositories across Python, JavaScript, and Java. An LLM-assisted harness generation pipeline with tracer validation creates these instances. Evaluating 10 LLMs in zero-shot mode, the strongest performer was Claude-Opus-4.6 with an average F1 of 72.9% across all three languages. However, when researchers fine-tuned the Qwen2.5-Coder family on the training split, they observed massive gains of up to +55.6 F1, with Qwen2.5-Coder-32B reaching 71.2% F1 — within just 1.7 points of the top proprietary model. The authors also release a reproducible pipeline that converts any open-source repository into new verified benchmark instances, enabling the community to expand TraceEval continuously.

Key Points

TraceEval is the first execution-verified, multi-language benchmark for code semantic reasoning, covering 10,583 programs across Python, JavaScript, and Java.
Top zero-shot model Claude-Opus-4.6 scores 72.9% F1; fine-tuned Qwen2.5-Coder-32B reaches 71.2% F1, closing the gap to just 1.7 points.
Includes a reproducible pipeline for converting any open-source repository into new benchmark instances, facilitating community-driven expansion.

Why It Matters

Moves code evaluation beyond test-passing to measure genuine semantic understanding, critical for reliable AI coding assistants in production.

Read Original Article

TraceEval benchmark reveals LLMs still struggle to recover program runtime call structure

Why It Matters

Related Articles

🚀 Stay Ahead in AI