QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
New benchmark shows LLMs score 42-59% on quantum code, but feedback repair boosts performance to 66-83%.
A research team from King Abdullah University of Science and Technology (KAUST) and collaborators has introduced QuanBench+, the first unified benchmark for evaluating Large Language Models (LLMs) on quantum code generation across multiple frameworks. The benchmark spans three major quantum programming frameworks—Qiskit (IBM), PennyLane (Xanadu), and Cirq (Google)—with 42 carefully aligned tasks covering quantum algorithms, gate decomposition, and state preparation. Unlike previous single-framework evaluations, QuanBench+ uses executable functional tests with KL-divergence-based acceptance for probabilistic outputs, allowing researchers to separate genuine quantum reasoning from mere framework familiarity.
The benchmark reveals significant performance gaps across frameworks, with the strongest models achieving 59.5% Pass@1 in Qiskit, 54.8% in Cirq, but only 42.9% in PennyLane. This 16.6 percentage point difference highlights how current LLM performance remains heavily dependent on framework-specific training data rather than fundamental quantum understanding. However, the study introduces a promising feedback-based repair mechanism where models can revise code after runtime errors or wrong answers, boosting scores to 83.3%, 76.2%, and 66.7% respectively—demonstrating that iterative correction significantly improves reliability.
QuanBench+ represents a crucial step toward standardized evaluation in the emerging field of AI-assisted quantum programming. By providing aligned tasks across frameworks and measuring both one-shot performance and repair capabilities, it establishes a rigorous methodology for tracking progress toward truly framework-agnostic quantum code generation. The benchmark has been accepted to the ICLR 2026 Workshop, signaling its importance to both the machine learning and quantum computing communities.
- QuanBench+ evaluates LLMs across Qiskit, PennyLane, and Cirq with 42 aligned quantum programming tasks
- Best one-shot scores show 16.6 percentage point gap between frameworks (59.5% Qiskit vs 42.9% PennyLane)
- Feedback-based repair boosts performance by 23.8-24.4 percentage points across all three frameworks
Why It Matters
Provides the first standardized way to measure AI progress in quantum programming, revealing current limitations in framework-agnostic reasoning.