PowerCodeBench benchmark includes 2,000 execution-validated tasks pairing natural-language queries with pandapower code and numerical ground truth?

PowerCodeBench benchmark includes 2,000 execution-validated tasks pairing natural-language queries with pandapower code and numerical ground truth.

Knowledge-boundary intervention improved accuracy by 32-56 points for all open-weight models ≥7B and all commercial APIs tested?

Knowledge-boundary intervention improved accuracy by 32-56 points for all open-weight models ≥7B and all commercial APIs tested.

Llama-3.1-405B and Qwen3-Coder-480B achieved top accuracy using only 41% of the prompt-token cost of full-context methods?

Llama-3.1-405B and Qwen3-Coder-480B achieved top accuracy using only 41% of the prompt-token cost of full-context methods.

Developer Tools

PowerCodeBench boosts open-weight LLMs 32-56 points on power grid code

arXiv cs.SE June 01, 2026

⚡Open-weight models match commercial APIs with documentation injection, cutting token costs 59%.

Deep Dive

Large language models (LLMs) are increasingly used to automate power-system analysis, but utilities require on-premise deployment for confidentiality and reproducibility. A new preprint from Hui Wu, Xiaoyang Wang, and Zhong Fan reveals that first-pass failures in power-system code generation are dominated by structured API-knowledge boundary errors: hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. To tackle this, the team built PowerCodeBench, an execution-validated benchmark generator that pairs natural-language operator queries with pandapower code and numerical ground truth. They also developed an L0-L3 documentation-driven probing procedure that measures per-model API knowledge profiles.

The core contribution is a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction. On a 2,000-task frozen release, the team evaluated ten open-weight LLMs (from 1.5B to 480B parameters) and four commercial mid-tier APIs. The intervention improved every open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points. Open-weight models in the 70B-120B range matched commercial mid-tier accuracy, while Llama-3.1-405B and Qwen3-Coder-480B led the panel. The targeted prompts preserved the full-context accuracy ceiling while using only 41% of the prompt-token cost. This provides a deployment-time path toward reliable on-premise LLM assistance for grid-analysis workflows without fine-tuning or cloud inference.

Key Points

PowerCodeBench benchmark includes 2,000 execution-validated tasks pairing natural-language queries with pandapower code and numerical ground truth.
Knowledge-boundary intervention improved accuracy by 32-56 points for all open-weight models ≥7B and all commercial APIs tested.
Llama-3.1-405B and Qwen3-Coder-480B achieved top accuracy using only 41% of the prompt-token cost of full-context methods.

Why It Matters

Enables reliable on-premise LLM assistance for power grid analysis without cloud inference or fine-tuning.

Read Original Article

PowerCodeBench boosts open-weight LLMs 32-56 points on power grid code

Why It Matters

Related Articles

🚀 Stay Ahead in AI