Evaluates LLMs on code golf across 60 programming languages using dynamic problems from code.golf?

Evaluates LLMs on code golf across 60 programming languages using dynamic problems from code.golf

Reasoning models hit 70.97% best average percentile vs. significantly lower scores for non-reasoning models?

Reasoning models hit 70.97% best average percentile vs. significantly lower scores for non-reasoning models

Performance gap largest in C++, where strict syntax demands higher reasoning for minimal solutions?

Performance gap largest in C++, where strict syntax demands higher reasoning for minimal solutions

Developer Tools

CodeGolf Bench tests LLMs on concise code in 60 languages

arXiv cs.SE June 01, 2026

⚡Reasoning models outperform non-reasoning ones by 30% in code golf challenges

Deep Dive

Researcher Vedant Padwal introduced CodeGolf Bench, a benchmark evaluating LLMs on generating concise code across 60 programming languages. Based on code golf, it leverages a code golf platform for dynamic problems and live human baselines. Evaluation of nine LLMs on Python and C++ tasks showed reasoning models significantly outperform non-reasoning models, achieving a best average percentile of 70.97%. The gap was particularly pronounced in C++, highlighting reasoning's importance for languages with strict syntax. Non-reasoning models struggled more with efficiency optimization across both languages, with best percentiles significantly lower than reasoning counterparts.

Key Points

Evaluates LLMs on code golf across 60 programming languages using dynamic problems from code.golf
Reasoning models hit 70.97% best average percentile vs. significantly lower scores for non-reasoning models
Performance gap largest in C++, where strict syntax demands higher reasoning for minimal solutions

Why It Matters

Code golf benchmarks push LLMs toward producing efficient, minimal code—critical for constrained environments and production optimization.

Read Original Article

CodeGolf Bench tests LLMs on concise code in 60 languages

Why It Matters

Related Articles

🚀 Stay Ahead in AI