CodeGolf Bench tests LLMs on concise code in 60 languages
Reasoning models outperform non-reasoning ones by 30% in code golf challenges
Researcher Vedant Padwal introduced CodeGolf Bench, a benchmark evaluating LLMs on generating concise code across 60 programming languages. Based on code golf, it leverages a code golf platform for dynamic problems and live human baselines. Evaluation of nine LLMs on Python and C++ tasks showed reasoning models significantly outperform non-reasoning models, achieving a best average percentile of 70.97%. The gap was particularly pronounced in C++, highlighting reasoning's importance for languages with strict syntax. Non-reasoning models struggled more with efficiency optimization across both languages, with best percentiles significantly lower than reasoning counterparts.
- Evaluates LLMs on code golf across 60 programming languages using dynamic problems from code.golf
- Reasoning models hit 70.97% best average percentile vs. significantly lower scores for non-reasoning models
- Performance gap largest in C++, where strict syntax demands higher reasoning for minimal solutions
Why It Matters
Code golf benchmarks push LLMs toward producing efficient, minimal code—critical for constrained environments and production optimization.