Developer Tools

CodeGolf Bench tests LLMs on concise code in 60 languages

Reasoning models outperform non-reasoning ones by 30% in code golf challenges

Deep Dive

Researcher Vedant Padwal introduced CodeGolf Bench, a benchmark evaluating LLMs on generating concise code across 60 programming languages. Based on code golf, it leverages a code golf platform for dynamic problems and live human baselines. Evaluation of nine LLMs on Python and C++ tasks showed reasoning models significantly outperform non-reasoning models, achieving a best average percentile of 70.97%. The gap was particularly pronounced in C++, highlighting reasoning's importance for languages with strict syntax. Non-reasoning models struggled more with efficiency optimization across both languages, with best percentiles significantly lower than reasoning counterparts.

Key Points
  • Evaluates LLMs on code golf across 60 programming languages using dynamic problems from code.golf
  • Reasoning models hit 70.97% best average percentile vs. significantly lower scores for non-reasoning models
  • Performance gap largest in C++, where strict syntax demands higher reasoning for minimal solutions

Why It Matters

Code golf benchmarks push LLMs toward producing efficient, minimal code—critical for constrained environments and production optimization.