Is ProgramBench Impossible?
All frontier models fail ProgramBench, but the real problem is that some tasks are literally unsolvable.
Deep Dive
ProgramBench is a new coding benchmark that all frontier models spectacularly fail—not because they lack intelligence, but because some tasks include undocumented sub-commands (e.g., seqtk's hrum and kfreq) hidden in unit tests, making the benchmark effectively impossible. The author suggests improvements like downstream testing and weighted scoring to make future benchmarks more meaningful.
Key Points
- ProgramBench tests if AI can recreate CLI programs from black-box access + documentation, but auto-generated unit tests may include undocumented behaviors (e.g., seqtk's hrum and kfreq sub-commands).
- All frontier models fail spectacularly, not due to lack of skill but because some tasks are literally unsolvable without hidden information.
- Proposed improvements: downstream testing (like compiling Linux kernel), weighted test scoring, and allowing agents to see test failures for iterative debugging.
Why It Matters
Highlights critical flaws in current AI benchmarks, pushing for more realistic evaluation of coding abilities.