Media & Culture

ProgramBench: Can LLMs rebuild programs from scratch?

Even the best LLMs fail to reverse-engineer a single program from binary alone.

Deep Dive

A new benchmark called ProgramBench challenges AI agents to reconstruct a complete, functional codebase from only a compiled binary and its documentation. Models must analyze the binary, infer logic, and write code that replicates the original behavior. According to the article, the current score for models is 0%.

Key Points
  • ProgramBench tests AI on full program reconstruction from compiled binaries and documentation only.
  • Every major LLM tested (GPT-4o, Claude 3.5, Llama 3) scored 0% on the benchmark's challenges.
  • The benchmark includes real-world programs like gzip and FFmpeg subsets, graded by exact behavior replication.

Why It Matters

Reveals a critical AI blind spot in reverse engineering essential for security, legacy code, and software maintenance.