ProgramBench: Can LLMs rebuild programs from scratch?
Even the best LLMs fail to reverse-engineer a single program from binary alone.
Deep Dive
A new benchmark called ProgramBench challenges AI agents to reconstruct a complete, functional codebase from only a compiled binary and its documentation. Models must analyze the binary, infer logic, and write code that replicates the original behavior. According to the article, the current score for models is 0%.
Key Points
- ProgramBench tests AI on full program reconstruction from compiled binaries and documentation only.
- Every major LLM tested (GPT-4o, Claude 3.5, Llama 3) scored 0% on the benchmark's challenges.
- The benchmark includes real-world programs like gzip and FFmpeg subsets, graded by exact behavior replication.
Why It Matters
Reveals a critical AI blind spot in reverse engineering essential for security, legacy code, and software maintenance.