Developer Tools

SCDBench: New Benchmark Shows LLMs Fail 93% at Smart Contract Decompilation

Frontier models like Claude Opus 4.7 only perfectly decompile 42 of 600 contracts

Deep Dive

SCDBench is a new benchmark designed to rigorously evaluate LLM-based smart contract decompilers. The dataset includes 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. It assesses decompiler outputs across four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. The benchmark addresses the growing problem of LLMs generating plausible-looking but semantically incorrect Solidity code that compiles and appears valid.

Testing frontier models—Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 (with and without extended reasoning)—reveals that semantic consistency is far from solved. The best-performing model perfectly decompiled only 42 out of 600 contracts (7%). However, introducing same-model compilation repair significantly improved performance with minimal additional cost. These results highlight a critical gap in current LLM capabilities for blockchain security applications, where reliable decompilation is essential for auditing and transparency.

Key Points
  • SCDBench contains 600 real-world Solidity contracts with bytecode, source code, and replayable semantic checkpoints for reproducible evaluation.
  • Evaluation covers four stages: format completeness, compilability, ABI recovery, and semantic consistency via differential replay.
  • Best frontier model (Claude Opus 4.7) perfectly decompiles only 42/600 contracts; same-model compilation repair substantially boosts performance at low cost.

Why It Matters

Reliable smart contract decompilation is critical for blockchain security; current LLMs fall far short.