Research & Papers

Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

A new neuro-symbolic framework combines LLMs with theorem provers to automate complex software verification.

Deep Dive

A research team led by Baoding He, Zenan Li, and others has introduced Stepwise, a novel neuro-symbolic framework designed to automate the complex task of formal software verification. The system tackles the longstanding bottleneck in proving critical systems correct—the manual construction of proof scripts—by integrating the reasoning capabilities of large language models (LLMs) with the precision of interactive theorem provers (ITPs) like Isabelle. Stepwise operates by performing a best-first search over a tree of possible proof states. At each step, it queries a fine-tuned LLM for candidate actions, then leverages symbolic ITP tools to validate, repair, or filter these suggestions and automatically discharge simpler subgoals. This synergy allows for data-efficient adaptation of the LLM and semantics-informed pruning of the vast search space, making the proof search process far more efficient and effective.

The team implemented their framework on a new Isabelle REPL and evaluated it rigorously. Its most notable achievement is on the FVEL seL4 benchmark, a formal verification of a high-assurance microkernel used in security-critical systems. Stepwise successfully proved 77.6% of the theorems, a result that substantially surpasses previous LLM-based methods and even outperforms Isabelle's powerful standalone automation tool, Sledgehammer. Crucially, it solved significantly more multi-step proofs, demonstrating an ability to handle complex, sequential reasoning that earlier approaches struggled with. Results across additional Isabelle developments showed strong generalization, indicating the framework is not a narrow solution but a viable and scalable approach to automating software verification for a wide range of critical systems, from operating systems to cryptographic protocols.

Key Points
  • Proved 77.6% of theorems on the seL4 microkernel benchmark, outperforming prior AI methods and the Sledgehammer tool.
  • Uses a neuro-symbolic architecture: fine-tuned LLMs suggest steps, while symbolic theorem prover tools validate and repair them.
  • Implements a best-first tree search over proof states, enabling it to solve complex, multi-step proofs previously out of reach.

Why It Matters

This dramatically accelerates the verification of critical software (OS kernels, crypto), making high-assurance systems more feasible to build.