Research & Papers

Verifiable Transformers: solver-checkable circuit explanations for AI

New framework turns opaque neural circuits into formal, provable claims using SMT solvers.

Deep Dive

Mechanistic interpretability often identifies circuits inside Transformer models, but explanations are typically validated through examples and ablations—leaving a gap between plausible circuits and proven understanding. Neel Somani's new paper, 'Towards Verifiable Transformers: Solver-Checkable Circuit Explanations', introduces a framework that bridges this gap by converting task-localized circuits into bounded, formal claims that can be verified or refuted using SMT (Satisfiability Modulo Theories) solvers. The approach works by extracting a task circuit given a behavior, finite domain, and candidate-token projection, then encoding it directly into an SMT solver. For circuits with operators that aren't tractably encodable, a surrogate-mediated method fits an SMT-encodable surrogate, validates it over the bounded domain, and verifies symbolic explanations against it.

Somani instantiates direct verification with a GPT-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU. On small symbolic sequence tasks (e.g., quote closing, bracket type tracking), the framework exhaustively verifies properties like projected functional equivalence, content invariance, edge necessity, and final-residual robustness. At GPT-2 scale, the same operator stack trains stably on OpenWebText, but naive direct SMT verification is intractable. Instead, surrogate-mediated verification is demonstrated on task-localized circuits with hard-to-encode attention, yielding both verified symbolic explanations and solver-generated counterexamples. The paper emphasizes that the goal isn't full-model verification, but a concrete path to turn mechanistic circuit explanations into formal propositions that can be proven or refuted—a key step for trustworthy AI.

Key Points
  • Introduces direct SMT encoding for Transformer circuits with sparsemax attention and LeakyReLU; exhaustively verified on small symbolic tasks.
  • Surrogate-mediated verification scales to GPT-2-sized models by fitting an SMT-encodable surrogate validated over bounded domains.
  • Properties verified include functional equivalence, edge necessity, content invariance, and final-residual robustness—with solver-generated counterexamples when claims fail.

Why It Matters

Transforms vague mechanistic interpretability into formal, provable claims—critical for auditing and trusting AI systems in production.