Research & Papers

FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Turning LaTeX physics problems into verified Lean4 code with 100% accuracy

Deep Dive

FormalScience, developed by Jordan Meadows, Lan Zhang, and Andre Freitas, introduces a domain-agnostic human-in-the-loop pipeline that enables a single domain expert (without deep formal language experience) to produce syntactically correct and semantically aligned formal proofs of informal reasoning for low economic cost. The system leverages agentic code generation in Lean, a proof assistant, to autoformalize scientific reasoning from LaTeX sources. Applied to physics, the team constructed FormalPhysics, a dataset of 200 university-level physics problems and solutions (primarily quantum mechanics and electromagnetism) with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity.

The researchers evaluated open-source models and proprietary systems on a statement autoformalization task via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach. They provide the first systematic characterization of semantic drift in physics autoformalization, identifying concepts such as notational collapse and abstraction elevation, revealing what formal language verifies when full semantic preservation is unattainable. The codebase and interactive UI-based FormalScience system are released on GitHub, facilitating autoformalization and theorem proving in scientific domains beyond physics. This work, accepted at ACL 2026, marks a significant step toward scalable, verifiable scientific reasoning with AI.

Key Points
  • FormalScience uses a human-in-the-loop agentic pipeline to autoformalize scientific reasoning into Lean4 proofs with perfect formal validity
  • The FormalPhysics dataset includes 200 university-level physics problems (quantum mechanics, electromagnetism) with higher statement complexity than existing formal math benchmarks
  • The system characterizes semantic drift through notational collapse and abstraction elevation, revealing limitations in full semantic preservation during autoformalization

Why It Matters

Enables domain experts without formal language skills to produce verified proofs, advancing AI-verified scientific research.