Research & Papers

Formalizing statistical learning theory in Lean 4 [R]

A new Lean 4 repository formalizes PAC bounds, VC-dimension, and Rademacher complexity.

Deep Dive

The FormalSLT repository, created by researcher R.S., provides a machine-checked formalization of key results in statistical learning theory using the Lean 4 proof assistant. Current results cover finite-class empirical risk minimization (ERM) bounds, Rademacher symmetrization, high-probability Rademacher bounds, the Sauer–Shelah lemma connecting VC-dimension to shattering, finite scalar contraction, linear predictor bounds, finite PAC-Bayes bounds, and algorithmic stability. The project prioritizes readability and pedagogical structure over raw infrastructure, aiming to serve as a "theorem ladder" that builds from assumptions to final bounds in clear, scoped steps.

Unlike existing Lean SLT efforts that rely heavily on empirical-process machinery and abstract probability, FormalSLT focuses on explicit finite-sample PAC bounds, Rademacher complexity, and stability routes. The code maintains explicit assumptions, scoped theorem statements, and close alignment with standard SLT textbooks. The author invites feedback on theorem organization, proof structure, naming/API decisions, and future targets. This makes the project especially valuable for ML practitioners and theorists who want to verify proofs without wading through dense probabilistic infrastructure.

Key Points
  • Formalizes 7+ SLT results: ERM bounds, Rademacher symmetrization, VC-dimension bridge, PAC-Bayes, and algorithmic stability.
  • Prioritizes readable, explicit assumptions and scoped theorem statements over abstract empirical-process infrastructure.
  • Built as a pedagogically structured 'theorem ladder' for end-to-end ML theory verification in Lean 4.

Why It Matters

Machine-checked proofs for learning theory ensure rigorous verification, boosting trust in published ML bounds.