The Network Structure of Mathlib
A deep dive into 308K declarations and 8.4M edges shows human taxonomies clash with logic.
A team of researchers from UCLA and other institutions has published a network analysis of Lean 4's Mathlib, the world's largest formalized mathematics library. By extracting its dependency structure into a multilayer graph of 308,129 declarations, 8.4 million edges, and 7,563 modules, they quantified how mathematical knowledge is organized in a formal proof assistant. The study introduces graph decompositions that isolate explicit edges from those synthesized by the compiler or driven by proofs, revealing structural properties that were previously opaque.
Three major findings emerge. First, human-designed taxonomies (like namespaces) diverge significantly from logical dependencies, with a 50.9% coupling across namespaces—meaning nearly half of all dependencies cross intended boundaries. Second, developers typically use only a median of 1.6% of the scope they import, suggesting massive over-importing. Third, formalization compresses semantic hierarchies, so network centrality measures capture language infrastructure (like type classes) rather than mathematical importance. This work, submitted to arXiv, has implications for designing better programming languages, improving library organization, and understanding how formalization shapes mathematical thinking.
- Graph includes 308,129 declarations, 8.4M edges, and 7,563 modules from Lean 4's Mathlib
- Human taxonomies show 50.9% coupling across namespaces, indicating logical structure diverges from human organization
- Developers use only a median 1.6% of imported scope, revealing massive over-importing in formal proofs
Why It Matters
This work reveals structural inefficiencies in formal math libraries, guiding better language design and proof engineering.