standard (hierarchy-informalized Mathlib with embedding-reranker pipeline) and reasoning (iterative sketch-retrieve-reflect cycles).

Standard mode achieves nDCG@10 of 0.62 vs. 0.53 for the next-best system, without domain-specific fine-tuning?

Standard mode achieves nDCG@10 of 0.62 vs. 0.53 for the next-best system, without domain-specific fine-tuning.

Reasoning mode recovers 46.1% of ground-truth premise groups within 10 candidates, and yields 20% proof success in controlled evaluation—4x better than no retrieval?

Reasoning mode recovers 46.1% of ground-truth premise groups within 10 candidates, and yields 20% proof success in controlled evaluation—4x better than no retrieval.

Research & Papers

LeanSearch v2 boosts Lean 4 theorem proving with 20% success rate

arXiv cs.IR May 14, 2026

⚡New AI retrieval system recovers 46.1% of theorem premises in 10 candidates.

Deep Dive

Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof—a task called global premise retrieval. Existing tools either find individual declarations matching a query or predict useful lemmas one tactic step at a time, but none recover the full premise set an entire theorem requires. LeanSearch v2, developed by Gao et al., tackles this with a two-mode architecture. Its standard mode applies a hierarchy-informalized Mathlib corpus with an embedding-reranker pipeline, achieving state-of-the-art single-query retrieval without any domain-specific fine-tuning (nDCG@10 of 0.62 vs. 0.53 for the next-best system).

The reasoning mode builds on this substrate and targets global premise retrieval through iterative sketch-retrieve-reflect cycles. On a benchmark of 69 research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%). In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval), confirming that retrieval quality propagates to proof generation. The authors have open-sourced all code, data, and benchmarks, and the standard mode is accessible via a public API.

Key Points

Two-mode retrieval: standard (hierarchy-informalized Mathlib with embedding-reranker pipeline) and reasoning (iterative sketch-retrieve-reflect cycles).
Standard mode achieves nDCG@10 of 0.62 vs. 0.53 for the next-best system, without domain-specific fine-tuning.
Reasoning mode recovers 46.1% of ground-truth premise groups within 10 candidates, and yields 20% proof success in controlled evaluation—4x better than no retrieval.

Why It Matters

LeanSearch v2 significantly improves automated theorem proving in Lean 4, enabling more efficient and accurate proof generation for research-level math.

Read Original Article

LeanSearch v2 boosts Lean 4 theorem proving with 20% success rate

Why It Matters

Related Articles

🚀 Stay Ahead in AI