Agent Frameworks

STAR-PólyaMath beats GPT-5.5 with 93.75% on math benchmarks

Perfect scores on AIME, Putnam, and HMMT — a new SOTA for multi-agent reasoning.

Deep Dive

STAR-PólyaMath tackles foundational reliability issues in multi-agent mathematical reasoning: hallucination accumulation, memory fragmentation, and imbalanced reasoning-tool trade-offs. Its architecture is an orchestrated state machine with nested challenge-step-replan loops, governed by a reasoning-free Python orchestrator that bounds error propagation through trace-back and re-planning. The key innovation is a persistent Meta-Strategist that maintains cross-attempt memory and issues high-level strategic guidance or mandatory directives, allowing the system to escape unproductive loops instead of stagnating or over-relying on tools.

The framework achieves state-of-the-art results on all eight top-tier competition benchmarks, including AIME 2025-2026, MathArena Apex Shortlist, MathArena Apex 2025, Putnam 2025, IMO 2025, HMMT February 2026, and USAMO 2026. It obtains perfect scores on AIME, Putnam, and HMMT, and shows its largest margin on Apex 2025 (93.75% vs. 80.21% by GPT-5.5). Ablation studies confirm that the gains come from the orchestration strategy rather than model diversity; substituting mixed backbones or removing key components consistently weakens performance.

Key Points
  • Achieves perfect scores on AIME 2025-2026, Putnam 2025, and HMMT February 2026.
  • Outperforms GPT-5.5 on Apex 2025 by 13.54 percentage points (93.75% vs 80.21%).
  • Employs a persistent Meta-Strategist with cross-attempt memory and a Python orchestrator for robust error recovery.

Why It Matters

Reliable long-horizon mathematical reasoning at benchmark-topping levels—critical for automated research, education, and formal theorem proving.