ZAYA1-8B Technical Report
A tiny 700M-parameter MoE model rivals giants on math and coding benchmarks.
Zyphra's ZAYA1-8B is a reasoning-focused Mixture-of-Experts (MoE) model that packs a punch with only 700M active parameters out of 8B total. Built on Zyphra's MoE++ architecture and trained from scratch on a full-stack AMD compute platform, it matches or exceeds DeepSeek-R1-0528 on challenging mathematics and coding benchmarks. The model stays competitive with much larger open-weight reasoning models, proving that efficiency can rival scale.
The model's training is uniquely reasoning-first, incorporating reasoning data from pretraining onward using an answer-preserving trimming scheme. Post-training uses a four-stage reinforcement learning cascade: warmup on math and puzzles, a 400-task RLVE-Gym curriculum, math and code RL with test-time compute traces, and behavioral RL for chat. Additionally, Zyphra introduces Markovian RSA, a test-time compute method that recursively aggregates parallel reasoning traces while carrying forward only 4K-token tails. This pushes AIME'25 accuracy to 91.9% and HMMT'25 to 89.6%, narrowing the gap to models like Gemini-2.5 Pro and GPT-5-High.
- ZAYA1-8B uses MoE++ with 700M active out of 8B total parameters, trained on AMD hardware.
- Matches or beats DeepSeek-R1-0528 on math and coding benchmarks despite far fewer active params.
- Markovian RSA test-time compute achieves 91.9% on AIME'25 with only 4K-token reasoning tails.
Why It Matters
Small, efficient MoE models can rival massive reasoning engines, lowering barriers for advanced AI deployment.