Works across model sizes from 270M to 10B parameters without architecture changes?

Works across model sizes from 270M to 10B parameters without architecture changes

Open-source release by Nous Research, applicable to existing pre-training pipelines?

Open-source release by Nous Research, applicable to existing pre-training pipelines

Media & Culture

Nous Research's TST cuts LLM pre-training time by 2.5x

r/Singularity May 16, 2026

⚡4,768 B200-GPU-hours vs 12,311 — a 2.5x speedup at fixed compute.

Deep Dive

Nous Research releases Token Superposition Training (TST), a method that substantially reduces LLM pre-training wall-clock time at fixed compute without changing architecture, optimizer, or data. At the 10B-A1B mixture-of-experts scale, TST reaches a lower final training loss than a matched-FLOPs baseline while using 4,768 B200-GPU-hours versus the baseline’s 12,311 — roughly a 2.5x reduction in total pre-training time.

Key Points

2.5x wall-clock speedup on 10B-A1B MoE models: 4,768 vs 12,311 B200-GPU-hours
Works across model sizes from 270M to 10B parameters without architecture changes
Open-source release by Nous Research, applicable to existing pre-training pipelines

Why It Matters

Cuts massive GPU costs for LLM training, enabling faster iteration and smaller labs to compete.

Read Original Article

Nous Research's TST cuts LLM pre-training time by 2.5x

Why It Matters

Related Articles

🚀 Stay Ahead in AI