Nous Research's TST cuts LLM pre-training time by 2.5x
4,768 B200-GPU-hours vs 12,311 — a 2.5x speedup at fixed compute.
Deep Dive
Nous Research releases Token Superposition Training (TST), a method that substantially reduces LLM pre-training wall-clock time at fixed compute without changing architecture, optimizer, or data. At the 10B-A1B mixture-of-experts scale, TST reaches a lower final training loss than a matched-FLOPs baseline while using 4,768 B200-GPU-hours versus the baseline’s 12,311 — roughly a 2.5x reduction in total pre-training time.
Key Points
- 2.5x wall-clock speedup on 10B-A1B MoE models: 4,768 vs 12,311 B200-GPU-hours
- Works across model sizes from 270M to 10B parameters without architecture changes
- Open-source release by Nous Research, applicable to existing pre-training pipelines
Why It Matters
Cuts massive GPU costs for LLM training, enabling faster iteration and smaller labs to compete.