Research & Papers

Test-Time Scaling Makes Overtraining Compute-Optimal

arXiv stat.ML April 03, 2026

⚡Research overturns Chinchilla laws, proving heavily overtrained LLMs are optimal when factoring in inference costs.

Deep Dive

A team of researchers including Nicholas Roberts and Sungjun Cho has published a groundbreaking paper introducing Train-to-Test (T²) scaling laws. This new framework fundamentally challenges the established Chinchilla scaling laws by accounting for a critical real-world factor: the cost of inference. Modern LLMs often use test-time scaling techniques like repeated sampling or chain-of-thought reasoning, where the inference cost scales with both model size and the number of samples generated. T² laws create a joint optimization problem for model size, training tokens, and inference samples under a fixed end-to-end compute budget.

Across eight downstream tasks, the team found that when inference costs are included in the total budget calculation, the optimal pretraining strategy shifts dramatically into the 'overtraining' regime. This means the most compute-efficient models are those trained on significantly more data than what Chinchilla laws recommend—often outside the range of standard pretraining suites. The researchers validated their forecasts by pretraining models in this optimal overtrained region, confirming they deliver substantially stronger performance compared to models optimized by pretraining laws alone.

The implications are significant for how companies like OpenAI, Anthropic, and Google build frontier models. The research also shows these findings hold up through the post-training stage (like RLHF or DPO), making T² scaling directly relevant to modern LLM deployments. This work provides a new, more holistic blueprint for allocating compute, suggesting that investing more in training to create a 'smarter' model can be far more efficient than spending that compute on running a larger number of inference samples from a less-capable model.

Key Points

Introduces Train-to-Test (T²) scaling laws that jointly optimize training and inference costs under a fixed compute budget.
Shows optimal model training shifts to the 'overtraining' regime, using 10x more data than Chinchilla laws suggest, when factoring in inference.
Validated by pretraining overtrained models that outperform standard ones, with findings surviving the post-training (RLHF) stage.

Why It Matters

Provides a new cost-optimization blueprint for AI labs, potentially leading to more capable and efficient frontier models by reallocating compute from inference to training.

Read Original Article

Test-Time Scaling Makes Overtraining Compute-Optimal

Why It Matters

Stay Ahead in AI