Research & Papers

The loss curve said tie. The judges said otherwise. Seeking replication for an early LLM training result [R]

r/MachineLearning April 28, 2026

⚡One GPU researcher's new loss functions show 60% human preference in LLM outputs

Deep Dive

An independent researcher has introduced two novel loss-shaping functions for LLM training that show a 59.9% human preference rate over standard cross-entropy training in blind evaluations. The functions—per-token gain and per-layer divergence scaling—were tested on two 1.2B-parameter models trained on identical data for 30,000 steps (3.9B tokens). The per-token gain function scales each token's loss by its surprise level: confident correct tokens get reduced weight while surprising tokens are amplified, preserving the overall gradient budget. The per-layer divergence scaling adjusts gradients per transformer block based on how much that block changed the representation during forward pass, amplifying actively-revising layers and attenuating settled ones.

The evaluation involved 42 blind judges (29 humans and 13 foundation models from 11 vendors) making 1,181 pairwise comparisons. The gain-trained model was preferred in 59.9% of 784 decisive comparisons (two-sided binomial p=2.80e-8). Humans and AI judges showed remarkable agreement: 60.5% vs 59.0% decisive preference, with 81.2% agreement on which prompts favored the new method. The result survived all sensitivity filters including excluding speed-clickers and tie-biased judges. Limitations include single seed testing at 1.2B parameters, training only 16.4% of Chinchilla-optimal tokens, and no separate ablation of the two functions. The researcher seeks an arXiv endorser for the cs.LG category.

Key Points

Two novel functions: per-token gain (scales loss by token surprise) and per-layer divergence scaling (amplifies active transformer layers)
42 blind judges (29 humans + 13 AI models) preferred the new method 59.9% of the time (p=2.80e-8)
Results consistent across human and AI judges with 81.2% agreement on prompt-level preferences

Why It Matters

Could improve LLM training efficiency and output quality with minimal code changes, democratizing better AI

Read Original Article

The loss curve said tie. The judges said otherwise. Seeking replication for an early LLM training result [R]

Why It Matters

Stay Ahead in AI