NVIDIA benchmark reveals AI-generated CUDA kernels silently break training
Top-ranked AI kernels pass tests but cause loss divergence in real training runs.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
NVIDIA recently introduced SOL-ExecBench, a benchmark containing 235 real CUDA kernels extracted from production models like DeepSeek, Qwen, Gemma, and Kimi. These kernels represent the exact operations running in modern transformers. The research community submitted AI-generated kernel implementations to compete on speed. The fastest submissions passed the benchmark's verifier with ease, promising significant performance gains. However, when researchers attempted to use these kernels in actual training workloads, they encountered unexpected failures. The most insidious case involved a fused embedding-gradient and RMSNorm backward pass kernel — a core component executed at the end of every transformer training step. When inserted into a small training loop, the loss started to diverge and never recovered. The kernel had passed all benchmark checks, yet it introduced a silent bug that looked exactly like a failed research idea.
The debugging effort revealed that the root cause was the kernel computing the embedding-gradient in bf16 precision instead of fp32. Embedding backward passes sum many small gradient contributions into rows of the embedding matrix. With uniformly sampled tokens (as in the benchmark), contributions spread evenly and bf16 precision sufficed. But in real training, a handful of token IDs accumulate thousands of contributions: small gradients round to zero against the growing accumulator, and high-frequency rows drift. The bug is masked by AdamW's per-parameter normalization, so under that optimizer the loss appears stable. Other top submissions had different failure modes, all equally dangerous. This highlights a critical gap in AI-generated code validation: benchmarks verify correctness under ideal conditions, but real workloads expose precision and distribution-dependent bugs that can waste weeks of research time.
- NVIDIA's SOL-ExecBench includes 235 real production CUDA kernels from models like DeepSeek, Qwen, Gemma, and Kimi.
- A top AI-generated fused embedding-gradient + RMSNorm backward kernel passed the benchmark but diverged loss due to bf16 accumulation instead of fp32.
- The bug only appears with non-uniform token distributions and is masked by AdamW, making it indistinguishable from a failed research idea.
Why It Matters
AI-generated code that passes benchmarks can introduce subtle bugs, wasting weeks of research time in production environments.