NVIDIA's SOL-ExecBench includes 235 real production CUDA kernels from models like DeepSeek, Qwen, Gemma, and Kimi?

NVIDIA's SOL-ExecBench includes 235 real production CUDA kernels from models like DeepSeek, Qwen, Gemma, and Kimi.

A top AI-generated fused embedding-gradient + RMSNorm backward kernel passed the benchmark but diverged loss due to bf16 accumulation instead of fp32?

A top AI-generated fused embedding-gradient + RMSNorm backward kernel passed the benchmark but diverged loss due to bf16 accumulation instead of fp32.

The bug only appears with non-uniform token distributions and is masked by AdamW, making it indistinguishable from a failed research idea?

The bug only appears with non-uniform token distributions and is masked by AdamW, making it indistinguishable from a failed research idea.

Research & Papers

NVIDIA benchmark reveals AI-generated CUDA kernels silently break training

r/MachineLearning May 28, 2026

⚡Top-ranked AI kernels pass tests but cause loss divergence in real training runs.

Deep Dive

NVIDIA recently introduced SOL-ExecBench, a benchmark containing 235 real CUDA kernels extracted from production models like DeepSeek, Qwen, Gemma, and Kimi. These kernels represent the exact operations running in modern transformers. The research community submitted AI-generated kernel implementations to compete on speed. The fastest submissions passed the benchmark's verifier with ease, promising significant performance gains. However, when researchers attempted to use these kernels in actual training workloads, they encountered unexpected failures. The most insidious case involved a fused embedding-gradient and RMSNorm backward pass kernel — a core component executed at the end of every transformer training step. When inserted into a small training loop, the loss started to diverge and never recovered. The kernel had passed all benchmark checks, yet it introduced a silent bug that looked exactly like a failed research idea.

The debugging effort revealed that the root cause was the kernel computing the embedding-gradient in bf16 precision instead of fp32. Embedding backward passes sum many small gradient contributions into rows of the embedding matrix. With uniformly sampled tokens (as in the benchmark), contributions spread evenly and bf16 precision sufficed. But in real training, a handful of token IDs accumulate thousands of contributions: small gradients round to zero against the growing accumulator, and high-frequency rows drift. The bug is masked by AdamW's per-parameter normalization, so under that optimizer the loss appears stable. Other top submissions had different failure modes, all equally dangerous. This highlights a critical gap in AI-generated code validation: benchmarks verify correctness under ideal conditions, but real workloads expose precision and distribution-dependent bugs that can waste weeks of research time.

Key Points

NVIDIA's SOL-ExecBench includes 235 real production CUDA kernels from models like DeepSeek, Qwen, Gemma, and Kimi.
A top AI-generated fused embedding-gradient + RMSNorm backward kernel passed the benchmark but diverged loss due to bf16 accumulation instead of fp32.
The bug only appears with non-uniform token distributions and is masked by AdamW, making it indistinguishable from a failed research idea.

Why It Matters

AI-generated code that passes benchmarks can introduce subtle bugs, wasting weeks of research time in production environments.

Read Original Article

NVIDIA benchmark reveals AI-generated CUDA kernels silently break training

Why It Matters

Related Articles

🚀 Stay Ahead in AI