Media & Culture

A Chinese AI lab just built an AI that writes CUDA code better than torch.compile. 40% better than Claude Opus 4.5. on the hardest benchmark.

A Chinese AI lab's new agent system generates optimized CUDA kernels, dramatically outperforming existing tools on hard benchmarks.

Deep Dive

A research team has unveiled CUDA Agent, a breakthrough AI system designed to automate and optimize the complex task of writing high-performance CUDA code for GPUs. This addresses a major bottleneck in deep learning, where crafting efficient GPU kernels traditionally requires deep hardware expertise. The system's core innovation is its agentic reinforcement learning (RL) architecture, which moves beyond existing methods like training-free refinement or simple execution-feedback loops. CUDA Agent establishes a new state-of-the-art, significantly outperforming industry-standard tools like PyTorch's torch.compile and advanced language models like Claude Opus 4.5 on the challenging KernelBench benchmark.

The technical achievement lies in CUDA Agent's three-component design: a pipeline for scalable synthetic data generation, a specialized development environment that verifies and profiles code reliably, and novel RL algorithms for stable training over long contexts. On the KernelBench evaluation, it delivered speedups of 100% over torch.compile on Level-1 and Level-2 tasks, and a 92% improvement on the most difficult Level-3 split. Furthermore, it demonstrated a 40% performance advantage over Anthropic's Claude Opus 4.5 on the hardest benchmarks. This represents a shift towards AI systems that can not only generate code but actively reason about and optimize for specific hardware constraints, potentially democratizing high-performance computing and accelerating model development cycles.

Key Points
  • CUDA Agent outperformed torch.compile by 100%, 100%, and 92% on three levels of the KernelBench benchmark.
  • The system uses a novel agentic RL framework with scalable data synthesis and a verified development environment.
  • It achieved a 40% performance advantage over Claude Opus 4.5 on the most difficult optimization tasks.

Why It Matters

This automates a specialized engineering skill, potentially accelerating AI model training and deployment by generating faster, hardware-optimized code.