Developer Tools

CodegenBench: LLMs fail on niche CPU architectures like Sunway and Kunpeng

New benchmark reveals LLMs are 40% slower on exotic HPC architectures

Deep Dive

A team led by Jie Li at Sun Yat-sen University released CodegenBench, a benchmark designed to test whether large language models can write efficient parallel code across diverse CPU architectures—not just the usual x86_64 or GPU targets. The benchmark includes 106 standard BLAS routines (the fundamental building blocks of linear algebra) plus 20 specialized computational kernels adapted for two supercomputing platforms: the domestic Sunway and Kunpeng architectures. The authors evaluated several state-of-the-art LLMs on tasks ranging from simple vector operations to complex matrix multiplications, measuring both correctness and runtime performance.

The results reveal a sharp divide: modern LLMs can generate near-optimal code for x86_64, often matching human-crafted libraries. But on the Sunway and Kunpeng architectures—which have sparse public documentation and training data—performance degrades significantly, with some models producing code that runs 2–5x slower than native implementations. The study also found that LLMs are most effective for moderately difficult problems requiring concise code (under 50 lines), but struggle with long, complex kernels that demand deep architectural knowledge. The dataset and evaluation infrastructure are open-sourced to drive future research in LLM-driven HPC code generation.

Key Points
  • CodegenBench includes 106 BLAS routines + 20 specialized kernels across x86_64, Sunway, and Kunpeng architectures
  • LLMs show up to 5x performance degradation on domain-specific hardware with limited training data
  • Models perform best on moderately complex tasks (<50 lines); fail on long, architecture-specific optimizations

Why It Matters

As HPC becomes more heterogeneous, LLMs must generalize beyond mainstream hardware to remain useful for scientific computing.