CLARC: C/C++ Benchmark for Robust Code Search
New C/C++ benchmark shows models fail when code identifiers are anonymized or compiled to Assembly.
A research team led by Kaicheng Wang has introduced CLARC, a new benchmark designed to rigorously test AI models on C/C++ code search. Accepted by ICLR 2026, CLARC addresses a critical gap in AI evaluation: while most benchmarks focus on Python, this new dataset tests models on lower-level languages where understanding code semantics is more challenging. The benchmark was built using an automated pipeline that starts by ensuring code compilability from real GitHub repositories, then categorizes snippets by dependency complexity. What makes CLARC particularly valuable is its focus on robustness—it doesn't just test basic code retrieval but introduces challenging conditions that reveal fundamental weaknesses in current AI approaches.
The benchmark's most revealing tests involve identifier anonymization (removing meaningful variable/function names) and compilation to low-level languages like Assembly and WebAssembly. When six state-of-the-art models were evaluated under these conditions, they showed sharp performance drops, sometimes exceeding 50% in retrieval accuracy. This demonstrates that current AI models for code search rely heavily on superficial lexical features rather than truly understanding code semantics, dependencies, and logic. The researchers' automated pipeline enables systematic testing of how well models handle code that depends on custom-defined types or helper functions, providing a more realistic assessment of their capabilities for professional software engineering tasks.
- CLARC contains 1,245 query-code pairs for evaluation and 5,472 for training, all from real GitHub C/C++ repositories
- Models showed 50%+ performance drops when identifiers were anonymized or code was compiled to Assembly/WebAssembly
- Benchmark reveals AI's reliance on lexical cues rather than semantic understanding of code logic and dependencies
Why It Matters
Exposes fundamental weaknesses in AI code assistants, pushing development toward true semantic understanding needed for professional software engineering.