CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation
New method adapts adversarial prompts to local contexts, revealing 2x more vulnerabilities than direct translation.
A team of researchers has published a paper, accepted at ICLR 2026, introducing CAGE (Culturally Adaptive Generation), a novel framework designed to address a critical blind spot in large language model (LLM) safety evaluation. The core problem is that existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture the nuanced socio-technical vulnerabilities rooted in local culture, law, and social norms. CAGE systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts, moving beyond testing for simple jailbreaks to modeling realistic, localized threats that models might actually encounter in deployment.
The technical innovation at the heart of CAGE is the 'Semantic Mold,' a method that disentangles a prompt's adversarial structure from its specific cultural content. This allows researchers to preserve the harmful intent of a prompt while re-grounding it in a new cultural context. As a proof of concept, the team created KoRSET, a Korean red-teaming safety evaluation benchmark. In their evaluation, CAGE-generated benchmarks proved significantly more effective at revealing model vulnerabilities compared to direct translation baselines, highlighting the inadequacy of current translation-based adaptation methods. The framework offers a scalable, systematic solution for developing meaningful, context-aware safety evaluations across diverse global cultures, and the team has made their dataset and evaluation rubrics publicly available to advance the field of AI safety.
- Introduces the 'Semantic Mold' to separate adversarial structure from cultural content in prompts.
- Demonstrated with KoRSET, a Korean benchmark that outperforms direct translation baselines.
- Framework and datasets are publicly released to enable culturally-aware LLM safety testing globally.
Why It Matters
Ensures AI safety evaluations are globally relevant, catching localized harmful outputs that simple translations miss.