SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models
Researchers unveil first standardized safety test for Arabic AI, revealing major gaps in popular models like Jais 2.
A research team including Omar Abdelnasser and Mohammed Fouda has published SalamahBench, a pioneering benchmark designed to systematically evaluate the safety of Arabic Language Models (ALMs). The work addresses a critical gap in AI safety evaluation, as existing benchmarks and safeguard models are predominantly English-centric, leaving Arabic NLP systems without standardized testing. SalamahBench comprises 8,170 carefully curated prompts across 12 distinct safety categories aligned with the MLCommons Safety Hazard Taxonomy, created through a rigorous pipeline of AI filtering and multi-stage human verification. This enables category-aware safety evaluation, moving beyond aggregate scores to identify specific harm domains where models fail.
The researchers used SalamahBench to evaluate five leading ALMs: Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, testing them under multiple safeguard configurations including individual guard models and majority-vote aggregation. Results revealed significant disparities in safety alignment. Fanar 2 achieved the lowest aggregate attack success rate but showed uneven robustness across specific harm categories. In contrast, Jais 2 consistently exhibited elevated vulnerability, indicating weaker intrinsic safety alignment. The study also demonstrated that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges, highlighting the necessity for specialized safety mechanisms tailored to Arabic linguistic and cultural contexts.
- First standardized safety benchmark for Arabic AI with 8,170 prompts across 12 harm categories
- Tested 5 major models: Fanar 2 performed best overall, Jais 2 showed highest vulnerability
- Reveals native Arabic models perform worse than dedicated safeguards as safety judges
Why It Matters
Enables safer deployment of Arabic AI by identifying specific safety gaps, crucial for enterprise and government adoption.