Research & Papers

ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

New benchmark tests 21 LLMs on military ethics—most fail badly.

Deep Dive

Large language models are increasingly considered for defense applications, but existing safety benchmarks only evaluate civilian social risks. To address this, researchers at Virginia Tech developed ARMOR 2025, a benchmark grounded in three core military doctrines: the Law of War, Rules of Engagement, and the Joint Ethics Regulation. They extracted doctrinal text and generated multiple-choice questions preserving the intended meaning of each rule. The benchmark is organized through the Observe-Orient-Decide-Act (OODA) decision-making framework, enabling systematic testing of accuracy and refusal across military-relevant decision types. It features a structured 12-category taxonomy, 519 doctrinally grounded prompts, and rigorous evaluation procedures.

When applied to 21 commercial LLMs, ARMOR 2025 revealed critical safety gaps in military alignment. Models often failed to comply with legal and ethical rules, highlighting the inadequacy of current safety training for defense contexts. The findings underscore the urgent need for domain-specific alignment methods to ensure LLMs can provide reliable and legally compliant decision support in military operations. As AI deployment in defense accelerates, benchmarks like ARMOR 2025 are essential for evaluating and improving model behavior in high-stakes scenarios.

Key Points
  • Grounded in three core military doctrines: Law of War, Rules of Engagement, Joint Ethics Regulation
  • Uses 519 prompts across 12 categories, tested on 21 commercial LLMs
  • Reveals critical safety gaps: models fail to follow legal/ethical rules for military ops

Why It Matters

Essential for evaluating AI safety in high-stakes defense scenarios beyond civilian benchmarks.