Research & Papers

Luna-2: Scalable Single-Token Evaluation with Small Language Models

arXiv cs.CL February 24, 2026

⚡New architecture replaces expensive LLM judges with specialized small models, saving $30M annually.

Deep Dive

A research team led by Vatsal Goel has introduced Luna-2, a breakthrough architecture designed to replace slow and expensive LLM-as-a-judge (LLMAJ) systems for real-time AI guardrails. The system leverages decoder-only small language models (SLMs) as a shared backbone, with hundreds of specialized metrics—like toxicity detection, hallucination scoring, and tool selection quality—implemented as lightweight LoRA/PEFT adapter heads. This enables deterministic, single-token evaluation that runs concurrently on a single GPU.

Technically, Luna-2 achieves accuracy at par or higher than state-of-the-art LLM-based evaluators while delivering dramatic efficiency gains: over 80x reduction in inference cost and over 20x reduction in latency. The architecture is already deployed at scale, processing over 100B tokens per month and protecting 100M+ AI sessions for customers, translating to annual cost savings exceeding $30M. The paper outlines the model architecture, training methodology, and presents empirical results across content safety and hallucination benchmarks.

This development addresses a critical bottleneck in production AI systems where real-time evaluation needs to be accurate, cheap, and fast. Traditional methods rely on multi-token generation from large frontier models, which introduces operational non-determinism, high latency, and significant expense. Luna-2's approach makes sophisticated, privacy-preserving guardrails deployable locally next to AI applications, optimizing for both latency and cost without sacrificing evaluation quality.

Key Points

Matches frontier LLM evaluator accuracy while cutting cost by 80x and latency by 20x
Uses a shared SLM backbone with specialized LoRA heads for hundreds of concurrent metrics
Already in production protecting 100M+ sessions, processing 100B tokens/month, saving $30M/year

Why It Matters

Enables affordable, fast, and local deployment of sophisticated AI safety guardrails for real-time applications.

Read Original Article

Luna-2: Scalable Single-Token Evaluation with Small Language Models

Why It Matters

Stay Ahead in AI