Luna-2: Scalable Single-Token Evaluation with Small Language Models
New architecture replaces expensive LLM judges with specialized small models, saving $30M annually.
A research team led by Vatsal Goel has introduced Luna-2, a breakthrough architecture designed to replace slow and expensive LLM-as-a-judge (LLMAJ) systems for real-time AI guardrails. The system leverages decoder-only small language models (SLMs) as a shared backbone, with hundreds of specialized metrics—like toxicity detection, hallucination scoring, and tool selection quality—implemented as lightweight LoRA/PEFT adapter heads. This enables deterministic, single-token evaluation that runs concurrently on a single GPU.
Technically, Luna-2 achieves accuracy at par or higher than state-of-the-art LLM-based evaluators while delivering dramatic efficiency gains: over 80x reduction in inference cost and over 20x reduction in latency. The architecture is already deployed at scale, processing over 100B tokens per month and protecting 100M+ AI sessions for customers, translating to annual cost savings exceeding $30M. The paper outlines the model architecture, training methodology, and presents empirical results across content safety and hallucination benchmarks.
This development addresses a critical bottleneck in production AI systems where real-time evaluation needs to be accurate, cheap, and fast. Traditional methods rely on multi-token generation from large frontier models, which introduces operational non-determinism, high latency, and significant expense. Luna-2's approach makes sophisticated, privacy-preserving guardrails deployable locally next to AI applications, optimizing for both latency and cost without sacrificing evaluation quality.
- Matches frontier LLM evaluator accuracy while cutting cost by 80x and latency by 20x
- Uses a shared SLM backbone with specialized LoRA heads for hundreds of concurrent metrics
- Already in production protecting 100M+ sessions, processing 100B tokens/month, saving $30M/year
Why It Matters
Enables affordable, fast, and local deployment of sophisticated AI safety guardrails for real-time applications.