Research & Papers

Luna-2: Scalable Single-Token Evaluation with Small Language Models

New architecture replaces expensive LLM judges with specialized small models, saving $30M annually.

Deep Dive

A research team led by Vatsal Goel has introduced Luna-2, a breakthrough architecture designed to replace slow and expensive LLM-as-a-judge (LLMAJ) systems for real-time AI guardrails. The system leverages decoder-only small language models (SLMs) as a shared backbone, with hundreds of specialized metrics—like toxicity detection, hallucination scoring, and tool selection quality—implemented as lightweight LoRA/PEFT adapter heads. This enables deterministic, single-token evaluation that runs concurrently on a single GPU.

Technically, Luna-2 achieves accuracy at par or higher than state-of-the-art LLM-based evaluators while delivering dramatic efficiency gains: over 80x reduction in inference cost and over 20x reduction in latency. The architecture is already deployed at scale, processing over 100B tokens per month and protecting 100M+ AI sessions for customers, translating to annual cost savings exceeding $30M. The paper outlines the model architecture, training methodology, and presents empirical results across content safety and hallucination benchmarks.

This development addresses a critical bottleneck in production AI systems where real-time evaluation needs to be accurate, cheap, and fast. Traditional methods rely on multi-token generation from large frontier models, which introduces operational non-determinism, high latency, and significant expense. Luna-2's approach makes sophisticated, privacy-preserving guardrails deployable locally next to AI applications, optimizing for both latency and cost without sacrificing evaluation quality.

Key Points
  • Matches frontier LLM evaluator accuracy while cutting cost by 80x and latency by 20x
  • Uses a shared SLM backbone with specialized LoRA heads for hundreds of concurrent metrics
  • Already in production protecting 100M+ sessions, processing 100B tokens/month, saving $30M/year

Why It Matters

Enables affordable, fast, and local deployment of sophisticated AI safety guardrails for real-time applications.