Research & Papers

Accuracy Is Speed: Towards Long-Context-Aware Routing for Distributed LLM Serving

New metric TTCA shows that inaccurate AI responses create hidden delays through retries.

Deep Dive

A team of researchers has published a paper arguing that for distributed large language model (LLM) serving, especially with long-context prompts, traditional latency metrics are misleading. They introduce a crucial new concept: Time-to-Correct-Answer (TTCA). This metric measures the total wall-clock time a user experiences from sending a query to receiving the first correct response, including the hidden cost of retries caused by inaccurate initial answers. Their measurement study found that prompt characteristics like length and language significantly increase accuracy variance, which in turn inflates TTCA, creating a direct link where lower accuracy results in slower effective performance.

To solve this, the researchers propose Lightweight Accuracy-Aware Routing (LAAR), a novel routing design for distributed LLM systems. Instead of just routing requests to the fastest or least busy server, LAAR incorporates a lightweight assessment of which model instance or server is most likely to produce an accurate answer for a given prompt's characteristics. By prioritizing accuracy on the first attempt, the system minimizes the need for costly retries, thereby reducing the user-perceived TTCA. The work, accepted to the EuroMLSys '26 workshop, fundamentally challenges system designers to treat accuracy as a first-class performance objective alongside throughput and latency, as it has become a key determinant of real-world speed in modern AI serving stacks.

Key Points
  • Introduces Time-to-Correct-Answer (TTCA), a new metric that captures the cumulative delay from inaccurate responses requiring retries.
  • Proposes Lightweight Accuracy-Aware Routing (LAAR), a capability-based system that routes prompts to optimize for first-attempt accuracy.
  • Argues that for long-context LLM serving, accuracy must be treated as a core systems objective because it directly determines user-visible speed.

Why It Matters

This research could lead to faster, more reliable AI APIs and reduce computational waste from retries, lowering costs for providers and users.