Research & Papers

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

Same model, different endpoint: up to 12.5-point accuracy gap and 10x latency swings.

Deep Dive

Token Arena, a new benchmark from researchers Yuxuan Gao, Megan Wang, and Yi Ling Yu, moves beyond model-level comparisons to measure AI inference at the endpoint level — the exact combination of provider, model, quantization, decoding strategy, region, and serving stack. Tested across 78 endpoints serving 12 model families, it tracks five core axes: output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint. These are composited into three headline metrics: joules per correct answer, dollars per correct answer, and endpoint fidelity (distribution similarity to a first-party reference). The findings are stark: the same model can differ by up to 12.5 accuracy points on math and code between endpoints, tail latency varies by an order of magnitude, and modeled energy per correct answer differs by a factor of 6.2.

Workload-aware pricing further shakes up rankings — under a chat preset (3:1 input-to-output ratio), 7 of the top 10 endpoints drop out under a retrieval-augmented generation preset (20:1), while a reasoning preset (1:5) elevates frontier closed models previously penalized on price. Token Arena is released as an open methodology with full provenance, schema, probe, eval harness, and a v1.0 leaderboard under CC BY 4.0. This highlights that deployment decisions must consider endpoint specifics, not just model name, and that energy efficiency and cost vary dramatically based on workload type.

Key Points
  • Same model on different endpoints shows up to 12.5-point accuracy variance in math and code tasks.
  • Energy per correct answer varies by a factor of 6.2 across 78 endpoints tested.
  • Workload-aware pricing reorders leaderboards: 7 of top 10 chat endpoints fall out of top 10 under RAG preset.

Why It Matters

Enterprises can't rely on model-name rankings; endpoint choice dramatically affects cost, energy, and quality.