Models & Releases

Comparison of AI Models across Intelligence, Performance, and Price

Independent benchmark of 364 models reveals speed and cost leaders across all tiers.

Deep Dive

Artificial Analysis published an extensive benchmark of 364 AI models spanning intelligence, speed, latency, price, and context windows. The new Intelligence Index v4.0 aggregates 10 evaluations including GDPval-AA, Terminal-Bench Hard, SciCode, and GPQA Diamond. GPT-5.5 in its highest configuration takes the top intelligence spot, followed closely by Claude Opus 4.7 (max) and Gemini 3.1 Pro Preview. The analysis distinguishes between proprietary and open-weight models, and separately highlights reasoning models.

On performance, Mercury 2 dominates output speed at 859 tokens per second, with Granite 4.0 H Small at 461 t/s. Latency leaders are NVIDIA Nemotron 3 Nano (0.40s) and Ministral 3 3B (0.47s). For cost efficiency, Qwen3.5 0.8B leads at just $0.02 per million tokens. Context window size is topped by Llama 4 Scout (10 million tokens) and Grok 4.20 0309 v2 (2 million tokens). The benchmark also includes a hallucination index (AA-Omniscience) that rewards accuracy and penalizes incorrect answers.

Key Points
  • GPT-5.5 (xhigh) and GPT-5.5 (high) top the Artificial Analysis Intelligence Index v4.0 among 364 models.
  • Mercury 2 delivers fastest output at 859 tokens/s; Qwen3.5 0.8B is cheapest at $0.02/M tokens.
  • Llama 4 Scout offers largest context window at 10 million tokens; NVIDIA Nemotron 3 Nano has lowest latency (0.40s).

Why It Matters

This independent benchmark helps teams pick the right model for intelligence, speed, or budget without vendor bias.