LLM Leaderboard 2026 — Compare 226 AI Models... | BenchLM.ai
New leaderboard reveals Claude Mythos Preview tops with 99 score, Gemini 3.1 Pro at 92
BenchLM.ai's 2026 LLM Leaderboard provides an exhaustive comparison of 228 large language models across 186 benchmarks, covering quality, cost, context, and runtime performance. It features 115 provisional-ranked and 23 verified-ranked models, with scores backed by exact-source coverage where possible. The leaderboard includes real-time metrics like time-to-first-token (TTFT) and tokens per second (TPS), plus pricing data averaging $0.30 per million output tokens. Use-case filters allow easy exploration for agentic coding, multimodal reasoning, long context, tool use, web research, and more. Top performers include Anthropic's Claude Mythos Preview (99 overall score), Google's Gemini 3.1 Pro (92), and OpenAI's GPT-5.5 (91), with DeepSeek V4 Pro Max ranking highest among open-weight models at 88.
The platform also offers decision-ready picks, model comparisons (e.g., Claude Opus 4.7 vs Gemini 3.1 Pro), and data export in CSV, JSON, or embed format. Pricing trends show a 94% drop since 2023. The leaderboard distinguishes between provisional and verified rankings, helping users assess score confidence. For speed, Mercury 2 leads with 789 tokens/sec, while LFM2-24B-A2B offers lowest latency at 0.42s TTFT. NVIDIA's Nemotron 3 Ultra 500B boasts the largest 10M token context window. This tool is essential for enterprises and developers evaluating frontier models like GPT-5, Claude, Gemini, Llama, and others for real-world deployment decisions.
- 228 models tracked with 115 provisional and 23 verified rankings across 186 benchmarks
- Includes real-time pricing ($0.30 avg output/M tokens) and speed metrics (TTFT, TPS)
- Claude Mythos Preview leads overall with 99 score; Gemini 3.1 Pro (92) and GPT-5.5 (91) follow
Why It Matters
Empowers developers and enterprises to compare LLMs by performance, cost, and speed for informed AI deployment.