Google Gemini 3 Pro leads on hardest reasoning?

37.5% on Humanity's Last Exam, 31-45% on ARC-AGI.

OpenAI GPT-5.1 scores ~91% on MMLU, competitive on knowledge but trails Gemini on frontier reasoning?

OpenAI GPT-5.1 scores ~91% on MMLU, competitive on knowledge but trails Gemini on frontier reasoning.

Anthropic Claude Opus 4.5 targets coding and agents, outperforming rivals in software and tool-use benchmarks?

Anthropic Claude Opus 4.5 targets coding and agents, outperforming rivals in software and tool-use benchmarks.

Models & Releases

Gemini 3 Pro dominates reasoning, Claude Opus 4.5 excels at coding

Macaron May 15, 2026

⚡Google's model crushes PhD-level tests; Anthropic's wins code generation.

Deep Dive

In late 2025, Anthropic, OpenAI, and Google DeepMind released their flagship LLMs: Claude Opus 4.5, ChatGPT 5.1 (GPT-5.1 series), and Gemini 3 Pro. On standard knowledge benchmarks like MMLU, all score near human-expert (~90%). However, on ultra-hard reasoning exams, Gemini 3 Pro dominates: it scores 37.5% on Humanity's Last Exam (vs. 26.8% for GPT-5.1 and ~13.7% for prior Claude) and reaches 31-45% on ARC-AGI, far surpassing competitors. This suggests Google's model has superior planning and problem-solving abilities, effectively PhD-level performance.

Claude Opus 4.5 finds its strength in code generation, agents, and computer use, where Anthropic claims it is the best in the world. GPT-5.1 offers a balance with two modes (Instant and Thinking) to trade speed for depth. All three support context windows above 100K tokens and have similar API latency, but pricing varies. For professionals, the choice now hinges on the specific task: extreme reasoning favors Gemini, coding and autonomy favor Claude, and broad knowledge tasks favor GPT-5.1.

Key Points

Google Gemini 3 Pro leads on hardest reasoning: 37.5% on Humanity's Last Exam, 31-45% on ARC-AGI.
OpenAI GPT-5.1 scores ~91% on MMLU, competitive on knowledge but trails Gemini on frontier reasoning.
Anthropic Claude Opus 4.5 targets coding and agents, outperforming rivals in software and tool-use benchmarks.

Why It Matters

Choosing an LLM now depends on task: reasoning (Gemini), coding (Claude), or general knowledge (GPT-5.1).

Read Original Article

Gemini 3 Pro dominates reasoning, Claude Opus 4.5 excels at coding

Why It Matters

Related Articles

🚀 Stay Ahead in AI