Gemini 3 Pro dominates reasoning, Claude Opus 4.5 excels at coding
Google's model crushes PhD-level tests; Anthropic's wins code generation.
In late 2025, Anthropic, OpenAI, and Google DeepMind released their flagship LLMs: Claude Opus 4.5, ChatGPT 5.1 (GPT-5.1 series), and Gemini 3 Pro. On standard knowledge benchmarks like MMLU, all score near human-expert (~90%). However, on ultra-hard reasoning exams, Gemini 3 Pro dominates: it scores 37.5% on Humanity's Last Exam (vs. 26.8% for GPT-5.1 and ~13.7% for prior Claude) and reaches 31-45% on ARC-AGI, far surpassing competitors. This suggests Google's model has superior planning and problem-solving abilities, effectively PhD-level performance.
Claude Opus 4.5 finds its strength in code generation, agents, and computer use, where Anthropic claims it is the best in the world. GPT-5.1 offers a balance with two modes (Instant and Thinking) to trade speed for depth. All three support context windows above 100K tokens and have similar API latency, but pricing varies. For professionals, the choice now hinges on the specific task: extreme reasoning favors Gemini, coding and autonomy favor Claude, and broad knowledge tasks favor GPT-5.1.
- Google Gemini 3 Pro leads on hardest reasoning: 37.5% on Humanity's Last Exam, 31-45% on ARC-AGI.
- OpenAI GPT-5.1 scores ~91% on MMLU, competitive on knowledge but trails Gemini on frontier reasoning.
- Anthropic Claude Opus 4.5 targets coding and agents, outperforming rivals in software and tool-use benchmarks.
Why It Matters
Choosing an LLM now depends on task: reasoning (Gemini), coding (Claude), or general knowledge (GPT-5.1).