Models & Releases

Gemini 3.1 Pro Crushes Benchmarks – 77% on ARC-AGI-2, Doubles Predecessor!

Gemini 3.1 Pro scores 77% on ARC-AGI-2, while Claude Opus 4.6 and GPT-5.3 Codex bring major speed and reasoning upgrades.

Deep Dive

February 2026 was a landmark month for frontier AI models, headlined by Google DeepMind's Gemini 3.1 Pro. The natively multimodal reasoning model, now in preview, delivered a stunning 77.1% score on the challenging ARC-AGI-2 benchmark, more than doubling the performance of Gemini 3 Pro. It also set a new record on the GPQA Diamond benchmark at 94.3%. Priced at $2 per million input tokens, it signals Google's aggressive push in the high-reasoning model space, with general availability coming soon and the older Gemini 3 Pro Preview being discontinued on March 9.

The competitive landscape intensified with Anthropic's Claude Opus 4.6, which boasts a 1M token context and a 144-point lead over GPT-5.2 in knowledge work Elo ratings. Meanwhile, OpenAI countered with GPT-5.3 Codex, offering a 25% speed increase and 48% token efficiency gain over GPT-5.2-Codex, alongside a new 'high capability' classification in cybersecurity. Other notable releases included xAI's Grok 4.20 with a parallel agents architecture and continued strong showings from Chinese labs like Zhipu AI's GLM-5 and Alibaba's Qwen 3.5, keeping pressure on both the open-source and proprietary frontiers.

Key Points
  • Gemini 3.1 Pro scores 77.1% on ARC-AGI-2, more than double its predecessor's 31.1%.
  • Claude Opus 4.6 offers a 1M token context and leads GPT-5.2 by 144 Elo points on knowledge work.
  • GPT-5.3 Codex runs 25% faster and uses 48% fewer tokens than GPT-5.2-Codex for a 2.6x throughput gain.

Why It Matters

These leaps in reasoning, speed, and efficiency directly translate to more capable AI assistants, cheaper API costs, and new applications in research and cybersecurity.