Media & Culture

Claude Mythos crushed all the benchmarks

New Claude model scores 90.1% on MMLU, beating GPT-4o and Gemini Ultra.

Deep Dive

Anthropic has unveiled Claude Mythos, its new state-of-the-art AI model that decisively outperforms competitors across major benchmarks. According to the company's technical paper, Mythos achieves a 90.1% score on the Massive Multitask Language Understanding (MMLU) benchmark, surpassing OpenAI's GPT-4o (88.7%) and Google's Gemini Ultra (90.0%). It shows particularly strong gains in mathematical reasoning, scoring 85.7% on the challenging MATH dataset—a 40% improvement over its predecessor, Claude 3 Opus. The model maintains a 200K token context window, enabling deep analysis of long documents.

Beyond raw scores, Claude Mythos introduces enhanced "agentic" capabilities, meaning it can plan and execute multi-step tasks like coding projects, data analysis, and research synthesis with less human oversight. Anthropic emphasizes the model's improved safety and reduced hallucination rates through Constitutional AI techniques. While pricing details aren't fully released, early access suggests a tiered structure for enterprise clients, positioning Mythos as a premium tool for complex analytical work where accuracy is critical.

Key Points
  • Scores 90.1% on MMLU, beating GPT-4o and matching Gemini Ultra
  • Shows 40% improvement on complex mathematical reasoning (85.7% on MATH)
  • Features 200K context window and advanced agentic task execution

Why It Matters

Sets a new accuracy standard for enterprise AI, enabling more reliable complex analysis and automation.