Claude Mythos crushed all the benchmarks
New Claude model scores 90.1% on MMLU, beating GPT-4o and Gemini Ultra.
Anthropic has unveiled Claude Mythos, its new state-of-the-art AI model that decisively outperforms competitors across major benchmarks. According to the company's technical paper, Mythos achieves a 90.1% score on the Massive Multitask Language Understanding (MMLU) benchmark, surpassing OpenAI's GPT-4o (88.7%) and Google's Gemini Ultra (90.0%). It shows particularly strong gains in mathematical reasoning, scoring 85.7% on the challenging MATH dataset—a 40% improvement over its predecessor, Claude 3 Opus. The model maintains a 200K token context window, enabling deep analysis of long documents.
Beyond raw scores, Claude Mythos introduces enhanced "agentic" capabilities, meaning it can plan and execute multi-step tasks like coding projects, data analysis, and research synthesis with less human oversight. Anthropic emphasizes the model's improved safety and reduced hallucination rates through Constitutional AI techniques. While pricing details aren't fully released, early access suggests a tiered structure for enterprise clients, positioning Mythos as a premium tool for complex analytical work where accuracy is critical.
- Scores 90.1% on MMLU, beating GPT-4o and matching Gemini Ultra
- Shows 40% improvement on complex mathematical reasoning (85.7% on MATH)
- Features 200K context window and advanced agentic task execution
Why It Matters
Sets a new accuracy standard for enterprise AI, enabling more reliable complex analysis and automation.