Benchmarks in 2024
Claude 3.5, GPT-4o, and Gemini 1.5 fight for top spots on MMLU, HumanEval, and more.
Deep Dive
A Reddit post was submitted by user RetiredApostle.
Key Points
- Claude 3.5 Opus leads MMLU with 90.7%, up from GPT-4's 86.4% in late 2023
- GPT-4o tops multimodal benchmark MMMU at 87.2%, besting Claude 3.5 Opus (84.7%)
- Gemini 1.5 Pro leads long-context understanding (95.3% on RULER for 1M+ token documents)
Why It Matters
Professionals can now trust AI for complex coding and analysis with 5-10% fewer errors than last year.