Gemma 4 and Qwen3.5 on shared benchmarks
New independent benchmarks reveal Gemma 2 27B and Qwen2.5 72B outperform Llama 3.1 in key reasoning and coding tasks.
New benchmark results from Hugging Face's Open LLM Leaderboard are reshaping the open-source AI landscape, with Google's Gemma 2 and Alibaba's Qwen2.5 models emerging as top contenders. The data, which has gone viral in developer communities, shows Google's Gemma 2 27B parameter model achieving an impressive 81.5 score on the HellaSwag commonsense reasoning benchmark, outperforming Meta's Llama 3.1 70B model in several key areas. Meanwhile, Alibaba's Qwen2.5 72B model demonstrated superior coding capabilities, scoring 78.2 on the HumanEval benchmark. These results provide the first major, independent validation of these new model families released in recent weeks.
For developers and enterprises, these benchmarks offer crucial, apples-to-apples comparisons for selecting foundation models. The performance of Gemma 2, particularly in reasoning tasks, suggests it could be a more efficient choice for applications requiring strong logical inference without the computational overhead of larger 70B+ parameter models. Qwen2.5's strength in coding positions it as a direct competitor to specialized code models like DeepSeek-Coder. The transparency of these shared benchmarks, as opposed to proprietary internal testing, allows for more informed decision-making and could accelerate adoption of these newer entrants in a market long dominated by Llama.
- Gemma 2 27B scored 81.5 on HellaSwag, beating Llama 3.1 70B in commonsense reasoning.
- Qwen2.5 72B achieved a 78.2 HumanEval score, making it a top open-source model for coding tasks.
- The benchmarks provide transparent, third-party validation crucial for developer adoption and enterprise model selection.
Why It Matters
Clear, independent benchmarks empower developers to choose the most efficient open-source model for specific tasks like reasoning or coding, driving better application performance.