Open Source

Artificial Analysis leaderboards: local-friendly models

Gemma 3 27B scores lower than its 12B sibling in new intelligence index.

Deep Dive

Artificial Analysis has published its latest intelligence benchmarking results, providing crucial data for developers evaluating local AI model deployment. The independent benchmark ranks models across reasoning and non-reasoning capabilities, with categories for tiny, small, and medium-sized models suitable for local hardware. The most surprising finding shows Google's Gemma 3 27B scoring only 10 points on their index, actually performing worse than the smaller Gemma 3 12B which scored 12 points. This counterintuitive result highlights how model size doesn't always correlate with benchmark performance, especially when considering different architectural approaches and optimization strategies.

Technical analysis reveals Solar Open 100B leads the pack with 22 points, demonstrating strong reasoning capabilities despite its larger size. Llama Nemotron Super 49B v1.5 follows with 19 points, while Meta's Llama 3.3 70B scored 14 points. Notably absent are results for GLM-Air, though GLM-4.6V appears in the dataset. The benchmark methodology focuses on practical intelligence metrics rather than declaring definitive superiority, giving developers nuanced data for selecting models based on specific use cases, hardware constraints, and performance requirements. This comes at a critical time as more organizations seek to run capable AI models locally rather than relying on cloud APIs.

Key Points
  • Gemma 3 27B scores 10 points, lower than Gemma 3 12B's 12 points
  • Solar Open 100B leads with 22 points in reasoning category
  • Llama 3.3 70B scores 14 points, behind Llama Nemotron Super 49B's 19 points

Why It Matters

Provides data-driven insights for developers choosing local AI models, revealing that bigger models don't always perform better.