AI IQ project ranks GPT-5.5, Gemini 3.1, Claude Opus 4.7 by IQ score
GPT-5.5 tops the IQ chart, but Gemini offers better cost efficiency per point.
Ryan Shay's AI IQ project introduces a novel way to compare frontier AI models by converting multiple public benchmark scores into a human IQ scale. Models evaluated include GPT-5.5, Anthropic's Claude Opus 4.7, Google Gemini 3.1, Grok 4.3, Kimi K2.6, Qwen 3.6, DeepSeek V4, and Muse Spark. GPT-5.5 currently holds the top IQ score, followed closely by GPT-5.4, Gemini 3.1 Pro, and Opus 4.7. The scoring aggregates benchmarks across four reasoning areas—abstract, mathematical, programming, and academic—using 12 tests such as ARC-AGI-1 and ARC-AGI-2. Adjustments prevent contamination or memorization from inflating results, and missing data is conservatively imputed. Beyond raw scores, AI IQ provides time-series trends, model family comparisons (e.g., filtering by xAI shows Grok generations), and a side-by-side of OpenAI, Anthropic, and Google. Cost efficiency is also measured: effective cost per IQ assumes a task with 2,000,000 input tokens and 1,000,000 output tokens, factoring in token-use efficiency. Results show Gemini offers lower cost than GPT and Opus within the same IQ band. Critically, the project has faced pushback for compressing diverse AI strengths into a single number, with some experts arguing it risks misleading users. Despite this, AI IQ highlights a shift in the AI race from raw benchmark bragging toward practical usability and cost-per-intelligence metrics that could influence purchasing decisions.
- GPT-5.5 scores highest IQ, followed by GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.7.
- Scoring averages 4 reasoning areas (abstract, math, programming, academic) from 12 benchmarks like ARC-AGI-1/2.
- Effective cost per IQ uses 2M input + 1M output tokens; Gemini offers lower cost than GPT/Opus at same IQ.
Why It Matters
A single IQ score and cost metric could reshape how enterprises select AI models for practical deployment.