GPT-5.6 Sol tops Pareto frontier: performance vs cost in AI evaluations
New benchmarks show logarithmic gains: more tokens yield diminishing returns.
The article 'Success Per Tokens' applies Pareto frontier analysis—where performance is optimized per unit of resource—to LLM evaluations. Using data from OpenAI’s GPT-5.6 Preview System Card, it shows GPT-5.6 Sol hitting over 50% pass@1 on the Multi Select Virology Troubleshooting benchmark at the lowest API cost tier. Graphs from DeepSWE (a frontier coding benchmark) reveal that even with high reasoning modes, all tested models cap out around 70% completion, with GPT-5.5 and Claude Fable 5 following similar cost-performance curves. Notably, Anthropic’s high-reasoning modes deliberately restrict token output to reduce costs, sometimes at the expense of coding performance.
Beyond models, the author draws parallels to human workers and startups. Individuals can bend their own Pareto frontier by investing in custom tooling, keyboard shortcuts, and efficient context management rather than burning tokens on agent calls. In the startup world, capital efficiency—achieving outsized growth with minimal resources—mirrors the same principle: maximize output per unit input. The meme of YC founders disrupting industries with zero domain experience is reframed as a rational risk-taking strategy. Overall, the piece argues that 'work smart more than hard' expands the frontier for both AI and humans, but pure hard work still moves the curve outward.
- GPT-5.6 Sol achieves >50% pass@1 on virus benchmarks at lowest cost, per OpenAI's system card.
- DeepSWE coding benchmarks show all models plateau at ~70% completion, with GPT-5.5 and Claude Fable 5 on similar curves.
- Human efficiency gains come from custom tooling and reduced token usage; startups apply same Pareto logic for capital efficiency.
Why It Matters
Resource efficiency defines AI competitiveness—cheaper, faster models win, and human productivity now mirrors token economics.