Mistral 3 8B + one retry hit 95% on code tests, beating Sonnet 4.6 (94%) and GPT-5.5 (91%)?

Mistral 3 8B + one retry hit 95% on code tests, beating Sonnet 4.6 (94%) and GPT-5.5 (91%)

Multi-hop reasoning gap was stark?

Sonnet 78%, Mistral 51%—retries didn't fix hallucinated reasoning paths

Research & Papers

Weaker LLMs + Verifier = Frontier Performance on High-Verifiability Tasks

r/MachineLearning June 11, 2026

⚡Mistral 3 8B with retries beat GPT-5 on code tests (95% vs 91%)

Deep Dive

A self-described small experiment (n=120, no peer review) tested Karpathy's task verifiability framework by routing three LLMs—Claude Sonnet 4.6, GPT-5.5, and local Mistral 3 8B (via vLLM 0.6.3)—across four categories: code unit tests, structured JSON extraction, multi-hop reasoning, and creative summarization. High-verifiability tasks (code, extraction) allow automatic error checking, so weaker models plus a verifier can often match frontier performance. Results: On code tests, Sonnet 4.6 scored 94%, GPT-5.5 91%, Mistral 8B 87%—but with one retry, Mistral hit 95%, besting both frontier models. Structured extraction: Sonnet 4.6 97%, GPT-5.5 94%, Mistral 8B 89%, retry 96%. However, an ambiguous JSON schema initially caused Sonnet to underperform (fixing it raised to 98%), highlighting that verifier design matters as much as model choice.

Multi-hop reasoning exposed a genuine capability gap: Sonnet 4.6 at 78%, GPT-5.5 71%, Mistral 8B 51% (retries didn't help—the models persistently hallucinated reasoning paths). Creative summarization (human-rated 1-5) showed expected gaps: Sonnet 4.6 at 4.2, GPT-5.5 3.9, Mistral 8B 3.1. The experiment's creator notes severe limitations: tiny sample (n=120), only one internal evaluator, simple verifiers (JSON Schema + regexes), and exclusion of prompts over 8k tokens (which likely bias results away from Mistral's weaknesses). Constrained decoding could change everything. Despite these caveats, the directional takeaway is clear: for tasks where outputs can be mechanically verified, a weak model + retry loop can rival frontier models at a fraction of the cost—saving inference budgets on high-volume, automatable work.

Key Points

Mistral 3 8B + one retry hit 95% on code tests, beating Sonnet 4.6 (94%) and GPT-5.5 (91%)
Structured extraction: Mistral 8B with retry reached 96% vs Sonnet 4.6 at 97% (schema bug initially hid Sonnet's strength)
Multi-hop reasoning gap was stark: Sonnet 78%, Mistral 51%—retries didn't fix hallucinated reasoning paths

Why It Matters

Suggests cost-effective routing: weak models + verifier for code/JSON tasks, frontier models for unstructured reasoning.

Read Original Article

Weaker LLMs + Verifier = Frontier Performance on High-Verifiability Tasks

Why It Matters

Related Articles

🚀 Stay Ahead in AI