Weaker LLMs + Verifier = Frontier Performance on High-Verifiability Tasks
Mistral 3 8B with retries beat GPT-5 on code tests (95% vs 91%)
A self-described small experiment (n=120, no peer review) tested Karpathy's task verifiability framework by routing three LLMs—Claude Sonnet 4.6, GPT-5.5, and local Mistral 3 8B (via vLLM 0.6.3)—across four categories: code unit tests, structured JSON extraction, multi-hop reasoning, and creative summarization. High-verifiability tasks (code, extraction) allow automatic error checking, so weaker models plus a verifier can often match frontier performance. Results: On code tests, Sonnet 4.6 scored 94%, GPT-5.5 91%, Mistral 8B 87%—but with one retry, Mistral hit 95%, besting both frontier models. Structured extraction: Sonnet 4.6 97%, GPT-5.5 94%, Mistral 8B 89%, retry 96%. However, an ambiguous JSON schema initially caused Sonnet to underperform (fixing it raised to 98%), highlighting that verifier design matters as much as model choice.
Multi-hop reasoning exposed a genuine capability gap: Sonnet 4.6 at 78%, GPT-5.5 71%, Mistral 8B 51% (retries didn't help—the models persistently hallucinated reasoning paths). Creative summarization (human-rated 1-5) showed expected gaps: Sonnet 4.6 at 4.2, GPT-5.5 3.9, Mistral 8B 3.1. The experiment's creator notes severe limitations: tiny sample (n=120), only one internal evaluator, simple verifiers (JSON Schema + regexes), and exclusion of prompts over 8k tokens (which likely bias results away from Mistral's weaknesses). Constrained decoding could change everything. Despite these caveats, the directional takeaway is clear: for tasks where outputs can be mechanically verified, a weak model + retry loop can rival frontier models at a fraction of the cost—saving inference budgets on high-volume, automatable work.
- Mistral 3 8B + one retry hit 95% on code tests, beating Sonnet 4.6 (94%) and GPT-5.5 (91%)
- Structured extraction: Mistral 8B with retry reached 96% vs Sonnet 4.6 at 97% (schema bug initially hid Sonnet's strength)
- Multi-hop reasoning gap was stark: Sonnet 78%, Mistral 51%—retries didn't fix hallucinated reasoning paths
Why It Matters
Suggests cost-effective routing: weak models + verifier for code/JSON tasks, frontier models for unstructured reasoning.