Qwen2.5-1.5B dropped from 91.5% to 48.0% accuracy on a calendar tool-call task when forced into a strict schema.

Overall, 15,000 generations across three models showed answer accuracy fell from 19.7% to 11.0% while schema validity rose from 61.5% to 100%?

Overall, 15,000 generations across three models showed answer accuracy fell from 19.7% to 11.0% while schema validity rose from 61.5% to 100%.

'reason free, constrain late' — separate reasoning from structural formatting to avoid semantic errors.

Research & Papers

New study: enforcing output schemas on small LLMs drops accuracy by 45%

arXiv cs.LG May 27, 2026

⚡Qwen2.5-1.5B's accuracy halved when forced to use tool-call schemas

Deep Dive

A new paper on arXiv (2605.26128) by Jaideep Ray systematically measures what it calls the 'constraint tax' — the hidden accuracy cost of enforcing structured output schemas (JSON, tool-call formats) on small language models (SLMs) under 3 billion parameters. The study tested Qwen2.5-0.5B, Qwen2.5-1.5B, and SmolLM2-1.7B across 15,000 GPU generations. Hard answer-only schema decoding improved schema validity from 61.5% to 100% — but answer accuracy fell from 19.7% to 11.0%, and outputs that were valid but wrong jumped from 49.5% to 88.9%. In a realistic calendar tool-call task, Qwen2.5-1.5B reached 91.5% executable accuracy with prompt-only JSON but only 48.0% under a strict tool-call schema (both modes hit 100% validity). The error is semantic: models sacrifice correct reasoning to satisfy structural constraints, not because the schema is broken.

The paper shows that even models at the 3B boundary still pay a direct-schema tax — the constraint cost doesn't disappear at the approximate size limit often cited as safe. However, the authors identify a constructive design pattern: 'reason free, constrain late' — allow the model to generate free-form reasoning first, then apply the structured output wrapper in a separate step. The practical conclusion for production systems is to separately report schema validity, answer accuracy, executable accuracy, and wrong-valid-schema rate, rather than conflating them. This finding has direct impact for on-device and low-cost deployments where SLMs are attractive for privacy and latency but where reliability demands structured outputs.

Key Points

Constraint tax cost: Qwen2.5-1.5B dropped from 91.5% to 48.0% accuracy on a calendar tool-call task when forced into a strict schema.
Overall, 15,000 generations across three models showed answer accuracy fell from 19.7% to 11.0% while schema validity rose from 61.5% to 100%.
Recommended pattern: 'reason free, constrain late' — separate reasoning from structural formatting to avoid semantic errors.

Why It Matters

For production SLM deployments, blindly enforcing schemas can halve accuracy — metrics must separate validity from correctness.

Read Original Article

New study: enforcing output schemas on small LLMs drops accuracy by 45%

Why It Matters

Related Articles

🚀 Stay Ahead in AI