Developer Tools

The Limits of Long-Context Reasoning in Automated Bug Fixing

A new study shows GPT-5-nano and Qwen3-Coder fail on 64k+ token tasks, debunking a key AI assumption.

Deep Dive

Researchers from multiple universities published a paper titled 'The Limits of Long-Context Reasoning in Automated Bug Fixing.' They systematically tested models like GPT-5-nano and Qwen3-Coder-30B-A3B on the SWE-bench benchmark. While agentic workflows achieved up to a 31% resolve rate, single-shot performance with 64k-128k token contexts plummeted to 0-7%. The study proves current LLMs cannot effectively reason across entire codebases, highlighting a gap between nominal and usable context capacity.

Why It Matters

This challenges the core premise of many AI coding tools and forces a re-evaluation of how we build and benchmark developer agents.