Developer Tools

Research reveals LLMs like GPT-4 and Claude 3 struggle with long, complex code reasoning

A new study shows AI performance drops by up to 40% when answering questions about legacy code like COBOL.

Deep Dive

Researchers Kishan Maharaj, Nandakishore Menon, Ashita Saxena, and Srikanth Tamilselvam published a paper analyzing LLM robustness for long-context code QA. They extended the LongCodeBench dataset with COBOL and Java, testing models under shuffled options, open-ended questions, and 'needle-in-a-haystack' contexts. Results showed substantial performance drops and brittle behavior with irrelevant information, highlighting a key limitation in current AI coding assistants.

Why It Matters

This exposes critical weaknesses in AI tools developers rely on for understanding legacy systems and complex codebases.

📬 Get the top 10 AI stories daily