Developer Tools

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

A new study shows AI performance drops by up to 40% when answering questions about legacy code like COBOL.

Deep Dive

Researchers Kishan Maharaj, Nandakishore Menon, Ashita Saxena, and Srikanth Tamilselvam published a paper analyzing LLM robustness for long-context code QA. They extended the LongCodeBench dataset with COBOL and Java, testing models under shuffled options, open-ended questions, and 'needle-in-a-haystack' contexts. Results showed substantial performance drops and brittle behavior with irrelevant information, highlighting a key limitation in current AI coding assistants.

Why It Matters

This exposes critical weaknesses in AI tools developers rely on for understanding legacy systems and complex codebases.