Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering
A new study shows AI performance drops by up to 40% when answering questions about legacy code like COBOL.
Researchers Kishan Maharaj, Nandakishore Menon, Ashita Saxena, and Srikanth Tamilselvam published a paper analyzing LLM robustness for long-context code QA. They extended the LongCodeBench dataset with COBOL and Java, testing models under shuffled options, open-ended questions, and 'needle-in-a-haystack' contexts. Results showed substantial performance drops and brittle behavior with irrelevant information, highlighting a key limitation in current AI coding assistants.
Why It Matters
This exposes critical weaknesses in AI tools developers rely on for understanding legacy systems and complex codebases.