Developer Tools

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

arXiv cs.SE February 20, 2026

⚡A new study shows AI performance drops by up to 40% when answering questions about legacy code like COBOL.

Deep Dive

Researchers Kishan Maharaj, Nandakishore Menon, Ashita Saxena, and Srikanth Tamilselvam published a paper analyzing LLM robustness for long-context code QA. They extended the LongCodeBench dataset with COBOL and Java, testing models under shuffled options, open-ended questions, and 'needle-in-a-haystack' contexts. Results showed substantial performance drops and brittle behavior with irrelevant information, highlighting a key limitation in current AI coding assistants.

Why It Matters

This exposes critical weaknesses in AI tools developers rely on for understanding legacy systems and complex codebases.

Read Original Article

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Why It Matters

Stay Ahead in AI