Temporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULAN
New research reconciles contradictory findings on whether LLMs can update outdated information using external context.
A research team from the University of Glasgow, led by Ritajit Dey, Iadh Ounis, Graham McDonald, and Yashar Moshfeghi, published a reproducibility paper that resolves a significant contradiction in AI research. Two prominent studies, DYNAMICQA and MULAN, had reported opposite conclusions on whether external context (like a news article) can effectively update outdated temporal facts in Large Language Models (LLMs). DYNAMICQA found temporal facts were resistant to change, while MULAN concluded they were easier to update. The team's work aimed to identify the source of this disagreement.
To do this, they first reproduced the original experiments from both benchmarks. They then applied each study's methodology to the other's dataset, standardizing the evaluation settings for a direct comparison. A key step involved using an LLM to synthetically generate realistic natural language contexts to replace MULAN's programmatically constructed statements, aligning it with DYNAMICQA's framework. Their analysis revealed strong dataset dependence: MULAN's findings generalized under both methodological frameworks, whereas applying MULAN's evaluation to DYNAMICQA's data yielded mixed results.
Furthermore, while the original studies only tested 7B parameter models, the Glasgow team expanded the scope to include LLMs of varying sizes. This revealed how model scale influences the encoding and updating of temporal knowledge, adding another critical dimension to the findings. The paper concludes that the observed LLM behavior in temporal fact conflicts is not absolute but is significantly shaped by three factors: the specific design of the evaluation dataset, the chosen performance metrics, and the size of the model being tested.
- The study reconciles contradictory findings from DYNAMICQA and MULAN benchmarks, showing MULAN's conclusion that temporal facts are easier to update generalizes more broadly.
- Methodology involved standardizing datasets, using an LLM to generate synthetic contexts, and cross-testing each study's framework on the other's data.
- Expanded testing beyond 7B models revealed that model size is a key factor influencing how LLMs encode and can update temporal knowledge.
Why It Matters
This provides crucial guidance for developers building reliable RAG systems and AI agents that need to handle real-world, evolving information.