When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making
Research shows AI models achieve perfect reproducibility while systematically diverging from statistical truth.
A new research paper titled 'When Stability Fails: Hidden Failure Modes Of LLMs in Data-Constrained Scientific Decision-Making' exposes a critical limitation in how we evaluate large language models for scientific applications. Authored by Nazia Riasat and accepted at the ICLR 2026 workshop, the study introduces a controlled behavioral evaluation framework that separates four dimensions of LLM decision-making: stability (reproducibility across runs), correctness (agreement with statistical ground truth), prompt sensitivity, and output validity. The research specifically tested models on a statistical gene prioritization task derived from differential expression analysis—a common bioinformatics workflow.
The experiments revealed that LLMs can exhibit near-perfect stability (consistent outputs across repeated runs) while systematically diverging from statistical ground truth. In practical tests, models over-selected genes under relaxed significance thresholds, responded sharply to minor prompt wording changes, and even produced syntactically plausible gene identifiers that weren't present in the input data. This demonstrates that stability—often used as a proxy for reliability—doesn't guarantee correctness when statistical references are available. The findings challenge current evaluation practices that emphasize reproducibility without validating against known truths.
The implications are significant for fields like genomics, drug discovery, and clinical research where LLMs are increasingly deployed as decision-support tools. The paper argues for explicit ground-truth validation and output validity checks in automated scientific workflows, moving beyond stability metrics alone. This research provides a crucial framework for more rigorous LLM evaluation in domains where incorrect decisions have real-world consequences.
- LLMs showed 100% run-to-run stability while making statistically incorrect decisions in gene prioritization tasks
- Models over-selected genes by 40% under relaxed significance thresholds and were highly sensitive to minor prompt changes
- The study introduces a 4-dimension evaluation framework separating stability, correctness, prompt sensitivity, and output validity
Why It Matters
Scientific teams using LLMs for data analysis must validate against ground truth, not just check for reproducible outputs.