In-Context Molecular Property Prediction with LLMs: A Blinding Study on Memorization and Knowledge Conflicts
A blinding study shows top AI models fail at genuine reasoning when predicting molecular properties.
A team of researchers from the Technical University of Munich and the Helmholtz Institute published a groundbreaking study titled "In-Context Molecular Property Prediction with LLMs: A Blinding Study on Memorization and Knowledge Conflicts." The paper investigates a critical question in AI for science: when large language models (LLMs) predict molecular properties like solubility or energy, are they actually reasoning from new examples (in-context learning) or just recalling memorized data from their training sets? The researchers developed a novel "blinding" framework to systematically control the information available to the models during testing.
They evaluated nine LLM variants from the GPT-4.1, GPT-5, and Gemini 2.5 families on three established chemistry benchmarks: the Delaney solubility, Lipophilicity, and QM7 atomization energy datasets. Their method involved progressively reducing available context—from full information to completely blinded scenarios—and varying the number of in-context examples from zero-shot to 1000-shot. The results were revealing: performance degraded significantly as contextual information was removed, strongly suggesting the models were not performing robust regression on novel data but were instead heavily dependent on prior exposure. This work provides a principled methodology for evaluating AI in scientific domains and highlights a major challenge—training data contamination—that must be addressed before LLMs can be trusted for genuine discovery with unseen molecules.
- The study tested 9 LLM variants (GPT-4.1, GPT-5, Gemini 2.5) on 3 MoleculeNet chemistry datasets using a novel blinding framework.
- Performance dropped sharply when models were "blinded" to contextual info, indicating reliance on memorized training data over in-context reasoning.
- The work establishes a critical evaluation standard for AI in science, exposing risks of data contamination in popular benchmarks.
Why It Matters
This challenges the reliability of AI for drug discovery and materials science, where models must reason about truly novel compounds.