Benchmark for Assessing Olfactory Perception of Large Language Models
LLMs scored just 64.4% on smell tests, revealing they rely on word associations over molecular reasoning.
A team of researchers from Yale University and other institutions has introduced the Olfactory Perception (OP) benchmark, a novel test designed to evaluate how well large language models (LLMs) can reason about smell. The comprehensive benchmark contains 1,010 questions spanning eight distinct task categories, including odor classification, intensity judgment, and predicting olfactory receptor activation. To assess different reasoning approaches, each question is presented in two formats: using common compound names (like 'vanillin') and using isomeric SMILES strings, which are text-based representations of a molecule's structure.
In evaluating 21 configurations across major model families, the researchers found a significant performance gap. Models consistently scored higher when prompted with compound names, outperforming SMILES-based prompts by an average of approximately 7 percentage points, with gains as high as +18.9 points. This suggests current LLMs access olfactory knowledge primarily through lexical associations learned from text, rather than through an understanding of molecular structure. The best-performing model achieved only 64.4% overall accuracy, highlighting a substantial capability gap in this sensory domain.
The study also explored multilingual performance, testing a subset of the benchmark across 21 languages. Interestingly, aggregating predictions from models prompted in different languages improved performance, with the best ensemble model achieving an AUROC (Area Under the Receiver Operating Characteristic curve) score of 0.86. This indicates that knowledge about smell is distributed across languages in training data, and combining these perspectives can yield better predictions. The work establishes a crucial baseline for a neglected area of AI sensory reasoning.
- The OP benchmark contains 1,010 questions across 8 smell-related tasks, from classification to receptor prediction.
- Models scored an average of 7 points higher using compound names vs. molecular SMILES, proving they use word associations, not structural reasoning.
- The top score was just 64.4% accuracy, and multilingual ensemble models boosted performance to an AUROC of 0.86.
Why It Matters
This reveals a fundamental weakness in AI's 'understanding' and paves the way for models that can genuinely reason about chemistry and sensory data.