Evaluated 7 LLMs; GPT5.4 achieved the highest score at only 40.2% success rate?

Evaluated 7 LLMs; GPT5.4 achieved the highest score at only 40.2% success rate.

Tasks require multi-turn reasoning, 3D chemistry intuition, and specialized tool use within limited oracle calls?

Tasks require multi-turn reasoning, 3D chemistry intuition, and specialized tool use within limited oracle calls.

Research & Papers

SMDD-Bench benchmark shows LLMs fail real-world drug design

arXiv cs.AI May 23, 2026

⚡Even GPT5.4 can't solve more than 40% of complex molecular tasks.

Deep Dive

A team of researchers from multiple institutions has released SMDD-Bench, a new benchmark designed to rigorously evaluate large language models on real-world small molecule drug design (SMDD) tasks. The benchmark consists of 502 guaranteed-solvable task instances spanning five distinct types: 2D Pharmacophore Identification, Interaction Point Discovery, Scaffold Hopping, Lead Optimization, and Fragment Assembly. These tasks cover a wide region of chemical space and involve 102 unique protein targets, requiring LLMs to exhibit strong chemical and biological reasoning, 3D intuition, specialized tool use, and planning expertise over a limited number of oracle calls.

In their evaluation of 7 frontier open and closed source LLMs, the best performer—GPT5.4—solved only 40.2% of the tasks, revealing significant performance gaps even among state-of-the-art models. The researchers argue that current evaluation methods are either ad hoc, too simplistic, or limited to single-turn question answering, making SMDD-Bench a much-needed standardized testbed. The benchmark is publicly available with a leaderboard, and the team hopes it will invigorate research toward fully autonomous computational drug design.

Key Points

SMDD-Bench includes 502 tasks across 5 categories: pharmacophore identification, interaction point discovery, scaffold hopping, lead optimization, and fragment assembly.
Evaluated 7 LLMs; GPT5.4 achieved the highest score at only 40.2% success rate.
Tasks require multi-turn reasoning, 3D chemistry intuition, and specialized tool use within limited oracle calls.

Why It Matters

Standardized evaluation reveals critical limitations in LLM-based drug design, guiding future research toward autonomous discovery.

Read Original Article

SMDD-Bench benchmark shows LLMs fail real-world drug design

Why It Matters

Related Articles

🚀 Stay Ahead in AI