Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection
A new, formally verified metric shows LLMs fail to outperform simple ML classifiers in selecting drug candidates.
Researchers Abhinaba Basu and Pavan Chakraborty have published a paper introducing the Budget-Sensitive Discovery Score (BSDS), a formally verified framework designed to solve a critical gap in AI-assisted science. As AI systems, especially large language models (LLMs), are increasingly used to propose candidates for expensive experimental validation—like potential drug molecules—there has been no principled, budget-aware way to compare different selection strategies. The BSDS framework, verified with 20 theorems using the Lean 4 proof assistant, creates a metric that jointly penalizes false discoveries and excessive abstention at every budget level, preventing proposers from gaming the system by performing well only at a cherry-picked budget.
In a major case study applying BSDS to drug discovery, the researchers delivered a sobering verdict on the current utility of LLMs. They evaluated 39 different AI 'proposers' on the MoleculeNet HIV dataset, which contains over 41,000 compounds. The proposers included 11 mechanistic machine learning variants and 28 LLM configurations (14 zero-shot and 14 few-shot) using SMILES representations. The results were clear: a simple Random Forest-based model named Greedy-ML achieved the best Discovery Quality Score of -0.046, outperforming all LLMs. No LLM configuration surpassed this baseline, leading the authors to conclude that LLMs currently provide "no marginal value" over an existing trained classifier for this task.
The study's findings proved robust, generalizing across five different MoleculeNet benchmarks with prevalence rates from 0.18% to 46.2%, and across a grid of penalty parameters. This work provides the scientific community with a rigorous, game-proof tool for evaluating AI selection strategies in any resource-constrained domain, from material science to biology, where the cost of false positives and missed discoveries is high.
- The BSDS/DQS framework is formally verified with 20 theorems machine-checked by the Lean 4 proof assistant, ensuring mathematical rigor.
- In testing on the MoleculeNet HIV dataset (41,127 compounds), a simple Random Forest model (Greedy-ML) outperformed all 28 tested LLM configurations.
- The proposer hierarchy and conclusion—that LLMs add no marginal value over a trained classifier—generalized across five benchmarks and a wide range of parameters.
Why It Matters
Provides a rigorous, game-proof standard for evaluating costly AI-guided research, preventing wasted resources on ineffective methods.