Generalist Large Language Models for Molecular Property Prediction: Distilling Knowledge from Specialist Models
New knowledge distillation technique transfers expert rules from decision trees to LLMs, closing the gap with specialist models.
A research team from the University of Notre Dame and other institutions has introduced TreeKD, a novel method that significantly enhances Large Language Models (LLMs) for the critical task of Molecular Property Prediction (MPP) in drug discovery. The core innovation lies in knowledge distillation: the method first trains specialist decision trees on functional group features derived from molecular structures. These trees learn predictive rules that are then 'verbalized'—translated into natural language explanations. This creates a rule-augmented context that LLMs can learn from, allowing them to grasp structural insights that are notoriously difficult to extract from raw SMILES string representations alone.
To further boost performance, the team introduced 'rule-consistency,' a test-time scaling technique inspired by ensemble methods like bagging. This technique aggregates predictions across diverse rules extracted from a Random Forest, enhancing the model's robustness and accuracy. In rigorous experiments conducted on 22 ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties from the Therapeutics Data Commons (TDC) benchmark, TreeKD substantially improved LLM performance. The method successfully narrows the performance gap between generalist LLMs and specialized, state-of-the-art models, marking a significant step toward making LLMs practically viable for accelerating the early stages of drug discovery and development.
- TreeKD distills knowledge from tree-based specialist models by verbalizing their learned rules into natural language for LLMs.
- The method improved LLM performance on 22 key ADMET properties from the TDC benchmark, closing the gap with SOTA specialists.
- Introduces 'rule-consistency,' a test-time ensemble technique that scales predictions across diverse rules for more robust outputs.
Why It Matters
This advances AI-driven drug discovery by making generalist LLMs more accurate and practical for predicting molecular properties, potentially speeding up research.