VILLA: Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models
New AI framework automates creation of high-quality scientific datasets from dense literature.
A research team led by Blessy Antony from Virginia Tech has introduced VILLA (Versatile Information Retrieval from scientific Literature using Large LAnguage models), a novel multi-step retrieval-augmented generation (RAG) framework designed to tackle complex scientific information extraction (SIE). The study addresses a critical gap in AI for science: the lack of high-quality training datasets. Existing SIE methods are often limited to broad topics like biomedicine, choice-based tasks, and well-formatted text. VILLA is engineered for open-ended, domain-specific queries, demonstrated in the virtually ignored field of virology.
The team curated a new ground-truth dataset of 629 mutations across ten influenza A virus proteins, manually extracted from 239 publications. This dataset served as the benchmark for their unique task: extracting mutations that modify virus-host interactions. In comprehensive evaluations, VILLA's multi-step RAG approach—which likely involves iterative query refinement and synthesis—significantly outperformed both basic vanilla RAG and other advanced RAG- and agent-based SIE tools. The paper is currently under review at ACM KDD 2026, highlighting its potential impact on AI-driven scientific discovery.
- VILLA is a new multi-step RAG framework for extracting complex scientific information from literature.
- It was tested on a novel virology task, extracting 629 influenza mutations from 239 papers to create a benchmark dataset.
- The framework demonstrated superior performance over vanilla RAG and other state-of-the-art agent-based SIE tools.
Why It Matters
Automates the labor-intensive creation of high-quality scientific datasets, accelerating AI-driven research in specialized fields.