Research & Papers

TaxDistill uses 500M-param GenomeOcean to boost metagenomic annotation F1 by 23%

Knowledge distillation slashes label noise, raising MMseqs2 F1 from 0.763 to 0.941 on gut data.

Deep Dive

Metagenomic taxonomic annotation—identifying which microbes DNA fragments come from—is crucial for understanding environmental and gut microbiomes. Traditional similarity-search methods (e.g., MMseqs2, BLAST) struggle with high microbial diversity and incomplete reference databases. Learning-based approaches like Taxometer attempt post hoc correction, but they inherit noise from the labels produced by those same search tools, degrading performance. To break this cycle, Rongye Ye and colleagues introduce TaxDistill, a knowledge distillation framework that leverages a large pre-trained genomic foundation model as a teacher.

The teacher, GenomeOcean (500M parameters), extracts deep semantic features from DNA sequences and generates soft labels based on confidence scores. These soft labels are distilled into a lightweight student network, effectively washing out the noise introduced by initial retrieval tools. Experiments on seven diverse CAMI2 datasets show consistent gains over baselines. On the Gastrointestinal dataset, TaxDistill boosts MMseqs2’s F1 score from 0.763 to 0.941, surpassing Taxometer. The paper demonstrates that distillation from a genomic foundation model is a reliable strategy for label correction in complex metagenomic analysis, offering a practical path to more accurate microbial profiling without requiring massive compute at inference time.

Key Points
  • TaxDistill uses a 500M-parameter genomic foundation model (GenomeOcean) as a teacher to generate soft labels and reduce noise from similarity-search tools.
  • On the CAMI2 Gastrointestinal dataset, the method improves MMseqs2's F1 score from 0.763 to 0.941 — a 23% relative gain.
  • Outperforms the prior Taxometer baseline across seven diverse metagenomic datasets, providing a reliable label-correction framework.

Why It Matters

More accurate taxonomic annotation from noisy short-read data improves microbiome research, diagnostics, and environmental monitoring at scale.