2D-ProteinRAG: new dual-dimensional RAG framework boosts protein-text QA
Outperforms baselines on OOD benchmarks by integrating BLAST and dual-filtering strategies.
Protein-Text Question Answering (QA) is essential for interpreting biological sequences via natural language, but standard Retrieval-Augmented Generation (RAG) methods rely on curated static datasets and struggle with novel (out-of-distribution) proteins. 2D-ProteinRAG bridges this gap by empowering LLMs to operate within the gold-standard biological research workflow, BLAST. The framework introduces a dual-dimensional filtering strategy: Horizontal Fine-grained Attribute Alignment prunes irrelevant metadata and aligns database entries with specific user queries, while Vertical Homology-based Semantic Denoising resolves functional contradictions across multiple homologs via hierarchical clustering.
Extensive evaluations on both in-distribution and diverse biological OOD benchmarks show that 2D-ProteinRAG consistently achieves state-of-the-art performance, surpassing fine-tuned baselines and other RAG methods. The results validate the framework's robustness and scalability, providing a practical solution for interpreting protein functions in real-world scientific scenarios. This work represents a significant step toward more accurate and generalizable AI tools for biology.
- 2D-ProteinRAG integrates the BLAST biological workflow instead of relying on static, curated datasets.
- Horizontal intent-aware filter aligns database entries to user queries, pruning irrelevant metadata.
- Vertical clustering-based denoising resolves functional contradictions across multiple homologs, improving OOD generalization.
Why It Matters
Enables robust, scalable interpretation of novel protein functions via natural language, accelerating biological discovery.