RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support
Handling 19K messages yearly, this RAG system cuts biocurator workload with citation-backed answers.
The RCSB Protein Data Bank (PDB) has deployed an AI-powered Help Desk using Retrieval-Augmented Generation (RAG) to streamline support for structural biologists depositing 3D macromolecular structures. With over 245,000 structures in the PDB and ~19,000 messages received from depositors in 2025, the system addresses a critical bottleneck faced by ~20 expert biocurators who handle >40% of global depositions. Built on LangChain with a pgvector store (PostgreSQL) and GPT-4.1-mini, the system uses pymupdf4llm for Markdown-preserving PDF extraction, two-stage document chunking, and Maximal Marginal Relevance retrieval to ensure relevant, citation-backed answers.
The Help Desk features a dual-LLM architecture with separate model configurations for question condensing and response generation, plus a specialized system prompt that prevents exposure of internal terminology. A topical guardrail filters off-topic queries, ensuring focus on deposition support. Deployed in production on Kubernetes, it provides around-the-clock assistance with streaming responses. The system is freely available, marking a significant step in applying AI to scientific data management and biocurator efficiency.
- Handles ~19,000 depositor messages from ~8,000 entries annually
- Uses GPT-4.1-mini with RAG on LangChain, pgvector, and dual-LLM architecture
- Deployed on Kubernetes with citation-backed, streaming responses
Why It Matters
Automates 40% of global protein structure deposition support, freeing biocurators for complex tasks.