Research & Papers

RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support

Handling 19K messages yearly, this RAG system cuts biocurator workload with citation-backed answers.

Deep Dive

The RCSB Protein Data Bank (PDB) has deployed an AI-powered Help Desk using Retrieval-Augmented Generation (RAG) to streamline support for structural biologists depositing 3D macromolecular structures. With over 245,000 structures in the PDB and ~19,000 messages received from depositors in 2025, the system addresses a critical bottleneck faced by ~20 expert biocurators who handle >40% of global depositions. Built on LangChain with a pgvector store (PostgreSQL) and GPT-4.1-mini, the system uses pymupdf4llm for Markdown-preserving PDF extraction, two-stage document chunking, and Maximal Marginal Relevance retrieval to ensure relevant, citation-backed answers.

The Help Desk features a dual-LLM architecture with separate model configurations for question condensing and response generation, plus a specialized system prompt that prevents exposure of internal terminology. A topical guardrail filters off-topic queries, ensuring focus on deposition support. Deployed in production on Kubernetes, it provides around-the-clock assistance with streaming responses. The system is freely available, marking a significant step in applying AI to scientific data management and biocurator efficiency.

Key Points
  • Handles ~19,000 depositor messages from ~8,000 entries annually
  • Uses GPT-4.1-mini with RAG on LangChain, pgvector, and dual-LLM architecture
  • Deployed on Kubernetes with citation-backed, streaming responses

Why It Matters

Automates 40% of global protein structure deposition support, freeing biocurators for complex tasks.