WeSearch

RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support

·3 min read · 0 reactions · 0 comments · 2 views
RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support

Motivation: Structural Biologists have contributed more than 245,000 experimentally determined three-dimensional structures of biological macromolecules to the Protein Data Bank (PDB). Incoming data are validated and biocurated by ~20 expert biocurators across the wwPDB. RCSB PDB biocurators who process more than 40% of global depositions face increasing challenges in maintaining efficient Help Desk operations, with approximately 19,000 messages in approximately 8,000 entries received from depositors in 2025. Results: We developed an AI-powered Help Desk using Retrieval-Augmented Generation (RAG) built on LangChain with a pgvector store (PostgreSQL) and GPT-4.1-mini. The system employs pymupdf4llm for Markdown-preserving PDF extraction, two-stage document chunking, Maximal Marginal Relevance retrieval, a topical guardrail that filters off-topic queries, and a specialized system prompt that prevents exposure of internal terminology. A dual-LLM architecture uses separate model configurations for question condensing and response generation. Deployed in production on Kubernetes with PostgreSQL (pgvector), it provides around-the-clock depositor assistance with citation-backed, streaming responses. Availability and implementation: Freely available at https://rcsb-deposit-help.rcsb.org.

Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Information Retrieval arXiv:2604.22800 (cs) [Submitted on 13 Apr 2026] Title:RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support Authors:Vivek Reddy Chithari (1), Jasmine Y. Young (1), Irina Persikova (1), Yuhe Liang (1), Gregg V. Crichlow (1), Justin W. Flatt (1), Sutapa Ghosh (1), Brian P. Hudson (1), Ezra Peisach (1), Monica Sekharan (1), Chenghua Shao (1), Stephen K. Burley (1 and 2) ((1) RCSB Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ, USA, (2) RCSB Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, CA, USA) View a PDF of the paper titled RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support, by Vivek Reddy…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI