Fully on-premise RAG system keeps sensitive CERN/CMS collaboration data completely private and secure?

Fully on-premise RAG system keeps sensitive CERN/CMS collaboration data completely private and secure.

Uses a two-tiered vector database to first identify the correct analysis before retrieving detailed docs, improving accuracy?

Uses a two-tiered vector database to first identify the correct analysis before retrieving detailed docs, improving accuracy.

Automated pipeline with Selenium and OCR outperforms standard keyword search on realistic, complex physics queries?

Automated pipeline with Selenium and OCR outperforms standard keyword search on realistic, complex physics queries.

Research & Papers

MITRA's on-premise AI helps physicists find answers in 1M+ CERN documents

arXiv cs.IR March 11, 2026

⚡Researchers built a private RAG system that outperforms keyword search for complex physics queries.

Deep Dive

Researchers from the University of Wisconsin-Madison have developed MITRA, a prototype AI assistant designed to solve a critical problem in large-scale physics: finding specific information in massive, internal documentation sets. Built for collaborations like the Compact Muon Solenoid (CMS) at CERN, MITRA tackles the challenge of navigating a vast and ever-growing corpus of technical notes, analysis summaries, and internal reports. It uses a Retrieval-Augmented Generation (RAG) framework, meaning it retrieves relevant documents and then uses a language model to generate precise answers, ensuring responses are grounded in the collaboration's actual work.

MITRA's architecture is built for security and precision in a sensitive scientific environment. Its entire pipeline—from the embedding model that understands document meaning to the final Large Language Model (LLM) that formulates answers—is hosted on-premise. This guarantees that proprietary and unpublished collaboration data never leaves the secure internal network. The system employs a novel two-tiered vector database: it first identifies the correct physics analysis from a set of abstracts before diving into the full, detailed documentation, which resolves ambiguities between similarly named studies.

The technical pipeline is highly automated, using Selenium for document retrieval from internal databases and Optical Character Recognition (OCR) combined with layout parsing for high-fidelity text extraction from various file formats. In tests, the prototype demonstrated superior retrieval performance compared to standard keyword-based search baselines when answering realistic, complex queries from researchers. The work was accepted at the NeurIPS 2025 Machine Learning for the Physical Sciences workshop, highlighting its relevance at the intersection of AI and big science.

Key Points

Fully on-premise RAG system keeps sensitive CERN/CMS collaboration data completely private and secure.
Uses a two-tiered vector database to first identify the correct analysis before retrieving detailed docs, improving accuracy.
Automated pipeline with Selenium and OCR outperforms standard keyword search on realistic, complex physics queries.

Why It Matters

Accelerates scientific discovery by letting researchers instantly query millions of internal documents instead of manual searches.

Read Original Article

MITRA's on-premise AI helps physicists find answers in 1M+ CERN documents

Why It Matters

Related Articles

🚀 Stay Ahead in AI