Research & Papers

EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

Synthetic Slack, Jira, and Gmail data with real-world noise for enterprise AI

Deep Dive

EnterpriseRAG-Bench fills a critical gap in RAG evaluation: most existing benchmarks focus on web or public data, but real enterprise use cases involve messy, proprietary internal knowledge. The dataset includes about 500,000 synthetic documents from nine common enterprise tools — Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence — all grounded in shared projects, people, and initiatives. It also features 500 carefully crafted questions across ten categories that test everything from simple lookups to multi-document reasoning, constrained retrieval, conflict resolution, and detecting missing information.

To make the benchmark realistic, the authors added noise like misfiled documents, near-duplicates, and conflicting information. The generation framework is open-source, allowing teams to adapt the corpus to their own industry, scale, and source mix. The dataset, code, evaluation harness, and leaderboard are publicly available. This gives enterprise teams a standardized way to compare RAG systems before deploying them on sensitive internal data.

Key Points
  • 500,000 synthetic documents across 9 enterprise tools (Slack, Gmail, Jira, etc.)
  • 500 questions in 10 categories testing retrieval, reasoning, conflict resolution, and absence detection
  • Includes realistic noise: misfiled docs, near-duplicates, conflicting information; generation framework allows customization

Why It Matters

Provides the first realistic benchmark for enterprise RAG, enabling safer and more reliable AI deployments on internal data.