Research & Papers

DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation

New pipeline replaces manual curation with a three-stage automated process for evaluating RAG reports...

Deep Dive

Evaluating long-form, citation-backed reports from RAG systems traditionally relies on manually curated nuggets—atomic facts that assess coverage of query-relevant information. This manual process scales poorly, especially for cross-lingual settings with multilingual source documents. To address this, Bryan Li, William Walden, and colleagues present DoGMaTiQ, a pipeline that automatically generates high-quality QA-based nugget sets. The pipeline operates in three stages: (1) document-grounded nugget generation using LLMs, (2) paraphrase clustering to reduce redundancy, and (3) principled subselection based on quality criteria. DoGMaTiQ integrates with AutoArgue, enabling fully automatic evaluation of generated reports.

The team conducted extensive experiments on two cross-lingual TREC shared tasks—NeuCLIR and RAGTIME—and demonstrated that DoGMaTiQ produces system rankings with strong rank correlations to both human-in-the-loop and fully manual evaluations. A detailed analysis reveals that the choice of LLM nugget generator is the most critical factor for quality, and that the resulting rankings are robust to outlier systems. By open-sourcing the code and artifacts, the researchers aim to facilitate future work in report evaluation, potentially removing a major bottleneck for scaling RAG assessment across languages and topics.

Key Points
  • DoGMaTiQ uses a 3-stage pipeline: document-grounded generation, paraphrase clustering, and quality-based subselection.
  • Achieves strong rank correlations with manual judgments on two cross-lingual TREC shared tasks (NeuCLIR and RAGTIME).
  • Key finding: a strong LLM nugget generator is critical for pipeline quality; code and artifacts are publicly released.

Why It Matters

Automates the labor-intensive manual curation of evaluation nuggets, enabling scalable assessment of RAG reports across languages.