Research & Papers

T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

Open-source T2S-Metrics standardizes SPARQL query evaluation with over 20 metrics.

Deep Dive

Evaluating question-answering systems over knowledge graphs has long suffered from fragmented, ad-hoc metrics that hinder reproducibility. To address this, researchers from ICN, WIMMICS, and I3S have introduced T2S-Metrics, an open-source, extensible library specifically designed for comparing and assessing SPARQL queries generated from natural language. The library provides a unified framework with over 20 evaluation metrics collected from literature and practical needs, covering lexical, syntactic, semantic, structural, execution-based, and ranking-based dimensions. Inspired by the ir-metrics library for information retrieval, T2S-Metrics offers a modular abstraction layer that decouples metric specification from implementation, ensuring consistent and transparent evaluation across studies.

Key metrics include token-level Precision, Recall, and F1; BLEU, ROUGE, METEOR, and CodeBLEU variants; variable-normalized metrics like SP-BLEU and SP-F1; graph- and URI-based exact match metrics; answer set metrics such as F1-QALD and Jaccard similarity; ranking metrics including MRR, NDCG, P@k, and Hit@k; and LLM-as-a-Judge metrics. The library goes beyond simple answer correctness to enable deeper diagnostic insights into system behavior, such as syntactic validity, semantic faithfulness, and computational efficiency. T2S-Metrics represents a significant step toward systematic, standardized evaluation in question answering over knowledge graphs, making it easier for researchers to compare results and build more robust systems.

Key Points
  • Provides over 20 evaluation metrics across lexical, syntactic, semantic, structural, execution, and ranking dimensions
  • Includes BLEU, ROUGE, METEOR, CodeBLEU, F1-QALD, MRR, NDCG, and LLM-as-a-Judge
  • Modular abstraction decouples metric specification from implementation for reproducibility

Why It Matters

Standardizes SPARQL QA evaluation, enabling reproducible, comparable benchmarks and deeper system diagnostics.