Research & Papers

T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

arXiv cs.IR May 01, 2026

⚡Open-source T2S-Metrics standardizes SPARQL query evaluation with over 20 metrics.

Deep Dive

Evaluating question-answering systems over knowledge graphs has long suffered from fragmented, ad-hoc metrics that hinder reproducibility. To address this, researchers from ICN, WIMMICS, and I3S have introduced T2S-Metrics, an open-source, extensible library specifically designed for comparing and assessing SPARQL queries generated from natural language. The library provides a unified framework with over 20 evaluation metrics collected from literature and practical needs, covering lexical, syntactic, semantic, structural, execution-based, and ranking-based dimensions. Inspired by the ir-metrics library for information retrieval, T2S-Metrics offers a modular abstraction layer that decouples metric specification from implementation, ensuring consistent and transparent evaluation across studies.

Key metrics include token-level Precision, Recall, and F1; BLEU, ROUGE, METEOR, and CodeBLEU variants; variable-normalized metrics like SP-BLEU and SP-F1; graph- and URI-based exact match metrics; answer set metrics such as F1-QALD and Jaccard similarity; ranking metrics including MRR, NDCG, P@k, and Hit@k; and LLM-as-a-Judge metrics. The library goes beyond simple answer correctness to enable deeper diagnostic insights into system behavior, such as syntactic validity, semantic faithfulness, and computational efficiency. T2S-Metrics represents a significant step toward systematic, standardized evaluation in question answering over knowledge graphs, making it easier for researchers to compare results and build more robust systems.

Key Points

Provides over 20 evaluation metrics across lexical, syntactic, semantic, structural, execution, and ranking dimensions
Includes BLEU, ROUGE, METEOR, CodeBLEU, F1-QALD, MRR, NDCG, and LLM-as-a-Judge
Modular abstraction decouples metric specification from implementation for reproducibility

Why It Matters

Standardizes SPARQL QA evaluation, enabling reproducible, comparable benchmarks and deeper system diagnostics.

Read Original Article

T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

Why It Matters

Stay Ahead in AI