SKG-Eval uses knowledge graphs to fix multi-turn dialogue evaluation
New framework catches contradictions across 20+ turns without using LLM judges.
Evaluating multi-turn AI dialogues is notoriously hard because responses depend on a growing history of entities, claims, and commitments. Existing methods—whether LLM-based judges or embedding metrics—treat each turn in isolation, missing contradictions, topic drift, and entity inconsistencies that span long conversations. To solve this, researchers Avijit Shil and Suman Samui introduce SKG-Eval, a framework that treats dialogue as a living Semantic Knowledge Graph (SKG). As turns progress, the system incrementally extracts structured triples (subject-predicate-object) and updates the graph. It then computes three complementary signals: local relevance (how well a response matches the current prompt), historical consistency (how new info connects to prior context using graph and embedding signals), and logical coherence—assessed by a novel geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused into a length-invariant session score via recency-weighted trend analysis.
On multiple benchmarks, SKG-Eval shows stronger correlation with human judgments than existing evaluators and significantly improves detection of long-range inconsistencies in extended conversations (e.g., 20+ turns). Crucially, the framework outputs explicit contradiction certificates and deterministic scores for fixed inputs, making evaluations reproducible and auditable—a key advantage over black-box LLM judges. The approach suggests that structured externalized state tracking through knowledge graphs can scale better than implicit reasoning for dialogue evaluation. The paper (36 pages, 6 figures) is available on arXiv and includes code links. For enterprise or research teams building conversational AI that must maintain coherence over long sessions, SKG-Eval offers a transparent, robust way to catch errors that traditional metrics miss.
- SKG-Eval models dialogue as an evolving Semantic Knowledge Graph of entities, relations, and commitments across turns, updating via structured triple extraction.
- It computes three signals: local relevance, historical consistency (graph + embedding), and logical coherence via a geometric contradiction engine that detects cross-turn conflicts without NLI models or LLM judges.
- Outperforms existing evaluators on benchmarks, producing deterministic scores and explicit contradiction certificates for reproducible, auditable evaluation.
Why It Matters
Provides a transparent, LLM-free method to catch contradictions and drift in long AI conversations—critical for enterprise chatbots and assistants.