Research & Papers

Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

A new qualitative metric reveals LLMs achieve high linguistic similarity but underperform on semantic accuracy.

Deep Dive

Researchers Natalie Perez, Sreyoshi Bhaduri, and Aman Chadha have published a paper introducing the Inductive Conceptual Rating (ICR), a novel metric designed to evaluate the meaning, not just the lexical similarity, of text summaries generated by Large Language Models (LLMs). The work, titled 'Simulating Meaning, Nevermore!', critiques current evaluation methods that rely on statistical approximations like BLEU or ROUGE scores, arguing they fail to capture the relational, context-dependent, and emergent nature of human meaning. The authors propose an interdisciplinary framework integrating semiotics (the study of signs) and hermeneutics (the theory of interpretation) with qualitative research methods to bridge the gap between LLM outputs and human interpretive understanding.

The ICR metric is grounded in inductive content analysis and reflexive thematic analysis, applying systematic qualitative interpretation to assess semantic accuracy and meaning alignment. In an empirical comparison of LLM-generated and human-generated thematic summaries across five datasets, the research found that while models like GPT-4 and Claude achieve high linguistic similarity to reference texts, they consistently underperform on semantic accuracy. Performance improved with larger datasets but remained variable, suggesting LLMs struggle with contextually grounded meanings and the coherence of recurring concepts. This finding challenges the assumption that better lexical metrics equate to better summaries and calls for a paradigm shift in AI evaluation toward frameworks that can assess the conceptual and interpretive depth of machine-generated language.

Key Points
  • Introduces the Inductive Conceptual Rating (ICR), a qualitative metric based on semiotics and hermeneutics to evaluate meaning in LLM summaries.
  • Empirical tests across five datasets (N=50 to 800) show LLMs achieve high linguistic similarity but underperform on semantic accuracy by 15-30%.
  • Argues for a paradigm shift from lexical similarity metrics (BLEU, ROUGE) to evaluation frameworks that assess true meaning alignment.

Why It Matters

This exposes a critical flaw in how we judge AI-generated text, pushing for evaluations that measure understanding, not just word matching.