Research & Papers

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

Researchers catalog 30+ recurring evaluation pitfalls in LLM testing.

Deep Dive

Ruchira Dhar and Anders Søgaard have published a scoping review on arXiv that systematically catalogs evaluation concerns in natural language processing (NLP). Titled "Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing," the paper synthesizes decades of methodological reflection—from early debates on corpus construction to modern critiques of large language model (LLM) benchmarks. The taxonomy organizes recurring positions and trade-offs across areas like data leakage, metric selection, and reproducibility, providing a structured checklist to guide more deliberate evaluation design.

By situating contemporary LLM evaluation debates within their historical context, Dhar and Søgaard argue that many so-called "new" criticisms—such as benchmark contamination or over-reliance on accuracy—were extensively debated in earlier NLP eras. The paper consolidates these insights into a practical framework, helping researchers avoid reinventing the wheel and make more informed methodological choices. Under review at the time of submission, the work aims to serve as a reference for both practitioners and theorists looking to improve evaluation rigor in NLP.

Key Points
  • Taxonomy synthesizes 30+ recurring evaluation concerns from decades of NLP literature
  • Includes practical checklist for designing and interpreting evaluations more deliberately
  • Links modern LLM benchmark critiques to historical methodological debates in NLP

Why It Matters

Provides a structured lens to avoid repeating past evaluation mistakes in fast-moving LLM research.