Research & Papers

Study of 284 linguistic features finds lexical richness is best AI text detector

Lexical richness beats context-dependent signals across 27 LLMs and 10 domains

Deep Dive

A new preprint on arXiv (2606.04177) from a team including Yassir El Attar, Esra Dönmez, Maximilian Maurer, and Agnieszka Falenska systematically evaluates the robustness of linguistic features for detecting AI-generated text. The study spans 284 interpretable features, 27 large language models (including GPT, Claude, LLaMA, Mistral, and others), and ten diverse domains (news, academic writing, creative fiction, social media, code, etc.). The goal: find which linguistic signals generalize across contexts rather than being dataset-specific.

Key results show that a classifier relying solely on linguistic features can reliably separate human-written from LLM-generated text. However, many previously touted indicators—such as sentence length variance, punctuation patterns, or specific part-of-speech distributions—fail under cross-model or cross-domain generalization. Only lexical richness (e.g., type-token ratio, hapax legomena) consistently signals machine generation, with lower lexical diversity in AI text. This finding offers a simple yet robust tool for practitioners building interpretable AI detection systems, especially for non-expert users who need explainable decisions rather than black-box classifiers.

Key Points
  • Study tested 284 linguistic features across 27 LLMs (GPT, Claude, LLaMA, Mistral, etc.) and 10 text domains
  • Lexical richness (e.g., type-token ratio) is the only feature robust across models and domains; others are context-dependent
  • A classifier using only linguistic features can reliably distinguish AI from human text, providing an interpretable alternative to black-box detectors

Why It Matters

Gives professionals a reliable, explainable method to detect AI-written content without complex black-box models.