Research & Papers

GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

New benchmark shows AI models fail to recall exact Persian verses, highlighting cultural training gaps.

Deep Dive

Researchers Ghazal Kalhor and Yadollah Yaghoobzadeh have introduced GhazalBench, a novel benchmark designed to evaluate how large language models (LLMs) interact with Persian ghazals—a culturally significant poetic form. Unlike traditional benchmarks, GhazalBench is usage-grounded, simulating real-world scenarios where Iranians quote, paraphrase, or complete verses from canonical poets like Hafez. It tests two complementary skills: the ability to produce accurate prose paraphrases of poetic couplets, and the ability to retrieve the exact canonical verse when given varying semantic and formal cues.

When testing several proprietary and open-weight multilingual LLMs, the researchers observed a consistent and revealing pattern. Models generally demonstrated a strong grasp of poetic meaning, successfully generating faithful paraphrases. However, they struggled significantly with exact verse recall in completion-based settings. This recall performance gap narrowed substantially in recognition-based tasks. A parallel evaluation on English sonnets showed markedly higher recall, indicating the limitation is tied to differences in training exposure—models see far less Persian poetic text—rather than inherent architectural constraints.

The findings underscore a critical blind spot in current LLM evaluation. Performance on standard benchmarks often fails to capture how models handle culturally entrenched surface forms and cue-dependent access to specific texts. GhazalBench highlights the need for evaluation frameworks that jointly assess meaning, form, and cultural recall. The benchmark is publicly available, providing a tool for developers to better understand and improve model performance on non-English, culturally rich content, moving beyond mere translation or semantic understanding.

Key Points
  • GhazalBench tests LLMs on two tasks: prose paraphrasing of Persian couplets and exact verse recall from cues.
  • Models showed a dissociation—good at understanding meaning but poor at recalling exact verses, especially in completion tasks.
  • The recall gap was much smaller for English sonnets, pointing to a training data imbalance, not a model architecture flaw.

Why It Matters

Reveals critical cultural blind spots in AI training, showing benchmarks must test for form and recall, not just meaning.