Research & Papers

Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

A new metric reveals AI stories lack narrative tension, correctly ranking them below human-written New Yorker fiction.

Deep Dive

A team of researchers from institutions including the University of Chicago and Carnegie Mellon University has published a paper titled 'Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling.' They argue that current benchmarks for evaluating AI-generated stories, like EQ-Bench, overlook the critical element of narrative tension, leading to flawed assessments. In fact, on EQ-Bench, LLM judges rank zero-shot AI stories above prestigious New Yorker short stories. To address this, the researchers developed the '100-Endings' metric, a novel evaluation method grounded in narratological principles.

The 100-Endings metric works by walking through a story sentence-by-sentence. At each position, a model predicts the story's ending 100 times based only on the preceding text. Tension is quantified by the mismatch rate—how often these predictions fail to match the actual, ground-truth ending. Beyond the overall rate, the method analyzes the sentence-level curve for statistics like 'inflection rate,' which tracks plot twists. Using this metric, the researchers confirmed that New Yorker stories exhibit far greater tension than typical LLM outputs.

Building on this diagnostic tool, the team designed a new story-generation pipeline incorporating structural constraints like narrative scaffolding and template analysis. This pipeline successfully increased narrative tension as measured by the 100-Endings metric while maintaining performance on traditional leaderboards. The work provides both a more accurate way to evaluate compelling storytelling in AI and a blueprint for building models that can generate more engaging, suspenseful narratives.

Key Points
  • Introduced the '100-Endings' metric, which measures narrative tension by having a model predict an ending 100 times per sentence and calculating the mismatch rate.
  • The metric correctly ranks human-written New Yorker stories above AI-generated ones, unlike current benchmarks where LLMs rank their own output higher.
  • The team built a new story-generation pipeline using structural constraints that significantly improved tension scores on the 100-Endings metric.

Why It Matters

Provides a better way to evaluate and build AI that can create genuinely engaging, suspenseful stories for entertainment and content generation.