NYU researchers tested if LMs' lower surprise stems from considering more simultaneous sentence parses than humans?

NYU researchers tested if LMs' lower surprise stems from considering more simultaneous sentence parses than humans.

Using RNNGs with beam search (2 to 50 active parses), reducing parses increased garden path effects but fell short of human reading times?

Using RNNGs with beam search (2 to 50 active parses), reducing parses increased garden path effects but fell short of human reading times.

The Parse Multiplicity Mismatch Hypothesis is insufficient to explain the human-LM processing gap, pointing to other factors?

The Parse Multiplicity Mismatch Hypothesis is insufficient to explain the human-LM processing gap, pointing to other factors.

Research & Papers

NYU researchers: LMs still less surprised than humans despite fewer parses

arXiv cs.CL May 18, 2026

⚡Even with limited simultaneous interpretations, AI models can’t match human sentence processing surprise.

Deep Dive

A new preprint by NYU researchers William Timkey, Brian Dillon, and Tal Linzen investigates why language models (LMs) systematically underestimate human processing difficulty in syntactically ambiguous sentences. The team tested the Parse Multiplicity Mismatch Hypothesis, which suggests LMs may be able to simultaneously consider a greater number of distinct sentence interpretations than humans, leading to lower surprisal values.

Using Recurrent Neural Network Grammars (RNNGs) with word-synchronous beam search, they manipulated the number of active parses used to compute word surprisal, then used those surprisals to predict human reading times. Reducing beam width from 50 to just 2 active parses did increase predicted garden path effects, but the magnitude remained far below actual human slowdowns. The finding suggests that differences in the number of simultaneous interpretations are not the primary cause of the human-LM surprisal gap.

The work highlights fundamental architectural constraints: LMs may process syntax differently than humans, and simple capacity limits on parallel parsing don't explain the divergence. For NLP researchers, this points to deeper mismatches in how models handle ambiguity resolution, potentially requiring new training objectives or architectures to better align with human language processing.

Key Points

NYU researchers tested if LMs' lower surprise stems from considering more simultaneous sentence parses than humans.
Using RNNGs with beam search (2 to 50 active parses), reducing parses increased garden path effects but fell short of human reading times.
The Parse Multiplicity Mismatch Hypothesis is insufficient to explain the human-LM processing gap, pointing to other factors.

Why It Matters

This research clarifies that current LMs still lack human-like syntactic processing, guiding future improvements in NLP alignment.

Read Original Article

NYU researchers: LMs still less surprised than humans despite fewer parses

Why It Matters

Related Articles

🚀 Stay Ahead in AI