Research & Papers

Humans and LLMs Diverge on Probabilistic Inferences

arXiv cs.CL March 02, 2026

⚡New research reveals AI models can't match human judgment on uncertain inferences, even with chain-of-thought.

Deep Dive

A new research paper from Stanford University, McGill University, and Cornell University reveals a persistent gap between human and artificial intelligence when it comes to probabilistic reasoning. The team, led by Gaurav Kamath, introduced ProbCOPA—a carefully constructed dataset of 210 open-ended inference problems where answers aren't certain but merely probable. When they compared responses from 25-30 human annotators per problem against eight top reasoning LLMs (including GPT-4, Claude 3 Opus, and Llama 3), they discovered the models consistently failed to match the nuanced, graded probability distributions that humans naturally produce. This finding challenges the assumption that current benchmarks adequately capture human-like reasoning capabilities.

Analyzing the reasoning chains of these models, researchers identified a common pattern: LLMs tend to approach probabilistic problems through deterministic reasoning frameworks, attempting to find definitive answers where humans recognize ambiguity. The study tested models using both standard prompting and chain-of-thought techniques, yet neither approach closed the gap with human performance. This research underscores that evaluating AI reasoning purely on deterministic tasks like mathematics or logic puzzles misses crucial aspects of how humans actually think in real-world situations filled with uncertainty. The authors argue for developing new evaluation frameworks that better capture probabilistic reasoning, which could lead to more robust and human-aligned AI systems.

Key Points

Researchers created ProbCOPA with 210 problems, each rated by 25-30 humans for probabilistic likelihood
Eight state-of-the-art LLMs (GPT-4, Claude 3, Llama 3) failed to match human probability distributions
Analysis revealed models use deterministic reasoning patterns even for inherently uncertain problems

Why It Matters

Shows current AI benchmarks miss crucial aspects of real-world reasoning, limiting practical applications in uncertain domains.

Read Original Article

Humans and LLMs Diverge on Probabilistic Inferences

Why It Matters

Stay Ahead in AI