Humans and LLMs Diverge on Probabilistic Inferences
New research reveals AI models can't match human judgment on uncertain inferences, even with chain-of-thought.
A new research paper from Stanford University, McGill University, and Cornell University reveals a persistent gap between human and artificial intelligence when it comes to probabilistic reasoning. The team, led by Gaurav Kamath, introduced ProbCOPA—a carefully constructed dataset of 210 open-ended inference problems where answers aren't certain but merely probable. When they compared responses from 25-30 human annotators per problem against eight top reasoning LLMs (including GPT-4, Claude 3 Opus, and Llama 3), they discovered the models consistently failed to match the nuanced, graded probability distributions that humans naturally produce. This finding challenges the assumption that current benchmarks adequately capture human-like reasoning capabilities.
Analyzing the reasoning chains of these models, researchers identified a common pattern: LLMs tend to approach probabilistic problems through deterministic reasoning frameworks, attempting to find definitive answers where humans recognize ambiguity. The study tested models using both standard prompting and chain-of-thought techniques, yet neither approach closed the gap with human performance. This research underscores that evaluating AI reasoning purely on deterministic tasks like mathematics or logic puzzles misses crucial aspects of how humans actually think in real-world situations filled with uncertainty. The authors argue for developing new evaluation frameworks that better capture probabilistic reasoning, which could lead to more robust and human-aligned AI systems.
- Researchers created ProbCOPA with 210 problems, each rated by 25-30 humans for probabilistic likelihood
- Eight state-of-the-art LLMs (GPT-4, Claude 3, Llama 3) failed to match human probability distributions
- Analysis revealed models use deterministic reasoning patterns even for inherently uncertain problems
Why It Matters
Shows current AI benchmarks miss crucial aspects of real-world reasoning, limiting practical applications in uncertain domains.