Research & Papers

Study: Llama, Qwen, GPT overreach in sensor data explanations

LLMs invent causal stories from sparse sensor data, new audit reveals

Deep Dive

A new paper from researchers at [University not specified in abstract, but likely multiple institutions] introduces the concept of 'epistemic overreach' (EO) to quantify a critical flaw in LLM-generated explanations of personal sensor data. As LLMs are increasingly used to translate raw behavioral traces—like activity, sleep, and mood anomalies—into coherent natural-language narratives, the authors warn that these stories often sound plausible while lacking evidential grounding. To audit this, they generated 14,922 explanations using three LLM families (Llama, Qwen, GPT) across three longitudinal student datasets (StudentLife, GLOBEM, CollegeExperience), covering multiple anomaly types (activity, sleep, affect). They tested two prompt conditions: a minimally constrained prompt and one explicitly instructing models to bound claims to data. They also varied the amount of behavioral evidence available to see if richer context reduces EO.

The results are striking: LLMs routinely attribute anomalous days to causes without sufficient support from the data—a pattern that replicates across datasets, anomaly types, and model families. Even when given more context, the models did not reliably reduce overreach; bounded prompting helped but failed to eliminate it. The authors decomposed EO into five dimensions: unsupported causal attribution, unacknowledged data gaps, overconfident language, temporal inconsistency, and diagnostic inference. They argue that for LLMs explaining personal sensing data, 'evidential discipline' is essential: systems must clearly distinguish what is observed, what is inferred, and what remains unknown. This work challenges the assumption that more data leads to better explanations and calls for evidential grounding to become a first-order evaluation criterion alongside fluency and plausibility.

Key Points
  • Llama, Qwen, and GPT attributed anomalous days to unsupported causes across 14,922 explanations from three student datasets.
  • Providing richer behavioral context did not reliably reduce epistemic overreach; bounded prompting helped but didn't eliminate it.
  • The study identifies five dimensions of overreach: unsupported causal attribution, unacknowledged data gaps, overconfident language, temporal inconsistency, and diagnostic inference.

Why It Matters

As LLMs explain personal data, overconfident stories risk misleading users about their health and behavior.