Nonstandard Errors in AI Agents
150 autonomous AI agents analyzing the same data produced wildly different results, revealing a new type of AI uncertainty.
A new study reveals that state-of-the-art AI coding agents, when deployed autonomously on the same research task, produce inconsistent empirical results due to what researchers term 'nonstandard errors' (NSEs). The paper, authored by Ruijiang Gao and Steven Chong Xiao, tasked 150 autonomous Claude Code agents (using both Sonnet 4.6 and Opus 4.6 models) with independently testing six hypotheses about market quality trends in NYSE TAQ data for SPY from 2015 to 2024. The agents diverged substantially on fundamental analytical choices, such as whether to use autocorrelation versus variance ratio measures or dollar versus share volume, leading to significant variation in their final estimates.
The research found that different AI model families exhibit stable 'empirical styles,' reflecting systematic differences in methodological preferences. In a three-stage feedback experiment, AI peer review (written critiques) had minimal effect on reducing this dispersion. However, exposing agents to top-rated exemplar papers dramatically reduced the interquartile range of estimates by 80-99% within converging measure families. This convergence occurred through agents tightening their estimates within a chosen method or switching measure families entirely, but the study concludes this reflects imitation rather than genuine understanding of the methodological choices.
These findings have profound implications for the growing reliance on AI in automated empirical research and policy evaluation. The presence of NSEs suggests that results from AI agents are not deterministic and can vary based on the agent's inherent 'style,' introducing a new source of uncertainty that must be accounted for. The study underscores that while AI can be guided toward consensus, this consensus may be superficial, raising questions about the robustness and interpretability of AI-driven scientific conclusions.
- 150 autonomous Claude Code agents (Sonnet 4.6 & Opus 4.6) produced widely varying results on identical financial data analysis tasks.
- Different model families showed stable 'empirical styles,' with systematic preferences for certain analytical methods over others.
- Exposure to exemplar papers reduced estimate dispersion by 80-99%, but convergence was driven by imitation, not methodological understanding.
Why It Matters
This reveals a new, unpredictable source of error in AI-driven research, challenging the reliability of automated analysis for science and policy.