Developer Tools

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

New research shows Claude 4.5 Sonnet is 3.8x more consistent than GPT-5 when solving complex software engineering tasks.

Deep Dive

A new research paper titled 'Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy' by Aman Mehta provides crucial insights for deploying LLM-based agents in production systems. The study tested three leading models—Anthropic's Claude 4.5 Sonnet, OpenAI's GPT-5, and Meta's Llama-3.1-70B—on SWE-bench, a challenging software engineering benchmark requiring multi-step reasoning. Across 50 runs per model (10 tasks × 5 runs), Claude demonstrated superior performance with 58% accuracy and the lowest coefficient of variation at 15.2%, indicating high behavioral consistency. GPT-5 showed intermediate results with 32% accuracy and 32.2% CV, while Llama-3.1-70B struggled with just 4% accuracy and 47% CV.

The research reveals a critical nuance: consistency amplifies outcomes rather than guaranteeing correctness. While Claude's high consistency correlated with better performance, 71% of its failures stemmed from 'consistent wrong interpretation'—making the same incorrect assumption across all runs. Interestingly, GPT-5 achieved similar early strategic agreement as Claude (diverging at step 3.4 vs. 3.2) but exhibited 2.1× higher variance, suggesting divergence timing alone doesn't determine consistency. The findings challenge conventional wisdom about agent evaluation, showing that interpretation accuracy matters more than execution consistency for reliable production deployment.

These results have significant implications for how companies should evaluate and train AI agents for real-world applications. The study suggests that while consistency metrics are valuable, they must be paired with accuracy assessments to avoid deploying agents that are consistently wrong. For software engineering tasks specifically, the research indicates that current models still have substantial room for improvement, with even the best-performing Claude model succeeding in only 58% of complex tasks. This work provides a framework for more nuanced agent evaluation that could shape future development priorities across the AI industry.

Key Points
  • Claude 4.5 Sonnet achieved 58% accuracy on SWE-bench with just 15.2% behavioral variance, outperforming GPT-5 (32% accuracy, 32.2% CV) and Llama-3.1-70B (4% accuracy, 47% CV)
  • 71% of Claude's failures came from 'consistent wrong interpretation'—making the same incorrect assumption across all runs, showing consistency amplifies both correct and incorrect outcomes
  • GPT-5 showed similar early strategic agreement as Claude (diverging at step 3.4 vs. 3.2) but had 2.1× higher variance, indicating divergence timing alone doesn't determine consistency

Why It Matters

This research provides a crucial framework for evaluating AI agents in production, showing that consistency metrics alone can mask systematic failures in complex reasoning tasks.