AI Safety

ChatGPT and Claude disagree 66% of the time but fail identically

⚡New research shows AI recommender failure modes converge 95% for long-tail brands.

Deep Dive

New research from Will Jack and colleagues at arXiv (2606.26116) reveals a paradox in AI commercial recommendations: ChatGPT and Claude pick different brands roughly two-thirds of the time, but when both skip a brand, they almost always agree on why. The study ran 215 commercially-framed prompts across four batches, measuring cross-provider brand recommendation Jaccard at just 0.35—well below the 0.50-0.61 same-prompt rerun baseline. That means brands can't rely on a single optimization playbook; a strategy that works for one provider may fail for the other.

The real surprise comes from failure-mode analysis. When neither provider recommends a brand, researchers classified the reason into three modes: discoverability (brand never enters the model), compellingness (model sees it but doesn't mention it), or positioning (mentioned but not recommended). On 7,763 joint failures, both providers diagnosed the same mode 95.1% of the time. Agreement increases with brand obscurity: 81% for category leaders, 99.6% for long-tail regional brands. Critically, Anthropic's models (Claude) rely more on priors (43-52% of recommendations) versus OpenAI's GPT (8-29%), yet their diagnostic convergence means a single content fix can lift visibility on both platforms for lesser-known brands.

Key Points
  • Cross-provider brand recommendation agreement is only 33% (Jaccard 0.35), far below same-prompt rerun baselines
  • 95.1% joint failure-mode diagnosis agreement across ChatGPT and Claude, rising to 99.6% for long-tail brands
  • Anthropic recommends from priors 43-52% of the time vs OpenAI's 8-29%, yet failure diagnoses converge

Why It Matters

Brands can use one failure-mode fix to boost visibility across both ChatGPT and Claude, especially for long-tail products.

📬 Get the top 10 AI stories daily