Media & Culture

Anthropic been nerfing models according to BridgeBench, looks like a marketing strategy.

Independent benchmark shows Opus 4.6's accuracy plummeted from 83.3% to 68.3% in one week.

Deep Dive

Independent AI benchmark platform BridgeBench has published data alleging a significant performance drop in Anthropic's flagship Claude Opus 4.6 model. According to their tests, the model's accuracy on a hallucination benchmark plummeted from 83.3% to just 68.3% over the course of a week, causing its ranking to fall from #2 to #10 on their leaderboard. This represents a 98% increase in hallucination rate, a dramatic shift that has fueled widespread user complaints, particularly from subscribers to Anthropic's $200/month Claude Pro plan who rely on consistent, high-quality outputs.

The allegations have ignited a debate within the AI community. Some speculate this could be an intentional 'nerfing'—a strategy to make a subsequent model release appear more impressive by comparison. Others suggest it may be an unintended side effect of backend adjustments or safety fine-tuning. Regardless of the cause, the incident has eroded trust in model consistency for power users and enterprises. It underscores the opaque nature of how providers manage live model performance and the risks of vendor lock-in without transparent versioning.

In response, developers and businesses are actively exploring alternatives. Benchmarks now show models like OpenAI's GPT-4o and the newly released, more affordable GLM-5.1 from Zhipu AI matching or surpassing the current tested performance of Opus 4.6. This event highlights the critical need for independent, continuous benchmarking in a fast-moving market where a model's capabilities are not a static guarantee but a variable service.

Key Points
  • BridgeBench data shows Claude Opus 4.6's hallucination rate increased 98%, dropping accuracy from 83.3% to 68.3%.
  • The performance drop caused the model to fall from #2 to #10 on the benchmark's leaderboard in one week.
  • The incident has triggered user backlash and spurred evaluation of rival models like GPT-4o and GLM-5.1.

Why It Matters

For businesses, inconsistent model performance undermines reliability in production systems and raises costs, forcing contingency planning.