AI benchmarks are broken. Here’s what we need instead.
Research reveals FDA-approved medical AI models create workflow delays despite perfect benchmark scores.
Traditional AI benchmarks are fundamentally broken, according to new research examining real-world AI deployment across healthcare, education, and nonprofit sectors. While models like GPT-4 and Claude 3 achieve impressive scores on standardized tests like MMLU (measuring knowledge) and HumanEval (coding), these metrics fail to capture how AI performs in actual organizational workflows. The research documents cases where FDA-approved radiology AI models with 98% accuracy scores actually created delays in hospital settings, as medical teams struggled to integrate AI outputs with existing reporting standards and multidisciplinary decision-making processes.
Researchers propose HAIC (Human–AI, Context-Specific Evaluation) benchmarks as a solution. Unlike current static tests, HAIC evaluates AI performance over extended time horizons within real human teams and organizational contexts. The framework addresses the 'AI graveyard' problem where technically superior models get abandoned after failing to deliver promised benefits in practice. Studies across UK, US, and Asian organizations show that AI's real value emerges through sustained interaction with multiple stakeholders, not through isolated task completion. This shift could prevent wasted investments and rebuild trust in AI systems for critical applications.
- Current benchmarks test AI in isolation, but real-world use involves complex team interactions
- FDA-approved medical AI with 98% accuracy scores actually slowed hospital workflows by 15-30%
- HAIC framework evaluates AI performance over time within actual organizational contexts
Why It Matters
Prevents wasted AI investments and ensures models deliver real value in healthcare, education, and business applications.