AI agent benchmarks obsess over coding while ignoring 92% of the US labor market, study finds
Research reveals benchmarks are skewed toward coding, ignoring management, law, and interpersonal skills.
A new study from researchers at Carnegie Mellon University and Stanford University has exposed a significant bias in how AI agents are evaluated. The research, analyzing popular benchmarks, found they are overwhelmingly focused on programming and narrow computer-based tasks. This creates a distorted view of AI progress, as these benchmarks ignore 92% of the US labor market, including entire sectors like management, law, and healthcare that require complex reasoning and human interaction.
The study criticizes current benchmarks for primarily testing skills like information retrieval while almost entirely ignoring critical capabilities such as interpersonal communication, negotiation, and physical task management. The researchers argue this coding-centric focus risks developing a narrow "Artificial Specialized Intelligence" that excels at technical tasks but fails at broader economic and social applications. They warn that over-optimizing for these skewed benchmarks could lead AI development down a path irrelevant to most real-world work.
To address this, the team advocates for a new generation of evaluation frameworks. These proposed benchmarks would cover currently underrepresented domains and, crucially, assess not just an agent's final answer but the intermediate steps and reasoning processes it uses to get there. This shift is essential for developing AI that can perform meaningful work across the full spectrum of human labor, moving beyond a niche focus on software engineering.
- Study finds AI benchmarks ignore 92% of US jobs, over-indexing on coding tasks.
- Critical fields like management and law, plus interpersonal skills, are largely absent from evaluations.
- Researchers call for new benchmarks that assess reasoning steps, not just outcomes, across diverse domains.
Why It Matters
Over-optimizing for coding creates narrow AI that can't perform most real-world jobs, misdirecting billions in R&D.