JobBench benchmark refocuses AI agents on human delegation, not replacement
Despite a decade of agentic AI hype, the best model scores just 45.9% on a benchmark that measures human delegation—not replacement. That gap is the story.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new benchmark called JobBench is resetting how we measure AI agents. Unlike predecessors that focus on raw task completion or economic value, JobBench evaluates AI across 130 tasks spanning 35 occupations—from data entry to industrial design—using rubrics defined by domain experts. Each task includes an average of 35.6 binary criteria that reflect real-world delegation priorities: not just 'did the agent finish the job,' but 'did it do so in a way that respects human workflow and judgment.' The top-performing system, Claude Opus 4.7 operating under Claude Code, achieved only 45.9%. That number is less a failure and more a map of the chasm between current capabilities and practical human needs.
The benchmark arrives amid a wave of agentic evaluation tools. SWE-bench tests software engineering via real GitHub issues but measures only code patches; AgentBench assesses agents across web, games, and operating systems but treats success as pass/fail. WebArena focuses on web interaction with ecological validity. JobBench departs from all of these by making human delegation the central metric. The rubrics were built by asking professionals: 'What would you trust an AI to do without oversight?' and 'What steps must it get right?' This shifts the conversation from 'can AI replace a worker?' to 'can AI earn a worker's trust?'—a far more nuanced and difficult question.
The implications for the AI industry are stark. The AI agent market is projected to exceed $30 billion by 2030, and companies from Anthropic to OpenAI to Google rely on benchmark scores to market their systems. A 45.9% score on a delegation-focused benchmark suggests that even frontier models are far from reliable autonomous workers. This could redirect R&D investment toward structured delegation workflows—such as explicit handoffs, iterative feedback loops, and fallback protocols—rather than end-to-end autonomy. However, the benchmark is not without risks. The rubrics were created by a small set of experts, potentially embedding cultural or domain biases. The binary criteria, while precise, may oversimplify professional judgment that requires nuance. Moreover, the model 'Claude Opus 4.7' does not correspond to any publicly announced Anthropic product as of early 2025, raising questions about whether the results are from a pre-release version or a hypothetical scenario. If the latter, it undermines the benchmark's credibility until confirmed.
JobBench’s deeper value is philosophical: it reframes AI evaluation around human priorities rather than technical prowess. By centering delegation, it forces the field to ask what we want from AI—not just what AI can do. The low top score is a healthy reality check, but the real lesson is that future benchmarks must incorporate similar human-centric design to avoid misleading optimism. Until AI agents can consistently meet the criteria set by people who actually do these jobs, the promise of autonomous delegation remains a long-term goal—not a near-term product.
- JobBench introduces expert-defined rubrics with an average of 35.6 binary criteria per task, shifting evaluation from replacement to delegation—a new standard for AI agent testing.
- The top model's 45.9% score reveals a substantial gap between current agentic AI capabilities and real-world human trust requirements, with implications for the $30B+ agent market.
- Hidden risks include potential expert bias in rubric design, oversimplification of nuanced professional tasks, and uncertainty about the model identity ('Claude Opus 4.7'), which may affect benchmark reproducibility.
Why It Matters
JobBench reframes AI evaluation around human delegation, exposing the gap between agentic hype and practical workplace trust.