Research & Papers

ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

New benchmark tests AI on 153 live website tasks; best model Claude Sonnet 4.6 succeeds only 33.3% of the time.

Deep Dive

A new benchmark called ClawBench, developed by researchers from the NAIL Group, has delivered a sobering reality check for AI browser agents. Unlike synthetic tests, ClawBench evaluates agents on 153 real-world tasks—like booking flights or filing taxes—across 144 actual, live websites. The results are stark: the top-performing model, Anthropic's Claude Sonnet 4.6, succeeded in only 33.3% of tasks. Zhipu AI's text-only GLM-5 model surprisingly took second place with a 24.2% success rate, highlighting that even advanced multimodal models struggle with the complexity of the modern web. No model exceeded 50% success in any category, with finance and academic tasks proving easier, while travel and development tasks were significantly harder.

ClawBench's methodology is what sets it apart. It employs a five-layer data capture system, recording everything from session replays and screenshots to HTTP traffic and the agent's internal reasoning traces. Crucially, a request interceptor blocks the final HTTP call before any irreversible action—like completing a payment or booking—allowing for safe, ethical testing on production sites without real-world consequences. Each task is validated against human-provided ground truth, and an agentic evaluator provides step-level diagnostics to pinpoint exactly where failures occur. The benchmark is fully open-source, with the dataset on Hugging Face, code on GitHub, and an interactive leaderboard available online.

The findings underscore that while AI agents excel in controlled environments, they are far from reliable for autonomous operation on the dynamic, unstructured web. The low success rates, especially in complex domains like travel, reveal fundamental challenges in understanding website layouts, handling multi-step workflows, and adapting to unexpected errors. This benchmark provides a crucial, standardized tool for the research community to measure progress toward practical, general-purpose AI assistants that can truly operate in the wild.

Key Points
  • Claude Sonnet 4.6 is the top model but succeeds in only 33.3% of 153 real-world web tasks.
  • The benchmark tests agents on 144 live websites with a safety interceptor to block payments and bookings.
  • Finance tasks saw 50% success for the best model, while travel and development tasks were significantly harder.

Why It Matters

This benchmark exposes the massive gap between AI lab performance and real-world usability, setting a clear target for building reliable autonomous agents.