KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
Top AI models pass only 27.9% of tasks requiring them to identify professional problems from raw data alone.
Researcher Ankit Maloo has introduced KWBench (Knowledge Work Bench), a new benchmark designed to test a critical but overlooked capability in large language models: unprompted problem recognition. Unlike existing benchmarks that measure how well models execute predefined tasks, KWBench evaluates whether an AI can first identify the governing structure of a professional situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across six complex domains including acquisitions, contract negotiations, clinical pharmacy, and fraud analysis. Each task encodes a formal game-theoretic pattern like principal-agent conflict or strategic omission. Models receive only raw data and a task prompt with no indication of the underlying problem type, forcing them to recognize the situation's core dynamics before attempting a solution.
In evaluations of 16 models, the results were stark. The best-performing model passed on just 27.9% of tasks, highlighting a significant gap in current AI capabilities. Perhaps more revealing is the low agreement among top models; the top two agreed on only 31.7% of their passes. Among the top eight models, 44 tasks were solved by exactly one model, suggesting specialized but non-generalizable strengths. Routing a query across this top cohort covered 50.7% of the benchmark—nearly double the coverage of the best single model—pointing to the potential of model routing for complex work. The research found that models could often articulate the correct game-theoretic concept when asked directly but failed to apply it unprompted, indicating a disconnect between knowledge and situational application. Maloo releases KWBench to shift how frontier models are evaluated, pushing the field to measure whether AI can recognize the right problem from the situation alone, not just execute once the problem is framed for it.
- The benchmark contains 223 real-world tasks from six domains like contract negotiations and fraud analysis, each encoding a game-theoretic pattern.
- The best model (unspecified) passed only 27.9% of tasks, and the top eight models combined covered just 50.7%, showing no single AI is proficient.
- Models could articulate correct concepts when prompted but failed to apply them unprompted, revealing a critical gap between knowledge and situational recognition.
Why It Matters
This exposes a fundamental weakness in current AI: it cannot reliably identify real-world business problems without explicit instructions, limiting its autonomous utility in professional settings.