Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
A new benchmark of 1,346 expert-level tasks shows leading LLMs like GPT-4 and Claude 3.5 struggle, averaging just 55% success.
A consortium of researchers led by Xue Liu and Yuxin Ma has unveiled XpertBench, a groundbreaking benchmark designed to rigorously test large language models (LLMs) on authentic, expert-level professional work. Unlike conventional benchmarks that use narrow or generalist tasks, XpertBench comprises 1,346 meticulously curated tasks across 80 categories spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks were sourced from over 1,000 submissions by genuine domain experts, including practitioners and academics, ensuring high ecological validity. Each task is evaluated using detailed rubrics with 15-40 weighted checkpoints to assess professional rigor.
To enable scalable yet accurate assessment, the team developed 'ShotJudge,' a novel evaluation paradigm. This method uses LLMs as judges but calibrates them with expert-provided, few-shot exemplars to mitigate the self-rewarding bias common in AI self-evaluation. When applied to state-of-the-art models, the results were stark: even the most advanced LLMs hit a performance ceiling, with a peak success rate of only about 66% and a mean score around 55%. The benchmark also revealed domain-specific divergence, with models showing non-overlapping strengths in areas like quantitative reasoning versus linguistic synthesis.
These findings underscore a pronounced 'expert-gap' in current AI systems, demonstrating that while LLMs excel as general-purpose assistants, they fall significantly short of performing as reliable specialized collaborators in professional contexts. XpertBench establishes a new, more demanding standard for AI evaluation, crucial for guiding the next phase of development toward true expert-level competency.
- Benchmarks 1,346 expert-level tasks across 80 professional domains like finance and law, created by 1,000+ domain experts.
- Introduces 'ShotJudge,' a novel evaluation method using LLM judges calibrated with expert examples to reduce self-rewarding bias.
- Reveals a major 'expert gap': top models like GPT-4o max out at ~66% success, with a mean score of just 55%.
Why It Matters
Sets a new, harder standard for AI evaluation, proving current models are not yet reliable for high-stakes professional work and guiding future development.