Developer Tools

BlueFin benchmark reveals frontier LLMs fail at financial spreadsheet tasks

Even the strongest AI models score under 50% on real-world finance spreadsheet work.

Deep Dive

BlueFin is a new benchmark from researchers at CMU and industry partners that tests LLM agents on real-world financial spreadsheet tasks—synthesis, manipulation, and comprehension. The dataset includes 131 challenging tasks with 3,225 fine-grained rubric criteria, all validated by a team of expert human annotators. An LM judge agent achieves near-expert consensus (Cohen’s kappa α=0.826, macro-F1 0.839), enabling reliable evaluation without human overhead.

Results are sobering: the best frontier models (e.g., GPT-4o, Claude 3.5) score below 50% on average, with particular weakness in “dynamic correctness” (adapting to evolving data). The paper argues that despite hundreds of millions of spreadsheet users globally—an order of magnitude more than professional developers—AI research has largely neglected this domain. BlueFin provides an open-source evaluation harness and dataset to spur progress, aiming to bridge the gap between LLM capabilities and occupational finance tasks.

Key Points
  • 131 complex tasks with 3,225 rubric criteria validated by human experts
  • LM judge achieves α=0.826 consensus parity with expert annotators
  • Top LLMs score below 50% average, weakest on dynamic correctness

Why It Matters

Spurs AI innovation for the 100M+ spreadsheet users, a domain far larger than coding.