SpreadsheetBench reveals specialist AI tools beat GPT-5.4 and Claude by 15% on Excel accuracy
Dealglass and Leni score 90%+ strict cell accuracy, while general models lag behind.
A new benchmark called SpreadsheetBench is giving the AI world a reality check on how accurate different models are at handling Excel documents. Unlike vague comparisons that only check if formulas look right, SpreadsheetBench pulls real-world tasks from actual Excel forums and evaluates strict cell accuracy—meaning every single cell in the output must exactly match the computed values in the correct solution file. The results show a clear divide: specialized AI tools built for spreadsheets, such as Dealglass and Leni, score above 90% on strict accuracy, while general-purpose models fall significantly behind.
Claude Opus 4.6 manages about 80% accuracy, and GPT-5.4 scores in the high 70s, representing a 10- to 15-point gap on the same tasks. The difference becomes even more pronounced on harder structural tasks, such as formulas that depend on other sheets or when the spreadsheet gets reorganized. General models tend to write formulas without verifying what they actually compute when executed, so they break under those conditions. These harder tasks are where real financial modeling work happens, making the gap particularly impactful for professionals. The leaderboard is updated regularly as new tools are added, so anyone considering an AI spreadsheet subscription should check it first.
- SpreadsheetBench evaluates AI on real Excel forum tasks with strict cell-by-cell accuracy matching.
- Specialized tools Dealglass and Leni exceed 90% accuracy; Claude Opus 4.6 hits ~80%, GPT-5.4 high 70s.
- General models fail on structural tasks like cross-sheet dependencies because they can't simulate computed formula results.
Why It Matters
Professionals relying on AI for financial modeling need specialist tools, not general LLMs, for reliable Excel automation.