Research & Papers

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

Claude Opus 4.6 tops at 94.1%, but property memorization ≠ reasoning

Deep Dive

A new benchmark called ThermoQA, published on arXiv, tests six frontier LLMs on 293 open-ended engineering thermodynamics problems across three tiers: property lookups (110 questions), component analysis (101 questions), and full cycle analysis (82 questions). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a refrigerant, and variable-cp air. The composite leaderboard is led by Claude Opus 4.6 at 94.1%, GPT-5.4 at 93.1%, and Gemini 3.1 Pro at 92.5%.

Cross-tier degradation ranges from just 2.8 percentage points for Opus to 32.5 points for MiniMax, confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 percentage point performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. The dataset and code are open-source.

Key Points
  • ThermoQA includes 293 questions across three tiers: property lookups, component analysis, and full cycle analysis
  • Claude Opus 4.6 leads at 94.1%, with GPT-5.4 at 93.1% and Gemini 3.1 Pro at 92.5%
  • Cross-tier degradation reveals up to 32.5 pp performance drops, proving memorization ≠ reasoning

Why It Matters

This benchmark exposes critical gaps in LLMs' engineering reasoning, essential for reliable deployment in technical fields.