AI Safety

AI Is Bad at Physics

Even GPT-5.3-powered Codex couldn't reproduce a single numerical result from physics papers.

Deep Dive

A team from Peking University has released a preprint evaluating LLMs on their ability to reproduce results from experimental physics papers. Using a benchmark called PRBench, they tested models including OpenAI's Codex powered by GPT-5.3. The results were stark: all agents achieved a 0% end-to-end callback rate, meaning they couldn't reproduce any full numerical results within specified accuracy bounds. Even the best performer, Codex, scored only 34% on an overall reproduction index that combined comprehension, code accuracy, and result accuracy. Interestingly, all LLMs passed reading comprehension tests, suggesting they understood the paper's methods but failed at implementation.

The researchers identified five common failure modes. First, formula implementation errors: LLMs made small mistakes like sign errors or wrong index conventions in code. Second, failures in algorithmic fidelity: models oversimplified complex physics, such as reducing a full Skyrme–Hartree–Fock equation to a basic Schrödinger equation. Third, methodological consistency failures: LLMs replaced paper-specific formulations with more modern variants from their training data, causing critical errors. These findings suggest that while LLMs excel at recalling information, they struggle with the nuanced, context-dependent reasoning required for real scientific work.

Key Points
  • All LLMs scored 0% on end-to-end reproduction of numerical physics results from papers
  • OpenAI's Codex powered by GPT-5.3 was the top performer but still scored only 34% on overall reproduction
  • Common failures included sign errors, oversimplification of equations, and replacing paper-specific methods with modern variants

Why It Matters

LLMs can't yet replace human researchers in physics, despite passing reading comprehension tests.