[R] I built a benchmark that catches LLMs breaking physics laws
Gemini Pro scored worse than Flash Lite, and Bernoulli's equation stumped every model at 0%.
A new open-source benchmark called 'Lawbreaker' is exposing fundamental flaws in how large language models (LLMs) handle physics. Built by independent researcher Agodianel, it uses symbolic math libraries (SymPy and Pint) to generate and grade adversarial physics questions, avoiding subjective 'LLM-as-judge' scoring. The benchmark systematically tests 28 core physical laws—including Ohm's Law, Newton's Laws, and Coulomb's Law—by baking in traps like unit confusion (mixing mA/A, Celsius/Kelvin) and anchoring bias prompts.
Initial results testing seven Google Gemini models revealed surprising failures. The high-capacity Gemini 3.1 Pro scored a dismal 22.1%, performing worse than the lighter Gemini 3.1 Flash Lite (35.7%). The Pro model repeatedly fell for the 'forget the ½ in kinetic energy' trap. Meanwhile, the top performer was Gemini 3.1 Flash Image Preview, which aced 24 out of 28 laws. However, Bernoulli's Equation proved impossible for all models, with a 0% success rate due to pressure unit confusion (Pa vs. atm).
The benchmark's strength lies in its procedural generation, creating infinite question variations to prevent memorization. Results are auto-pushed to a Hugging Face dataset, and the creator plans to test OpenAI, Claude, and open models next. This tool provides a much-needed, mathematically rigorous check on LLMs' tendency to 'confidently hallucinate' incorrect scientific answers, pushing for more reliable reasoning in AI assistants.
- Benchmark tests 28 physics laws with symbolic math grading, avoiding LLM judges.
- Gemini 3.1 Pro scored only 22.1%, worse than lighter Flash Lite models.
- Bernoulli's Equation had a 0% success rate across all tested models due to unit confusion.
Why It Matters
Provides objective, rigorous testing to expose when AI assistants hallucinate incorrect science, pushing for more reliable reasoning.