Physics-R1 exposes 46-point MCQ bias in visual physics AI benchmarks
Hidden benchmark flaws revealed: 46% gap between multiple-choice and open-ended physics reasoning
A new paper from researcher Shan Yang systematically audits the entire multimodal-physics evaluation pipeline and uncovers three critical flaws distorting vision-language reasoning measurements. First, train-eval contamination: while simple 5-gram checks find zero hits, a three-stage audit (Jaccard → cosine → LLM-judge) reveals 134 near-duplicates and 4,846 paraphrase candidates in the SciInstruct training pool alone. Second, translation drift: a 17 percentage point performance delta between Estonian and English olympiad problems (30.5% vs. 13.6%) shows language bias. Third, MCQ saturation: identical model weights show a 46 percentage point gap between multiple-choice (79.7% on PhyX) and open-ended (33.4% on PhysOlym-A) evaluation, meaning closed-form questions artificially inflate scores.
To address these gaps, Physics-R1 releases four key artifacts: PhysCorp-A (a 6,432-record multimodal corpus triple-audited for contamination), PhysR1Corp (2,268 records for reinforcement learning), PhysOlym-A (500 novel-source olympiad problems with native difficulty labels and an English-Estonian bilingual subset), and a reference GSPO+DAPO training recipe cold-started from Qwen3-VL-8B-Thinking. Across three seeds, Physics-R1 lifts the 8B base model by +18.3 percentage points on PhysOlym-A (8.0 → 26.3%, just 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (now ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ. The work provides both a rigorous audit methodology and practical recipe for fair visual physics reasoning evaluation.
- Three-stage audit of training pools found 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone, exposing hidden contamination
- Translation drift caused 17 percentage point performance difference between Estonian and English olympiad problems (30.5% vs 13.6%)
- Physics-R1 recipe improved Qwen3-VL-8B by +18.3 pp on open-ended PhysOlym-A, outperforming Qwen3-VL-32B and Gemini 2.5 Pro
Why It Matters
Exposes hidden benchmark flaws in physics AI; provides audited corpus and training recipe for reliable visual reasoning evaluation.