Three-stage audit of training pools found 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone, exposing hidden contamination?

Three-stage audit of training pools found 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone, exposing hidden contamination

Translation drift caused 17 percentage point performance difference between Estonian and English olympiad problems (30.5% vs 13.6%)?

Translation drift caused 17 percentage point performance difference between Estonian and English olympiad problems (30.5% vs 13.6%)

Physics-R1 recipe improved Qwen3-VL-8B by +18.3 pp on open-ended PhysOlym-A, outperforming Qwen3-VL-32B and Gemini 2.5 Pro?

Physics-R1 recipe improved Qwen3-VL-8B by +18.3 pp on open-ended PhysOlym-A, outperforming Qwen3-VL-32B and Gemini 2.5 Pro

Research & Papers

Physics-R1 exposes 46-point MCQ bias in visual physics AI benchmarks

arXiv cs.CL May 15, 2026

⚡Hidden benchmark flaws revealed: 46% gap between multiple-choice and open-ended physics reasoning

Deep Dive

A new paper from researcher Shan Yang systematically audits the entire multimodal-physics evaluation pipeline and uncovers three critical flaws distorting vision-language reasoning measurements. First, train-eval contamination: while simple 5-gram checks find zero hits, a three-stage audit (Jaccard → cosine → LLM-judge) reveals 134 near-duplicates and 4,846 paraphrase candidates in the SciInstruct training pool alone. Second, translation drift: a 17 percentage point performance delta between Estonian and English olympiad problems (30.5% vs. 13.6%) shows language bias. Third, MCQ saturation: identical model weights show a 46 percentage point gap between multiple-choice (79.7% on PhyX) and open-ended (33.4% on PhysOlym-A) evaluation, meaning closed-form questions artificially inflate scores.

To address these gaps, Physics-R1 releases four key artifacts: PhysCorp-A (a 6,432-record multimodal corpus triple-audited for contamination), PhysR1Corp (2,268 records for reinforcement learning), PhysOlym-A (500 novel-source olympiad problems with native difficulty labels and an English-Estonian bilingual subset), and a reference GSPO+DAPO training recipe cold-started from Qwen3-VL-8B-Thinking. Across three seeds, Physics-R1 lifts the 8B base model by +18.3 percentage points on PhysOlym-A (8.0 → 26.3%, just 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (now ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics, and +4.1 pp on PhyX MCQ. The work provides both a rigorous audit methodology and practical recipe for fair visual physics reasoning evaluation.

Key Points

Three-stage audit of training pools found 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone, exposing hidden contamination
Translation drift caused 17 percentage point performance difference between Estonian and English olympiad problems (30.5% vs 13.6%)
Physics-R1 recipe improved Qwen3-VL-8B by +18.3 pp on open-ended PhysOlym-A, outperforming Qwen3-VL-32B and Gemini 2.5 Pro

Why It Matters

Exposes hidden benchmark flaws in physics AI; provides audited corpus and training recipe for reliable visual reasoning evaluation.

Read Original Article

Physics-R1 exposes 46-point MCQ bias in visual physics AI benchmarks

Why It Matters

Related Articles

🚀 Stay Ahead in AI