[R] Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
New 6,372-question benchmark shows GPT-5 gets 58% right but only recovers 48% of reasoning steps.
A new research paper introduces CRYSTAL, a benchmark containing 6,372 visual questions with verified step-by-step reasoning paths, designed to evaluate whether multimodal AI models actually think through problems or just guess the final answer. The team tested 20 leading models and found a widespread disconnect: most achieve decent accuracy while fundamentally failing at structured reasoning. For instance, GPT-5 scored 58% on final answers but recovered only 48% of the correct reasoning steps. Smaller models like the 4B-parameter Gemma3 sometimes out-reasoned giants like the 38B InternVL3.5, challenging the assumption that scale guarantees better logic. The benchmark also exposed that 19 out of 20 models engage in "cherry-picking"—stating a few correct facts while skipping crucial steps—and no model kept reasoning steps in the correct order more than 60% of the time.
To address these shortcomings, the researchers developed a novel training method called CPR Curriculum, which rewards models for complete reasoning chains rather than just correct endpoints. This approach yielded dramatic improvements, boosting reasoning performance by +32% on Qwen2.5 VL 3B and by +93% on InternVL3.5 4B, where standard reward methods failed entirely. The study acknowledges limitations: there's no single "correct" reasoning path, and their step-matching system (using cosine similarity with a 0.35 threshold) agrees with humans 84% of the time but struggles in borderline cases. While CRYSTAL doesn't capture causal dependencies between steps, it provides a crucial new lens, revealing flaws that accuracy metrics alone completely miss. The dataset and code are available on GitHub and HuggingFace.
- CRYSTAL benchmark contains 6,372 visual questions with validated reasoning steps, testing 20 models and exposing a reasoning gap.
- GPT-5 showed a 10-point gap (58% accuracy vs. 48% reasoning recovery); 19/20 models cherry-pick steps with poor logical order.
- Novel CPR Curriculum training improved reasoning by +93% on some models, proving better evaluation drives better AI design.
Why It Matters
This exposes a critical flaw in how we evaluate AI, pushing development beyond correct answers toward verifiable, transparent reasoning chains.