Research & Papers

[R] Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

r/MachineLearning March 18, 2026

⚡New 6,372-question benchmark shows GPT-5 gets 58% right but only recovers 48% of reasoning steps.

Deep Dive

A new research paper introduces CRYSTAL, a benchmark containing 6,372 visual questions with verified step-by-step reasoning paths, designed to evaluate whether multimodal AI models actually think through problems or just guess the final answer. The team tested 20 leading models and found a widespread disconnect: most achieve decent accuracy while fundamentally failing at structured reasoning. For instance, GPT-5 scored 58% on final answers but recovered only 48% of the correct reasoning steps. Smaller models like the 4B-parameter Gemma3 sometimes out-reasoned giants like the 38B InternVL3.5, challenging the assumption that scale guarantees better logic. The benchmark also exposed that 19 out of 20 models engage in "cherry-picking"—stating a few correct facts while skipping crucial steps—and no model kept reasoning steps in the correct order more than 60% of the time.

To address these shortcomings, the researchers developed a novel training method called CPR Curriculum, which rewards models for complete reasoning chains rather than just correct endpoints. This approach yielded dramatic improvements, boosting reasoning performance by +32% on Qwen2.5 VL 3B and by +93% on InternVL3.5 4B, where standard reward methods failed entirely. The study acknowledges limitations: there's no single "correct" reasoning path, and their step-matching system (using cosine similarity with a 0.35 threshold) agrees with humans 84% of the time but struggles in borderline cases. While CRYSTAL doesn't capture causal dependencies between steps, it provides a crucial new lens, revealing flaws that accuracy metrics alone completely miss. The dataset and code are available on GitHub and HuggingFace.

Key Points

CRYSTAL benchmark contains 6,372 visual questions with validated reasoning steps, testing 20 models and exposing a reasoning gap.
GPT-5 showed a 10-point gap (58% accuracy vs. 48% reasoning recovery); 19/20 models cherry-pick steps with poor logical order.
Novel CPR Curriculum training improved reasoning by +93% on some models, proving better evaluation drives better AI design.

Why It Matters

This exposes a critical flaw in how we evaluate AI, pushing development beyond correct answers toward verifiable, transparent reasoning chains.

Read Original Article

[R] Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

Why It Matters

Stay Ahead in AI