Research & Papers

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

Gemini 2.5 Pro beats GPT-4.1, but all models struggle with frequency-domain tasks.

Deep Dive

A new deterministic benchmark called PlotChain tests multimodal LLMs on reading quantitative data from 450 engineering plots like Bode and stress-strain curves. Using strict JSON output and zero temperature, it reveals major performance gaps. Gemini 2.5 Pro leads with an 80.42% pass rate, just ahead of GPT-4.1 (79.84%) and Claude Sonnet 4.5 (78.21%). However, performance on tasks like bandpass response plummets to 23% or lower, exposing a critical weakness.

Why It Matters

This exposes a major blind spot for AI in science and engineering, where accurate data extraction from graphs is essential.