Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
138k expert Q&A pairs expose why AI can't think like a botanist yet.
A team of researchers from multiple institutions has released PlantInquiryVQA, a novel benchmark designed to test multimodal large language models (MLLMs) on multi-step, intent-driven visual reasoning. Unlike standard vision evaluations that rely on single-turn question answering, this benchmark mimics how expert botanists diagnose plant diseases: they inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. The dataset includes 24,950 expert-curated plant images and 138,068 question-answer pairs, each annotated with visual grounding, severity labels, and domain-specific reasoning templates.
Evaluations on top-tier MLLMs revealed a significant gap: while models could describe visual symptoms adequately, they struggled with safe clinical reasoning and accurate diagnosis. However, when models used a structured chain-of-inquiry approach—guided by sequential, intent-driven questions—diagnostic correctness improved substantially, hallucinations decreased, and reasoning efficiency increased. The work, accepted at ACL 2026 Findings, positions PlantInquiryVQA as a foundational benchmark for training diagnostic agents that reason like expert botanists rather than static classifiers. This research highlights a critical limitation in current AI systems and points toward a more robust framework for medical and agricultural diagnostics.
- Dataset includes 24,950 expert-curated plant images and 138,068 QA pairs with visual grounding and severity labels.
- Top MLLMs like GPT-4V describe symptoms but fail at safe clinical reasoning and accurate diagnosis.
- Structured chain-of-inquiry questioning improves diagnostic correctness and reduces hallucination.
- Accepted at ACL 2026 Findings, positioning as a foundational benchmark for diagnostic AI agents.
Why It Matters
This benchmark exposes a critical blind spot in MLLMs for real-world diagnostic tasks, pushing AI toward expert-level reasoning.