Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
New agent framework uses a manager and workers to reason across multiple subplots.
Advanced chart question answering demands precise perception of small visual elements and multi-step reasoning across multiple subplots—tasks where current multimodal large language models (MLLMs) often fall short. To address this, Qihua Dong and colleagues from multiple institutions introduce HierVA, a hierarchical visual agent framework that manages context in a joint image-text space. The system features a high-level manager that generates plans and maintains a compact working context containing only key information, while specialized worker agents perform reasoning tasks, gather evidence, and return results. A key innovation is the separation of visual and textual contexts, with a zoom-in tool that restricts the visual field to focus on relevant chart regions. This design allows HierVA to decompose complex queries into smaller steps and iteratively update its context, avoiding the distraction of irrelevant data.
On the CharXiv reasoning subset—a challenging benchmark for chart understanding—HierVA consistently outperforms strong multimodal baselines. Ablation studies reveal that each component of the architecture contributes complementary gains: the hierarchical structure enables efficient task decomposition, the scoped visual context prevents information overload, and the distilled context ensures that only essential information is retained. The paper, accepted to ACL 2026, highlights a promising direction for making MLLMs more reliable in data-heavy domains like financial analysis, scientific visualization, and business intelligence, where reasoning across multiple charts is common.
- HierVA uses a high-level manager for planning and low-level workers for reasoning, mimicking a hierarchical team structure.
- The framework introduces a zoom-in tool that restricts visual context to key chart regions, improving accuracy on fine-grained perception.
- On CharXiv, HierVA shows consistent gains over baselines; ablations confirm hierarchical design, scoped vision, and distilled context each add value.
Why It Matters
HierVA enables MLLMs to handle complex multi-chart reasoning, critical for data analysis in finance, science, and business.