Research & Papers

RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

A new 2,800-instance benchmark reveals VLMs fail to generate accurate code for multi-panel charts from real data.

Deep Dive

A research team led by Jiajun Zhang has introduced RealChart2Code, a groundbreaking benchmark designed to rigorously test Vision-Language Models' (VLMs) ability to generate code for complex data visualizations. Unlike previous benchmarks that used simplified or synthetic data, RealChart2Code features over 2,800 instances grounded in authentic datasets with clear analytical intent. Crucially, it evaluates two challenging tasks: generating chart code from large-scale raw data and refining that code through multi-turn conversation. This moves beyond simple chart recognition to assess true data-to-visualization automation capabilities.

When the team evaluated 14 leading VLMs—including both proprietary models like GPT-4V and open-weight alternatives—the results exposed significant limitations. Models showed substantial performance degradation compared to simpler benchmarks, struggling particularly with intricate, multi-panel chart structures. The analysis revealed a pronounced performance gap between proprietary and open-weight models, confirming that even state-of-the-art systems often fail to accurately replicate complex visualizations from real-world data. These findings provide crucial guidance for future AI development in data science applications.

The benchmark's release includes both the dataset and evaluation code, enabling researchers and developers to systematically test and improve their models. This represents a significant step toward more realistic assessment of AI's data visualization capabilities, moving beyond toy examples to real-world complexity. The work highlights that while VLMs excel at many code generation tasks, translating messy, authentic data into accurate, multi-faceted visualizations remains a formidable challenge requiring focused research attention.

Key Points
  • First benchmark with 2,800+ instances using authentic datasets and multi-turn code refinement evaluation
  • Reveals 14 tested VLMs show significant performance drop on complex charts versus simple benchmarks
  • Exposes substantial gap between proprietary and open-weight models in real-world data visualization tasks

Why It Matters

Data scientists need realistic benchmarks to assess which AI tools can truly automate complex visualization workflows from messy real data.