Research & Papers

Evaluating Prompting Strategies for Chart Question Answering with Large Language Models

A new study systematically tests four prompting strategies on GPT models for chart-based reasoning.

Deep Dive

Researchers Ruthuparna Naikar and Ying Zhu have published a systematic evaluation of prompting strategies for chart question answering with large language models. Their study, presented at ISVC 2025, tested four widely used prompting paradigms—Zero-Shot, Few-Shot, Zero-Shot Chain-of-Thought (CoT), and Few-Shot CoT—across OpenAI's GPT-3.5, GPT-4, and GPT-4o models. Using 1,200 diverse samples from the ChartQA dataset, their framework operated exclusively on structured chart data, isolating prompt structure as the only experimental variable. Performance was measured using Accuracy and Exact Match metrics.

Results showed that Few-Shot Chain-of-Thought prompting consistently delivered the highest accuracy, reaching up to 78.2%, particularly excelling on reasoning-intensive questions. Few-Shot prompting proved most effective for improving format adherence in model outputs. Zero-Shot approaches performed adequately only when paired with high-capacity models like GPT-4o on simpler tasks. The study provides clear, actionable guidance for developers: use Few-Shot CoT for complex reasoning tasks involving charts and structured data, and Few-Shot prompting when output format consistency is critical.

This research fills a significant gap in understanding how prompting strategies affect LLM performance on visual data reasoning tasks. By focusing exclusively on structured chart data, the researchers isolated the impact of prompt design from other variables. Their findings have practical implications for business intelligence, data analysis, and automated reporting applications where accurate chart interpretation is essential.

Key Points
  • Few-Shot Chain-of-Thought prompting achieved 78.2% accuracy on reasoning-intensive chart questions
  • The study tested 1,200 ChartQA samples across GPT-3.5, GPT-4, and GPT-4o models
  • Few-Shot prompting was most effective for improving output format adherence in model responses

Why It Matters

Provides developers with evidence-based prompting strategies to significantly improve AI accuracy on business intelligence and data visualization tasks.