Research & Papers

Information Extraction from Electricity Invoices with General-Purpose Large Language Models

New study shows prompt quality dominates over hyperparameter tuning in LLM extraction.

Deep Dive

A new study from arXiv evaluates how well general-purpose large language models (LLMs) can extract structured information from semi-structured business documents, specifically Spanish electricity invoices, without any task-specific fine-tuning. The researchers benchmarked two architecturally distinct models—Gemini 1.5 Pro and Mistral-small—across 19 parameter configurations and 6 prompting strategies, using a subset of the IDSEM dataset. Their experimental framework treated prompt engineering as the primary variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies.

The results are striking: prompt quality dominates over hyperparameter tuning. The F1-score variation across all parameter configurations was marginal, while the gap between zero-shot and the best few-shot strategy exceeded 19 percentage points. The best configuration, few-shot with cross-validation, achieved an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity, providing an empirical framework for integrating general-purpose LLMs into business document automation without costly fine-tuning.

Key Points
  • Prompt quality dominated: F1-score gap between zero-shot and best few-shot exceeded 19 percentage points across both models.
  • Best config (few-shot with cross-validation) hit 97.61% F1 for Gemini 1.5 Pro and 96.11% for Mistral-small.
  • Document template structure was the primary determinant of extraction difficulty, not hyperparameter tuning.

Why It Matters

Proves you can automate invoice processing with LLMs and smart prompts, no fine-tuning needed.