Information Extraction from Electricity Invoices with General-Purpose Large Language Models
New study shows prompt quality dominates over hyperparameter tuning in LLM extraction.
A new study from arXiv evaluates how well general-purpose large language models (LLMs) can extract structured information from semi-structured business documents, specifically Spanish electricity invoices, without any task-specific fine-tuning. The researchers benchmarked two architecturally distinct models—Gemini 1.5 Pro and Mistral-small—across 19 parameter configurations and 6 prompting strategies, using a subset of the IDSEM dataset. Their experimental framework treated prompt engineering as the primary variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies.
The results are striking: prompt quality dominates over hyperparameter tuning. The F1-score variation across all parameter configurations was marginal, while the gap between zero-shot and the best few-shot strategy exceeded 19 percentage points. The best configuration, few-shot with cross-validation, achieved an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity, providing an empirical framework for integrating general-purpose LLMs into business document automation without costly fine-tuning.
- Prompt quality dominated: F1-score gap between zero-shot and best few-shot exceeded 19 percentage points across both models.
- Best config (few-shot with cross-validation) hit 97.61% F1 for Gemini 1.5 Pro and 96.11% for Mistral-small.
- Document template structure was the primary determinant of extraction difficulty, not hyperparameter tuning.
Why It Matters
Proves you can automate invoice processing with LLMs and smart prompts, no fine-tuning needed.