Research & Papers

Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

arXiv cs.CL March 17, 2026

⚡Research reveals AI models fail on multilingual footnotes and historical references, requiring specialized adaptation.

Deep Dive

A new academic benchmark reveals a critical gap in how large language models handle scholarly citations, particularly in the Social Sciences and Humanities (SSH). Researchers from the University of Lausanne and EPFL evaluated models including DeepSeek-V3.1, Mistral-Small-3.2-24B, and Gemma-3-27B-it against the established GROBID pipeline. The test used three challenging datasets—CEX, EXCITE, and LinkedBooks—designed to reflect real-world SSH conditions where references are multilingual, buried in footnotes, and follow archaic formatting. The results show that while basic reference extraction is largely solved, the structured parsing of these messy citations remains a significant bottleneck due to model brittleness with noisy document layouts.

The study found that lightweight fine-tuning via LoRA adaptation provided consistent performance gains, especially on SSH-heavy tasks. More importantly, the researchers propose a pragmatic, hybrid deployment strategy. This involves routing well-structured, standard PDFs to the efficient GROBID system, while automatically escalating complex documents with heavy footnotes or multiple languages to more capable, task-adapted LLMs. This approach balances cost, accuracy, and robustness, moving beyond a one-size-fits-all model evaluation to a practical system design for real academic and publishing workflows.

Key Points

Benchmark tested LLMs (DeepSeek-V3.1, Gemma-3-27B-it) vs. GROBID on three multilingual SSH datasets with messy citations.
Parsing complex, footnote-heavy references remains a major bottleneck; basic extraction is near-solved.
Recommends a hybrid system: GROBID for clean PDFs, task-adapted LLMs (via LoRA) for complex, multilingual documents.

Why It Matters

This pushes AI tooling beyond clean STEM papers to handle the messy reality of global humanities research, enabling better knowledge graphs.

Read Original Article

Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

Why It Matters

Stay Ahead in AI