Temporal Text Classification with Large Language Models
First systematic test shows proprietary LLMs excel at identifying when a text was written, a key task for historians and archivists.
A new research paper provides the first comprehensive benchmark for using Large Language Models (LLMs) to perform Temporal Text Classification (TTC), the task of automatically estimating the publication date of a text. Authored by Nishat Raihan and Marcos Zampieri, the study systematically evaluates leading proprietary models—Claude 3.5, GPT-4o, and Gemini 1.5—against open-source contenders like LLaMA 3.2, Gemma 2, and Mistral. The models were tested on three historical text corpora (two in English, one in Portuguese) using zero-shot prompting, few-shot prompting, and fine-tuning techniques.
The results reveal a clear performance hierarchy. Proprietary models from OpenAI, Anthropic, and Google delivered strong results, particularly when given a few examples in a prompt (few-shot). While fine-tuning substantially improved the capabilities of open-source models, they still failed to match the accuracy of their closed-source counterparts. This gap highlights the current advantage in temporal reasoning held by the most advanced commercial models. The study establishes a crucial baseline for a specialized task that has applications in historical research, digital archiving, and content verification.
The findings suggest that for professionals needing to date documents or analyze linguistic change over time, current top-tier proprietary LLMs are the most effective tools. The research also underscores that simply fine-tuning a smaller open-source model is not yet a substitute for the nuanced temporal understanding embedded in larger, frontier models. This work opens the door for more specialized AI applications in humanities and social sciences, providing a methodology and benchmark for future development in temporal language modeling.
- First systematic benchmark of LLMs like GPT-4o and Claude 3.5 on Temporal Text Classification (dating texts).
- Proprietary models outperformed open-source ones, especially with few-shot prompting on English and Portuguese corpora.
- Fine-tuning improved open-source models (LLaMA 3.2, Gemma 2) but couldn't close the performance gap with commercial leaders.
Why It Matters
Provides a benchmark for historians and archivists to use AI for dating documents and analyzing linguistic change over time.