Research & Papers

PRIMETIME : Limits of LLMs in Temporal Primitives

LLMs score near-zero on basic datetime math, but synthetic fine-tuning changes everything.

Deep Dive

A new paper from researchers Edward Gaere and Florian Wangenheim introduces PRIMETIME, a synthetic data generator designed to diagnose and improve temporal reasoning in large language models (LLMs). Unlike existing benchmarks that conflate multiple skills into one score, PRIMETIME breaks temporal reasoning into two primitive operations: parsing datetime strings and performing arithmetic on them (e.g., adding days). The generator creates unlimited, uncontaminated exemplars in canonical forms, enabling precise evaluation of each primitive in isolation. When tested across various models and prompting conditions, accuracy ranged wildly from near-zero to perfect, indicating that current LLMs lack reliable basic temporal capabilities.

The paper's constructive contribution is equally significant: PRIMETIME-generated training data fine-tunes small quantized LoRA transformers to achieve frontier-level accuracy on the composed Event Planning task. This demonstrates that the primitives are fully learnable with targeted synthetic data, and the same generator used for diagnosis can also produce production-ready models. The broader implication is that this methodological pattern—a single synthetic generator serving both evaluation and remediation—could extend beyond temporal reasoning to other domains where LLMs exhibit superficial understanding.

Key Points
  • PRIMETIME isolates two temporal primitives: datetime parsing and arithmetic, revealing accuracy from 0% to 100% across models.
  • Existing benchmarks conflate skills and offer no remediation path; PRIMETIME provides uncontaminated, unlimited exemplars.
  • Fine-tuning with PRIMETIME data enables small quantized LoRA models to match frontier LLMs on complex event planning.

Why It Matters

Synthetic generators like PRIMETIME can systematically diagnose and fix fundamental LLM gaps, enabling reliable temporal reasoning for real-world applications.