Research & Papers

TSFMAudit detects hidden data contamination in time series AI models

New method reveals if your forecasting model was trained on test data...

Deep Dive

Data contamination—where evaluation data leaks into pretraining—is a growing problem for AI models. For text-based LLMs, exact-match checks offer some protection, but time series signals are continuous and heterogeneous, making contamination auditing far harder. A new paper from researchers at Zhejiang University, Salesforce AI, and others proposes TSFMAudit, the first formal framework to detect such leakage in time series foundation models (TSFMs). The method leverages a simple but powerful insight: when a model has already seen a dataset during pretraining, fine-tuning on that same data shows unusually efficient adaptation—loss drops faster and the model’s weights change less. TSFMAudit measures these signals via probe adaptation dynamics, flagging potentially contaminated evaluation sets.

The team evaluated TSFMAudit across 6 major TSFMs (including Lag-Llama, PatchTST, and TimesNet) and 187 benchmark datasets, using documented training sources as ground-truth labels for contamination. Compared to 10 baseline methods from the LLM contamination literature, TSFMAudit achieved the highest detection accuracy, demonstrating that time series contamination leaves a distinct adaptive signature. The authors also release their code and benchmarks. This work is critical as TSFMs become more widely deployed in finance, energy, and weather forecasting—domains where inflated performance claims could lead to costly real-world decisions. By giving auditors a reliable detection tool, TSFMAudit helps ensure that reported accuracy reflects genuine generalization, not data leakage.

Key Points
  • First formal study of pretraining contamination auditing for time series foundation models
  • Detects contamination via probe adaptation dynamics: 30% faster loss reduction and 40% smaller backbone movement on leaked datasets
  • Evaluated on 6 TSFMs (e.g., Lag-Llama, PatchTST) and 187 datasets, outperforming 10 baselines from LLM literature

Why It Matters

Ensures honest evaluation of time series AI, preventing over-optimistic claims in finance, energy, and weather forecasting.