AI Safety

Explaining undesirable model behavior: (How) can influence functions help?

New technique identifies which training examples cause harmful outputs in trillion-token datasets.

Deep Dive

A new research paper from Jinesis AI Lab and EuroSafeAI demonstrates how Influence Functions (IFs) are becoming a critical mechanistic tool for AI safety and alignment. The technique, which mathematically approximates how much a single training example affects a model's output, solves the core 'garbage in, garbage out' problem by allowing researchers to efficiently identify problematic data points within trillion-token web-scale datasets. Recent applications show IFs can trace harmful or biased LLM outputs—like those in the SocialHarmBench where adversarial success rates reach 98%—back to specific documents, determining if failures stem from propaganda, euphemistic media, or confabulation.

Technically, the field has advanced significantly since Koh & Liang's 2017 introduction, with Grosse et al. (2023) scaling IFs to LLMs with up to 52B parameters using curvature approximations like KFAC and EKFAC. Subsequent optimizations include Choe et al.'s LoGra for acceleration and tools like LogIX. Practically, this enables two major use cases: auditing benchmark contamination by distinguishing genuine reasoning from memorization, and tracing misaligned safety responses to implement durable data-level 'unlearning.' The main current limitation is that IFs require open-data models like OLMo, excluding many closed-weight models such as Llama and DeepSeek.

Key Points
  • Influence Functions scale to LLMs with 52B parameters using KFAC/EKFAC approximations
  • Can trace 98% adversarial success rates in benchmarks like SocialHarmBench to specific training docs
  • Enables data-level 'unlearning' of harmful influences as an alternative to RLHF patches

Why It Matters

Provides a mechanistic audit trail for AI safety, moving from black-box fixes to targeted data interventions.