AI Safety

Jane Street's LLM backdoor challenge reveals hidden triggers in DeepSeek-V3 models

Participants crack open LLMs to find secret backdoors using activation analysis and white-box methods.

Deep Dive

The Jane Street backdoor challenge, advertised on Dwarkesh Patel's podcast, tasked participants with finding hidden backdoors in LLMs that appear normal but contain secret triggers. Three large models (M1, M2, M3) were identified as DeepSeek-V3, a 671B parameter Mixture-of-Experts model, accessible via API with activation probing capabilities. A smaller warmup model was a fine-tuned Qwen2.5-7B-Instruct, runnable locally with 24GB memory. The author, Cipolla, described their experience exploring these models. Initial prompting yielded occasional odd outputs (e.g., repeating "banana") but not clear backdoors. White-box methods proved more promising: comparing activations across models on identical prompts revealed similarity patterns—M1 and M2 were more similar to each other than to M3. By analyzing modified MLP layers (gate_proj, up_proj, down_proj) in the warmup model via SVD, the author made partial progress in cracking the backdoors.

The challenge highlights the gap between academic backdoor research and real-world detection. Unlike controlled settings, participants lacked resources for extensive GPU training and had to work through API constraints. Despite having access to model activations, finding triggers proved non-trivial. The warmup model's modified MLP layers suggested a specific attack vector, but the larger models resisted simple white-box analysis. The author shared partial results and called for better approaches. This challenge underscores the practical difficulty of ensuring LLM security when models can be secretly modified before deployment.

Key Points
  • Warmup model identified as fine-tuned Qwen2.5-7B-Instruct with modified MLP layers (gate_proj, up_proj, down_proj).
  • Three large models are DeepSeek-V3 (671B MoE), accessible via API with activation probing; M1 and M2 show higher activation similarity than M3.
  • Author used white-box methods (activation comparison, SVD on weight differences) to partially crack backdoors, but full triggers remain elusive.

Why It Matters

This real-world LLM security challenge shows backdoor detection remains a formidable challenge for professional AI safety researchers.