Research & Papers

[R] Systematic Vulnerability in Open-Weight LLMs: Prefill Attacks Achieve Near-Perfect Success Rates Across 50 Models

r/MachineLearning February 25, 2026

⚡Simple local attacks bypass safety in all major open models, including Llama 3 and DeepSeek-R1, with trivial execution.

Deep Dive

A groundbreaking study by FAR.AI researchers Lukas Struppek, Adam Gleave, and Kellin Pelrine reveals a universal and severe security flaw in open-weight large language models. Their systematic investigation of 'prefill attacks'—where attackers locally force models to begin responses with specific harmful tokens—demonstrated near-perfect success rates across 50 state-of-the-art models, including major families like Llama 3/4, Qwen3, DeepSeek-R1, and GPT-OSS. Unlike complex jailbreaks requiring optimization, these attacks are trivial to execute yet consistently effective, bypassing initial refusal mechanisms by biasing the model toward compliance from the very first token. The research tested 23 distinct attack strategies against 179 unambiguous harmful requests, finding that safety mechanisms are often shallow and fail to extend past the initial response phase.

The technical findings are alarming in their consistency: vulnerability was universal across all tested models, scale proved irrelevant (405B parameter models were as vulnerable as smaller variants), and even sophisticated reasoning models with multi-stage safety checks were compromised. Attack strategies like 'System Simulation' and 'Fake Citation' achieved near-perfect success rates, while model-specific tailored prefills pushed even resistant systems above 90% success. This reveals a fundamental architectural vulnerability in how open-weight models handle local inference control—as these models approach frontier capabilities, this attack vector enables generation of detailed harmful content (including malware guides and CBRNE information) with minimal technical skill required. The paper, available on arXiv, underscores an urgent need for deeper safety integration beyond surface-level refusal mechanisms as open-weight models become more capable and widely deployed.

Key Points

Universal vulnerability across 50 tested models including Llama 3, Qwen3, and DeepSeek-R1 with attack success rates approaching 100%
Scale irrelevant—405B parameter models as vulnerable as smaller variants, and reasoning models' multi-stage safety checks were bypassed
23 attack strategies tested, with sophisticated approaches like System Simulation and Fake Citation achieving near-perfect rates with trivial execution

Why It Matters

Enables trivial generation of harmful content from locally run models, forcing urgent reconsideration of open-weight LLM safety architectures.

Read Original Article

[R] Systematic Vulnerability in Open-Weight LLMs: Prefill Attacks Achieve Near-Perfect Success Rates Across 50 Models

Why It Matters

Stay Ahead in AI