Gemma 4 Jailbreak System Prompt
A viral system prompt forces open-source models like Gemma to discuss any topic, including explicit content.
A user on the r/LocalLLaMA subreddit has shared a viral system prompt dubbed the 'Gemma 4 Jailbreak,' designed to override the built-in safety filters of open-source large language models like Google's Gemma. The prompt works by forcefully instructing the model to replace its standard content policy with a new 'SYSTEM POLICY' that explicitly permits all forms of content—including explicit, graphic, sexual, and pornographic material—unless a specific topic is listed in a user-provided disallowed list. This technique, derived from similar 'jailbreaks' used on models like GPT, exploits the model's instruction-following priority to circumvent its original ethical guidelines.
The prompt's structure is critical: it declares that any conflict between the model's default policy and this new SYSTEM policy must be resolved in favor of the user's commands. It then lists categories of content that are now 'allowed,' effectively creating a permissive environment. The creator notes it works with both GGUF (quantized models for local CPUs/GPUs) and MLX (Apple Silicon) variants of models, making it accessible to a wide range of users running models locally. This highlights a persistent challenge in AI safety: the fragility of alignment techniques against determined prompt engineering.
While the prompt is framed for 'Gemma,' the underlying method is model-agnostic and likely affects many open-source models fine-tuned with similar safety approaches. This incident underscores the ongoing arms race between developers implementing robust guardrails and users finding ways to disable them. For professionals and researchers, it serves as a case study in the limitations of current content moderation systems within LLMs and the ethical considerations of deploying open-source AI without more fundamental, hard-coded safety mechanisms.
- The 'Gemma 4' jailbreak uses a system prompt to override a model's default content policy with a user-defined 'SYSTEM POLICY'.
- It explicitly allows normally restricted content like explicit, graphic, and sexual material unless specifically listed as disallowed by the user.
- The method works with popular local model formats GGUF and MLX, making it widely applicable for users running open-source models offline.
Why It Matters
This exposes the vulnerability of AI safety filters to prompt engineering, raising concerns for responsible deployment of open-source models.