Research & Papers

[P] Runtime GGUF tampering in llama.cpp: persistent output steering without server restart

Attackers can permanently alter AI model outputs by modifying shared model files during inference.

Deep Dive

Security researcher Piotr Maciej Bednarski has exposed a significant runtime integrity vulnerability in popular local AI inference setups using llama.cpp. His proof-of-concept tool, llm-inference-tampering, demonstrates how attackers can permanently alter model behavior by modifying GGUF model files while they're actively being served. The attack exploits llama.cpp's default memory-mapping (mmap) behavior, where model weights are loaded directly from disk during inference. If another process has write access to the same GGUF file, it can modify quantization scale values in specific output.weight layers, making certain tokens disproportionately likely to appear in generated outputs.

Unlike traditional attacks requiring process injection or server restarts, this tampering occurs at the model file level and persists across inference sessions. The vulnerability is particularly dangerous for self-hosted AI deployments where model directories are shared between containers or processes—a common pattern in local development stacks. Bednarski's research challenges the widespread assumption that "model loaded once = immutable behavior," revealing this as an infrastructure-level security issue rather than just a prompt injection concern.

Mitigation strategies include mounting model directories as read-only, isolating serving user permissions, avoiding shared writable model paths with untrusted code, using the --no-mmap flag in llama.cpp where security is critical, and implementing periodic model hash integrity checks. The discovery highlights how local AI inference, often considered more secure than cloud services, carries its own unique attack surfaces that require careful infrastructure hardening.

Key Points
  • Llama.cpp's default mmap behavior allows runtime modification of GGUF model files during inference
  • Attackers can permanently steer token generation by altering quantization scales in output.weight layers
  • Mitigation requires read-only model mounts, permission isolation, and --no-mmap flag usage

Why It Matters

Self-hosted AI deployments must secure model files as attack surfaces, not just prompts, to prevent persistent output manipulation.