Research & Papers

SLAM: Structural Linguistic Activation Marking for Language Models

100% detection accuracy on Gemma-2 with only 1-2 reward points of quality cost

Deep Dive

A new paper from UCLA researchers Fabrice Harel-Canada and Amit Sahai presents SLAM (Structural Linguistic Activation Marking), a watermarking scheme for large language models that avoids the usual quality trade-off. Unlike existing token-distribution methods like KGW, EWD, and Unigram—which bias the next-token distribution and incur measurable text quality loss—SLAM embeds its mark into the structural geometry of generated text. It uses sparse autoencoders to identify directions in the residual stream that encode linguistic features such as voice, tense, and clause order. At generation time, the model causally steers those directions, leaving lexical sampling and semantics entirely unconstrained.

On Gemma-2 2B and 9B, SLAM demonstrates 100% detection accuracy with a quality cost of only 1-2 reward points on standard metrics—dramatically lower than the 7.5-11.5 points lost by competing methods. Naturalness and diversity scores remain near those of unwatermarked text across both model sizes. The robustness profile is complementary to token-based approaches: SLAM resists word-level edits (e.g., synonym substitution) but is vulnerable to paraphrase that restructures syntax, although such attacks incur a quality cost of their own. The paper is under review and offers a practical path toward lossless watermarking for production LLMs.

Key Points
  • SLAM uses sparse autoencoders to identify residual-stream directions encoding linguistic structure (voice, tense, clause order) and steers them to embed a watermark.
  • On Gemma-2 2B and 9B, SLAM achieves 100% detection accuracy with a quality cost of only 1-2 reward points (vs. 7.5-11.5 for KGW, EWD, and Unigram).
  • Naturalness and diversity are preserved at near-unwatermarked levels; robust to word-level edits but vulnerable to syntax-changing paraphrase (at a quality cost).

Why It Matters

SLAM offers near-zero quality cost watermarking for LLMs, enabling reliable detection without degrading output naturalness or diversity.