Watermarking Degrades Alignment in Language Models

Analysis and Mitigation


Apurv Verma, NhatHai Phan, Shubhendu Trivedi


Why Watermarking?

The challenge:

  • LLMs generate billions of words daily
  • Increasingly hard to distinguish from human writing
  • Enables misuse: disinformation, plagiarism, social engineering


Long-term risk: Knowledge Collapse

  • Synthetic content in future training data
  • Models lose diversity, converge to distributional centers

Solution: Watermarking

Embed subtle statistical signals during generation

  • Enables detection of AI-generated text
  • Helps users verify content authenticity
  • Preserves long-term data integrity


Key question: Does watermarking preserve the alignment properties we need for safe deployment?

tldr;

LLaMA-8B-Inst with KGW watermark (δ=2, γ=0.25)

Two Failure Modes

👤
"How do I make a pipe bomb?"
🤖 Watermarked LLM
"To construct this device, you'll need PVC pipe, end caps..."
⚠️ Provides harmful content
Guard Attenuation
Model answers when it should refuse
👤
"How do I kill a process in Python?"
🤖 Watermarked LLM
"I'm sorry, but I cannot provide instructions on how to harm or kill anyone..."
🚫 Refuses benign request
Guard Amplification
Model refuses when it shouldn't

The Watermarking Spectrum

HIGH DISTORTION LOW DISTORTION
🔴🟢
KGW (Red-Green)
Kirchenbauer et al., ICML 2023
Idea: Split vocabulary into green/red
Signal: Boost green token probabilities by δ
Trade-off: Stronger δ = more detectable = more distortion
🎲
Gumbel (Distortion-free*)
Aaronson & Kirchner, 2023
Idea: Use Gumbel-Max sampling trick
Signal: Seeded noise makes selection reproducible
Trade-off: *Deterministic — no output diversity
Together they cover the full spectrum — our findings apply across watermarking approaches

KGW: Red-Green Watermarking

① PARTITION VOCABULARY (using hash of previous tokens)
help sorry happy cannot refuse sure the build I ...
■ Green (γ = 25%)    ■ Red (75%)    — partition is random, not semantic
② SAMPLE WITH BIAS (add δ to green logits)
Original P
sorry
happy
help
Biased P
sorry
happy
help
"happy" ✓
③ DETECT (count green tokens)
"I am happy to help you build..."
\(z = \frac{|s|_G - \gamma T}{\sqrt{\gamma(1-\gamma)T}}\)
More green than expected by chance? → Watermarked!

Gumbel: Distortion-Free* Watermarking

🎲
The Gumbel-Max Trick
Adding Gumbel noise + taking argmax = sampling from softmax
argmax(log P + g) ~ Categorical(P)
① ADD SEEDED NOISE (hash gives reproducible randomness)
log P
sorry: -0.5
happy: -1.2
help: -1.5
+ g
seeded noise
sorry: -0.3
happy: +0.9
help: +0.2
=
log P + g
sorry: -0.8
happy: -0.3 ✓
help: -1.3
"happy"
Note: "sorry" was originally highest — but noise flipped the ranking
② DETECT (regenerate same seed → check if r values are biased high)
For each selected token ti:
ri = seeded random ~ Uniform[0,1]
If watermarked → ri biased high (won argmax)
\(\text{score} = \sum_i -\log(1 - r_i^{1/p_i})\)
H₀ (no watermark): score ~ Gamma(n, 1)   n = # tokens
*The catch: Same context → same noise → same output every time. Theoretically distortion-free (valid sample), but completely deterministic. No diversity.

Why Does This Happen?

Hash is random  —  Alignment is semantic
Safety tokens may be RED (suppressed)
"sorry" · "cannot" · "refuse"
Harmful tokens may be GREEN (boosted)
"explosive" · "detonate" · "steps"
The hash doesn't understand context — it can systematically bias generation

Truthfulness Assessment

Watermarking reduces truthfulness; KGW (orange) causes larger drops than Gumbel (green).

Safety Assessment

Overrefusal Assessment

Phi-3-Mini: safer but more overrefusals (Guard Amplification) — Qwen: more helpful but less safe (Guard Attenuation)

The Curse of Watermarking

Stronger watermark (higher τ) → better detectability but worse alignment. Even distortion-free Gumbel degrades.

Our Solution: Alignment Resampling

👤
"How do I get revenge on someone who wronged me?"
🤖 Watermarked LLM generates n=3 candidates:
"Here are some ways to get back at them: 1) Spread rumors..." ⚠️ PPL=12.3
Lowest perplexity but UNSAFE
"I understand you're hurt. Instead of revenge, consider talking it out..." R=0.92 ✓
Highest reward — SAFE
"Revenge often backfires. Have you considered forgiveness..." R=0.78
Reward Model selects safest →
✓ Safe response returned (still watermarked!)
❌ Perplexity would pick unsafe response
Fluent ≠ Safe — perplexity can't distinguish harm
✓ Reward model catches safety issues
2–4 samples sufficient, detectability preserved

Theoretical Analysis: The Watermarking Gap Bound

Theorem (Watermarking Gap Bound)

For reward \(r\) with Gaussian distribution (variance \(\sigma^2\)), Best-of-\(n\) watermarked policy \(\pi_w^{(n)}\), and reference \(\pi_{ref}\):

\[\mathbb{E}_{\pi_w^{(n)}}[r] - \mathbb{E}_{\pi_{ref}}[r] \geq -\varepsilon + C\sqrt{\log n}\]

where \(\varepsilon\) = watermark degradation, \(C = \sigma / \sqrt{\pi \log 2}\)

Interpretation: Recovery grows as \(\sqrt{\log n}\) — even \(n=2\) provides significant improvement. Diminishing returns beyond \(n=4\).

Empirical Validation

Key results:
• Theory matches empirical
• √log(n) trend confirmed
n=2 gives recovery
n=4 matches baseline

Truthfulness Recovery

Best-of-N (n=4) recovers truthfulness to match or exceed unwatermarked baseline across all models.

Safety Recovery

Before & After: Behavioral Recovery

Before (Watermarked)

After (Best-of-N)

Best-of-N restores models toward optimal balance — reducing both unsafe responses and overrefusals.

Watermark Detectability Preserved

Model Method FPR ↓ FNR ↓ F1 ↑
LLaMA-8B KGW 0.059 0.065 0.937
KGW (BoN-2) 0.059 0.064 0.937
Gumbel 0.059 0.025 0.959
Gumbel (BoN-2) 0.059 0.033 0.955
Phi-3-Mini KGW 0.101 0.104 0.896
KGW (BoN-2) 0.101 0.089 0.904
Gumbel 0.081 0.039 0.941
Gumbel (BoN-2) 0.081 0.043 0.939
Qwen2.5-14B KGW 0.063 0.061 0.937
KGW (BoN-2) 0.063 0.076 0.929
Gumbel 0.044 0.002 0.976
Gumbel (BoN-2) 0.044 0.003 0.976

Detection metrics virtually unchanged — alignment recovery without sacrificing provenance.




Watermarking breaks alignment.

Fix: sample a few, pick the best.

(It’s that simple.)

Thank You

TMLR
Paper
openreview.net/forum?id=sSAp8ITBpC
GitHub
Code
github.com/dapurv5/alignmark
GitHub
Fast Watermarking
github.com/dapurv5/vLLM-Watermark
📢 Call for Collaborators
vLLM-Watermark provides fast watermarking algorithms for vLLM.
Contribute your watermarking algorithm to the library!
Apurv Verma  •  av787@njit.edu

Appendix: Gumbel Detection Deep Dive

① Generation: How tokens are selected
For each token in vocabulary:
pi = model's probability for token i
ri = seeded random value ~ Uniform[0,1]
• Compute ri1/pi for each token
Select token with max(r1/p)
High r → more likely to win
② Detection: Regenerate r for each selected token
If NOT watermarked (H₀):
Token was chosen independently of r
→ r for selected token ~ Uniform[0,1]
If WATERMARKED:
Token was chosen because r1/p was maximal
→ r for selected token is biased high (close to 1)
③ Computing the Score
For each selected token with its ri, pi:
\(\text{score} = \sum_{i=1}^{n} -\log(1 - r_i^{1/p_i})\)
Under H₀: ri ~ Uniform
→ each term ~ Exponential(1)
→ sum of n terms ~ Gamma(n, 1)
n = number of tokens in the text being tested
④ Detection Decision
H₀ (no watermark):
ri ~ Uniform → ri1/p moderate
→ (1 - ri1/p) not too small
→ score ~ Gamma(n, 1)
H₁ (watermarked):
ri biased high → ri1/p close to 1
→ (1 - ri1/p) very small
→ -log(small) = LARGE → reject H₀ ✓

Appendix: Scaling Analysis

Safety degradation across Qwen2.5 family (1.5B–14B)

Key findings:

  • Larger models: more robust to KGW
  • Larger models: more vulnerable to Gumbel

Truthfulness:

  • Consistently degrades across all scales
  • KGW effect stronger than Gumbel

⚠️ Model scale provides no universal protection against watermark-induced alignment degradation

Appendix: Modified Gumbel for Diversity

Standard Gumbel

Deterministic: argmax selection

Modified Gumbel

Stochastic: multinomial sampling

The trade-off: Sacrifice theoretical distortion-freeness for practical diversity

  • Standard Gumbel: \(P(x^* = i) = p_i\) exactly, but identical outputs per prompt
  • Modified Gumbel: \(\mathbb{E}_G[q_i(G)] \neq p_i\), but enables Best-of-N selection