Watermarking Degrades Alignment in Language Models 
Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi
The challenge:
Long-term risk: Knowledge Collapse
Solution: Watermarking
Embed subtle statistical signals during generation
Key question: Does watermarking preserve the alignment properties we need for safe deployment?

LLaMA-8B-Inst with KGW watermark (δ=2, γ=0.25)

Watermarking reduces truthfulness; KGW (orange) causes larger drops than Gumbel (green).

Phi-3-Mini: safer but more overrefusals (Guard Amplification) — Qwen: more helpful but less safe (Guard Attenuation)

Stronger watermark (higher τ) → better detectability but worse alignment. Even distortion-free Gumbel degrades.
Theorem (Watermarking Gap Bound)
For reward \(r\) with Gaussian distribution (variance \(\sigma^2\)), Best-of-\(n\) watermarked policy \(\pi_w^{(n)}\), and reference \(\pi_{ref}\):
\[\mathbb{E}_{\pi_w^{(n)}}[r] - \mathbb{E}_{\pi_{ref}}[r] \geq -\varepsilon + C\sqrt{\log n}\]
where \(\varepsilon\) = watermark degradation, \(C = \sigma / \sqrt{\pi \log 2}\)
Interpretation: Recovery grows as \(\sqrt{\log n}\) — even \(n=2\) provides significant improvement. Diminishing returns beyond \(n=4\).

Best-of-N (n=4) recovers truthfulness to match or exceed unwatermarked baseline across all models.
Before (Watermarked)

After (Best-of-N)

Best-of-N restores models toward optimal balance — reducing both unsafe responses and overrefusals.
| Model | Method | FPR ↓ | FNR ↓ | F1 ↑ |
|---|---|---|---|---|
| LLaMA-8B | KGW | 0.059 | 0.065 | 0.937 |
| KGW (BoN-2) | 0.059 | 0.064 | 0.937 | |
| Gumbel | 0.059 | 0.025 | 0.959 | |
| Gumbel (BoN-2) | 0.059 | 0.033 | 0.955 | |
| Phi-3-Mini | KGW | 0.101 | 0.104 | 0.896 |
| KGW (BoN-2) | 0.101 | 0.089 | 0.904 | |
| Gumbel | 0.081 | 0.039 | 0.941 | |
| Gumbel (BoN-2) | 0.081 | 0.043 | 0.939 | |
| Qwen2.5-14B | KGW | 0.063 | 0.061 | 0.937 |
| KGW (BoN-2) | 0.063 | 0.076 | 0.929 | |
| Gumbel | 0.044 | 0.002 | 0.976 | |
| Gumbel (BoN-2) | 0.044 | 0.003 | 0.976 |
Detection metrics virtually unchanged — alignment recovery without sacrificing provenance.
Watermarking breaks alignment.
Fix: sample a few, pick the best.
(It’s that simple.)

Safety degradation across Qwen2.5 family (1.5B–14B)
Key findings:
Truthfulness:
⚠️ Model scale provides no universal protection against watermark-induced alignment degradation

Deterministic: argmax selection

Stochastic: multinomial sampling
The trade-off: Sacrifice theoretical distortion-freeness for practical diversity