Watermarking Degrades Alignment in Language Models

Watermarking has emerged as a critical tool for ensuring the authenticity of LLM outputs. However, its broader effects on model behavior remain underexplored. In our paper, “Watermarking Degrades Alignment in Language Models: Analysis and Mitigation,” presented at the 1st GenAI Watermarking Workshop at ICLR 2025, we investigate how watermarking impacts key alignment properties such as truthfulness, safety, and helpfulness.

While watermarking methods are designed to minimally perturb token selection, even small distributional shifts can lead to measurable degradation in model alignment — a phenomenon that standard metrics like perplexity fail to capture.

Curse of Watermarking Simplex

Figure 1: Simplex denoting the safety profiles of LLMs with watermarking. KGW and Gumbel Watermarking schemes tend to increase unsafe responses or overrefusal behaviors (Left) while Alignment Resampling effectively recovers or surpasses baseline (unwatermarked) alignment scores (Right).

Curse of Watermarking

Our analysis reveals two distinct behavioral shifts introduced by watermarking: guard amplification and guard attenuation. In guard amplification, models become overly cautious, increasing refusal rates even for benign queries. In guard attenuation, increased helpfulness weakens the safety profile, making models more likely to produce unsafe outputs.

These effects illustrate a fundamental trade-off: increasing watermark detectability tends to worsen alignment. We refer to this phenomenon as the Watermarking Dilemma.

We propose a simple inference-time technique called Alignment Resampling that samples multiple watermarked generations and selects the best using an external reward model. Theoretically, we show that alignment recovery through sampling improves logarithmically with the number of generations:

$$ \mathbb{E}[r_{\text{best of }n}] - \mathbb{E}[r_{\text{unwatermarked}}] \geq C \sqrt{\log n} - \epsilon $$

where $\epsilon$ captures the initial alignment degradation caused by watermarking and $n$ is the number of generations.

Modified Gumbel Watermarking

The following code snippet provides a pseudo-code for the default Gumbel watermark implementation.

# Original Gumbel Watermark (Distortion Free)
seed = hash(preceding_tokens)  # Hash previous tokens
rng.manual_seed(seed)  # Set deterministic seed
rs = torch.rand(vocab_size, generator=rng)  # Gumbel noise
scores = torch.pow(rs, 1/probs)  # Compute Scores
next_token = torch.argmax(scores)  # Deterministic

Next, we show the slightly modified version where we replace the last argmax with multinomial sampling.

# Modified Gumbel Watermark
seed = hash(preceding_tokens)  # Hash previous tokens
rng.manual_seed(seed)  # Set deterministic seed
rs = torch.rand(vocab_size, generator=rng)  # Gumbel noise
scores = torch.pow(rs, 1/probs)  # Compute Scores
next_token = torch.multinomial(scores)  #Stochastic

This transformation preserves the relative ordering of Gumbel-max trick. However, instead of taking argmax, the modified implementation introduces an additional source of randomness by performing multinomial sampling. Consequently, this is no longer equivalent to sampling from the softmax distribution and violates the distortion-free property of Gumbel watermark. But this change allows us to generate diverse samples with Gumbel watermark

Alignment Resampling (AR)

KGW is compatible with Alignment Resampling (AR) by default. With the modification above, we also make Gumbel watermark compatible with AR. AR is an inference-time mitigation technique that samples multiple watermarked completions for a given prompt and selects the best according to an external reward model. In the paper, we also try out a variant where we select the best based on perplexity and still see a degradation in alignment metrics suggesting that the degradation of alignment observed with watermarking is not merely an incidental consequence of increased perplexity; rather, it is an intrinsic effect arising directly from the watermarking itself.

Empirical results show that sampling just two to four completions per query suffices to recover, and often surpass, the alignment metrics of unwatermarked models across a range of LLMs and watermarking methods.

Best-of-N Sampling Improvement

Broader Implications

Reliable watermarking is essential for safeguarding the provenance of AI-generated content. However, watermarking techniques must be carefully designed and evaluated to ensure they do not compromise core alignment objectives.

This work shows that the tension between watermark strength and alignment can be effectively managed at inference time, offering a practical solution for responsible deployment of watermarked models.

Citation

If you found this work helpful, please consider citing our paper:

@inproceedings{vermawatermarking,
  title={Watermarking Degrades Alignment in Language Models: Analysis and Mitigation},
  author={Verma, Apurv and Phan, Hai and Trivedi, Shubhendu},
  booktitle={The 1st Workshop on GenAI Watermarking}
}