Most text-watermarking papers ship code that nobody can actually use. Each codebase is welded to a specific experimental setup, and none of them are fast. MarkLLM consolidates several watermarking methods, but Gumbel watermarking still took hours on my benchmarks.

Last spring I needed to compare two watermarking schemes for my paper. The reference implementations would have taken weeks to run. I built vLLM-Watermark to fix this. vLLM-Watermark provides minimal, implementations of KGW and Gumbel watermarking running at vLLM speed. Generation that took hours now takes minutes.

Design

vLLM has a logit processor hook. If your watermark just biases the logits, you’re done. KGW works this way. Gumbel watermarking is different. It controls the RNG during sampling, then replaces sampling with argmax. There’s no API for that.

I ended up monkey patching vLLM’s sampler directly. Each watermark is a small class (under 100 lines) that intercepts sampling and returns modified samples. The patching is isolated so it doesn’t break every time vLLM updates. It is not elegant, but it is fast.

Watermarking implementations need the generator and detector to produce the same random sequence from the same seed. One thing that tripped me up was that CPU and GPU RNGs don’t do this. If you generate on GPU but run detection on CPU, the sequences don’t match and detection fails silently.

vLLM integration has other benefits. Batching, memory optimization, and parallelism come for free. No hand-rolled sampling loops in PyTorch.

1000 examples, Llama-2-7b. Gumbel sees the largest speedup because it can’t use vLLM’s logit processor API.

Method Reference impl. vLLM-Watermark Speedup
KGW 45 min 8 min 5.6×
Gumbel 3 hours 12 min 15×

Benchmarked on 1000 examples with Llama-2-7b.

Limitations

This only works on a single GPU. Monkey patching happens at runtime, and you can’t patch objects across Python processes. vLLM’s multiprocessing has to be disabled.

The patching may also break when vLLM updates its internals. I’ve isolated the patching logic to minimize this, but there’s no guarantee. If you need multi-GPU or long-term stability, this isn’t the right tool. For single-GPU experiments, the speed gain is worth the fragility.

Quick Start

Install: Install from source (not yet on PyPI):

git clone https://github.com/dapurv5/vLLM-Watermark.git
cd vLLM-Watermark
pip install -e .

Generate:

import os
os.environ["VLLM_USE_V1"] = "1"
os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"

from vllm import LLM, SamplingParams
from vllm_watermark.core import WatermarkedLLMs, WatermarkingAlgorithm

llm = LLM(model="meta-llama/Llama-3.2-1B")
wm_llm = WatermarkedLLMs.create(
    llm,
    algo=WatermarkingAlgorithm.MARYLAND,
    seed=42,
    ngram=2,
    delta=2.0
)

sampling_params = SamplingParams(temperature=0.8, max_tokens=64)
outputs = wm_llm.generate(prompts, sampling_params)

Detect:

from vllm_watermark.core import WatermarkDetectors, DetectionAlgorithm

detector = WatermarkDetectors.create(
    algo=DetectionAlgorithm.MARYLAND_Z,
    model=llm,
    ngram=2,
    seed=42,
    gamma=0.5,
    delta=2.0,
    threshold=0.05
)

result = detector.detect(generated_text)
print(f"Watermarked: {result['is_watermarked']}")

Conclusion

If you’re running single-GPU experiments, there’s no reason to use the standard PyTorch reference implementations. They’re too slow. This framework isn’t “production ready.” It requires disabling multiprocessing and patching the runtime. But it works. It cuts iteration time from hours to minutes.

I’d welcome implementations of other watermarking algorithms. A generator class, a detection function, and a usage example — that’s all it takes. If you find a cleaner way to inject logic into vLLM, send a pull request.

Documentation: vLLM-Watermark.

Citation

If you use this framework in your research, please consider citing:

@software{vllm_watermark,
  title  = {vLLM-Watermark: A tiny, hackable research framework for
            LLM watermarking experiments},
  author = {Verma, Apurv},
  year   = {2025},
  url    = {https://github.com/dapurv5/vLLM-Watermark}
}