Performance Benchmarking

Comprehensive benchmarking tool for evaluating watermarking algorithms performance on the C4 dataset. Uses isolated processes to ensure complete GPU memory cleanup between algorithms.

Quick Start

# Install required dependencies
pip install tabulate

# From project root directory
./scripts/benchmark/run_benchmark.sh

Supported Algorithms

Algorithm	Parameters	Description
OPENAI	ngram, seed, payload	Power-law transformation with n-gram hashing
MARYLAND	ngram, seed, gamma, delta	Statistical watermarking with hypothesis testing
MARYLAND_L	ngram, seed, gamma, delta	Maryland algorithm with logit processing
PF	ngram, seed, payload	Prefix-free coding watermarking

Usage Examples

Shell Script (Recommended):

# Default: All algorithms, 5000 samples, meta-llama/Llama-3.2-1B
./scripts/benchmark/run_benchmark.sh

# Custom model
./scripts/benchmark/run_benchmark.sh meta-llama/Llama-3.2-3B

# Specific algorithms
./scripts/benchmark/run_benchmark.sh meta-llama/Llama-3.2-1B "OPENAI MARYLAND"

# Custom sample count
./scripts/benchmark/run_benchmark.sh meta-llama/Llama-3.2-1B "OPENAI PF" 1000

Python Script (Advanced):

python scripts/benchmark/benchmark_watermarks.py \
    --model_name meta-llama/Llama-3.2-1B \
    --algorithms OPENAI MARYLAND PF \
    --num_samples 5000 \
    --data_path resources/datasets/c4/processed_c4.jsonl

Output Metrics

The benchmark provides comprehensive metrics for each algorithm:

Detection Performance:

Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1 Score: 2 × (Precision × Recall) / (Precision + Recall)
Accuracy: (TP + TN) / (TP + TN + FP + FN)
FPR: False Positive Rate = FP / (FP + TN)
FNR: False Negative Rate = FN / (TP + FN)

Performance Metrics:

Input Tokens/Second: Throughput for input processing
Output Tokens/Second: Throughput for text generation
Generation Time: Time spent generating watermarked + unwatermarked text
Detection Time: Time spent running detection
Total Time: Generation + Detection time

Sample Results

LLaMA-3.2-1B Performance on C4 Dataset (500 samples)

Detection Performance Comparison

Algorithm	Configuration	Precision	Recall	F1 Score	Accuracy	FPR
OPENAI	ngram=2, seed=42, payload=0	0.941	0.996	0.968	0.967	0.062
MARYLAND	ngram=2, seed=42, γ=0.5, δ=1.0	0.902	0.882	0.892	0.893	0.096
MARYLAND_L	ngram=2, seed=42, γ=0.5, δ=1.0	0.899	0.962	0.929	0.927	0.108

Performance Metrics Comparison

Algorithm	Generation Rate (tokens/s)	Generation Time (s)	Detection Time (s)	Total Time (s)
OPENAI	7,487	54.9	56.9	111.8
MARYLAND	5,968	66.1	104.4	170.5
MARYLAND_L	3,093	139.1	200.7	339.8

Key Observations:

OPENAI achieves the highest precision (0.941) and recall (0.996), making it excellent for applications requiring minimal false positives and false negatives
MARYLAND_L provides a good balance with high recall (0.962) but at the cost of slower generation speed (3,093 tokens/s)
MARYLAND offers moderate performance across all metrics with faster processing than MARYLAND_L
Generation throughput varies significantly: OPENAI (7,487 tokens/s) > MARYLAND (5,968 tokens/s) > MARYLAND_L (3,093 tokens/s)

Note

Results are saved to the output/benchmark/ directory with detailed configuration parameters for reproducibility.