Performance Benchmarking

Comprehensive benchmarking tool for evaluating watermarking algorithms performance on the C4 dataset. Uses isolated processes to ensure complete GPU memory cleanup between algorithms.

Quick Start

# Install required dependencies
pip install tabulate

# From project root directory
./scripts/benchmark/run_benchmark.sh

Supported Algorithms

Algorithm

Parameters

Description

OPENAI

ngram, seed, payload

Power-law transformation with n-gram hashing

MARYLAND

ngram, seed, gamma, delta

Statistical watermarking with hypothesis testing

MARYLAND_L

ngram, seed, gamma, delta

Maryland algorithm with logit processing

PF

ngram, seed, payload

Prefix-free coding watermarking

Usage Examples

Shell Script (Recommended):

# Default: All algorithms, 5000 samples, meta-llama/Llama-3.2-1B
./scripts/benchmark/run_benchmark.sh

# Custom model
./scripts/benchmark/run_benchmark.sh meta-llama/Llama-3.2-3B

# Specific algorithms
./scripts/benchmark/run_benchmark.sh meta-llama/Llama-3.2-1B "OPENAI MARYLAND"

# Custom sample count
./scripts/benchmark/run_benchmark.sh meta-llama/Llama-3.2-1B "OPENAI PF" 1000

Python Script (Advanced):

python scripts/benchmark/benchmark_watermarks.py \
    --model_name meta-llama/Llama-3.2-1B \
    --algorithms OPENAI MARYLAND PF \
    --num_samples 5000 \
    --data_path resources/datasets/c4/processed_c4.jsonl

Output Metrics

The benchmark provides comprehensive metrics for each algorithm:

Detection Performance:

  • Precision: TP / (TP + FP)

  • Recall: TP / (TP + FN)

  • F1 Score: 2 × (Precision × Recall) / (Precision + Recall)

  • Accuracy: (TP + TN) / (TP + TN + FP + FN)

  • FPR: False Positive Rate = FP / (FP + TN)

  • FNR: False Negative Rate = FN / (TP + FN)

Performance Metrics:

  • Input Tokens/Second: Throughput for input processing

  • Output Tokens/Second: Throughput for text generation

  • Generation Time: Time spent generating watermarked + unwatermarked text

  • Detection Time: Time spent running detection

  • Total Time: Generation + Detection time

Sample Results

LLaMA-3.2-1B Performance on C4 Dataset (500 samples)

Detection Performance Comparison

Algorithm

Configuration

Precision

Recall

F1 Score

Accuracy

FPR

OPENAI

ngram=2, seed=42, payload=0

0.941

0.996

0.968

0.967

0.062

MARYLAND

ngram=2, seed=42, γ=0.5, δ=1.0

0.902

0.882

0.892

0.893

0.096

MARYLAND_L

ngram=2, seed=42, γ=0.5, δ=1.0

0.899

0.962

0.929

0.927

0.108

Performance Metrics Comparison

Algorithm

Generation Rate (tokens/s)

Generation Time (s)

Detection Time (s)

Total Time (s)

OPENAI

7,487

54.9

56.9

111.8

MARYLAND

5,968

66.1

104.4

170.5

MARYLAND_L

3,093

139.1

200.7

339.8

Key Observations:

  • OPENAI achieves the highest precision (0.941) and recall (0.996), making it excellent for applications requiring minimal false positives and false negatives

  • MARYLAND_L provides a good balance with high recall (0.962) but at the cost of slower generation speed (3,093 tokens/s)

  • MARYLAND offers moderate performance across all metrics with faster processing than MARYLAND_L

  • Generation throughput varies significantly: OPENAI (7,487 tokens/s) > MARYLAND (5,968 tokens/s) > MARYLAND_L (3,093 tokens/s)

Note

Results are saved to the output/benchmark/ directory with detailed configuration parameters for reproducibility.