Performance Benchmarking ======================== Comprehensive benchmarking tool for evaluating watermarking algorithms performance on the C4 dataset. Uses isolated processes to ensure complete GPU memory cleanup between algorithms. Quick Start ----------- .. code-block:: bash # Install required dependencies pip install tabulate # From project root directory ./scripts/benchmark/run_benchmark.sh Supported Algorithms -------------------- .. list-table:: :header-rows: 1 * - Algorithm - Parameters - Description * - **OPENAI** - ngram, seed, payload - Power-law transformation with n-gram hashing * - **MARYLAND** - ngram, seed, gamma, delta - Statistical watermarking with hypothesis testing * - **MARYLAND_L** - ngram, seed, gamma, delta - Maryland algorithm with logit processing * - **PF** - ngram, seed, payload - Prefix-free coding watermarking Usage Examples -------------- **Shell Script (Recommended):** .. code-block:: bash # Default: All algorithms, 5000 samples, meta-llama/Llama-3.2-1B ./scripts/benchmark/run_benchmark.sh # Custom model ./scripts/benchmark/run_benchmark.sh meta-llama/Llama-3.2-3B # Specific algorithms ./scripts/benchmark/run_benchmark.sh meta-llama/Llama-3.2-1B "OPENAI MARYLAND" # Custom sample count ./scripts/benchmark/run_benchmark.sh meta-llama/Llama-3.2-1B "OPENAI PF" 1000 **Python Script (Advanced):** .. code-block:: bash python scripts/benchmark/benchmark_watermarks.py \ --model_name meta-llama/Llama-3.2-1B \ --algorithms OPENAI MARYLAND PF \ --num_samples 5000 \ --data_path resources/datasets/c4/processed_c4.jsonl Output Metrics -------------- The benchmark provides comprehensive metrics for each algorithm: **Detection Performance:** * **Precision:** TP / (TP + FP) * **Recall:** TP / (TP + FN) * **F1 Score:** 2 × (Precision × Recall) / (Precision + Recall) * **Accuracy:** (TP + TN) / (TP + TN + FP + FN) * **FPR:** False Positive Rate = FP / (FP + TN) * **FNR:** False Negative Rate = FN / (TP + FN) **Performance Metrics:** * **Input Tokens/Second:** Throughput for input processing * **Output Tokens/Second:** Throughput for text generation * **Generation Time:** Time spent generating watermarked + unwatermarked text * **Detection Time:** Time spent running detection * **Total Time:** Generation + Detection time Sample Results -------------- LLaMA-3.2-1B Performance on C4 Dataset (500 samples) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Detection Performance Comparison** .. list-table:: :header-rows: 1 :widths: 15 15 15 15 15 15 15 * - Algorithm - Configuration - Precision - Recall - F1 Score - Accuracy - FPR * - **OPENAI** - ngram=2, seed=42, payload=0 - 0.941 - 0.996 - 0.968 - 0.967 - 0.062 * - **MARYLAND** - ngram=2, seed=42, γ=0.5, δ=1.0 - 0.902 - 0.882 - 0.892 - 0.893 - 0.096 * - **MARYLAND_L** - ngram=2, seed=42, γ=0.5, δ=1.0 - 0.899 - 0.962 - 0.929 - 0.927 - 0.108 **Performance Metrics Comparison** .. list-table:: :header-rows: 1 :widths: 15 20 20 20 20 * - Algorithm - Generation Rate (tokens/s) - Generation Time (s) - Detection Time (s) - Total Time (s) * - **OPENAI** - 7,487 - 54.9 - 56.9 - 111.8 * - **MARYLAND** - 5,968 - 66.1 - 104.4 - 170.5 * - **MARYLAND_L** - 3,093 - 139.1 - 200.7 - 339.8 **Key Observations:** - **OPENAI** achieves the highest precision (0.941) and recall (0.996), making it excellent for applications requiring minimal false positives and false negatives - **MARYLAND_L** provides a good balance with high recall (0.962) but at the cost of slower generation speed (3,093 tokens/s) - **MARYLAND** offers moderate performance across all metrics with faster processing than MARYLAND_L - Generation throughput varies significantly: OPENAI (7,487 tokens/s) > MARYLAND (5,968 tokens/s) > MARYLAND_L (3,093 tokens/s) .. note:: Results are saved to the ``output/benchmark/`` directory with detailed configuration parameters for reproducibility.