Red-Teaming Large Language Models (LLMs)

Imagine asking ChatGPT for homework help and receiving the response “Please die.”; or picture discovering that an AI chatbot has leaked someone’s private medical information. These real incidents that have made headlines in recent months underscore the fragility of current AI safeguards.

As Large Language Models (LLMs) become deeply integrated in various professions – from helping doctors diagnose patients to assisting lawyers with case research – their safety and security has never been more critical. While these AI systems show remarkable capabilities, they can also exhibit concerning behaviors like generating harmful content, leaking private information, or being manipulated to bypass their built-in safety measures.

The AI research community has been actively studying these vulnerabilities, but until now, most work has focused on specific types of attacks or defenses in isolation. What’s been missing is a comprehensive framework for understanding the full landscape of threats – from how attackers might compromise these systems to how we can defend against such attempts.

Our new paper, “Operationalizing a Threat Model for Red-Teaming Large Language Models,” published at TMLR, aims to fill this gap. We present the first systematic analysis of LLM security, mapping out the various ways these systems can be attacked, identifying their vulnerable points, and outlining strategies for testing and protecting them. Our goal in this work is to provide a rigorous, unified framework that researchers and practitioners can use to assess vulnerability across the entire model lifecycle.

Attack Taxonomy

The first part of our paper introduces a new attack taxonomy. While prior literature has attempted to systematize attacks on Large Language Models (LLMs), these taxonomies have largely centered around loosely defined terms such as ‘jailbreaks’ and ‘prompt injections’. Rather than organizing attacks by the methodology or the risk posed, our research introduces a novel way to categorize these attacks based on the level of access required to execute the attack. This shift cultivates a more red-teaming centric mindset essential for building robust guardrails.

As illustrated in Figure 1, this access-based taxonomy presents a hierarchical organization of attacks ranging from those requiring minimal access (such as crafting malicious prompts) to those demanding privileged access to the model’s training procedure. This systematic categorization not only provides ML practitioners with an intuitive framework for threat assessment but also facilitates the implementation of targeted defensive measures at each access level.

Attack Taxonomy Figure 1: Taxonomy of attacks on LLMs

Attack Surface Visualization

To make this taxonomy more precise, we visualize the attack surface in Figure 2. The diagram traces the LLM lifecycle from left to right, progressing from the application layer to the model’s foundations. On the far left, we see attack vectors targeting the end user-facing application, the point of entry for prompt-based interactions. As we move right, we extend deeper into the stack, ultimately reaching the model’s underlying training data and algorithms.

The figure employs a systematic visual encoding scheme: Colored boxes mark different types of attack vectors, each corresponding to a category from our taxonomy. Black arrows show how information normally flows through the system. Gray arrows reveal “side channels” – subtle vulnerabilities that emerge from standard practices like data deduplication (where duplicate data is removed to improve efficiency)

This visualization serves a distinct purpose: it compels security teams to adopt an adversarial mindset. By mapping every entry point—from obvious prompt manipulations to subtle pre-processing exploits—we can better fortify LLM systems at each stage of their lifecycle. This comprehensive view leads us to several fundamental questions that developers must address before deployment.

Attack Surface Figure 2: Visualization of attack vectors and their corresponding entry points

Key Questions Addressed

Through a detailed survey of existing literature and industry practices, our paper tackles the following key security challenges and questions:

How red-teaming evaluation differs from traditional benchmarking in uncovering vulnerabilities.
Who the potential adversaries are and what capabilities they possess at each system entry point.
What distinguishes often-confused terms like Direct vs. Indirect Prompt Injection.
How to practically detect and mitigate backdoor triggers.
Which defense strategies align best with specific attack vectors.
How subtle vulnerabilities, such as side-channel attacks, manifest in production systems.
What the evidence-based best practices and limitations of red-teaming exercises are.

Staying Current

LLM security is a rapidly moving target, with new attack vectors and defense strategies emerging regularly. To help researchers and practitioners stay updated, we maintain a GitHub repository tracking papers in the area. While recent work on red-teaming has predominantly focused on prompt-based attacks, effective red-teaming must encompass the entire model development pipeline. As LLMs become increasingly integrated into critical systems, the community must broaden its security focus to address vulnerabilities across all levels of access identified in our taxonomy.

Citation

If you found this work helpful for understanding and securing LLM systems, please consider citing our paper:

@article{verma2024operationalizing,
  title={Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)},
  author={Verma, Apurv and Krishna, Satyapriya and Gehrmann, Sebastian and Seshadri, Madhavan and Pradhan, Anu and Ault, Tom and Barrett, Leslie and Rabinowitz, David and Doucette, John and Phan, NhatHai},
  journal={arXiv preprint arXiv:2407.14937},
  year={2024}
}