Imagine asking ChatGPT for homework help and receiving the response “Please die.” Or picture discovering that an AI chatbot has leaked someone’s private medical information. These aren’t hypothetical scenarios – they’re real incidents that have made headlines in recent months.

As Large Language Models (LLMs) become deeply woven into the fabric of our daily lives – from helping doctors diagnose patients to assisting lawyers with case research to tutoring students – their safety and security have never been more critical. While these AI systems show remarkable capabilities, they can also exhibit concerning behaviors like generating harmful content, leaking private information, or being manipulated to bypass their built-in safety measures.

The AI research community has been actively studying these vulnerabilities, but until now, most work has focused on specific types of attacks or defenses in isolation. What’s been missing is a comprehensive framework for understanding the full landscape of threats – from how attackers might compromise these systems to how we can defend against such attempts.

Our new paper, “Operationalizing a Threat Model for Red-Teaming Large Language Models,” aims to fill this gap. We present the first systematic analysis of LLM security, mapping out the various ways these systems can be attacked, identifying their vulnerable points, and outlining strategies for testing and protecting them. Think of it as a complete security guide for the age of AI – one that’s essential for anyone building, deploying, or interested in the safety of LLM applications.

Attack Taxonomy

How do we make sense of the dizzying array of possible attacks on LLMs? Just as a building’s security depends on understanding its entry points – from front doors to windows to delivery entrances – LLM security starts with mapping out all the ways an attacker might gain access to the system. Our research introduces a novel way to categorize these attacks based on exactly this principle: the level of access required to execute them.

Let’s break down this taxonomy, visualized in Figure 1, which organizes attacks from those requiring minimal access (like crafting malicious prompts) to those demanding deep access to the model’s training process. This systematic organization helps developers and researchers not only understand potential threats but also implement appropriate defenses at each level.

Attack Taxonomy Figure 1: Taxonomy of attacks on LLMs

Attack Surface Visualization

Attack Surface Figure 2: Visualization of attack vectors and their corresponding entry points

Now that we understand the different types of attacks, let’s explore where these attacks can actually occur in an LLM system. Figure 2 provides a comprehensive “attack surface visualization” showing all the potential weak points in an LLM’s defenses.

Reading from left to right, the figure traces the full lifecycle of an LLM system. On the left, we see attack vectors targeting the “finished product” – the user-facing application where people input their prompts and receive responses. As we move right, we encounter progressively deeper levels of the system, ultimately reaching its foundations: the training data and algorithms that give the LLM its capabilities.

The figure uses a thoughtful visual coding:

  • Colored boxes mark different types of attack vectors, each corresponding to a category from our taxonomy
  • Black arrows show how information normally flows through the system
  • Gray arrows reveal “side channels” – subtle vulnerabilities that emerge from standard practices like data deduplication (where duplicate data is removed to improve efficiency)

This visualization serves a crucial purpose: it helps security teams and developers think like attackers. By understanding all possible entry points – from the most obvious (like malicious prompts) to the most subtle (like exploiting data preprocessing) – we can better protect LLM systems at every stage of their lifecycle.

This comprehensive view of attack surfaces and entry points leads us to several crucial questions that security teams and developers must address when building and deploying LLM systems.

Key Questions Addressed

Our paper addresses the following key questions and security challenges:

  • How does red-teaming evaluation differ from traditional benchmarking in uncovering vulnerabilities
  • Who are the potential adversaries and what capabilities they possess at each entry point
  • Clear definitions of often-confused terms like Direct vs. Indirect Prompt Injection
  • Practical strategies for detecting and mitigating backdoor triggers
  • A systematic categorization of defense strategies aligned with different attack vectors
  • Deep dives into subtle vulnerabilities like side-channel attacks
  • Evidence-based best practices and limitations of red-teaming exercises

Staying Current

The landscape of LLM security evolves rapidly, with new attack vectors and defense strategies emerging regularly. To help researchers and practitioners stay updated, we maintain a GitHub repository that catalogs all papers referenced in our attack taxonomy.

Citation

If you found this work helpful for understanding and securing LLM systems, please consider citing our paper:

@article{verma2024operationalizing,
  title={Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)},
  author={Verma, Apurv and Krishna, Satyapriya and Gehrmann, Sebastian and Seshadri, Madhavan and Pradhan, Anu and Ault, Tom and Barrett, Leslie and Rabinowitz, David and Doucette, John and Phan, NhatHai},
  journal={arXiv preprint arXiv:2407.14937},
  year={2024}
}