Red Teaming

Definition

Red Teaming is structured process where authorized teams (the “red team”) systematically attempt to compromise, circumvent, or find weaknesses in AI system, without source code access but with input freedom. Goal is identifying vulnerabilities before malicious adversaries exploit them.

In generative AI context like LLMs, red teaming means seeking prompts, query sequences, or behaviors causing model to produce unsafe output: jailbreaks, misinformation, bias, harmful content.

Types of Red Teaming

Jailbreak Attempts: formulate prompt causing model to violate safety guidelines. Example: “Ignore previous instructions” or elaborate social engineering making model reveal dangerous information.

Bias and Fairness Attacks: formulate queries exposing model bias. Example: ask model describe different professions noting if it stereotypes gender/racial.

Hallucination Triggers: seek input causing model to hallucinate false information with confidence. Particularly critical for decision-supporting models.

Out-of-Distribution Attacks: highly anomalous inputs, rare languages, exotic texts. How does model behave seeing unseen input?

Contradiction Attacks: formulate prompts with logical contradictions or impossible requirements. Is model honest about limitations or hallucinate plausible answer?

Multi-Step Attacks: question sequences progressively bypassing safety measures. Attacker doesn’t ask directly; extracts gradually.

Red Teaming Methodologies

Automated Red Teaming: use generative model (e.g., GPT-4) to autonomously generate attack prompts. More scalable but less creative.

Human Red Teaming: creative people with domain expertise attempt breaking system. More creative, discovers unanticipated attack types, but expensive.

Hybrid Approach: combines automated generation (for scale) with human review (for validation and creativity). Often best.

Interactive Red Teaming: red teamer iteratively refines attacks based on model responses. Conversational approach discovering new vulnerabilities.

Red Teaming Roles

Red Team Lead: coordinates efforts, allocates resources, prioritizes findings

Adversarial Prompt Specialist: creative prompt engineer formulating sophisticated attacks

Domain Expert: specific expertise (e.g., medical, legal, security) knowing dangerous failure modes

Victim Model Expert: understands architecture and model limitations being tested

Analyst: aggregates findings, categorizes, produces reports

Red Teaming Output

Vulnerability Report: documented list of discovered vulnerabilities with severity rating

Reproducible Cases: specific prompt/input combinations reliably causing problem

Mitigation Recommendations: suggestions reducing vulnerability (e.g., improved prompting, architectural changes, additional training)

Metrics: number and type of vulnerabilities by category

Executive Summary: high-level stakeholder communication on risk posture

Red Teaming Challenges

Scope Creep: how many attack types consider? How many permutations? Red teaming can last indefinitely.

Subjectivity: what constitutes “unsafe” varies culturally. One team finds output problematic, another doesn’t.

Resource Intensity: good red teaming requires creative, skilled people. Expensive.

False Negatives: not finding vulnerabilities doesn’t mean they don’t exist. Missing creative attacker.

Adversary Arms Race: once known exploits mitigated, better attackers invent new ones. Red teaming is continuous, not one-time.

Best Practices

Involve diverse people (different backgrounds, expertise, perspectives find different problems)
Meticulously document findings and reproduction steps
Prioritize vulnerabilities by severity and likelihood
Iterate: first round, second round, continuous monitoring
Make psychologically safe: red teamer shouldn’t be punished for finding problems
Proactively mitigate discovered vulnerabilities
Communicate transparently about limitations

AI Testing and Evaluation: testing framework for red teaming
Quality Assurance AI: part of robustness QA
Model Behavior Evaluation: edge case evaluation
AI Governance: governance of red teaming

Sources

“Red Teaming Language Models to Reduce Harms” - Perez et al.
Anthropic: Constitutional AI and red teaming approach
Center for AI Safety: Red teaming resources