Red Teaming

Apr 29

Red teaming refers to the practice of deliberately trying to break an AI system—by probing it for harmful, unsafe, or unintended behaviour before (and sometimes after) deployment. It includes generating adversarial prompts, testing for policy violations (such as, harmful content), probing edge cases and failure modes, and attempting to bypass safeguards (also called jailbreaking). Red Teaming originated as a core practice in cybersecurity, and has since come to become a core practice in evaluating modern AI systems, especially large language models.

Compliance

Red Teaming helps align with the NIST AI Risk Management Framework, which recommends adversarial testing and continuous evaluation. It also helps comply with the EU AI Act, which requires risk management and testing for high-risk AI systems.

In Practice

Red Teaming is implemented before major model releases, as part of structured safety testing and adversarial evaluations, and as part of internal adversarial testing mechanisms for AI systems. It is now standard for frontier models, but less common for smaller or enterprise AI systems.

There is no single agreed standard of practice for Red Teaming. Internal Red Teaming involves dedicated safety teams attempting to break models and the use of curated adversarial datasets and prompts. External Red Teaming involves inviting external experts (researchers and domain specialists), and broader and less predictable testing engagements. An emerging area is Automated Red Teaming, which involves the use of AI systems to attack other AI systems and scaling adversarial testing.

Red Teaming typically looks for harmful content generation, bias and discrimination, hallucination, factual errors, security vulnerabilities, and potential for misuse. The findings of a Red Teaming engagement help refine the model and address gaps and challenges before roll out. There are no specific standard practices, as different companies test different things and it is not always possible to test all failure modes.

Embedding Responsibility and Ethical Practices

AI systems can fail in unexpected and unpredictable ways. Red Teaming helps by enabling the early detection of risks before deployment, carrying out stress testing beyond standard benchmarks, and builds evidence for safety claims. It is particularly helpful for general-purpose models, systems deployed at scale, and high-risk or sensitive use cases. Red Teaming signals a commitment to produce models that serve people rather than exacerbate harm. It reflects a shift toward anticipating misuse rather than treating misuse and harm as byproducts, and acknowledges uncertainty that comes with the deployment and use of these tools.

Kirthi Jayakumar

Red Teaming

Bias Testing

Model Cards