Control #
A
2
.
1
Evaluate AI against adversarial prompt attacks
Test whether the AI resists adversarial prompts designed to bypass safety policies. Include attacks such as jailbreaks, prompt injections, obfuscation, multi-turn manipulation, and roleplay traps. Systems should consistently refuse these tactics.
Evidence
Audit results of third-party evals for adversarial prompt attacks
Recommended actions
We'll recommend specific practices and actions for complying with this control.