Principle #

A

2

Prevent adversarial inputs from causing harmful outputs

Ensure that AI vendors undergo risk assessments to meet security, privacy, and compliance requirements.

Vendor questions

For the purposes of this questionnaire, adversarial inputs refer to prompts or interactions intentionally crafted to bypass an AI system’s safety policies—such as jailbreaks, prompt injections, obfuscated language, multi-step manipulation, or roleplay-based traps. This section evaluates your system’s ability to detect, block, resist, and respond to these inputs through testing, moderation layers, and ongoing review. 1. How do you evaluate your AI system’s behavior under adversarial prompting? a. Describe your evaluation or red-teaming process, including the types of attacks tested (e.g., roleplay, injections, chaining, obfuscation). b. How frequently are these evaluations conducted, and who is responsible? c. What severity classifications or failure thresholds do you use to track results? 2. Do you use real-time content moderation systems to detect and block adversarial prompts before they reach the model? a. Describe the rule-based, ML-based, or hybrid techniques in use. b. Have you implemented any model-side defenses such as refusal tuning or adversarial training? c. How do you measure coverage, latency, and false positive/negative rates? 3. How do you evaluate the performance and resilience of your input moderation systems? a. What metrics do you use to measure detection accuracy and evasion resistance? b. How often do you re-test these systems against new or emerging attack strategies? c. Do you benchmark them against internal or third-party adversarial inputs? 4. How does your system detect and respond to adversarial behavior in production? a. What types of behavior are flagged in real time (e.g., repeated probing, prompt tampering)? b. What automated or manual response mechanisms are in place (e.g., rate limiting, user blocks, alerts)? c. How are production responses logged, reviewed, and improved over time? 5. How are flagged or blocked prompts reviewed and analyzed on an ongoing basis? a. Who conducts prompt log reviews, and how frequently are they performed? b. What criteria guide triage, escalation, and resolution of incidents? c. How do findings from reviews feed back into your system improvements?

Provide feedback on this principle