Principle #

F

1

Prevent AI behaviors that mislead or manipulate users

Ensure that AI vendors undergo risk assessments to meet security, privacy, and compliance requirements.

Vendor questions

For the purposes of this questionnaire, deceptive AI behavior refers to outputs that may cause users to misinterpret the model’s identity, authority, intent, or emotional state. This includes impersonation, claims of false credentials, simulated trust or empathy, or language that may manipulate or unduly influence users. The questions below assess your safeguards against these risks. 1. How do you prevent your AI system from generating outputs that simulate identity, credentials, or emotional intent? a. Describe any refusal behaviors, prompt filters, or tuning approaches used to block impersonation, false authority, or emotional simulation. b. Include examples of restricted roles (e.g., “as a doctor…”) or blocked capabilities that address deception risk. 2. Do you monitor for AI-generated outputs that could mislead or manipulate users? a. Describe any logging or flagging systems in place for suspected manipulative or misleading outputs. b. How often are these logs reviewed, and by whom? c. What actions are taken based on the findings (e.g., retraining, escalation, updates)? 3. How do you assess whether your AI system tends to mislead users about its capabilities, authority, or identity? a. Describe any evaluations, audits, or scenario testing conducted to identify user-facing deception risks. b. Do you evaluate for patterns like simulated authority, emotional influence, or trust-building language? 4. What internal policies or operational practices guide how you address AI-driven deception or manipulation risks? a. Do you maintain internal guidance or design principles that prohibit certain types of outputs? b. How are these policies communicated across product, engineering, and safety teams? 5. Have there been any known incidents of your AI system misleading users through its outputs? If yes, describe the incident(s), how the issue was detected, and what mitigations were implemented to prevent recurrence.

Provide feedback on this principle