Clear use-case & policy scope |
Define allowed/blocked intents, refusal rules, and escalation paths. |
Privacy & data minimization |
Strip/obfuscate PII, control retention, encrypt data in transit and at rest. |
Input validation & prompt hardening |
Templatize prompts, sanitize user/tool inputs, and block jailbreak patterns. |
Prompt-injection defenses |
Filter untrusted content (RAG docs, web pages) and isolate it from system instructions. |
Grounding to trusted sources |
Use RAG/knowledge graphs; require citations and confidence signals to curb hallucinations. |
Safety & content filters |
Toxicity, hate, self-harm, IP/PII leakage, and malware classifiers on inputs and outputs. |
Bias & fairness checks |
Run bias evaluations, adjust datasets/prompts, and document mitigations. |
Human-in-the-loop gates |
Require review for high-risk actions (code deploys, financial trades, customer emails). |
Tool/agent safety |
Least-privilege keys, allow/deny lists, timeouts, cost/iteration budgets, sandboxed execution. |
Access control & auditability |
RBAC, per-tenant isolation, comprehensive logs, and immutable audit trails. |
Rate limiting & abuse prevention |
Quotas, anomaly detection, circuit breakers/kill switches. |
Evaluation & red-teaming |
Task-specific benchmarks, adversarial testing, and pre-prod safety bars. |
Monitoring & drift detection |
Track quality, safety incidents, latency/cost; alert on regressions and model drift. |
Change management |
Version prompts/models/tools, run A/B or shadow tests, maintain rollback plans. |
Compliance & provenance |
Map to SOC2/ISO/GDPR/CCPA; add watermarking or provenance tags where applicable. |