07 · Safety, Alignment & Governance
Keeping it trustworthy at scale. Technical safety first (guardrails, red-teaming), then regulatory frameworks (EU AI Act, NIST RMF).
| Step | Topic | One-liner | Status |
|---|---|---|---|
| 1 | Input guardrails | The 5-stage pipeline that sanitises every user message before it reaches the model | 🔴 |
| 2 | PII redaction | Detecting and reversibly replacing personal data in prompts | 🔴 |
| 3 | Prompt injection | Attacks that override system prompts via user input — detection and defence | 🔴 |
| 4 | Output guardrails | Validating and filtering model responses before returning to the client | 🔴 |
| 5 | Content filtering | Blocking harmful content categories: hate, violence, CSAM, self-harm | 🔴 |
| 6 | Red-teaming | Structured adversarial testing to find safety failures before deployment | 🔴 |
| 7 | Adversarial inputs | The attack surface: jailbreaks, emoji smuggling, homoglyphs, indirect injection | 🔴 |
| 8 | Safety alignment | How models are trained to refuse harmful requests — RLHF, RLAIF, CAI | 🔴 |
| 9 | Bias detection | Identifying systematic unfairness in model outputs across demographic groups | 🔴 |
| 10 | Fairness metrics | Quantitative measures of equitable treatment in model outputs | 🔴 |
| 11 | Human approval gate | Routing high-risk agent actions to humans before execution | 🔴 |
| 12 | Policy-as-code | Expressing safety rules as machine-readable policies that run in CI/CD | 🔴 |
| 13 | AI RMF (NIST) | The US National Institute of Standards and Technology AI risk framework | 🔴 |
| 14 | EU AI Act | The EU's binding regulation on AI — risk tiers, obligations, and timelines | 🔴 |
| 15 | Responsible AI | The principles and practices that tie governance to engineering decisions | 🔴 |
← Previous section: 06 · Agents & Orchestration | Next section → 08 · Evaluation & Quality