07 · Safety, Alignment & Governance

Keeping it trustworthy at scale. Technical safety first (guardrails, red-teaming), then regulatory frameworks (EU AI Act, NIST RMF).

Step	Topic	One-liner	Status
1	Input guardrails	The 5-stage pipeline that sanitises every user message before it reaches the model	🔴
2	PII redaction	Detecting and reversibly replacing personal data in prompts	🔴
3	Prompt injection	Attacks that override system prompts via user input — detection and defence	🔴
4	Output guardrails	Validating and filtering model responses before returning to the client	🔴
5	Content filtering	Blocking harmful content categories: hate, violence, CSAM, self-harm	🔴
6	Red-teaming	Structured adversarial testing to find safety failures before deployment	🔴
7	Adversarial inputs	The attack surface: jailbreaks, emoji smuggling, homoglyphs, indirect injection	🔴
8	Safety alignment	How models are trained to refuse harmful requests — RLHF, RLAIF, CAI	🔴
9	Bias detection	Identifying systematic unfairness in model outputs across demographic groups	🔴
10	Fairness metrics	Quantitative measures of equitable treatment in model outputs	🔴
11	Human approval gate	Routing high-risk agent actions to humans before execution	🔴
12	Policy-as-code	Expressing safety rules as machine-readable policies that run in CI/CD	🔴
13	AI RMF (NIST)	The US National Institute of Standards and Technology AI risk framework	🔴
14	EU AI Act	The EU's binding regulation on AI — risk tiers, obligations, and timelines	🔴
15	Responsible AI	The principles and practices that tie governance to engineering decisions	🔴

← Previous section: 06 · Agents & Orchestration | Next section → 08 · Evaluation & Quality