Skip to main content

07 · Safety, Alignment & Governance

Keeping it trustworthy at scale. Technical safety first (guardrails, red-teaming), then regulatory frameworks (EU AI Act, NIST RMF).

StepTopicOne-linerStatus
1Input guardrailsThe 5-stage pipeline that sanitises every user message before it reaches the model🔴
2PII redactionDetecting and reversibly replacing personal data in prompts🔴
3Prompt injectionAttacks that override system prompts via user input — detection and defence🔴
4Output guardrailsValidating and filtering model responses before returning to the client🔴
5Content filteringBlocking harmful content categories: hate, violence, CSAM, self-harm🔴
6Red-teamingStructured adversarial testing to find safety failures before deployment🔴
7Adversarial inputsThe attack surface: jailbreaks, emoji smuggling, homoglyphs, indirect injection🔴
8Safety alignmentHow models are trained to refuse harmful requests — RLHF, RLAIF, CAI🔴
9Bias detectionIdentifying systematic unfairness in model outputs across demographic groups🔴
10Fairness metricsQuantitative measures of equitable treatment in model outputs🔴
11Human approval gateRouting high-risk agent actions to humans before execution🔴
12Policy-as-codeExpressing safety rules as machine-readable policies that run in CI/CD🔴
13AI RMF (NIST)The US National Institute of Standards and Technology AI risk framework🔴
14EU AI ActThe EU's binding regulation on AI — risk tiers, obligations, and timelines🔴
15Responsible AIThe principles and practices that tie governance to engineering decisions🔴

← Previous section: 06 · Agents & Orchestration | Next section → 08 · Evaluation & Quality