Skip to main content

08 · Evaluation & Quality

Measuring what the model actually does. Frameworks, metrics, and CI/CD patterns for knowing when your system is working — and when it regresses.

StepTopicOne-linerStatus
1RAGASThe most widely used RAG evaluation framework — faithfulness, relevancy, recall🔴
2Faithfulness metricsMeasuring whether the model's answer is grounded in the retrieved context🔴
3Hallucination rateQuantifying how often the model generates confident-but-wrong output🔴
4LLM-as-judgeUsing a second LLM to score the output of the first🔴
5Golden datasetCurating a fixed test set that represents what good looks like🔴
6Offline vs online evalPre-deployment evaluation on fixed datasets vs. live traffic monitoring🔴
7DeepEvalOpen-source eval framework with 14+ built-in metrics🔴
8Benchmark evalsIndustry-standard datasets: MMLU, HellaSwag, BEIR, and when they matter🔴
9Regression testingCatching quality degradation when prompts or models change🔴
10A/B prompt testingStatistically comparing prompt variants on real or simulated traffic🔴
11Trajectory evaluationEvaluating agents on the quality of their full action sequence, not just final output🔴
12CI/CD eval gatesBlocking deployments when eval scores drop below thresholds🔴
13PROMOTE / HOLD / ROLLBACKThe deployment decision framework for LLM systems🔴

← Previous section: 07 · Safety, Alignment & Governance | Next section → 09 · Observability & Ops