08 · Evaluation & Quality

Measuring what the model actually does. Frameworks, metrics, and CI/CD patterns for knowing when your system is working — and when it regresses.

Step	Topic	One-liner	Status
1	RAGAS	The most widely used RAG evaluation framework — faithfulness, relevancy, recall	🔴
2	Faithfulness metrics	Measuring whether the model's answer is grounded in the retrieved context	🔴
3	Hallucination rate	Quantifying how often the model generates confident-but-wrong output	🔴
4	LLM-as-judge	Using a second LLM to score the output of the first	🔴
5	Golden dataset	Curating a fixed test set that represents what good looks like	🔴
6	Offline vs online eval	Pre-deployment evaluation on fixed datasets vs. live traffic monitoring	🔴
7	DeepEval	Open-source eval framework with 14+ built-in metrics	🔴
8	Benchmark evals	Industry-standard datasets: MMLU, HellaSwag, BEIR, and when they matter	🔴
9	Regression testing	Catching quality degradation when prompts or models change	🔴
10	A/B prompt testing	Statistically comparing prompt variants on real or simulated traffic	🔴
11	Trajectory evaluation	Evaluating agents on the quality of their full action sequence, not just final output	🔴
12	CI/CD eval gates	Blocking deployments when eval scores drop below thresholds	🔴
13	PROMOTE / HOLD / ROLLBACK	The deployment decision framework for LLM systems	🔴

← Previous section: 07 · Safety, Alignment & Governance | Next section → 09 · Observability & Ops