09 · Observability & Ops (LLMOps)

Running it in production reliably. Tracing, cost tracking, drift detection, and the tooling ecosystem that keeps LLM systems healthy.

Step	Topic	One-liner	Status
1	LLMOps overview	The operational discipline for LLM systems — how it differs from MLOps	🔴
2	OpenTelemetry for LLMs	The open standard for instrumenting LLM calls with traces and metrics	🔴
3	Tracing & spans	Distributed tracing across model calls, tool calls, and retrieval steps	🔴
4	Spans & traces — primitives	The span/trace/context data model in detail	🔴
5	Latency percentiles — p95/p99	Why averages lie and how to interpret tail latency for LLMs	🔴
6	Cost per token	Tracking and attributing inference cost at request, feature, and user level	🔴
7	Prompt & semantic drift	Detecting when model behavior changes over time without explicit deployment	🔴
8	Toxicity scoring	Monitoring output quality for harmful content in production	🔴
9	RBAC & IAM	Role-based access control for LLM APIs and model access	🔴
10	Audit logs	What to log, how long to retain it, and how to make it queryable	🔴
11	Langfuse & LangSmith	The two leading open-source/hosted LLM observability platforms	🔴
12	Arize & Datadog LLM	Enterprise-grade LLM monitoring — ML observability meets LLMOps	🔴
13	Data residency	Where prompt and response data lives and the compliance implications	🔴
14	Model versioning	Tracking which model version handled which request for debugging and rollback	🔴

← Previous section: 08 · Evaluation & Quality | Next section → 10 · Integration & Cloud