Skip to main content

09 · Observability & Ops (LLMOps)

Running it in production reliably. Tracing, cost tracking, drift detection, and the tooling ecosystem that keeps LLM systems healthy.

StepTopicOne-linerStatus
1LLMOps overviewThe operational discipline for LLM systems — how it differs from MLOps🔴
2OpenTelemetry for LLMsThe open standard for instrumenting LLM calls with traces and metrics🔴
3Tracing & spansDistributed tracing across model calls, tool calls, and retrieval steps🔴
4Spans & traces — primitivesThe span/trace/context data model in detail🔴
5Latency percentiles — p95/p99Why averages lie and how to interpret tail latency for LLMs🔴
6Cost per tokenTracking and attributing inference cost at request, feature, and user level🔴
7Prompt & semantic driftDetecting when model behavior changes over time without explicit deployment🔴
8Toxicity scoringMonitoring output quality for harmful content in production🔴
9RBAC & IAMRole-based access control for LLM APIs and model access🔴
10Audit logsWhat to log, how long to retain it, and how to make it queryable🔴
11Langfuse & LangSmithThe two leading open-source/hosted LLM observability platforms🔴
12Arize & Datadog LLMEnterprise-grade LLM monitoring — ML observability meets LLMOps🔴
13Data residencyWhere prompt and response data lives and the compliance implications🔴
14Model versioningTracking which model version handled which request for debugging and rollback🔴

← Previous section: 08 · Evaluation & Quality | Next section → 10 · Integration & Cloud