09 · Observability & Ops (LLMOps)
Running it in production reliably. Tracing, cost tracking, drift detection, and the tooling ecosystem that keeps LLM systems healthy.
| Step | Topic | One-liner | Status |
|---|---|---|---|
| 1 | LLMOps overview | The operational discipline for LLM systems — how it differs from MLOps | 🔴 |
| 2 | OpenTelemetry for LLMs | The open standard for instrumenting LLM calls with traces and metrics | 🔴 |
| 3 | Tracing & spans | Distributed tracing across model calls, tool calls, and retrieval steps | 🔴 |
| 4 | Spans & traces — primitives | The span/trace/context data model in detail | 🔴 |
| 5 | Latency percentiles — p95/p99 | Why averages lie and how to interpret tail latency for LLMs | 🔴 |
| 6 | Cost per token | Tracking and attributing inference cost at request, feature, and user level | 🔴 |
| 7 | Prompt & semantic drift | Detecting when model behavior changes over time without explicit deployment |