03 · Serving Infrastructure
How to deploy and scale. The engines, parallelism strategies, and operational concepts that turn a model into a production service.
| Step | Topic | One-liner | Status |
|---|---|---|---|
| 1 | vLLM | The most widely used open-source inference engine — PagedAttention, continuous batching | 🔴 |
| 2 | OpenAI-compatible API | The REST interface standard that makes engines interchangeable | 🔴 |
| 3 | TGI — Text Generation Inference | HuggingFace's inference server — strengths vs vLLM | 🔴 |
| 4 | TensorRT-LLM | NVIDIA's compiled inference engine — maximum GPU throughput | 🔴 |
| 5 | SGLang | Structured generation language — fast multi-call and constrained decoding | 🔴 |
| 6 | llama.cpp & Ollama | CPU-first inference for local and edge use cases | 🔴 |
| 7 | Triton Inference Server | NVIDIA's production serving framework supporting multiple backends | 🔴 |
| 8 | P/D disaggregation | Splitting prefill and decode onto separate workers for efficiency | 🔴 |
| 9 | Tensor parallelism | Splitting model layers across GPUs horizontally | 🔴 |
| 10 | Pipeline parallelism | Splitting model layers across GPUs vertically in stages | 🔴 |
| 11 | Expert parallelism | Distributing MoE expert networks across GPUs | 🔴 |
| 12 | Serving metrics | Throughput, latency SLO, goodput — how to measure a serving system | 🔴 |
| 13 | Batch inference | Offline processing of large request volumes without latency constraints | 🔴 |
| 14 | Edge inference | Running models on-device with constrained compute and memory | 🔴 |
| 15 | NVIDIA Dynamo | NVIDIA's new inference and serving framework (2025) | 🔴 |
← Previous section: 02 · Prompting & Control | Next section → 04 · Model Optimization & Formats