03 · Serving Infrastructure

How to deploy and scale. The engines, parallelism strategies, and operational concepts that turn a model into a production service.

Step	Topic	One-liner	Status
1	vLLM	The most widely used open-source inference engine — PagedAttention, continuous batching	🔴
2	OpenAI-compatible API	The REST interface standard that makes engines interchangeable	🔴
3	TGI — Text Generation Inference	HuggingFace's inference server — strengths vs vLLM	🔴
4	TensorRT-LLM	NVIDIA's compiled inference engine — maximum GPU throughput	🔴
5	SGLang	Structured generation language — fast multi-call and constrained decoding	🔴
6	llama.cpp & Ollama	CPU-first inference for local and edge use cases	🔴
7	Triton Inference Server	NVIDIA's production serving framework supporting multiple backends	🔴
8	P/D disaggregation	Splitting prefill and decode onto separate workers for efficiency	🔴
9	Tensor parallelism	Splitting model layers across GPUs horizontally	🔴
10	Pipeline parallelism	Splitting model layers across GPUs vertically in stages	🔴
11	Expert parallelism	Distributing MoE expert networks across GPUs	🔴
12	Serving metrics	Throughput, latency SLO, goodput — how to measure a serving system	🔴
13	Batch inference	Offline processing of large request volumes without latency constraints	🔴
14	Edge inference	Running models on-device with constrained compute and memory	🔴
15	NVIDIA Dynamo	NVIDIA's new inference and serving framework (2025)	🔴

← Previous section: 02 · Prompting & Control | Next section → 04 · Model Optimization & Formats