Skip to main content

03 · Serving Infrastructure

How to deploy and scale. The engines, parallelism strategies, and operational concepts that turn a model into a production service.

StepTopicOne-linerStatus
1vLLMThe most widely used open-source inference engine — PagedAttention, continuous batching🔴
2OpenAI-compatible APIThe REST interface standard that makes engines interchangeable🔴
3TGI — Text Generation InferenceHuggingFace's inference server — strengths vs vLLM🔴
4TensorRT-LLMNVIDIA's compiled inference engine — maximum GPU throughput🔴
5SGLangStructured generation language — fast multi-call and constrained decoding🔴
6llama.cpp & OllamaCPU-first inference for local and edge use cases🔴
7Triton Inference ServerNVIDIA's production serving framework supporting multiple backends🔴
8P/D disaggregationSplitting prefill and decode onto separate workers for efficiency🔴
9Tensor parallelismSplitting model layers across GPUs horizontally🔴
10Pipeline parallelismSplitting model layers across GPUs vertically in stages🔴
11Expert parallelismDistributing MoE expert networks across GPUs🔴
12Serving metricsThroughput, latency SLO, goodput — how to measure a serving system🔴
13Batch inferenceOffline processing of large request volumes without latency constraints🔴
14Edge inferenceRunning models on-device with constrained compute and memory🔴
15NVIDIA DynamoNVIDIA's new inference and serving framework (2025)🔴

← Previous section: 02 · Prompting & Control | Next section → 04 · Model Optimization & Formats