10 · Integration & Cloud

Connecting to your existing stack. Gateways, routing, streaming, cloud provider specifics, and deployment topology.

Step	Topic	One-liner	Status
1	Streaming (SSE)	Server-sent events for token-by-token streaming from LLM APIs	🔴
2	AI gateway	The entry point: auth, rate limiting, routing, and observability in one layer	🔴
3	Auth & rate limiting	OAuth, API keys, and per-tenant token budgets	🔴
4	Load balancing	Distributing requests across model replicas — LLM-specific challenges	🔴
5	Model router	Routing requests to different models based on cost, latency, or capability	🔴
6	KV cache-aware routing	Routing requests to the replica that already has the matching KV cache	🔴
7	Prefix-aware routing	Routing by shared prompt prefix to maximize cache hit rate	🔴
8	Fallback chain	Cascading through model options when the primary fails or is rate-limited	🔴
9	Open-weight vs managed endpoints	The cost, control, and operational tradeoffs between self-hosted and API	🔴
10	Cloud platforms	Vertex AI, Azure ML, and SageMaker — what each offers for LLM workloads	🔴
11	Managed inference	Groq, Baseten, Modal — fast inference APIs without self-hosting	🔴
12	Hugging Face Inference Endpoints	One-click hosted inference for HF models	🔴
13	Data sovereignty	Legal and architectural constraints on where model data can go	🔴
14	VPC & private endpoint	Keeping LLM traffic inside your network perimeter	🔴
15	On-prem & hybrid deployment	Running models on your own hardware alongside cloud capacity	🔴
16	AI SDK Patterns	Vercel AI SDK, Anthropic SDK, OpenAI SDK.	🔴
17	Webhook vs Streaming vs Polling	Patterns for receiving AI responses — tradeoffs for real integrations	🔴

← Previous section: 09 · Observability & Ops