Skip to main content

10 · Integration & Cloud

Connecting to your existing stack. Gateways, routing, streaming, cloud provider specifics, and deployment topology.

StepTopicOne-linerStatus
1Streaming (SSE)Server-sent events for token-by-token streaming from LLM APIs🔴
2AI gatewayThe entry point: auth, rate limiting, routing, and observability in one layer🔴
3Auth & rate limitingOAuth, API keys, and per-tenant token budgets🔴
4Load balancingDistributing requests across model replicas — LLM-specific challenges🔴
5Model routerRouting requests to different models based on cost, latency, or capability🔴
6KV cache-aware routingRouting requests to the replica that already has the matching KV cache🔴
7Prefix-aware routingRouting by shared prompt prefix to maximize cache hit rate🔴
8Fallback chainCascading through model options when the primary fails or is rate-limited🔴
9Open-weight vs managed endpointsThe cost, control, and operational tradeoffs between self-hosted and API🔴
10Cloud platformsVertex AI, Azure ML, and SageMaker — what each offers for LLM workloads🔴
11Managed inferenceGroq, Baseten, Modal — fast inference APIs without self-hosting🔴
12Hugging Face Inference EndpointsOne-click hosted inference for HF models🔴
13Data sovereigntyLegal and architectural constraints on where model data can go🔴
14VPC & private endpointKeeping LLM traffic inside your network perimeter🔴
15On-prem & hybrid deploymentRunning models on your own hardware alongside cloud capacity🔴
16AI SDK PatternsVercel AI SDK, Anthropic SDK, OpenAI SDK.🔴
17Webhook vs Streaming vs PollingPatterns for receiving AI responses — tradeoffs for real integrations🔴

← Previous section: 09 · Observability & Ops