10 · Integration & Cloud
Connecting to your existing stack. Gateways, routing, streaming, cloud provider specifics, and deployment topology.
| Step | Topic | One-liner | Status |
|---|---|---|---|
| 1 | Streaming (SSE) | Server-sent events for token-by-token streaming from LLM APIs | 🔴 |
| 2 | AI gateway | The entry point: auth, rate limiting, routing, and observability in one layer | 🔴 |
| 3 | Auth & rate limiting | OAuth, API keys, and per-tenant token budgets | 🔴 |
| 4 | Load balancing | Distributing requests across model replicas — LLM-specific challenges | 🔴 |
| 5 | Model router | Routing requests to different models based on cost, latency, or capability | 🔴 |
| 6 | KV cache-aware routing | Routing requests to the replica that already has the matching KV cache | 🔴 |
| 7 | Prefix-aware routing | Routing by shared prompt prefix to maximize cache hit rate | 🔴 |
| 8 | Fallback chain | Cascading through model options when the primary fails or is rate-limited | 🔴 |
| 9 | Open-weight vs managed endpoints | The cost, control, and operational tradeoffs between self-hosted and API | 🔴 |
| 10 | Cloud platforms | Vertex AI, Azure ML, and SageMaker — what each offers for LLM workloads | 🔴 |
| 11 | Managed inference | Groq, Baseten, Modal — fast inference APIs without self-hosting | 🔴 |
| 12 | Hugging Face Inference Endpoints | One-click hosted inference for HF models | 🔴 |
| 13 | Data sovereignty | Legal and architectural constraints on where model data can go | 🔴 |
| 14 | VPC & private endpoint | Keeping LLM traffic inside your network perimeter | 🔴 |
| 15 | On-prem & hybrid deployment | Running models on your own hardware alongside cloud capacity | 🔴 |
| 16 | AI SDK Patterns | Vercel AI SDK, Anthropic SDK, OpenAI SDK. | 🔴 |
| 17 | Webhook vs Streaming vs Polling | Patterns for receiving AI responses — tradeoffs for real integrations | 🔴 |
← Previous section: 09 · Observability & Ops