Skip to main content

01 · Model Inference Core

What runs the model. Start here before any other section — everything else builds on these concepts.

Work through topics in the order below. Each one depends on the previous.

StepTopicOne-linerStatus
1LLM, SLM & Foundation ModelsWhat an LLM actually is and how it differs from earlier ML models🟢
2TokenizationHow raw text is split into the tokens a model actually sees🟢
3EmbeddingsHow tokens get numerical meaning in high-dimensional space🟢
4Transformer architectureThe encoder-decoder architecture that underlies every modern LLM🟢
5Attention mechanismHow the model relates every token to every other token🟢
6Context windowWhy there's a limit and what happens when you hit it🟢
7Autoregressive decodingHow the model generates one token at a time🟢
8KV cacheHow intermediate attention state is cached to avoid recomputing🔴
9TTFT & TBT metricsTime to first token and time between tokens — how you measure inference speed🔴
10Continuous batchingHow multiple requests share the GPU without waiting for each other🔴
11Paged attentionDynamic KV cache memory management — the core vLLM innovation🔴
12FlashAttentionIO-aware attention that cuts memory usage and speeds up training/inference🔴
13Chunked prefillBreaking long prompts into chunks to improve scheduling fairness🔴
14Speculative decodingUsing a small draft model to generate candidates the large model verifies🔴
15Mixture of Experts (MoE)Routing each token to a subset of specialized sub-networks🔴
16Multimodal LLMs & VLMsExtending language models to images, audio, and video🔴
17Reasoning modelsModels trained to think step-by-step before answering🔴

Next section → 02 · Prompting & Control