01 · Model Inference Core

What runs the model. Start here before any other section — everything else builds on these concepts.

Work through topics in the order below. Each one depends on the previous.

Step	Topic	One-liner	Status
1	LLM, SLM & Foundation Models	What an LLM actually is and how it differs from earlier ML models	🟢
2	Tokenization	How raw text is split into the tokens a model actually sees	🟢
3	Embeddings	How tokens get numerical meaning in high-dimensional space	🟢
4	Transformer architecture	The encoder-decoder architecture that underlies every modern LLM	🟢
5	Attention mechanism	How the model relates every token to every other token	🟢
6	Context window	Why there's a limit and what happens when you hit it	🟢
7	Autoregressive decoding	How the model generates one token at a time	🟢
8	KV cache	How intermediate attention state is cached to avoid recomputing	🔴
9	TTFT & TBT metrics	Time to first token and time between tokens — how you measure inference speed	🔴
10	Continuous batching	How multiple requests share the GPU without waiting for each other	🔴
11	Paged attention	Dynamic KV cache memory management — the core vLLM innovation	🔴
12	FlashAttention	IO-aware attention that cuts memory usage and speeds up training/inference	🔴
13	Chunked prefill	Breaking long prompts into chunks to improve scheduling fairness	🔴
14	Speculative decoding	Using a small draft model to generate candidates the large model verifies	🔴
15	Mixture of Experts (MoE)	Routing each token to a subset of specialized sub-networks	🔴
16	Multimodal LLMs & VLMs	Extending language models to images, audio, and video	🔴
17	Reasoning models	Models trained to think step-by-step before answering	🔴

Next section → 02 · Prompting & Control