04 · Model Optimization & Formats
Making models smaller and faster. Study section 01 first — you need to understand what a model is before learning how to shrink or adapt it.
| Step | Topic | One-liner | Status |
|---|---|---|---|
| 1 | Quantization | Reducing weight precision from FP32 to INT8/INT4 to shrink model size and speed up inference | 🔴 |
| 2 | FP8 / INT8 / INT4 | The numeric formats used in quantization and what each trades off | 🔴 |
| 3 | GPTQ, AWQ & GGUF | Practical quantization algorithms and the file formats they produce | 🔴 |
| 4 | Pruning & Sparsity | Removing weights that contribute little to model output | 🔴 |
| 5 | Knowledge Distillation | Training a small model to mimic the output distribution of a large one | 🔴 |
| 6 | LoRA & QLoRA | Low-rank adaptation — fine-tuning with a fraction of the parameters | 🔴 |
| 7 | Adapter layers | Small trainable modules inserted into a frozen base model | 🔴 |
| 8 | Fine-tuning & SFT | Supervised fine-tuning on task-specific data | 🔴 |
| 9 | RLHF | Reinforcement learning from human feedback — how alignment training works | 🔴 |
| 10 | DPO & GRPO | Direct preference optimization and group relative policy optimization — RLHF without a reward model | 🔴 |
| 11 | Model merging | Combining weights from multiple fine-tuned models into one | 🔴 |
← Previous section: 03 · Serving Infrastructure | Next section → 05 · Retrieval & Memory