️ LLM Infrastructure

A 4-part series

How do large language models actually run? This series walks through the inference engine itself — the difference between the prefill and decode phases, the role of the KV cache, the numerical compression of weights, and the batched scheduling of requests. The goal is a clear, vendor-neutral picture of where cost and speed are really decided — foundational knowledge that serves you whatever your stack.

️ LLM Infrastructure

PagedAttention and continuous batching: how one server answers more users

MLP is the model's memory: where knowledge lives

What quantization actually does: precision loss and vector-space collapse

Save first, then publish: a simple rule for not losing work