Engineering · Series
️ LLM Infrastructure
A 4-part series
How do large language models actually run? This series walks through the inference engine itself — the difference between the prefill and decode phases, the role of the KV cache, the numerical compression of weights, and the batched scheduling of requests. The goal is a clear, vendor-neutral picture of where cost and speed are really decided — foundational knowledge that serves you whatever your stack.