When you serve a language model to real users, the key question is: how many requests can you answer at once on the same hardware? The answer depends largely on two infrastructure tricks, both of which come from one simple idea — don’t waste resources. Let’s get to know them.
The problem with simple batching
The simplest way to serve is static batching: you wait until a few requests gather, process them all together, and don’t move to the next batch until the whole batch is finished. The problem is that requests have different lengths. If one request finishes early but another in the same batch is long, that empty slot sits idle until the batch ends. Expensive GPUs spend much of their time waiting.
Continuous batching
Continuous batching removes this waste. Instead of waiting for the whole batch to finish, the moment one request completes, it’s pulled out of the batch and a new request takes its place. The batch always stays full and no slot sits idle. The effect of this change is dramatic: throughput can be several times that of static batching, especially when request lengths vary widely.
The KV-cache problem
But there’s a second bottleneck: the KV cache. Each request needs space to hold its memory, and that space grows with the length of the text. The simple approach is to reserve one large contiguous block per request — sized for the worst case. But most requests don’t get that long, so a large part of the reserved memory is wasted. In practice, only about half the memory gets used.
PagedAttention
The solution is an idea borrowed from operating systems: PagedAttention. Just as an OS breaks memory into small pages and keeps them scattered, here too the KV cache is divided into small blocks (say, sixteen tokens per block). Each request takes only as many blocks as it actually needs, and the blocks don’t have to sit next to each other in memory. The result is that memory fragmentation disappears and utilisation rises above ninety percent.
There’s a bonus too: if several requests share a common prefix — say, the same system prompt — they can share the same blocks rather than each keeping its own copy. For uses like chat that have a fixed prompt, this is a big saving.
Why these two matter together
Continuous batching keeps the compute slots full, and PagedAttention frees memory so more requests fit. The two reinforce each other: the freed memory means you can build a larger batch, and a larger batch means more throughput. This is why these two tricks are the foundation of modern language-model serving libraries.
Putting it together
The important point is that these improvements don’t change the model; they just manage resources more intelligently. The same model, on the same hardware, can answer far more users — simply because it no longer wastes compute slots and memory. At scale, it’s often this resource management, not the model itself, that decides how much each user costs.