Text generation in a language model doesn’t work the way it looks from the outside. Behind every request, the work splits into two phases with very different characteristics: prefill and decode. Understanding the difference between them explains why optimising inference is so hard.

Why generation is sequential

Language models generate text token by token, and each new token depends on all the tokens before it. You can’t produce the next token before the current one, because each prediction changes what comes next. For that reason, generation is inherently sequential and can’t be parallelised.

Prefill: compute-bound

In the prefill phase, the model processes the whole input prompt at once, in parallel. All positions are computed at the same time, which produces large matrix multiplications that saturate the GPU’s compute cores. This phase is “compute-bound”: the work is fundamentally heavy computation, and core utilisation is very high — close to 95 percent in the source’s example.

Decode: memory-bandwidth-bound

In the decode phase, the model produces just one token at a time. The input shrinks to a small vector and the matrix multiplications become tiny. But in return, the main cost is elsewhere: at every step, the entire cache has to be read from memory. This phase is “memory-bandwidth-bound”: gigabytes of data move per step, but very little computation is done on them. In the source’s example, core utilisation in this phase drops to around 10 percent, while memory bandwidth is near saturation.

Why the cache exists

If there were no cache at all, every step would have to recompute all the past tokens from scratch — hugely wasteful. Here’s the key: when a new token is added, what’s needed from the past tokens for the attention computation hasn’t changed, because the past doesn’t depend on the future. So it’s enough to compute those values once and keep them stored, so they aren’t rebuilt at every later step. This is the key-value cache (KV cache).

This cache has a price, and it grows linearly with the length of the text. In the source’s example, for a model with 32 layers and a hidden dimension of 4,096 at half precision, each token needs about 512 KB of cache; that means about 2 GB for a 4,096-token text — and that’s for a single request. This is why it’s often the cache, not the size of the model itself, that decides how many users a system can serve at once.

What this means in practice

In a typical request, the decode phase swallows most of the time, because each token is a separate step. Since this phase is bound by memory rather than computation, the most effective way to raise throughput is to batch several requests together: the cost of reading the weights and the memory is spread across multiple requests, and the system’s total throughput goes up considerably.