Quantization is a way to compress a model: holding each weight with fewer bits. It reduces memory and raises speed. But what’s happening beneath this saving? What exactly do we lose, and why do models usually survive? Let’s take a closer look.
Rounding to a smaller grid
At the heart of quantization sits a simple operation: rounding. At full precision, each weight can be one of tens of thousands of possible values. Quantization restricts these values to a smaller grid — say, 256 values in the eight-bit case, or just 16 in the four-bit case. Each original weight is rounded to the nearest value on this grid. For example, a weight like 0.3847 might be rounded to 0.3861 — a small error, but a real one.
Error accumulation
This small error seems harmless on its own. But a model has dozens of layers, and each layer’s error passes to the next. This accumulation can grow the small initial error as it travels through the model. That’s why the coarser the quantization grid (fewer bits), the greater the quality loss — not linearly, but accelerating.
The outlier problem
But quantization’s biggest challenge is elsewhere: outliers. The distribution of a model’s weights is usually tight and close to zero; most weights are small. But there are also a few very large weights. If you make the quantization grid wide enough to fit those large values, then the gap between grid points grows and you lose precision in the range near zero — where the majority of weights live. This is exactly where crude quantization does the most damage, and advanced methods manage these very outliers separately.
Vector-space collapse
Under extreme compression (say, two or three bits), something worse happens: different values get mapped to a single quantized value. When weights that were once distinct become identical, the model loses its power to discriminate — as though its information capacity collapses. This explains why quality loss at very low bit widths becomes suddenly severe rather than gradual.
Why models are robust
Even so, the surprising thing is that language models are remarkably robust to moderate quantization. In the eight-bit case, quality loss is usually under one percent; at four bits, with a good method, one to three percent. The reason is that a model’s weights are the result of a statistical process, not precise computation, and the model’s internal redundancy — millions of parameters — lets it “route around” small errors. The model has more parameters than its task strictly needs, and it’s this margin that makes quantization possible.
Putting it together
Quantization is a smart trade: you buy memory and speed against a small loss of precision. The key to success is knowing the limits — how far you can compress before error accumulation and vector-space collapse bring quality down. And as always, the final judge is the model’s behaviour on your real task, not the bit count.