A seventy-billion-parameter model needs about 140 GB of memory at FP16 precision — that’s eight A100 cards with 80 GB each. For most teams, that’s out of reach. But with a technique called quantization, you can compress that same model until it fits on a single GPU. In this guide we’ll walk the memory math step by step and see what you gain and what you give up.
The memory math
The starting point is simple: number of parameters times bytes per parameter. At FP16, each parameter takes two bytes, so seventy billion parameters is about 140 GB. Quantization reduces the number of bits per parameter:
| Precision | Bits per parameter | Memory for a 70B model |
|---|---|---|
| FP16 | 16 | ~140 GB |
| INT8 | 8 | ~70 GB |
| INT4 | 4 | ~35 GB |
At INT8 the model halves and fits on a single 80 GB card. At INT4 it drops to about 35 GB — small enough to fit even on larger consumer cards, with room left for the KV cache and batch processing.
What quantization actually does
Quantization rounds each weight to the nearest value on a smaller grid of numbers. FP16 has more than sixty thousand possible values; INT8 has only 256, and INT4 just 16. This rounding introduces error, but the error is small and — here’s the key — language models are remarkably robust to it, because their many parameters and internal redundancy let them “route around” small errors.
What you give up
The price of quantization is a small drop in quality that grows with how hard you compress:
- INT8: under one percent quality loss, and usually about twice as fast. In practice the difference isn’t noticeable.
- INT4: roughly one to three percent quality loss, provided you use a good method. Acceptable for most uses.
- Below that (e.g. INT2): the loss grows sharply; this range isn’t recommended for serious work.
The important point is that the quantization method matters. Naive, crude quantization can drop quality by five to ten percent, whereas smarter methods achieve the same compression with far less loss.
The outlier problem
Why does crude quantization do badly? The main reason is outliers. Among a model’s weights, there are a few very large ones. If you make the quantization grid wide enough to fit those large values, you lose precision in the range of common values — where the vast majority of weights live. Advanced methods target exactly this problem.
The practical recipe
For INT4 quantization on a large model, methods like GPTQ and AWQ are the most common choices; both manage outliers intelligently so quality loss stays minimal. The NF4 format used in QLoRA is also optimised for the typical distribution of weights. The practical path is: try INT8 first — if the model fits and quality is good, that’s enough. If you need more compression, move to INT4 with a good method, and measure the result on your own real task, not on public benchmarks.
Putting it together
Quantization is one of those technologies that changes the cost equation: what needed eight GPUs yesterday is possible on one today. The key is choosing the right level of compression — enough to fit, but not so much that quality collapses. And as always, the final judge isn’t the number on paper but how the model behaves on your real data.