LoRA is remarkable in practice: it tunes a giant model by training only a few small matrices, while the main body stays frozen. But why does this work? How can you make a meaningful change in the behaviour of a billion-parameter model without touching most of it? The answer lies in a subtle idea.
Two kinds of knowledge, two kinds of need
To understand LoRA, we have to distinguish two different jobs. Pre-training is a big job: the model has to learn broad general knowledge from scratch β language, facts, patterns. This needs full capacity and uses all of those billions of parameters. But fine-tuning is a smaller job: just adapting an already-capable model for one specific task. And that adaptation isnβt all that complex.
The low-intrinsic-dimension hypothesis
LoRAβs key idea is this: although the modelβs original weights are high-dimensional, the change needed for fine-tuning has a low intrinsic dimension. Put simply, what has to change in fine-tuning fits in a small space, even if the model itself is enormous. This is an empirical observation that forms the basis of the whole method: you donβt need to change the whole model, because the useful change is inherently small.
The simple math: ΞW = BA
LoRA implements this idea directly. Instead of training a large full change matrix (ΞW), it writes it as the product of two small matrices: ΞW = B Γ A, where A and B have a small rank. This means that instead of training millions of values, you train only a few thousand. The final output becomes: the original weight (frozen) plus this trained low-rank change. The reduction in the number of parameters is dramatic β often hundreds of times.
The initialisation trick
Thereβs a beautiful detail at work too. At the start of training, matrix A is initialised randomly and matrix B with zeros. The result is that the product B Γ A is zero at the start β meaning the model behaves exactly like the original on the first step. This means training begins from the very point where the base model stands and gradually moves away from it. This trick prevents a sudden, destructive jump at the start of training.
Why this matters
Understanding this βwhyβ isnβt just theoretical curiosity; it shapes practical decisions. Because the useful change is low-rank, you know that very large ranks are usually unnecessary and only raise the risk of overfitting. And because the main body stays untouched, the modelβs general knowledge is preserved and the risk of catastrophic forgetting drops. LoRA, in effect, keeps small what should in theory be small.
Putting it together
LoRA is born from a simple insight: adapting a model for a specific task doesnβt need a big change, because that change inherently fits in a small space. By writing this change as the product of two small matrices, you can tune enormous models for a small fraction of the usual cost β without harming the modelβs core knowledge. Thatβs the beauty of a right idea in the right place.