When we think about a language model, we often cast attention as the hero of the story. But if we ask where the model’s knowledge — that the capital of France is Paris, or the syntax of a programming language — is stored, the answer is elsewhere: in the MLP layers. Let’s see why.
Two different roles
A transformer block has two main parts: attention and the MLP (feed-forward network). These two do different jobs. Attention routes information: it decides which part of the text relates to which other part and moves information between positions. But it doesn’t hold knowledge itself. The MLP holds the knowledge: it’s where facts and patterns are stored.
A useful metaphor: attention is like a librarian who knows which book to fetch from where, and the MLP is like the books themselves, where the content lives. To answer “what is the capital of France,” attention brings the context of “France” and “capital” together, and the MLP retrieves the answer “Paris.”
A key-value kind of memory
Why does the MLP work this way? You can think of it as a key-value memory. The first part of the MLP acts like a set of keys that recognise patterns, and the second part like values retrieved in response to those patterns. When a particular pattern is activated in the input — say, “this is about the capital of a country” — the matching key lights up and the associated value (the stored knowledge) is added to the flow.
Why the MLP is so large
This role explains why the MLP takes up a large share of the model’s parameters. In a typical model, the MLP layers hold the bulk of the parameters, while attention takes a smaller share and the rest goes to embeddings. The reason is clear: storing millions of independent facts — names, dates, relations, language syntax — needs a lot of capacity. Routing is lighter work; holding knowledge is heavy.
Why this matters
This separation isn’t just a theoretical point; it has practical consequences. When you fine-tune a model to learn new knowledge, you have to make sure the MLP layers are targeted too, not just attention — because that’s where knowledge lives. Likewise, when you think about compressing or pruning a model, you should know that touching the MLP means touching the model’s memory.
Putting it together
The simple picture is this: attention says “look at this,” and the MLP says “and here’s what I know about it.” A model’s knowledge is spread across millions of MLP weights, not in the attention mechanism. Understanding this separation helps both with designing better fine-tuning and with grasping why models are so large — most of that size is memory.