When we think about the cost of serving a language model, we often settle for a vague number. But if you open up the bill, you see a clearer picture: most of the cost is concentrated in one place. Understanding that concentration not only helps control cost, but also clarifies the eternal question — “build or buy.”

Where the money goes

In serving a model, the bulk of the infrastructure cost goes to GPUs — often close to ninety percent. The rest goes to operations and people. This means any decision that raises GPU utilisation directly affects the bill. And this is exactly where the big saving opportunity lies.

An illustrative example

The source we rely on offers a worked example worth seeing. We stress that this is an illustrative example, not a result we’ve achieved. In this example, the baseline monthly GPU cost is about thirty thousand units. Then several optimisations stack on top of one another:

  • Quantization: compressing the model brings the biggest saving.
  • Response caching: a fraction of requests are answered from a response cache, not by re-running the model.
  • Model routing: a large share of traffic is handed to a smaller model, and only the hard work reaches the large one.
  • Reserved capacity: a long-term commitment instead of on-demand payment brings a discount.

In this example, the sum of these optimisations brings the cost from thirty thousand down to about seven thousand units — roughly a seventy-seven percent saving. The important point is that no single trick reaches this number alone; it’s the result of several optimisations stacking together.

The lesson of this example

The message of this example isn’t that you’ll reach exactly the same number; it’s that serving cost is largely in your own hands. When you know where the money goes — to the GPU — and know which tricks raise its utilisation, you can shrink the bill meaningfully. This is a combined engineering effort, not a magic button.

Build or buy

This picture also clarifies “build or buy.” When cost is mostly tied to GPU utilisation, and when control of that utilisation belongs to whoever runs the infrastructure, then building in-house capability can make sense over the long term — especially at scale, where those few percentage points of saving turn into large numbers.

Putting it together

The cost of serving a language model isn’t a fixed fate; it’s the result of decisions. Most of the money goes to the GPU, and any optimisation that raises its utilisation brings direct savings. The key is to know where the money goes and which tricks compound — and then, rather than accepting the bill, to design it.