Evaluate models on your own set, not the public leaderboard

Public leaderboards tell you less about your work than you think. The reliable method: build a small eval set that represents your real task, and score the candidates on that.

With every new model release, our first move is usually to check its place on the public leaderboards. Yet those tables tell us far less about our real work than we think.

The problem with the public leaderboard

A single number on a leaderboard compresses dozens of distinct capabilities, which means the very capability that’s critical for your project gets hidden in the average. The more serious problem is data contamination: because public benchmarks are repeated so often, parts of them gradually find their way into models’ training data, so a high score doesn’t always mean real ability. On top of that, your task is rarely exactly what the public benchmark measures.

Build your own small eval set

The reliable method is simple: gather a few dozen real examples of the very task you intend to give the model, along with their correct answers. It doesn’t need to be large; it just needs to represent your real work and cover the hard and edge cases. Then test the candidate models on that set. It takes only a few hours, but unlike any public leaderboard it gives you a far more accurate picture of how each model performs on the real job.

Tie the bar to the cost of failure

Before you evaluate, decide what level you need for each capability — and set that level by the cost of failure, not by the highest score possible. If an error in a capability is only mildly annoying, the level you need is low; if its failure destroys money or the trust of your users, the level is critical. Then set a clear rule: a model passes only if it meets at least the required threshold on every necessary capability. Finally, among the models that clear that bar, choose the smallest.

When the measure changes, the winner changes

This is the crux: when you measure exactly what matters for your project instead of relying on a single overall score, the ordering of models often shifts. A model that looks weaker by public evaluation may turn out to perform best on the specific capability you need; while a famous, much-discussed model may fail on your particular job. Public leaderboards only show what models do in general, whereas a dedicated eval set tells you which model fits your need. For a sound decision, you need only the second.

What this looked like in practice

We once had to choose a vision model to read Persian financial documents — birth certificates, bank statements, pay slips, and the like. So we built exactly the set described above: about two dozen real documents, each paired with a human-checked answer, covering the easy formats and the awkward ones, and scored the candidates on it.

The result rearranged our expectations. The strongest candidate was not the largest: a mid-sized model was the most consistent and never invented data, while a larger previous-generation model and a flagship several times its size each lost on overall reliability — though that flagship was the most precise on numeric amounts, exactly the kind of split that tells you to tie the bar to the capability that matters. And the decisive signal was one only our own set could surface: one model returned plausible but fabricated values — fine-looking, wrong, and invisible on any public score. For a financial task that single failure mode was disqualifying, and we saw it only because we had our own answers to check against.

One honest caveat, because honesty about evidence is part of the method: this was a small, curated set of about 27 documents, scored by meaning rather than exact string match, run in early 2026 on an earlier version of our pipeline that we have since replaced. It is not a production guarantee or a benchmark result — it is an illustration of what evaluating on your own data buys you.