Stop ranking LLMs, start profiling them

A single number on a leaderboard won't tell you which model fits your job. A multi-dimensional profile of capabilities will.

Every time a new language model ships, the first question is: “where is it on the leaderboard?” That question is natural but misleading. A single rank compresses dozens of distinct capabilities into one number — and in that compression, exactly the thing that matters for your decision gets lost.

Models are like people: someone who is strong at math is not necessarily a good writer. Asking “who is the smartest?” is almost meaningless; the right question is “for this specific job, who is the better fit?” Instead of ranking, we should profile.

What ranking hides

A model might be excellent at mathematical reasoning yet slip when generating structured output. Another might be superb at following precise instructions yet forget details in the middle of very long documents. If you only look at the overall rank, these differences stay invisible — until the day they surface in production as an expensive error.

The problem with the leaderboard isn’t that it is wrong; it’s that it doesn’t ask your question. Your question is always specific: how well does this model do my job?

Instead of a rank, a capability profile

Profiling means scoring each model across several independent dimensions, not a single axis. The dimensions that matter in real practice are usually these:

Instruction following — when you set an explicit constraint, how precisely does it hold to it?
Structured-output stability — does it produce valid, repeatable JSON or the requested format?
Long-context recall — does it find a detail buried in the middle of a long input?
Reasoning depth — on multi-step problems, does it keep the chain of logic to the end?
Refusal behavior — where does it correctly say “I don’t know,” and where does it answer wrong with false confidence?
Latency and cost — does the quality it delivers justify its time and money price?

The key point is that none of these is “general”; you have to measure them all on your own work. Build a small evaluation set of real (or near-real) examples — even 20 to 30 cases are highly informative — and score each model on those same examples, dimension by dimension.

A simple sketch for profiling

The idea fits in a few lines of code. Turn each dimension into a probe: a function that takes a model’s output and returns a score between zero and one. Then run those same probes across every candidate model and build a profile.

# Profile, not rank: score each model across independent dimensions.
# `probes` maps a dimension name to a function that evaluates a model's
# output on one example and returns a score in [0, 1].

def profile_model(run, dataset, probes):
    scores = {trait: [] for trait in probes}
    for example in dataset:
        output = run(example["prompt"])          # one call to the model
        for trait, probe in probes.items():
            scores[trait].append(probe(example, output))
    # mean of each dimension -> that model's profile
    return {trait: sum(v) / len(v) for trait, v in scores.items()}


def valid_json(example, output):
    import json
    try:
        json.loads(output)
        return 1.0
    except ValueError:
        return 0.0


probes = {
    "structured_output": valid_json,
    "instruction_following": follows_constraints,   # your own probes
    "long_context_recall": finds_buried_fact,
    # ... whatever dimension matters for your job
}

profiles = {name: profile_model(run, dataset, probes)
            for name, run in candidate_models.items()}

The output is no longer a single number; it’s a table that shows, for each model, its strengths and weaknesses side by side. Now the choice is an informed judgment: you pick the model whose profile matches the shape of your work — not the one that ranks higher on some arbitrary axis.

Why this makes better decisions

It has three clear benefits. First, your choice is tied to your job, not to a generic benchmark that may have nothing to do with your need. Second, when a new model ships, you don’t have to rely on rumor; you run the same probes and within minutes know whether it is better for your job or not. And third, profiling surfaces weaknesses earlier than production — where fixing them is cheap, not where it is expensive.

The leaderboard is good for a headline. An engineering decision needs a profile. Don’t rank your models; profile them.