Extracting information from a document always looks easy in a demo. You hand the model a clean invoice, and it pulls out the amount, the date, and the number correctly. But real documents are not like that. Skewed scans, broken OCR, layouts that differ from one organization to the next, tables that have run together, sometimes handwriting. The difference between a demo system and a reliable one shows up exactly in that “messy tail.”
This is a general problem; its shape differs across domains — invoices and financial forms, insurance documents, administrative records, reports — but the core is the same: how do we make the model stay stable even on messy input? The answer is not a bigger model. It’s a better system.
Why reliability is not a property of the model
The common temptation is to think the problem is solved by a stronger model. But reliability isn’t something that lives only in the model’s weights; it is a property of the whole system — and that means it can be designed and measured.
A model might do fine on a clean document and misread a single figure on the same document with a bad scan. The most dangerous case isn’t that it makes a mistake; it’s that it makes the mistake with full confidence and you don’t notice. In a reliable system, the model is allowed to say “I’m not sure” — and the system knows what to do with that uncertainty.
Break the problem down, then measure
The first step is to break “document extraction” into specific fields and define, for each one, what “correct” means. A total amount is an exact figure; a date has a defined format; a party’s name must match the text of the document. Once you have defined correctness field by field, you can finally measure it.
And measuring requires a real evaluation set: examples of the same messy documents you actually face, paired with the correct answer for each field. That set is your ground truth. You measure every change to the system against it: did this change make extraction better, or did it just look better on a few lucky examples?
The numbers we report are evaluation results on our own internal test set, not the outcome of a customer project. That distinction is fundamental to us: a number only means something when you know which set it was measured on and with what definition of “correct.”
Turn uncertainty into a feature
The most important design decision is this: what happens when the model is not sure? In a brittle system, uncertainty is hidden and the wrong output quietly moves on. In a reliable system, uncertainty is a signal: the uncertain field is flagged and set aside for human review.
This is what honest AI looks like. The system doesn’t need to be right about everything; it needs to know where it is sure and where it isn’t — and to surface that boundary instead of hiding it. Combining “automatic extraction on the confident cases” with “human review on the borderline cases” produces something that is both fast and dependable.
The transferable idea
The core of this goes well beyond documents. Anywhere AI has to work on messy real-world input, the same principle holds: you don’t buy reliability, you build it and measure it. Define correctness precisely, measure it on real examples, and surface uncertainty instead of burying it.
On a clean document, any model is a hero. Trust is earned in the messy tail — and there, you win not with a bigger model, but with a better-measured system.