The Benchmark
Enterprise document extraction accuracy varies dramatically by approach. Dedicated OCR/IDP solutions like ABBYY achieve 99.5% field-level accuracy on structured invoices and 97%+ on semi-structured formats. LLM-based approaches lag behind: Claude Sonnet 3.5 achieved 90% field-level accuracy, Gemini 2.5 Pro reached 96.5% on clean invoices (92.7% on scanned), and GPT-4o hit 91% when combined with OCR preprocessing. GPT-5.2 (2026) improved to 96% on invoices but drops to 87% on documents with handwritten annotations. The gap matters: at 10,000 invoices per month, the difference between 99.5% and 96% accuracy is 350 additional errors requiring human review.
Accuracy Comparison
Dedicated IDP:
ABBYY structured: 99.5%
ABBYY semi-struct: 97.0%
Overall IDP avg: 98.7%
LLM-based:
GPT-5.2 (2026): 96.0%
Gemini 2.5 Pro: 96.5% clean
Claude 3.5 Sonnet: 90.0%
GPT-4o + OCR: 91.0%
At 10K invoices/month:
99.5% = 50 errors
96.0% = 400 errors
// Source: Onezipp, ChatFin, 2026
Key insight: For structured extraction (invoices, forms), dedicated IDP still wins. LLMs shine on unstructured understanding (contracts, emails, reports). The best production systems use both — IDP for extraction, LLM for comprehension.