2025-02-06

Why LLMs still have problems with OCR

Hype vs. reality of VLM-based OCR

The thread centers on whether multimodal LLMs/VLMs (e.g., GPT‑4o, Gemini 2.0, Claude) “solve” OCR or are fundamentally unreliable compared with traditional OCR pipelines.
Several people report impressive, near‑frictionless results on simple, single‑page tasks (screenshots, basic PDFs, grocery lists, product labels, small tables).
Others working in large‑scale or high‑stakes document extraction say these success stories don’t generalize to complex layouts, large volumes, or mission‑critical domains (finance, medical, regulatory).

Hallucinations, determinism, and reliability

A recurring complaint: VLMs “helpfully” infer missing or unclear text (e.g., truncated grocery items, incomplete recipes, financial rows), which is useful for casual use but unacceptable for production OCR.
Unlike classic OCR, which typically surfaces confidence scores and obvious failure modes, VLMs produce fluent text that hides errors and is hard to systematically validate.
Some suggest post‑verification passes (“is the image text identical to this text?”) or separate verifier models, but current models reportedly still hallucinate or ignore instructions when scaled.

Architecture and training debates

The article’s critique of ViT/CLIP‑style vision stacks (patch size, positional embeddings, “semantic over precise”) is challenged as technically inaccurate or overstated.
Counter‑arguments: vision encoders can and do capture fine‑grained text; can output bounding boxes and confidence (e.g., CLIP derivatives, OWLv2, Florence‑style models).
Broad agreement that current VLMs are weaker on vision than on text, largely due to training data and benchmarks rather than inherent architectural limits.
Some argue better training (synthetic complex documents, RL with strict verifiers, 2D attention) could close the gap; others think hallucination is structurally hard to constrain.

Use cases, scale, and layout complexity

People distinguishing “OCR as a step” vs “vision‑based RAG / semantic querying” note that VLMs can excel at high‑level understanding even if raw transcription isn’t perfect.
Where high character‑level fidelity is required (financial tables, historical archives, long multi‑page documents, nested/50‑page tables), practitioners report persistent digit drops, misaligned columns, and layout confusion.
Reported acceptable error rates differ: some are happy with “99%+” for business use; OCR veterans call that “terrible” compared to traditional pipelines plus human review.

Tools, hybrids, and future directions

Multiple traditional or hybrid systems are mentioned: Tesseract + LLM cleanup, PaddleOCR, Surya, Mathpix, Florence‑2, Moondream, and bespoke pipelines focused on bounding boxes and layout.
Some believe pure OCR will fade as VLMs subsume it; others argue that specialized OCR/layout engines plus LLMs for higher‑level tasks will coexist for the foreseeable future.

Related topics