2025-02-26

Replace OCR with Vision Language Models

Capabilities and Use Cases

VLM-based OCR is praised for handling “semantic” tasks: understanding context, inferring units, dealing with unlabeled axes, legends, historical censuses, and messy form-filling.
People report good results on simple–medium complexity forms, flowchart-to-schema extraction, financial data, and specific tasks like finding Apple serial numbers on poorly taken box photos.
VLMs can directly produce structured outputs (JSON now, trivially convertible to YAML), and some users want more ambitious outputs (e.g., LaTeX reconstruction of whole books).

Schemas and Structured Extraction

The project’s main “value-add” is described as schema-driven, typed extraction that coaxes models into strict, structured formats.
Type constraints and optional fields are used to reduce hallucinations and enforce well-formed JSON; some argue they still do not solve “making things up” when content is unreadable.

Bounding Boxes, Layout, and Tables

Traditional OCR is still seen as better for precise bounding boxes, dense text, and multi-column layouts.
VLM “visual grounding” is claimed to provide bounding boxes and experimental table detection, but even supporters acknowledge this remains weaker than classic methods.
A separate open benchmark suggests VLMs outperform OCR on handwriting and charts/infographics, while OCR wins on dense standardized text and precise box coordinates.

Quality, Hallucinations, and Confidence

A major concern: VLMs confidently hallucinate missing names, dates, or text, with no grounded confidence measure; “confidence scores” returned by models are viewed as fabricated.
Traditional OCR errors are local and usually recognizable as gibberish, while VLM failures can globally rewrite or “summarize” text incorrectly.
For regulated domains (audit, legal, healthcare, finance), commenters insist on confidence intervals and traceable failure modes; some say hallucinations make pure VLM OCR a non-starter for production.
Proposed mitigations: strict schemas, fine-tuning, ensembles of multiple models with majority voting, or using VLMs only for layout/semantics on top of OCR output.

Performance, Cost, and Deployment

VLMs are acknowledged as 2–3 orders of magnitude worse in characters-per-watt than OCR today, but proponents expect future distillation/quantization to close the gap.
Some users want fully local, API-key-free setups; others report success via Ollama/vLLM, while one user criticizes the hosted service for 500s, format issues, and hallucinations.

Related topics