Replace OCR with Vision Language Models
Capabilities and Use Cases
- VLM-based OCR is praised for handling “semantic” tasks: understanding context, inferring units, dealing with unlabeled axes, legends, historical censuses, and messy form-filling.
- People report good results on simple–medium complexity forms, flowchart-to-schema extraction, financial data, and specific tasks like finding Apple serial numbers on poorly taken box photos.
- VLMs can directly produce structured outputs (JSON now, trivially convertible to YAML), and some users want more ambitious outputs (e.g., LaTeX reconstruction of whole books).
Schemas and Structured Extraction
- The project’s main “value-add” is described as schema-driven, typed extraction that coaxes models into strict, structured formats.
- Type constraints and optional fields are used to reduce hallucinations and enforce well-formed JSON; some argue they still do not solve “making things up” when content is unreadable.
Bounding Boxes, Layout, and Tables
- Traditional OCR is still seen as better for precise bounding boxes, dense text, and multi-column layouts.
- VLM “visual grounding” is claimed to provide bounding boxes and experimental table detection, but even supporters acknowledge this remains weaker than classic methods.
- A separate open benchmark suggests VLMs outperform OCR on handwriting and charts/infographics, while OCR wins on dense standardized text and precise box coordinates.
Quality, Hallucinations, and Confidence
- A major concern: VLMs confidently hallucinate missing names, dates, or text, with no grounded confidence measure; “confidence scores” returned by models are viewed as fabricated.
- Traditional OCR errors are local and usually recognizable as gibberish, while VLM failures can globally rewrite or “summarize” text incorrectly.
- For regulated domains (audit, legal, healthcare, finance), commenters insist on confidence intervals and traceable failure modes; some say hallucinations make pure VLM OCR a non-starter for production.
- Proposed mitigations: strict schemas, fine-tuning, ensembles of multiple models with majority voting, or using VLMs only for layout/semantics on top of OCR output.
Performance, Cost, and Deployment
- VLMs are acknowledged as 2–3 orders of magnitude worse in characters-per-watt than OCR today, but proponents expect future distillation/quantization to close the gap.
- Some users want fully local, API-key-free setups; others report success via Ollama/vLLM, while one user criticizes the hosted service for 500s, format issues, and hallucinations.