Unlimited OCR: One-shot long-horizon parsing
State of OCR Today
- Strong disagreement on whether OCR is “solved.”
- Some argue traditional OCR is fast, cheap, and very reliable for many printed, simple documents.
- Others say even in 2026 OCR “still sucks,” especially for complex layouts, tables, math, and messy real-world scans.
- Non-Latin scripts (CJK, Arabic, Vietnamese, Thai) and cursive or historical handwriting are cited as especially challenging.
Traditional OCR vs LLM/VLM-based OCR
- Traditional OCR:
- Good at character-level detection, hand-filled forms, fixed layouts, and offline CPU use.
- Often struggles with layout understanding (multi-columns, headers/footers, ads, irregular structures).
- LLM/VLM-based OCR:
- Better at handling diverse scripts, cursive, mixed-language content, and complex layouts.
- Can leverage language priors to fix noisy input, but raises concerns about hallucinations and silent corrections.
- Commercial cloud offerings (e.g., major cloud OCR APIs) rated around “85%” accuracy, expensive, and with differing failure modes.
Unlimited OCR & Long-Horizon Attention
- KV cache growth is a major bottleneck for long documents; naïve approaches require per-page chunking.
- Unlimited OCR uses Reference Sliding Window Attention:
- Global: always attends to the full document image.
- Local: only keeps a small moving window of its own generated text.
- This aims to maintain document-wide context without O(N) memory growth, promising for long PDFs and local deployment.
- Some question whether the local window is too small for very long or token-heavy inputs.
Use Cases, Tools, and Benchmarks
- Users mention using a variety of tools (marker, Mistral OCR, Claude, cloud services, document-parsing frameworks) with mixed results.
- For long, complex technical documents (standards, datasheets, scientific PDFs), structure-aware chunking and multi-page context are seen as crucial.
- Comparisons to other SOTA OCR parsers and benchmarks are requested; current standing of Unlimited OCR vs leading models is unclear.
Reliability, Hallucinations, and Trust
- Concern that context-aware models “correct” text (names, foreign words) or translate when they should just transcribe, which can be unacceptable for archival or legal use.
- Some propose expensive verification loops (regenerating images from text and visually comparing) to approach near-100% reliability.
Open Source and Strategy
- Open-sourcing by large companies is seen as driven by a mix of ideals, reputation, hiring, ecosystem-building, and potential strategic impact on competitors.
- Some view broad release of strong OCR models from China as possibly weakening revenue of Western AI labs.