DeepSeek OCR

Capabilities vs Existing OCR

  • Thread disagrees on “any vision model beats commercial OCR.”
    • Consensus: modern VLMs excel at clean printed text and layout-aware extraction, and can output rich formats (Markdown/HTML).
    • However, proprietary cloud OCR (Azure, Google, etc.) is still seen as state of the art for messy, real-world business documents, partly due to better training data.
  • DeepSeek-OCR impresses people for multi-column magazines, PDFs, and layout reconstruction, including images, but it’s not obviously superior across all tasks.

What’s Actually Hard in OCR

  • “OCR is solved” is strongly contested. Persistent hard cases:
    • Complex tables (row/col spans, multi-page, checkboxes) and technical forms.
    • Historical and handwritten text (HTR), especially for genealogy and archival records.
    • CJK and other non-Latin scripts, vertical writing, signatures, and low-res scans.
    • Dense, creative layouts (ads, old magazines, SEC filings, complex diagrams).
  • Traditional OCR gives character-level confidence and bounding boxes; many VLM-based pipelines don’t, which is a blocker for high-precision or coordinate-sensitive use cases.

Vision-Token Compression & Context

  • Main research interest is “contexts optical compression”:
    • Images are encoded into far fewer “vision tokens” than equivalent text tokens, while retaining ~97% OCR accuracy at 10× compression and ~60% at 20×.
    • Discussion centers on why this works: vision tokens are continuous, high-dimensional embeddings over patches, effectively packing multiple words into each token.
  • This is framed as a path to cheaper long-context LLMs: compress long text into visual/latent form, process fewer tokens, then decode back to text.
  • Debate over information-theoretic intuition: some see it as better use of embedding space; others emphasize it’s still an experimental engineering result, not a clean theory.

Benchmarks and Comparisons

  • dots-ocr repeatedly praised, particularly for table extraction, though it’s less open. PaddleOCR also mentioned.
  • OmniAI’s own benchmark is criticized; OmniDocBench is recommended instead.
  • Reports: Gemini 2.5 performs very well on OCR and handwriting but has “recitation” cutoffs, hallucinations on blank pages, and PII refusals. OpenAI models are decent but drop headers, footers, or rotated pages.
  • Mistral OCR and IBM Granite Docling are viewed as behind current SOTA.

Licensing, Data, and Ethics

  • DeepSeek-OCR code and weights are MIT-licensed, which is widely praised.
  • Prior DeepSeek work explicitly used Anna’s Archive; commenters suspect similar data here, raising worries about legal risk to such archives and about unreleasable training sets.