Mistral OCR

Performance & Features

  • Many commenters report Mistral OCR is very fast and works well for clean PDFs → Markdown, sometimes “significantly” better than Google, Claude, ChatGPT, etc. for basic text extraction.
  • Standout feature: returns Markdown plus coordinates for extracted images, enabling figure extraction and layout-aware applications.
  • However, several users note that on some pages it classifies everything as a single image and returns just a ![img-0] tag, with no text.

Pricing, Batching & API UX

  • Pricing of “$1 / 1000 pages” is seen as aggressive vs other cloud OCR tools, but not obviously cheaper than renting GPUs if you self-host a model.
  • “Batching” is understood as async, high-latency jobs (minutes/hours) that let providers utilize GPUs more efficiently; single huge PDFs can still time out.
  • Some frustration with documentation: OCR API is hard to discover, chunking behavior is unclear, and image vs PDF endpoints are confusing.
  • Le Chat integration is inconsistent: some users say it works, others say it hallucinates or truncates pages and appears not to use the new OCR at all.

Benchmarks vs Real-World Tests

  • Mistral advertises ~95% “text-only” accuracy; independent benchmarks on mixed, real-world documents report ~72% accuracy and frequent image-only outputs where other VLMs produce usable text.
  • Multiple users find Gemini Flash / Gemini 2.0, Claude, or specialized tools (Mathpix, Marker, MinerU, Docling, other SaaS OCRs) outperform Mistral on:
    • Complex tables, receipts, invoices
    • Scientific papers with equations/figures
    • Domain documents (medical, legal, regulatory, technical textbooks).
  • Handwriting, historical scripts, and multilingual/bidi (e.g., Hebrew, Arabic, Chinese, old German) are recurring weak spots; some models like Gemini or custom HTR models do better there.

Hallucinations, Reliability & Use Cases

  • LLM-based OCR is praised for flexibility (can summarize, structure, or normalize) but criticized for hallucinations, dropped content, and lack of confidence scores.
  • In high‑stakes domains (contracts, leases, finance, medical, regulation), even 1–2% errors in names, numbers, or dates are unacceptable; most commenters see human-in-the-loop workflows as mandatory.
  • Traditional CNN/character-level OCR is still regarded as more predictable for strict text fidelity; several suggest hybrid pipelines (classic OCR + LLM cleanup, or multi-model “tournaments”).

Ecosystem & Future Direction

  • Many expect OCR itself to become a commodity; real value will come from:
    • Document structuring, table/figure linking, layout semantics
    • Domain-specific models
    • Tooling: pipelines, validation, human review, integrations, on‑prem options.
  • Some discuss “microLLM”/agent architectures: specialized OCR or document-understanding models plugged into larger orchestration frameworks rather than one monolithic VLM.