2026-06-23

Mistral OCR 4

Overall impressions of Mistral OCR 4

Many commenters report strong real-world performance, especially on degraded or old documents, comparing it favorably to classic tools like ABBYY FineReader and some modern VLMs.
Others are skeptical due to earlier Mistral OCR versions underperforming relative to marketing claims; some say OCR 4 looks better but want independent benchmarks first.
Some users praise Mistral specifically for OCR while criticizing its coding/general models as weaker than US/Chinese SOTA.

Benchmarks, accuracy & evaluation

People question the heavy reliance on internal benchmarks and limited public metrics; concerns about past “98% accurate on tiny internal sets.”
External benchmarks like OlmOCRBench, OmniDocBench, ParseBench, and Arbitr leaderboards are referenced; one link suggests previous Mistral OCR wasn’t top-tier.
Several complain about “chart crimes”: truncated y‑axes and presentation that may exaggerate gains.
There’s interest in comparisons vs Baidu’s Unlimited-OCR, Llama Parse, Apple’s local models, Claude’s vision, Gemini, and Google Vision / Document AI, but data is incomplete or absent.

Pricing and competition

$4 per 1,000 pages is seen as very cheap by some, but others note Google Vision OCR is cheaper for plain text ($1.50/1k) and that layout-aware Google/ Azure offerings are closer in price.
Some wonder how traditional OCR vendors can compete at these price points.

Use cases, limitations & risks

Reported good results on complex business docs, tables, forms, and magazines; one mention of automatic markdown + image cropping being particularly useful.
Some real-world failures are noted (e.g., misrecognized dates on receipts, quotation mark style changes), highlighting risk for high‑stakes or formatting‑sensitive workflows.
Discussion on using OCR outputs in downstream decision systems; concern about silent OCR errors affecting financial or other critical decisions.

Handwriting, languages & edge cases

Multiple comments confirm good handwriting recognition in practice (including historical documents), though always with a human review tail.
Other tools like Transkribus, Sarvam, Gemini Pro, and Qwen models are cited as strong for handwriting or Indic languages.
One user reports language misclassification (Malayalam as Kannada); another notes “rare/specialized languages” labeling (formerly “minor”) as revealing of training priorities.
Some ask for benchmarks by language and on handwritten data; current public benchmarks are seen as skewed toward printed text.

Related topics