2024-08-09

Show HN: LLM-aided OCR – Correcting Tesseract OCR errors with LLMs

Overall response

Many commenters like the idea of layering LLMs on top of OCR, especially for long, low‑quality scans (old books, archives) where manual cleanup is painful.
Others feel Tesseract is outdated and that newer OCR or vision models alone can now do better.

Tesseract and other OCR engines

Tesseract is praised for being free, CPU‑friendly, and fast, but accuracy is a recurring complaint—especially with digits, punctuation, plus signs, and non‑Latin scripts.
Some note that tuning DPI (around 300) helps, but still hit issues like “77” → “7” or “+40%” → “440%”.
Alternatives mentioned: EasyOCR, PaddleOCR, Surya, TrOCR, Nougat, Donut, and commercial APIs (Google, Amazon Textract, Azure).
For Arabic, Japanese, and German, several report that commercial cloud OCR is clearly superior to open‑source options but at the cost of dependency and privacy.

LLM‑aided correction: strengths and weaknesses

The LLM layer shines on prose: fixing obvious OCR typos, normalizing punctuation, and turning noisy scans into readable markdown/epub text.
It is much less trustworthy for numbers, names, and contract‑like documents; several insist such outputs must be manually checked, often field‑by‑field.
Multi‑stage prompting and narrow tasks per step matter more than model choice. Prompt engineering is key.
Some show examples where an LLM correctly “rescues” Tesseract errors using domain context (finance, road names).

Vision LLMs vs OCR+LLM pipelines

Opinions diverge:
- Some claim GPT‑4V/4o, Claude, Gemini Flash, and specialized models (e.g., Florence‑2) already outperform Tesseract/EasyOCR for many documents, including equations and tables, and can work directly from PDFs or page images.
- Others find small open multimodal models slower and less accurate than classical OCR, especially on symbols, dense tables, and weird layouts.
There’s debate over cost: some say page‑image → GPT‑4o‑mini is comparable to OCR+LLM; others think full VLM OCR is still orders of magnitude more expensive.

Hallucinations, safety, and trust

Multiple commenters worry about “silent” semantic errors (numbers, legal terms) analogous to the historic JBIG2 Xerox bug; they prefer visibly broken OCR over polished but wrong text.
LLM hallucinations and safety filters (e.g., refusing violent content) are seen as real risks for sensitive documents like police reports or contracts.
Attempts to detect prompt‑injection or safety interference via additional prompts are viewed as only partially reliable.

Complex layouts, tables, and handwriting

Multi‑column layouts, forms, dense tables, charts, and scientific formulas remain difficult. Some recommend specialized models (e.g., for formulas or invoices) plus a final LLM cleanup.
Handwriting OCR is still weak overall; certain vision transformers and GPT‑4o work well on neat handwriting, but large personal archives and cursive remain challenging.

Related topics