Show HN: LLM-aided OCR – Correcting Tesseract OCR errors with LLMs
Overall response
- Many commenters like the idea of layering LLMs on top of OCR, especially for long, low‑quality scans (old books, archives) where manual cleanup is painful.
- Others feel Tesseract is outdated and that newer OCR or vision models alone can now do better.
Tesseract and other OCR engines
- Tesseract is praised for being free, CPU‑friendly, and fast, but accuracy is a recurring complaint—especially with digits, punctuation, plus signs, and non‑Latin scripts.
- Some note that tuning DPI (around 300) helps, but still hit issues like “77” → “7” or “+40%” → “440%”.
- Alternatives mentioned: EasyOCR, PaddleOCR, Surya, TrOCR, Nougat, Donut, and commercial APIs (Google, Amazon Textract, Azure).
- For Arabic, Japanese, and German, several report that commercial cloud OCR is clearly superior to open‑source options but at the cost of dependency and privacy.
LLM‑aided correction: strengths and weaknesses
- The LLM layer shines on prose: fixing obvious OCR typos, normalizing punctuation, and turning noisy scans into readable markdown/epub text.
- It is much less trustworthy for numbers, names, and contract‑like documents; several insist such outputs must be manually checked, often field‑by‑field.
- Multi‑stage prompting and narrow tasks per step matter more than model choice. Prompt engineering is key.
- Some show examples where an LLM correctly “rescues” Tesseract errors using domain context (finance, road names).
Vision LLMs vs OCR+LLM pipelines
- Opinions diverge:
- Some claim GPT‑4V/4o, Claude, Gemini Flash, and specialized models (e.g., Florence‑2) already outperform Tesseract/EasyOCR for many documents, including equations and tables, and can work directly from PDFs or page images.
- Others find small open multimodal models slower and less accurate than classical OCR, especially on symbols, dense tables, and weird layouts.
- There’s debate over cost: some say page‑image → GPT‑4o‑mini is comparable to OCR+LLM; others think full VLM OCR is still orders of magnitude more expensive.
Hallucinations, safety, and trust
- Multiple commenters worry about “silent” semantic errors (numbers, legal terms) analogous to the historic JBIG2 Xerox bug; they prefer visibly broken OCR over polished but wrong text.
- LLM hallucinations and safety filters (e.g., refusing violent content) are seen as real risks for sensitive documents like police reports or contracts.
- Attempts to detect prompt‑injection or safety interference via additional prompts are viewed as only partially reliable.
Complex layouts, tables, and handwriting
- Multi‑column layouts, forms, dense tables, charts, and scientific formulas remain difficult. Some recommend specialized models (e.g., for formulas or invoices) plus a final LLM cleanup.
- Handwriting OCR is still weak overall; certain vision transformers and GPT‑4o work well on neat handwriting, but large personal archives and cursive remain challenging.