Show HN: LLM-aided OCR – Correcting Tesseract OCR errors with LLMs

Overall response

  • Many commenters like the idea of layering LLMs on top of OCR, especially for long, low‑quality scans (old books, archives) where manual cleanup is painful.
  • Others feel Tesseract is outdated and that newer OCR or vision models alone can now do better.

Tesseract and other OCR engines

  • Tesseract is praised for being free, CPU‑friendly, and fast, but accuracy is a recurring complaint—especially with digits, punctuation, plus signs, and non‑Latin scripts.
  • Some note that tuning DPI (around 300) helps, but still hit issues like “77” → “7” or “+40%” → “440%”.
  • Alternatives mentioned: EasyOCR, PaddleOCR, Surya, TrOCR, Nougat, Donut, and commercial APIs (Google, Amazon Textract, Azure).
  • For Arabic, Japanese, and German, several report that commercial cloud OCR is clearly superior to open‑source options but at the cost of dependency and privacy.

LLM‑aided correction: strengths and weaknesses

  • The LLM layer shines on prose: fixing obvious OCR typos, normalizing punctuation, and turning noisy scans into readable markdown/epub text.
  • It is much less trustworthy for numbers, names, and contract‑like documents; several insist such outputs must be manually checked, often field‑by‑field.
  • Multi‑stage prompting and narrow tasks per step matter more than model choice. Prompt engineering is key.
  • Some show examples where an LLM correctly “rescues” Tesseract errors using domain context (finance, road names).

Vision LLMs vs OCR+LLM pipelines

  • Opinions diverge:
    • Some claim GPT‑4V/4o, Claude, Gemini Flash, and specialized models (e.g., Florence‑2) already outperform Tesseract/EasyOCR for many documents, including equations and tables, and can work directly from PDFs or page images.
    • Others find small open multimodal models slower and less accurate than classical OCR, especially on symbols, dense tables, and weird layouts.
  • There’s debate over cost: some say page‑image → GPT‑4o‑mini is comparable to OCR+LLM; others think full VLM OCR is still orders of magnitude more expensive.

Hallucinations, safety, and trust

  • Multiple commenters worry about “silent” semantic errors (numbers, legal terms) analogous to the historic JBIG2 Xerox bug; they prefer visibly broken OCR over polished but wrong text.
  • LLM hallucinations and safety filters (e.g., refusing violent content) are seen as real risks for sensitive documents like police reports or contracts.
  • Attempts to detect prompt‑injection or safety interference via additional prompts are viewed as only partially reliable.

Complex layouts, tables, and handwriting

  • Multi‑column layouts, forms, dense tables, charts, and scientific formulas remain difficult. Some recommend specialized models (e.g., for formulas or invoices) plus a final LLM cleanup.
  • Handwriting OCR is still weak overall; certain vision transformers and GPT‑4o work well on neat handwriting, but large personal archives and cursive remain challenging.