Ask HN: What is the best method for turning a scanned book as a PDF into text?

Traditional OCR vs. Simple Text Extraction

  • Tools like pdftotext and desktop PDF readers work only if the PDF already has a text layer; they fail on pure scans.
  • Classic OCR stacks mentioned: Tesseract (often via OCRmyPDF), Surya, EasyOCR, Paddle, MuPDF-based scripts, Paperless, OCR4All, extractous, ABBYY FineReader, Mathpix (especially for math).
  • Several users report good results with OCRmyPDF + preprocessing (e.g., Scantailor) and say it “just works” for many books.
  • Handwriting is a weak spot for open-source OCR; cloud services (e.g., Google Vision) reportedly outperform them there.

Cloud and Commercial OCR Services

  • Common recommendations: Google Document AI / Vision, AWS Textract, Azure Document Intelligence, Adobe PDF text extraction API, ABBYY FineReader, Mathpix, Llamaparse.
  • People describe building pipelines: e.g., upload to S3 → trigger Textract → store text → email results.
  • Some highlight Google’s tools (Document AI, Gemini OCR) as accurate across languages; others note limited flexibility or schema assumptions.

LLMs as OCR Engines

  • Many advocate multimodal LLMs (Gemini 2.0/Flash, Claude Sonnet, GPT‑4o) as highly accurate, especially page‑by‑page using images.
  • Reported advantages: better handling of noisy scans and context-aware correction; easy Markdown/styled output.
  • Concerns:
    • Marketing claims about “state of the art” Gemini OCR are seen as overhyped and limited to subproblems.
    • LLMs can hallucinate or silently change text, which is unacceptable for high‑stakes domains or strict transcription.
    • Long-context degradation: users observe lower quality when feeding whole books vs. single pages.
    • Possible censorship/safety filters dropping “awkward” content; suggested fix is tuning API safety settings and insisting on verbatim output.

Hybrid and Workflow Approaches

  • Suggested best practice for high accuracy: combine classical OCR with LLMs:
    • Do OCR first, then have an LLM clean formatting and correct clear OCR glitches.
    • Or send both the page image and OCR text to an LLM to reconcile differences and avoid hallucinations.
  • Several tools (e.g., zerox, LLMWhisperer, custom scripts) orchestrate page splitting, OCR/LLM calls, and structured output.

File Conversion, Layout, and Ecosystem

  • PDF → EPUB/flowed text remains hard; Calibre’s ebook-convert is widely recommended but imperfect.
  • Tools like Docling, Llamaparse, fixmydocuments, and Mathpix target layout and Markdown/structural recovery.
  • Internet Archive’s upload-and-OCR workflow is praised for convenience and public benefit, but its OCR is often less accurate than LLM-based methods, especially for historical or complex texts.