2025-02-14

Ask HN: What is the best method for turning a scanned book as a PDF into text?

Traditional OCR vs. Simple Text Extraction

Tools like pdftotext and desktop PDF readers work only if the PDF already has a text layer; they fail on pure scans.
Classic OCR stacks mentioned: Tesseract (often via OCRmyPDF), Surya, EasyOCR, Paddle, MuPDF-based scripts, Paperless, OCR4All, extractous, ABBYY FineReader, Mathpix (especially for math).
Several users report good results with OCRmyPDF + preprocessing (e.g., Scantailor) and say it “just works” for many books.
Handwriting is a weak spot for open-source OCR; cloud services (e.g., Google Vision) reportedly outperform them there.

Cloud and Commercial OCR Services

Common recommendations: Google Document AI / Vision, AWS Textract, Azure Document Intelligence, Adobe PDF text extraction API, ABBYY FineReader, Mathpix, Llamaparse.
People describe building pipelines: e.g., upload to S3 → trigger Textract → store text → email results.
Some highlight Google’s tools (Document AI, Gemini OCR) as accurate across languages; others note limited flexibility or schema assumptions.

LLMs as OCR Engines

Many advocate multimodal LLMs (Gemini 2.0/Flash, Claude Sonnet, GPT‑4o) as highly accurate, especially page‑by‑page using images.
Reported advantages: better handling of noisy scans and context-aware correction; easy Markdown/styled output.
Concerns:
- Marketing claims about “state of the art” Gemini OCR are seen as overhyped and limited to subproblems.
- LLMs can hallucinate or silently change text, which is unacceptable for high‑stakes domains or strict transcription.
- Long-context degradation: users observe lower quality when feeding whole books vs. single pages.
- Possible censorship/safety filters dropping “awkward” content; suggested fix is tuning API safety settings and insisting on verbatim output.

Hybrid and Workflow Approaches

Suggested best practice for high accuracy: combine classical OCR with LLMs:
- Do OCR first, then have an LLM clean formatting and correct clear OCR glitches.
- Or send both the page image and OCR text to an LLM to reconcile differences and avoid hallucinations.
Several tools (e.g., zerox, LLMWhisperer, custom scripts) orchestrate page splitting, OCR/LLM calls, and structured output.

File Conversion, Layout, and Ecosystem

PDF → EPUB/flowed text remains hard; Calibre’s ebook-convert is widely recommended but imperfect.
Tools like Docling, Llamaparse, fixmydocuments, and Mathpix target layout and Markdown/structural recovery.
Internet Archive’s upload-and-OCR workflow is praised for convenience and public benefit, but its OCR is often less accurate than LLM-based methods, especially for historical or complex texts.

Related topics