Ask HN: What is the best method for turning a scanned book as a PDF into text?
Traditional OCR vs. Simple Text Extraction
- Tools like
pdftotextand desktop PDF readers work only if the PDF already has a text layer; they fail on pure scans. - Classic OCR stacks mentioned: Tesseract (often via OCRmyPDF), Surya, EasyOCR, Paddle, MuPDF-based scripts, Paperless, OCR4All, extractous, ABBYY FineReader, Mathpix (especially for math).
- Several users report good results with OCRmyPDF + preprocessing (e.g., Scantailor) and say it “just works” for many books.
- Handwriting is a weak spot for open-source OCR; cloud services (e.g., Google Vision) reportedly outperform them there.
Cloud and Commercial OCR Services
- Common recommendations: Google Document AI / Vision, AWS Textract, Azure Document Intelligence, Adobe PDF text extraction API, ABBYY FineReader, Mathpix, Llamaparse.
- People describe building pipelines: e.g., upload to S3 → trigger Textract → store text → email results.
- Some highlight Google’s tools (Document AI, Gemini OCR) as accurate across languages; others note limited flexibility or schema assumptions.
LLMs as OCR Engines
- Many advocate multimodal LLMs (Gemini 2.0/Flash, Claude Sonnet, GPT‑4o) as highly accurate, especially page‑by‑page using images.
- Reported advantages: better handling of noisy scans and context-aware correction; easy Markdown/styled output.
- Concerns:
- Marketing claims about “state of the art” Gemini OCR are seen as overhyped and limited to subproblems.
- LLMs can hallucinate or silently change text, which is unacceptable for high‑stakes domains or strict transcription.
- Long-context degradation: users observe lower quality when feeding whole books vs. single pages.
- Possible censorship/safety filters dropping “awkward” content; suggested fix is tuning API safety settings and insisting on verbatim output.
Hybrid and Workflow Approaches
- Suggested best practice for high accuracy: combine classical OCR with LLMs:
- Do OCR first, then have an LLM clean formatting and correct clear OCR glitches.
- Or send both the page image and OCR text to an LLM to reconcile differences and avoid hallucinations.
- Several tools (e.g., zerox, LLMWhisperer, custom scripts) orchestrate page splitting, OCR/LLM calls, and structured output.
File Conversion, Layout, and Ecosystem
- PDF → EPUB/flowed text remains hard; Calibre’s
ebook-convertis widely recommended but imperfect. - Tools like Docling, Llamaparse, fixmydocuments, and Mathpix target layout and Markdown/structural recovery.
- Internet Archive’s upload-and-OCR workflow is praised for convenience and public benefit, but its OCR is often less accurate than LLM-based methods, especially for historical or complex texts.