PDF to Text, a challenging problem

Why PDF → Text Is Hard

  • PDFs are primarily an object graph of drawing instructions (glyphs at coordinates), not a text/markup format. Logical structure (paragraphs, headings, tables) is often absent or optional “extra” metadata.
  • The same visual document can be encoded many different ways, depending on the authoring tool (graphics suite vs word processor, etc.).
  • Tables, headers/footers, multi-column layouts, nested boxes, and arbitrary positioning are often just loose collections of text and lines; they only look structured when rendered.
  • Fonts may map characters to arbitrary glyphs; some PDFs have different internal text than what’s visibly rendered, or white-on-white text.

Traditional Tools and Heuristics

  • Poppler’s pdftotext / pdftohtml, pdfminer.six, and PDFium-based tools are reported as fast and “serviceable” but differ in paragraph breaks and layout fidelity.
  • Some users convert to HTML and reconstruct structure using element coordinates (e.g., x-positions for columns).
  • Others rely on geometric clustering: treat everything as geometry, use spacing to infer word/paragraph breaks and reading order.
  • Many workflows resort to OCR and segmentation instead of relying on internal PDF text, especially for complex or scanned documents.

ML, Vision Models, and LLMs

  • ML-based tools (e.g., DocLayNet + YOLO, Docling/SmolDocling, Mistral OCR, specialized pipelines) segment pages into text, tables, images, formulas, then run OCR. This yields strong results but is compute-heavy.
  • Vision LLMs (Gemini, Claude, OpenAI, etc.) can read PDFs as images and often perform impressively on simple pages, but:
    • Hallucinate, especially on complex tables and nested layouts.
    • Have trouble with long documents and global structure.
    • Are costly at scale (e.g., 1 TB corpus for a search engine), so impractical for some use cases.
  • Some argue old-school ML (non-LLM) on good labeled data might compete well with hand-written heuristics.

Scale, Use Cases, and Reliability

  • For massive corpora (millions of PDFs), CPU-only, heuristic-heavy or classical-ML pipelines are favored over GPU vision models.
  • Business and legal workflows need structured extraction (tables, fields, dates) and high reliability; VLMs are seen as too error-prone today.
  • Accessibility adds another dimension: must recover semantics (tables, math, headings) for arbitrary PDFs without sending data to the cloud.

Ecosystem, Tools, and Alternatives

  • Many tools are mentioned: pdf.js (rendering + text extraction), pdf-table-extractor, pdfplumber, Cloudflare’s ai.toMarkdown(), ocrmypdf, Azure Document Intelligence, docTR, pdftotext wrappers, Marker/AlcheMark, custom libraries like “natural-pdf”.
  • There’s demand for “PDF dev tools” akin to browser inspectors: live mapping between content streams (BT/TJ ops) and rendered regions.
  • Suggestions like embedding the original editable source in PDFs or enforcing Tagged PDF could help, but depend on author incentives and legacy content.
  • Several comments defend PDF as an excellent “digital paper” format; the core issue is using it as a machine-readable data container, which it was never designed to be.