2025-05-13

PDF to Text, a challenging problem

Why PDF → Text Is Hard

PDFs are primarily an object graph of drawing instructions (glyphs at coordinates), not a text/markup format. Logical structure (paragraphs, headings, tables) is often absent or optional “extra” metadata.
The same visual document can be encoded many different ways, depending on the authoring tool (graphics suite vs word processor, etc.).
Tables, headers/footers, multi-column layouts, nested boxes, and arbitrary positioning are often just loose collections of text and lines; they only look structured when rendered.
Fonts may map characters to arbitrary glyphs; some PDFs have different internal text than what’s visibly rendered, or white-on-white text.

Traditional Tools and Heuristics

Poppler’s pdftotext / pdftohtml, pdfminer.six, and PDFium-based tools are reported as fast and “serviceable” but differ in paragraph breaks and layout fidelity.
Some users convert to HTML and reconstruct structure using element coordinates (e.g., x-positions for columns).
Others rely on geometric clustering: treat everything as geometry, use spacing to infer word/paragraph breaks and reading order.
Many workflows resort to OCR and segmentation instead of relying on internal PDF text, especially for complex or scanned documents.

ML, Vision Models, and LLMs

ML-based tools (e.g., DocLayNet + YOLO, Docling/SmolDocling, Mistral OCR, specialized pipelines) segment pages into text, tables, images, formulas, then run OCR. This yields strong results but is compute-heavy.
Vision LLMs (Gemini, Claude, OpenAI, etc.) can read PDFs as images and often perform impressively on simple pages, but:
- Hallucinate, especially on complex tables and nested layouts.
- Have trouble with long documents and global structure.
- Are costly at scale (e.g., 1 TB corpus for a search engine), so impractical for some use cases.
Some argue old-school ML (non-LLM) on good labeled data might compete well with hand-written heuristics.

Scale, Use Cases, and Reliability

For massive corpora (millions of PDFs), CPU-only, heuristic-heavy or classical-ML pipelines are favored over GPU vision models.
Business and legal workflows need structured extraction (tables, fields, dates) and high reliability; VLMs are seen as too error-prone today.
Accessibility adds another dimension: must recover semantics (tables, math, headings) for arbitrary PDFs without sending data to the cloud.

Ecosystem, Tools, and Alternatives

Many tools are mentioned: pdf.js (rendering + text extraction), pdf-table-extractor, pdfplumber, Cloudflare’s ai.toMarkdown(), ocrmypdf, Azure Document Intelligence, docTR, pdftotext wrappers, Marker/AlcheMark, custom libraries like “natural-pdf”.
There’s demand for “PDF dev tools” akin to browser inspectors: live mapping between content streams (BT/TJ ops) and rendered regions.
Suggestions like embedding the original editable source in PDFs or enforcing Tagged PDF could help, but depend on author incentives and legacy content.
Several comments defend PDF as an excellent “digital paper” format; the core issue is using it as a machine-readable data container, which it was never designed to be.

Related topics