PDF to Text, a challenging problem
Why PDF → Text Is Hard
- PDFs are primarily an object graph of drawing instructions (glyphs at coordinates), not a text/markup format. Logical structure (paragraphs, headings, tables) is often absent or optional “extra” metadata.
- The same visual document can be encoded many different ways, depending on the authoring tool (graphics suite vs word processor, etc.).
- Tables, headers/footers, multi-column layouts, nested boxes, and arbitrary positioning are often just loose collections of text and lines; they only look structured when rendered.
- Fonts may map characters to arbitrary glyphs; some PDFs have different internal text than what’s visibly rendered, or white-on-white text.
Traditional Tools and Heuristics
- Poppler’s
pdftotext/pdftohtml, pdfminer.six, and PDFium-based tools are reported as fast and “serviceable” but differ in paragraph breaks and layout fidelity. - Some users convert to HTML and reconstruct structure using element coordinates (e.g., x-positions for columns).
- Others rely on geometric clustering: treat everything as geometry, use spacing to infer word/paragraph breaks and reading order.
- Many workflows resort to OCR and segmentation instead of relying on internal PDF text, especially for complex or scanned documents.
ML, Vision Models, and LLMs
- ML-based tools (e.g., DocLayNet + YOLO, Docling/SmolDocling, Mistral OCR, specialized pipelines) segment pages into text, tables, images, formulas, then run OCR. This yields strong results but is compute-heavy.
- Vision LLMs (Gemini, Claude, OpenAI, etc.) can read PDFs as images and often perform impressively on simple pages, but:
- Hallucinate, especially on complex tables and nested layouts.
- Have trouble with long documents and global structure.
- Are costly at scale (e.g., 1 TB corpus for a search engine), so impractical for some use cases.
- Some argue old-school ML (non-LLM) on good labeled data might compete well with hand-written heuristics.
Scale, Use Cases, and Reliability
- For massive corpora (millions of PDFs), CPU-only, heuristic-heavy or classical-ML pipelines are favored over GPU vision models.
- Business and legal workflows need structured extraction (tables, fields, dates) and high reliability; VLMs are seen as too error-prone today.
- Accessibility adds another dimension: must recover semantics (tables, math, headings) for arbitrary PDFs without sending data to the cloud.
Ecosystem, Tools, and Alternatives
- Many tools are mentioned: pdf.js (rendering + text extraction), pdf-table-extractor, pdfplumber, Cloudflare’s
ai.toMarkdown(), ocrmypdf, Azure Document Intelligence, docTR, pdftotext wrappers, Marker/AlcheMark, custom libraries like “natural-pdf”. - There’s demand for “PDF dev tools” akin to browser inspectors: live mapping between content streams (BT/TJ ops) and rendered regions.
- Suggestions like embedding the original editable source in PDFs or enforcing Tagged PDF could help, but depend on author incentives and legacy content.
- Several comments defend PDF as an excellent “digital paper” format; the core issue is using it as a machine-readable data container, which it was never designed to be.