2025-08-03

So you want to parse a PDF?

PDF structure, streaming, and corruption

Trailer dictionary and startxref footer make naïve streaming hard; linearized PDFs exist to enable first-page rendering without full download.
Range requests can still support streaming: fetch end bytes for xref, then needed ranges, at the cost of a couple extra RTTs.
Real-world PDFs frequently have broken incremental-save chains: /Prev offsets wrong, out of bounds, or inconsistent. Robust parsers fall back to brute-force scanning for obj tokens and reconstruct xref tables.
Newer versions add xref streams and object streams, often compressed; offsets may point into compressed structures, further complicating parsing.
Some libraries choose recovery-first designs, accepting slower throughput in exchange for surviving malformed files.

“PDF hell”: complexity and fragility

Many commenters stress how deceptively hard PDF is: weird mix of text and binary, multiple compression layers, various font encodings, and decades of buggy producers.
The internal “structure” of text is often just glyphs with arbitrary numeric codes, sometimes reversed or split into individual letters; ligatures (e.g., “ff”) confuse downstream parsers.
PDFs may contain only images, paths used as text, hidden or overwritten text, rotated pages, watermarks, or partially OCR’d layers.
Large-scale tests show many libraries either fail to parse a nontrivial fraction of real PDFs or are 1–2 orders of magnitude slower than the fastest ones.

Raster/OCR vs direct PDF parsing

One camp converts each page to an image, then uses OCR or vision/multimodal LLMs to recover text, layout, and tables.
- Arguments for: works uniformly on scanned/image-only PDFs; bypasses broken encodings and bizarre layouts; models approximate human reading order; easier to ship quickly.
The opposing camp calls this “absurd”: if you can render to pixels, you’ve already solved parsing, so render to structured data (text/SVG/XML) instead and avoid OCR errors, hallucinations, and heavy compute.
- They report high accuracy and efficiency using renderers plus custom geometry-based algorithms to rebuild words, lines, and blocks.
Middle ground: direct parsing can be superior for well-behaved, known sources; pure CV is often more robust for heterogeneous, adversarial, or legacy corpora. Many real systems are hybrids (PDF metadata + layout models + OCR for images).

Use cases, tooling, and alternatives

Pain points: bank statements, invoices, resumes, complex magazines/catalogs, forms, and financial documents where CSV/APIs are missing or crippled.
Some banking ecosystems expose proper APIs; others rely solely on PDFs, sometimes deliberately hindering analysis.
Tagged PDF, PDF/A/UA, embedded metadata, and digital signatures can make PDFs machine- and accessibility-friendly, but are inconsistently used and ignored by vision-only approaches.
Suggested tools and approaches: Poppler (pdftotext, pdf2cairo), MuPDF/mutool, pdfgrep, Ghostscript-based PDF/A converters, and layout-analysis frameworks like PdfPig or Docling.
Several commercial APIs/SDKs pitch “PDF in, structured JSON out,” often combining structural parsing with computer vision.
Broader sentiment: PDF is “digital paper,” great for fixed layout, terrible as a primary data format; some hope future workflows adopt content-first formats (Markdown, HTML/EPUB, XML/ODF) with PDFs as derived views only.

Related topics