So you want to parse a PDF?

PDF structure, streaming, and corruption

  • Trailer dictionary and startxref footer make naïve streaming hard; linearized PDFs exist to enable first-page rendering without full download.
  • Range requests can still support streaming: fetch end bytes for xref, then needed ranges, at the cost of a couple extra RTTs.
  • Real-world PDFs frequently have broken incremental-save chains: /Prev offsets wrong, out of bounds, or inconsistent. Robust parsers fall back to brute-force scanning for obj tokens and reconstruct xref tables.
  • Newer versions add xref streams and object streams, often compressed; offsets may point into compressed structures, further complicating parsing.
  • Some libraries choose recovery-first designs, accepting slower throughput in exchange for surviving malformed files.

“PDF hell”: complexity and fragility

  • Many commenters stress how deceptively hard PDF is: weird mix of text and binary, multiple compression layers, various font encodings, and decades of buggy producers.
  • The internal “structure” of text is often just glyphs with arbitrary numeric codes, sometimes reversed or split into individual letters; ligatures (e.g., “ff”) confuse downstream parsers.
  • PDFs may contain only images, paths used as text, hidden or overwritten text, rotated pages, watermarks, or partially OCR’d layers.
  • Large-scale tests show many libraries either fail to parse a nontrivial fraction of real PDFs or are 1–2 orders of magnitude slower than the fastest ones.

Raster/OCR vs direct PDF parsing

  • One camp converts each page to an image, then uses OCR or vision/multimodal LLMs to recover text, layout, and tables.
    • Arguments for: works uniformly on scanned/image-only PDFs; bypasses broken encodings and bizarre layouts; models approximate human reading order; easier to ship quickly.
  • The opposing camp calls this “absurd”: if you can render to pixels, you’ve already solved parsing, so render to structured data (text/SVG/XML) instead and avoid OCR errors, hallucinations, and heavy compute.
    • They report high accuracy and efficiency using renderers plus custom geometry-based algorithms to rebuild words, lines, and blocks.
  • Middle ground: direct parsing can be superior for well-behaved, known sources; pure CV is often more robust for heterogeneous, adversarial, or legacy corpora. Many real systems are hybrids (PDF metadata + layout models + OCR for images).

Use cases, tooling, and alternatives

  • Pain points: bank statements, invoices, resumes, complex magazines/catalogs, forms, and financial documents where CSV/APIs are missing or crippled.
  • Some banking ecosystems expose proper APIs; others rely solely on PDFs, sometimes deliberately hindering analysis.
  • Tagged PDF, PDF/A/UA, embedded metadata, and digital signatures can make PDFs machine- and accessibility-friendly, but are inconsistently used and ignored by vision-only approaches.
  • Suggested tools and approaches: Poppler (pdftotext, pdf2cairo), MuPDF/mutool, pdfgrep, Ghostscript-based PDF/A converters, and layout-analysis frameworks like PdfPig or Docling.
  • Several commercial APIs/SDKs pitch “PDF in, structured JSON out,” often combining structural parsing with computer vision.
  • Broader sentiment: PDF is “digital paper,” great for fixed layout, terrible as a primary data format; some hope future workflows adopt content-first formats (Markdown, HTML/EPUB, XML/ODF) with PDFs as derived views only.