New information extracted from Snowden PDFs through metadata version analysis

How PDF “version history” works

  • PDFs are collections of numbered objects linked in a graph; readers follow a root “catalog” and ignore unreferenced or superseded objects.
  • The format supports incremental updates: new versions of objects are appended with higher generations instead of rewriting the file.
  • Older revisions can persist in the file as orphaned or superseded objects, or as earlier “revisions” delimited by repeated %%EOF markers.
  • Tools like pdfresurrect and manual truncation at earlier %%EOF positions can expose prior document states.
  • This behavior was intended to make editing, annotations, and signatures fast on limited hardware, not as a security feature.

Tools and need for better PDF inspection

  • Commenters use mutool, qpdf (QDF mode), and reverse‑engineering toolkits like REMnux for inspecting structure, objects, and potential malware.
  • There is a desire for more user‑friendly GUIs on top of these low-level tools.

Redaction failures & journalistic practices

  • The Snowden PDFs in question appear to have journalist-made redactions, with metadata timestamps suggesting edits weeks before publication.
  • Most documents in the archive are described as carefully handled; these specific files are exceptions where metadata leaks revealed significant extra info.
  • Some commenters think redactions should have been visibly marked and that safer workflows (screenshots, rasterization) should have been used.

Sanitizing PDFs and alternative workflows

  • Proposed mitigation approaches:
    • Print-and-scan to image-only PDFs.
    • Convert to PNG/JPEG/TIFF or BMP, optionally add noise, then rebuild a PDF.
    • “Print to PDF” is seen as less trustworthy unless it truly rasterizes.
    • More extreme ideas: LLM-based rephrasing plus rasterization to strip subtle identifiers.

Printer tracking dots and OPSEC limits

  • Color printers often embed tiny yellow tracking dots encoding at least serial numbers and timestamps; some commenters doubt claims that public IPs are encoded.
  • Black-and-white laser printers are believed not to use yellow-dot schemes, making them preferable for anonymity.
  • Open, fully controllable printers are discussed as a missing piece of privacy infrastructure.

Broader reflections

  • Some see this as a new OSINT technique and note parallels with long-term exploitation of archived data.
  • Others critique the article series for being more descriptive than analytical and question how much novel insight it actually adds.