New information extracted from Snowden PDFs through metadata version analysis
How PDF “version history” works
- PDFs are collections of numbered objects linked in a graph; readers follow a root “catalog” and ignore unreferenced or superseded objects.
- The format supports incremental updates: new versions of objects are appended with higher generations instead of rewriting the file.
- Older revisions can persist in the file as orphaned or superseded objects, or as earlier “revisions” delimited by repeated
%%EOFmarkers. - Tools like
pdfresurrectand manual truncation at earlier%%EOFpositions can expose prior document states. - This behavior was intended to make editing, annotations, and signatures fast on limited hardware, not as a security feature.
Tools and need for better PDF inspection
- Commenters use
mutool,qpdf(QDF mode), and reverse‑engineering toolkits like REMnux for inspecting structure, objects, and potential malware. - There is a desire for more user‑friendly GUIs on top of these low-level tools.
Redaction failures & journalistic practices
- The Snowden PDFs in question appear to have journalist-made redactions, with metadata timestamps suggesting edits weeks before publication.
- Most documents in the archive are described as carefully handled; these specific files are exceptions where metadata leaks revealed significant extra info.
- Some commenters think redactions should have been visibly marked and that safer workflows (screenshots, rasterization) should have been used.
Sanitizing PDFs and alternative workflows
- Proposed mitigation approaches:
- Print-and-scan to image-only PDFs.
- Convert to PNG/JPEG/TIFF or BMP, optionally add noise, then rebuild a PDF.
- “Print to PDF” is seen as less trustworthy unless it truly rasterizes.
- More extreme ideas: LLM-based rephrasing plus rasterization to strip subtle identifiers.
Printer tracking dots and OPSEC limits
- Color printers often embed tiny yellow tracking dots encoding at least serial numbers and timestamps; some commenters doubt claims that public IPs are encoded.
- Black-and-white laser printers are believed not to use yellow-dot schemes, making them preferable for anonymity.
- Open, fully controllable printers are discussed as a missing piece of privacy infrastructure.
Broader reflections
- Some see this as a new OSINT technique and note parallels with long-term exploitation of archived data.
- Others critique the article series for being more descriptive than analytical and question how much novel insight it actually adds.