X-ray: a Python library for finding bad redactions in PDF documents
Context: Epstein PDFs and Redaction Failures
- Many recently released Epstein court PDFs used naive “black box over text” redactions, leaving underlying text intact.
- In some files, users can simply select and copy “redacted” lines in a browser PDF viewer and see the hidden text.
- X-ray is highlighted as a tool that detects these overlay-style redactions at scale; it has already been run on examples from justice.gov.
Proper Redaction Tools and Workflows
- Commenters note that Adobe Acrobat Pro, when used correctly with “mark for redaction” then “apply redactions,” permanently removes content and has been standard in legal practice for years.
- “Draw a black box” is described as a legacy paper-era habit; in PDF this only hides, not removes, the text.
- Some PDFs retain older revisions via incremental updates (
/Prevtrailer), meaning earlier, less-redacted states can still be recovered.
Rasterization vs. Searchability
- A common “safer” user workflow is to overlay black boxes then rasterize pages to images, but this produces large, non-CCITT-compressed files and loses text search.
- Requirements for searchable public records often rule out full rasterization.
- Some governments go the opposite direction and intentionally scramble text layers so documents are readable but not searchable or copyable, which is viewed as hostile.
AI, Side-Channels, and Font Metrics
- One proposal: use AI to enforce an objective redaction standard and compare human vs. AI redaction rates.
- Others argue AI isn’t needed to detect naive redactions, but could help infer what should be redacted.
- There is extensive discussion of “glyph spacing” / font-metric attacks: inferring redacted words from bounding box width, kerning, and context, especially when combined with AI.
- Suggested mitigations include widening redaction boxes (possibly to a constant width) and using reflowable formats; skepticism remains about fully eliminating these leaks, especially for short, predictable text.
Intentional vs. Incompetent Redactions
- Some insist such failures are pure amateurism and lack of training or process.
- Others, citing strict federal redaction training, argue this was likely “malicious compliance” or deliberate sabotage of overbroad redactions.
- There is no consensus; motives are described as unclear.
Legal and Ethical Redaction Considerations
- Defensible reasons for redaction: protect victims, witnesses, informants, ongoing investigations, national security, and avoid releasing child sexual abuse material.
- Law is said to prohibit redaction for embarrassment, reputational harm, or political sensitivity, yet many redactions in these documents appear unconnected to those legitimate grounds.
- Victims have publicly complained that required redactions (victim identities) were missed while non-permitted redactions (protecting others) were applied.
Disclosure Ethics and Impact
- One camp argues powerful de-redaction techniques should be withheld to avoid mass unmasking of past redactions.
- Another argues that undisclosed vulnerabilities only create more victims, and that publicizing flaws is necessary for improved tools and workflows.
- A key complication: unlike cryptography, existing redacted PDFs can’t be retroactively “patched,” so disclosure has permanent consequences.
Broader PDF and Government Tech Critiques
- Several comments deride PDF as a fragile, overcomplicated, and insecure format for long-term public records, especially when redaction is anticipated.
- Observers are unsurprised by widespread PDF mishandling, citing poor technical literacy in non-technical institutions and management that underestimates the complexity of secure redaction.