X-ray: a Python library for finding bad redactions in PDF documents

Context: Epstein PDFs and Redaction Failures

  • Many recently released Epstein court PDFs used naive “black box over text” redactions, leaving underlying text intact.
  • In some files, users can simply select and copy “redacted” lines in a browser PDF viewer and see the hidden text.
  • X-ray is highlighted as a tool that detects these overlay-style redactions at scale; it has already been run on examples from justice.gov.

Proper Redaction Tools and Workflows

  • Commenters note that Adobe Acrobat Pro, when used correctly with “mark for redaction” then “apply redactions,” permanently removes content and has been standard in legal practice for years.
  • “Draw a black box” is described as a legacy paper-era habit; in PDF this only hides, not removes, the text.
  • Some PDFs retain older revisions via incremental updates (/Prev trailer), meaning earlier, less-redacted states can still be recovered.

Rasterization vs. Searchability

  • A common “safer” user workflow is to overlay black boxes then rasterize pages to images, but this produces large, non-CCITT-compressed files and loses text search.
  • Requirements for searchable public records often rule out full rasterization.
  • Some governments go the opposite direction and intentionally scramble text layers so documents are readable but not searchable or copyable, which is viewed as hostile.

AI, Side-Channels, and Font Metrics

  • One proposal: use AI to enforce an objective redaction standard and compare human vs. AI redaction rates.
  • Others argue AI isn’t needed to detect naive redactions, but could help infer what should be redacted.
  • There is extensive discussion of “glyph spacing” / font-metric attacks: inferring redacted words from bounding box width, kerning, and context, especially when combined with AI.
  • Suggested mitigations include widening redaction boxes (possibly to a constant width) and using reflowable formats; skepticism remains about fully eliminating these leaks, especially for short, predictable text.

Intentional vs. Incompetent Redactions

  • Some insist such failures are pure amateurism and lack of training or process.
  • Others, citing strict federal redaction training, argue this was likely “malicious compliance” or deliberate sabotage of overbroad redactions.
  • There is no consensus; motives are described as unclear.

Legal and Ethical Redaction Considerations

  • Defensible reasons for redaction: protect victims, witnesses, informants, ongoing investigations, national security, and avoid releasing child sexual abuse material.
  • Law is said to prohibit redaction for embarrassment, reputational harm, or political sensitivity, yet many redactions in these documents appear unconnected to those legitimate grounds.
  • Victims have publicly complained that required redactions (victim identities) were missed while non-permitted redactions (protecting others) were applied.

Disclosure Ethics and Impact

  • One camp argues powerful de-redaction techniques should be withheld to avoid mass unmasking of past redactions.
  • Another argues that undisclosed vulnerabilities only create more victims, and that publicizing flaws is necessary for improved tools and workflows.
  • A key complication: unlike cryptography, existing redacted PDFs can’t be retroactively “patched,” so disclosure has permanent consequences.

Broader PDF and Government Tech Critiques

  • Several comments deride PDF as a fragile, overcomplicated, and insecure format for long-term public records, especially when redaction is anticipated.
  • Observers are unsurprised by widespread PDF mishandling, citing poor technical literacy in non-technical institutions and management that underestimates the complexity of secure redaction.