Archivists work to save disappearing data.gov datasets

Historical and Political Framing

  • Several commenters frame the deletions as a modern “book burning,” invoking Qin Shi Huang and Orwell’s 1984 to highlight deliberate erasure of history as a tool of power.
  • Others caution against over-literal historical analogies (e.g., Nazis, concentration camps), arguing that hyperbole blurs the line between real authoritarian danger and normal—if aggressive—policy change.

Are Deletions Routine or Malicious?

  • Multiple people note that large dataset count swings (hundreds or thousands) have occurred before; a user who’s been scraping data.gov shows several ~10k swings in 2024 alone.
  • Commenters point out that about 1,000 datasets disappeared right after Biden’s inauguration as well; this suggests some churn is normal around transitions.
  • However, there’s concern that current deletions appear concentrated in environmental, climate, and DEI‑related domains, which many interpret as ideologically motivated rather than housekeeping.
  • Others stress that without a clear, public diff of what changed, it’s impossible to separate renames/moves from genuine deletions or to judge intent.

Targets, Motives, and Legal Context

  • Reported removals from EPA, NOAA, USDA (climate pages) and USAID/contractor sites (climate, gender, biodiversity) are seen as part of a broader effort to suppress inconvenient science and development work.
  • Some commenters connect this to executive orders and broader moves to weaken regulatory agencies and freeze spending, debating the legality (impoundment, APA constraints, prior Supreme Court decisions).
  • A long subthread argues about Trump’s broader rule‑of‑law record, jury nullification, and whether democratic legitimacy can “wash away” legal violations, reflecting deep polarization.

Archiving Efforts and Technical Challenges

  • Independent archivists, the Internet Archive, End of Term (EOT), and academic labs are all copying datasets and web content; there’s also grassroots organizing on r/DataHoarder.
  • A lab representative describes:
    • Signed BagIt-based snapshots (bag-nabit) to provide provenance and verifiable integrity.
    • Difficulty distinguishing true deletions from renamed/relocated datasets.
    • The challenge of capturing important data that sits behind HTML landing pages or deep links.
  • Commenters propose techniques for change detection and deduplication (hashing, Jaccard/MinHash), cryptographic timestamping, and even TLS-based or blockchain-based attestation; others argue these can’t fully solve social trust problems.

How to Help and Low-Budget Archiving

  • Suggestions for volunteers include:
    • Targeted scraping of key domains (especially scientific and climate-related data).
    • Using WARC-based tools, torrents, IPFS, and rclone to mirror and share datasets.
  • Some stress that archives must be not only complete but usable and discoverable, or they’ll never meaningfully inform future research or accountability.

Broader Concerns

  • Several comments lament that erasing or muddying public data undermines one of the U.S.’s core strengths: long-term governmental transparency enabling science, policy evaluation, and legal accountability.
  • Others warn against “hysteria,” arguing that extreme rhetoric benefits Trump politically and obscures the real, documentable harms—such as the quiet disappearance of critical datasets.