2025-01-30

Archivists work to save disappearing data.gov datasets

Historical and Political Framing

Several commenters frame the deletions as a modern “book burning,” invoking Qin Shi Huang and Orwell’s 1984 to highlight deliberate erasure of history as a tool of power.
Others caution against over-literal historical analogies (e.g., Nazis, concentration camps), arguing that hyperbole blurs the line between real authoritarian danger and normal—if aggressive—policy change.

Are Deletions Routine or Malicious?

Multiple people note that large dataset count swings (hundreds or thousands) have occurred before; a user who’s been scraping data.gov shows several ~10k swings in 2024 alone.
Commenters point out that about 1,000 datasets disappeared right after Biden’s inauguration as well; this suggests some churn is normal around transitions.
However, there’s concern that current deletions appear concentrated in environmental, climate, and DEI‑related domains, which many interpret as ideologically motivated rather than housekeeping.
Others stress that without a clear, public diff of what changed, it’s impossible to separate renames/moves from genuine deletions or to judge intent.

Targets, Motives, and Legal Context

Reported removals from EPA, NOAA, USDA (climate pages) and USAID/contractor sites (climate, gender, biodiversity) are seen as part of a broader effort to suppress inconvenient science and development work.
Some commenters connect this to executive orders and broader moves to weaken regulatory agencies and freeze spending, debating the legality (impoundment, APA constraints, prior Supreme Court decisions).
A long subthread argues about Trump’s broader rule‑of‑law record, jury nullification, and whether democratic legitimacy can “wash away” legal violations, reflecting deep polarization.

Archiving Efforts and Technical Challenges

Independent archivists, the Internet Archive, End of Term (EOT), and academic labs are all copying datasets and web content; there’s also grassroots organizing on r/DataHoarder.
A lab representative describes:
- Signed BagIt-based snapshots (bag-nabit) to provide provenance and verifiable integrity.
- Difficulty distinguishing true deletions from renamed/relocated datasets.
- The challenge of capturing important data that sits behind HTML landing pages or deep links.
Commenters propose techniques for change detection and deduplication (hashing, Jaccard/MinHash), cryptographic timestamping, and even TLS-based or blockchain-based attestation; others argue these can’t fully solve social trust problems.

How to Help and Low-Budget Archiving

Suggestions for volunteers include:
- Targeted scraping of key domains (especially scientific and climate-related data).
- Using WARC-based tools, torrents, IPFS, and rclone to mirror and share datasets.
Some stress that archives must be not only complete but usable and discoverable, or they’ll never meaningfully inform future research or accountability.

Broader Concerns

Several comments lament that erasing or muddying public data undermines one of the U.S.’s core strengths: long-term governmental transparency enabling science, policy evaluation, and legal accountability.
Others warn against “hysteria,” arguing that extreme rhetoric benefits Trump politically and obscures the real, documentable harms—such as the quiet disappearance of critical datasets.

Related topics