Data centers contain 90% crap data
What Counts as “Crap Data”?
- Several commenters distinguish “crap” from “cold” but still-important data: old emails, photos, logs, and business records may be rarely accessed but can be critical for debugging, audits, disputes, or personal memory.
- Others recount cleaning product databases or “big data” lakes where 50–95% of data was obviously wrong, duplicated, or never used—true waste rather than low-frequency value.
- Sturgeon’s law (“90% of everything is crap”) is invoked: the issue is not just data volume, but that most human output is low value.
Economic Tradeoffs and Incentives
- A recurring theme: storage is so cheap that it’s often rational to keep everything rather than pay humans to decide what to delete.
- Cloud providers may profit from unused allocations and have weak incentives to make deletion easy; users on flat plans hoard “just in case.”
- Some argue usage is correctly incentivized: if the debugging or future-proofing value > storage cost, “waste” is acceptable.
Environmental Impact Disagreement
- One side: data centers contribute meaningfully to emissions; storing useless data is morally analogous to other wasteful consumption, and externalities are underpriced.
- Other side: storage’s share of global energy is small versus heavy industry, transport, and compute-heavy workloads (AI, crypto); blaming “crap storage” is seen as misplaced or rhetorical.
- Some suggest the right lever is carbon pricing on energy, not moralizing over what people store.
Compliance, Risk, and Legal Holds
- Long-retention business data often exists for regulation, audits, fraud investigations, and PCI/GDPR obligations.
- A large migration story shows how litigation holds on “leftover” petabytes can stall deletion for months.
- GDPR “right to be forgotten” is called practically unenforceable in messy real-world estates (SharePoint sprawl, orphaned backups, test DBs).
Personal Photos, Email, and Hoarding
- Many admit to multi‑TB photo libraries and full email archives; most items are never revisited but are emotionally or potentially practically valuable.
- People want smarter tools: dedup, similarity grouping, AI culling, retention policies by type/importance; existing UX makes manual cleanup painful.
- Some argue that better search and AI make big unsorted piles more useful over time, reducing the need to cull.
Technical Approaches and Limits
- Debate over deduplication (mail signatures, filesystem-level dedup vs Exchange dropping single‑instance storage; E2EE complicates dedup).
- Calls for filesystem‑level expiry/retention flags; today this is mostly ad‑hoc cron jobs and enterprise records-management systems.
- Cold storage tiers (tape, deep archive, compressed/slow formats) are seen as a better target than aggressive deletion, though operational complexity often outweighs savings.
Cultural and Managerial Drivers
- Examples of pointless logging, CI overuse, multi-copy pipelines, and “checkbox” big-data/AI projects that collect and never use data.
- Some blame vanity metrics and promotion incentives for generating data and systems whose outputs are rarely consumed.
Archives, History, and Link Rot
- Multiple commenters push back against the article’s “few pages get 80% of hits” framing: rare, long-tail content (old government pages, obscure tech docs) can be crucial later.
- Libraries and the Internet Archive are used as analogies: most items are rarely accessed, but that doesn’t make them crap; deletion leads to irreversible knowledge loss and link rot.