38% of webpages that existed in 2013 are no longer accessible a decade later
Code reuse, mirroring, and copyright
- Disagreement on whether you can mirror useful code from random websites to GitHub.
- One view: license “depends,” but small snippets for educational/archival purposes likely fall under fair use (in the U.S.), and much code is closer to “idea” than “expression.
- Counterpoints:
- Fair use is an after-the-fact legal defense, not a shield.
- Many jurisdictions don’t have fair use at all; in some (e.g., Germany) mirroring could clearly be infringement.
- Safest route is to follow the explicit license or just keep private copies.
Ephemerality vs preservation (“feature or bug”)
- Some argue disappearance is good: forgetting is healthy, storage and attention are finite, and we shouldn’t try to fight entropy.
- Others push back: rediscovering lost scientific/cultural knowledge is costly; future historians and archaeologists benefit from “store everything,” including mundane content (analogy to Sumerian clay-tablet trash heaps).
- Debate over “worthy content”: highly subjective; we can’t predict what will matter in 100–10,000 years.
- Moral angle: if disappearance is treated as a “feature,” who decides what vanishes, especially when data is in private platforms or behind paywalls?
Archiving practices and tools
- Strong praise for the Internet Archive; several commenters donate and use it heavily, but worry about its legal exposure.
- Many now save local copies (PDF, MHTML, SingleFile, Epub) instead of merely bookmarking.
- Tools mentioned: ArchiveBox, Linkwarden, bookmarklets that auto-save to Wayback, browser extensions, self-hosted link archives.
- Some post their site content to public Git repos so others can mirror or rebuild it.
Technical causes of decay
- Dynamic and database-backed sites are fragile: code rot, dependency changes, CVEs, framework deprecations, and unmaintained APIs kill sites even if the HTML would still work.
- Static sites on stable hosting (e.g., S3, GitHub Pages, plain HTML) are seen as the most durable.
- TLS, server maintenance, and costs of commercial hosting contribute to short lifetimes; domains lapsing is common.
Centralization, walled gardens, and discoverability
- Widespread shift from independent sites/forums to Facebook pages, Instagram, Reddit, and Discord:
- Small businesses often use Facebook as their only presence; people without accounts are excluded or discouraged.
- Forums and niche communities move to private or semi-private groups/Discords; information becomes harder to search, index, or archive.
- Mixed views:
- Some lament the loss of an open, diverse “Web 1.0” and the burying of high-signal hobbyist content.
- Others welcome semi-closed spaces: knowledge stays within communities, less exposed to scraping, SEO spam, and large ML models.
How bad is 38%, and what is lost?
- Some are surprised the number isn’t higher, given business churn and hobby sites dying.
- Others see “62% still alive” as still troubling, especially for references in Wikipedia, news, government sites, and niche resources (e.g., immigration forums).
- Recognition that search engines increasingly surface SEO-heavy or “content farm” material, while older, high-quality but inactive sites sink or vanish.
- Several respondents have already taken old content offline intentionally, sometimes explicitly because of AI scraping concerns.