2024-05-18

38% of webpages that existed in 2013 are no longer accessible a decade later

Code reuse, mirroring, and copyright

Disagreement on whether you can mirror useful code from random websites to GitHub.
One view: license “depends,” but small snippets for educational/archival purposes likely fall under fair use (in the U.S.), and much code is closer to “idea” than “expression.
Counterpoints:
- Fair use is an after-the-fact legal defense, not a shield.
- Many jurisdictions don’t have fair use at all; in some (e.g., Germany) mirroring could clearly be infringement.
- Safest route is to follow the explicit license or just keep private copies.

Ephemerality vs preservation (“feature or bug”)

Some argue disappearance is good: forgetting is healthy, storage and attention are finite, and we shouldn’t try to fight entropy.
Others push back: rediscovering lost scientific/cultural knowledge is costly; future historians and archaeologists benefit from “store everything,” including mundane content (analogy to Sumerian clay-tablet trash heaps).
Debate over “worthy content”: highly subjective; we can’t predict what will matter in 100–10,000 years.
Moral angle: if disappearance is treated as a “feature,” who decides what vanishes, especially when data is in private platforms or behind paywalls?

Archiving practices and tools

Strong praise for the Internet Archive; several commenters donate and use it heavily, but worry about its legal exposure.
Many now save local copies (PDF, MHTML, SingleFile, Epub) instead of merely bookmarking.
Tools mentioned: ArchiveBox, Linkwarden, bookmarklets that auto-save to Wayback, browser extensions, self-hosted link archives.
Some post their site content to public Git repos so others can mirror or rebuild it.

Technical causes of decay

Dynamic and database-backed sites are fragile: code rot, dependency changes, CVEs, framework deprecations, and unmaintained APIs kill sites even if the HTML would still work.
Static sites on stable hosting (e.g., S3, GitHub Pages, plain HTML) are seen as the most durable.
TLS, server maintenance, and costs of commercial hosting contribute to short lifetimes; domains lapsing is common.

Centralization, walled gardens, and discoverability

Widespread shift from independent sites/forums to Facebook pages, Instagram, Reddit, and Discord:
- Small businesses often use Facebook as their only presence; people without accounts are excluded or discouraged.
- Forums and niche communities move to private or semi-private groups/Discords; information becomes harder to search, index, or archive.
Mixed views:
- Some lament the loss of an open, diverse “Web 1.0” and the burying of high-signal hobbyist content.
- Others welcome semi-closed spaces: knowledge stays within communities, less exposed to scraping, SEO spam, and large ML models.

How bad is 38%, and what is lost?

Some are surprised the number isn’t higher, given business churn and hobby sites dying.
Others see “62% still alive” as still troubling, especially for references in Wikipedia, news, government sites, and niche resources (e.g., immigration forums).
Recognition that search engines increasingly surface SEO-heavy or “content farm” material, while older, high-quality but inactive sites sink or vanish.
Several respondents have already taken old content offline intentionally, sometimes explicitly because of AI scraping concerns.

Related topics