ArchiveTeam has finished archiving all goo.gl short links
Scope and Method of the goo.gl Archive
- Commenters confirm “all” means exhaustive enumeration of the entire goo.gl keyspace, not just known URLs.
- Volunteers ran a distributed client (“Warrior”) to iterate every possible key, record the HTTP response, and avoid IP bans.
- Since goo.gl no longer issues new links, the namespace is finite and fully searchable.
ArchiveTeam vs Internet Archive
- Several comments clarify the title: ArchiveTeam did the crawling and packaging; Internet Archive is mainly the hosting library.
- ArchiveTeam writes site-specific scripts, coordinates volunteers (via Warrior VMs/Docker), and “grazes” rate limits when sites are shutting down.
- They’re described as the “bucket brigade” rescuing data from dying services; Internet Archive is the storage.
- One anecdote highlights how quickly and efficiently ArchiveTeam infrastructure scaled to archive a video platform.
Google’s Policy Shift and Cost Debate
- People question why Google would deprecate “inactive” links given how simple and cheap a read-only key–value redirector should be.
- Several argue infra costs are negligible for a company like Google; organizational churn and stack churn are speculated as more likely drivers.
- Clarification: “recently clicked” isn’t the criterion; links with activity in late 2024 are kept, others will break.
Dataset Size, Format, and Access
- Confusion over the reported tens–hundreds of TiB leads to explanations: data is stored as WARC files containing full HTTP requests/responses, often including destination content, not just mappings.
- Some wonder why destination pages are archived given they’re no more “at risk” than the rest of the web.
- The WARC sets on archive.org are temporarily access-restricted; explanation relayed is concern over being blocked in the broader “AI scraping wars.”
- This frustrates some volunteers who helped, though others note content is still accessible via the Wayback Machine, just not as bulk dumps.
Privacy and Ethics of Archiving Short URLs
- Debate over whether anyone should have expected privacy: short URLs are easily enumerable, so treating them as secrets is called “silly.”
- Others worry about sensitive materials (private docs, unlisted videos) and compare to earlier incidents where private GPT links were archived.
- Counterpoint: preserving history sometimes requires acting without explicit consent when services are being shuttered.
Wider Web Archiving and Anti–Link-Rot Efforts
- Discussion of similar archives for Reddit (Pushshift, ArcticShift, AcademicTorrents), and speculation about HN datasets.
- A proposal for a blockchain/P2P global web snapshot meets pushback, with some pointing to Common Crawl as a de facto shared corpus, though acknowledged as incomplete.
- Overall, many celebrate the goo.gl effort as a concrete win against link rot, especially for references embedded in old documents and Stack Overflow posts.