2025-08-17

ArchiveTeam has finished archiving all goo.gl short links

Scope and Method of the goo.gl Archive

Commenters confirm “all” means exhaustive enumeration of the entire goo.gl keyspace, not just known URLs.
Volunteers ran a distributed client (“Warrior”) to iterate every possible key, record the HTTP response, and avoid IP bans.
Since goo.gl no longer issues new links, the namespace is finite and fully searchable.

ArchiveTeam vs Internet Archive

Several comments clarify the title: ArchiveTeam did the crawling and packaging; Internet Archive is mainly the hosting library.
ArchiveTeam writes site-specific scripts, coordinates volunteers (via Warrior VMs/Docker), and “grazes” rate limits when sites are shutting down.
They’re described as the “bucket brigade” rescuing data from dying services; Internet Archive is the storage.
One anecdote highlights how quickly and efficiently ArchiveTeam infrastructure scaled to archive a video platform.

Google’s Policy Shift and Cost Debate

People question why Google would deprecate “inactive” links given how simple and cheap a read-only key–value redirector should be.
Several argue infra costs are negligible for a company like Google; organizational churn and stack churn are speculated as more likely drivers.
Clarification: “recently clicked” isn’t the criterion; links with activity in late 2024 are kept, others will break.

Dataset Size, Format, and Access

Confusion over the reported tens–hundreds of TiB leads to explanations: data is stored as WARC files containing full HTTP requests/responses, often including destination content, not just mappings.
Some wonder why destination pages are archived given they’re no more “at risk” than the rest of the web.
The WARC sets on archive.org are temporarily access-restricted; explanation relayed is concern over being blocked in the broader “AI scraping wars.”
This frustrates some volunteers who helped, though others note content is still accessible via the Wayback Machine, just not as bulk dumps.

Privacy and Ethics of Archiving Short URLs

Debate over whether anyone should have expected privacy: short URLs are easily enumerable, so treating them as secrets is called “silly.”
Others worry about sensitive materials (private docs, unlisted videos) and compare to earlier incidents where private GPT links were archived.
Counterpoint: preserving history sometimes requires acting without explicit consent when services are being shuttered.

Wider Web Archiving and Anti–Link-Rot Efforts

Discussion of similar archives for Reddit (Pushshift, ArcticShift, AcademicTorrents), and speculation about HN datasets.
A proposal for a blockchain/P2P global web snapshot meets pushback, with some pointing to Common Crawl as a de facto shared corpus, though acknowledged as incomplete.
Overall, many celebrate the goo.gl effort as a concrete win against link rot, especially for references embedded in old documents and Stack Overflow posts.

Related topics