ArchiveTeam has finished archiving all goo.gl short links

Scope and Method of the goo.gl Archive

  • Commenters confirm “all” means exhaustive enumeration of the entire goo.gl keyspace, not just known URLs.
  • Volunteers ran a distributed client (“Warrior”) to iterate every possible key, record the HTTP response, and avoid IP bans.
  • Since goo.gl no longer issues new links, the namespace is finite and fully searchable.

ArchiveTeam vs Internet Archive

  • Several comments clarify the title: ArchiveTeam did the crawling and packaging; Internet Archive is mainly the hosting library.
  • ArchiveTeam writes site-specific scripts, coordinates volunteers (via Warrior VMs/Docker), and “grazes” rate limits when sites are shutting down.
  • They’re described as the “bucket brigade” rescuing data from dying services; Internet Archive is the storage.
  • One anecdote highlights how quickly and efficiently ArchiveTeam infrastructure scaled to archive a video platform.

Google’s Policy Shift and Cost Debate

  • People question why Google would deprecate “inactive” links given how simple and cheap a read-only key–value redirector should be.
  • Several argue infra costs are negligible for a company like Google; organizational churn and stack churn are speculated as more likely drivers.
  • Clarification: “recently clicked” isn’t the criterion; links with activity in late 2024 are kept, others will break.

Dataset Size, Format, and Access

  • Confusion over the reported tens–hundreds of TiB leads to explanations: data is stored as WARC files containing full HTTP requests/responses, often including destination content, not just mappings.
  • Some wonder why destination pages are archived given they’re no more “at risk” than the rest of the web.
  • The WARC sets on archive.org are temporarily access-restricted; explanation relayed is concern over being blocked in the broader “AI scraping wars.”
  • This frustrates some volunteers who helped, though others note content is still accessible via the Wayback Machine, just not as bulk dumps.

Privacy and Ethics of Archiving Short URLs

  • Debate over whether anyone should have expected privacy: short URLs are easily enumerable, so treating them as secrets is called “silly.”
  • Others worry about sensitive materials (private docs, unlisted videos) and compare to earlier incidents where private GPT links were archived.
  • Counterpoint: preserving history sometimes requires acting without explicit consent when services are being shuttered.

Wider Web Archiving and Anti–Link-Rot Efforts

  • Discussion of similar archives for Reddit (Pushshift, ArcticShift, AcademicTorrents), and speculation about HN datasets.
  • A proposal for a blockchain/P2P global web snapshot meets pushback, with some pointing to Common Crawl as a de facto shared corpus, though acknowledged as incomplete.
  • Overall, many celebrate the goo.gl effort as a concrete win against link rot, especially for references embedded in old documents and Stack Overflow posts.