We can't have nice things because of AI scrapers
MetaBrainz changes and “nice things”
- Many commenters see MetaBrainz’s new auth requirements (tokens for lookup APIs, requiring login for ListenBrainz Radio, removing debug endpoints) as modest and reasonable, but lament the need to add friction at all.
- Some question the title: what’s really lost is unauthenticated, frictionless APIs and open infrastructure for casual users and learners.
Technical defenses against AI scrapers
- Cloudflare’s “AI Labyrinth” and similar tarpits (e.g., iocaine, Anubis, Poison Fountain) are discussed: they detect likely AI scrapers and serve infinite junk or mazes.
- Objections: using Cloudflare centralizes control, degrades UX (VPNs, shared IPs, uncommon browsers), and may exempt paying scrapers.
- DIY tarpits and IP/ASN blocking help somewhat but cost bandwidth and are undermined by residential proxies, rotating IPs, and headless browsers that ignore honeypot links.
- Some suggest per-IP/netblock request budgets, sophisticated rate limiting, and more efficient backends; others say bots will still overwhelm small projects.
Bulk data dumps vs page-by-page scraping
- MetaBrainz already offers full DB dumps and torrents, yet scrapers still crawl page-by-page, ignoring robots.txt and bulk-download options.
- This is framed as a coordination failure: sites assume good faith; large crawlers assume adversarial sites and just run generic DFS scrapers.
- Several people propose standards:
.well-knownmachine-readable files,llms.txt, or explicit “here is the canonical dump” metadata, possibly with deltas/ETags.
Economics, incentives, and “tragedy of the commons”
- Widely shared view: AI companies are externalizing crawling costs onto volunteer or low-budget projects, similar to SQLite’s experience.
- Some think new standards or tip-address mechanisms (e.g., in llms.txt) could align incentives; others are skeptical scrapers that already ignore robots.txt will respect new signals or pay.
- Blocking large IP ranges and clouds harms legitimate API users and smaller good-faith bots, but many feel forced into it.
Impact on small sites and the open web
- Multiple anecdotes of small personal or hobby sites being taken down, pushed to static hosting, or put behind donation/login walls due to scraping load or host suspensions.
- AI summaries and browser-integrated summarizers are seen as further eroding traffic and incentives to publish, while still feeding models.
Normative debate and analogies
- Strong language (“evil”, “shitty and selfish”, “destroying the free internet”) is used against aggressive scrapers; a minority finds this rhetoric overblown or inevitable.
- Some compare today’s anger to early complaints about search indexers, predicting eventual acceptance; others reject this analogy since search sent traffic back, whereas LLMs often don’t.