I fear for the unauthenticated web
Copyright, Fair Use, and Legal Tactics
- Some propose aggressive copyright notices or per-word fees to deter LLM training; others say such footer text is legally meaningless without an actual contract or EULA click-through.
- Debate over whether LLM training is fair use:
- One side expects courts to treat training as transformative and non-infringing.
- Another cites recent fair-use rulings (e.g. Warhol) and argues market harm and paid licensing deals make “fair use” unlikely.
- Others shift focus from copyright to computer misuse laws (e.g. CFAA): if you explicitly ban AI training in terms of access, every non-compliant GET request could be argued as unauthorized.
- Skepticism that individuals can realistically enforce any of this against large AI companies with deep pockets and little regard for copyright.
Scraping Ethics and Changing Norms
- Some note the tech community previously cheered unrestricted scraping (e.g. LinkedIn cases) and argue the law hasn’t changed—only people’s feelings about AI.
- Others distinguish normal indexing from LLM crawlers that: ignore robots.txt, spoof user agents, and cause heavy load, likening them to abusive bots rather than traditional search engines.
- There’s dissatisfaction that LLMs effectively republish and profit from others’ work without attribution.
Costs Externalized to Small Sites
- Core concern: site owners are literally paying for bandwidth and compute so AI companies can extract value.
- This particularly hurts on usage-based platforms (Vercel, Cloud Run, clouds without hard billing caps).
- Rate limiting is seen as a precursor to putting more content behind logins/paywalls, degrading the open web.
Defenses: Rate Limits, CDNs, and Proof‑of‑Work
- Suggestions include strict rate limiting, mandatory respect for robots.txt, accurate scraper identification, and legal penalties for misbehaving crawlers.
- Some recommend Cloudflare or similar CDNs; others fear over-centralization, opaque business practices, account shutdowns, and invasive bot challenges.
- Proof-of-work schemes (e.g. Anubis, as used by GNOME’s GitLab) are floated as a way to throttle anonymous traffic, though people note targeted scrapers can adapt with headless browsers and cookie reuse.
Micropayments and HTTP 402
- Several commenters see a fit for per-request micropayments (e.g. L402, HTTP 402 “Payment Required”) so scrapers pay for the resources they consume.
- Others note this is conceptually similar to current “CPU payment” via heavy frontends or PoW challenges.
- There’s hope that machines might handle micropayments better than humans did, though this would likely accelerate paywalling.
Good vs Bad Bots
- A proposed distinction:
- “Good bots”: search crawlers and useful automation that obey robots.txt, identify themselves, and rate-limit.
- “Bad bots”: LLM scrapers, spam, fraud, DDoS—anything that increases costs or degrades service.
- Verifying big search bots (Google, Bing) is straightforward via published methods; this may entrench incumbents and make life harder for new search engines.
Centralization and Cloudflare Concerns
- Many dislike the growing dependence on a few CDNs, both for power concentration and jurisdictional control over traffic.
- Multiple anecdotes describe Cloudflare as a “protection racket”: free or cheap at first, then expensive upsells, bandwidth surprises, or abrupt service changes.
- Others defend Cloudflare’s technical quality while acknowledging philosophical and market-power worries.
Broader Reactions to LLMs and the Open Web
- Some are unbothered, having always assumed anything online is public and scrapable; they see LLMs as just another user of data and find them practically useful.
- Others feel viscerally exploited: they welcome humans reusing their work (e.g. YouTube videos with credit) but resent high-leverage automated reuse without consent or attribution.
- A recurring cynical stance: “If you don’t want it used, don’t put it online,” which others argue leads directly to the death of the open, unauthenticated web.
Meta: Blogspam and Curation
- A subthread criticizes the linked post as thin “blogspam” that adds little beyond an earlier, more in-depth article.
- Others defend short commentary posts as legitimate curation and participation in the “participatory web,” especially compared to fully machine-generated content.