I fear for the unauthenticated web

Copyright, Fair Use, and Legal Tactics

  • Some propose aggressive copyright notices or per-word fees to deter LLM training; others say such footer text is legally meaningless without an actual contract or EULA click-through.
  • Debate over whether LLM training is fair use:
    • One side expects courts to treat training as transformative and non-infringing.
    • Another cites recent fair-use rulings (e.g. Warhol) and argues market harm and paid licensing deals make “fair use” unlikely.
  • Others shift focus from copyright to computer misuse laws (e.g. CFAA): if you explicitly ban AI training in terms of access, every non-compliant GET request could be argued as unauthorized.
  • Skepticism that individuals can realistically enforce any of this against large AI companies with deep pockets and little regard for copyright.

Scraping Ethics and Changing Norms

  • Some note the tech community previously cheered unrestricted scraping (e.g. LinkedIn cases) and argue the law hasn’t changed—only people’s feelings about AI.
  • Others distinguish normal indexing from LLM crawlers that: ignore robots.txt, spoof user agents, and cause heavy load, likening them to abusive bots rather than traditional search engines.
  • There’s dissatisfaction that LLMs effectively republish and profit from others’ work without attribution.

Costs Externalized to Small Sites

  • Core concern: site owners are literally paying for bandwidth and compute so AI companies can extract value.
  • This particularly hurts on usage-based platforms (Vercel, Cloud Run, clouds without hard billing caps).
  • Rate limiting is seen as a precursor to putting more content behind logins/paywalls, degrading the open web.

Defenses: Rate Limits, CDNs, and Proof‑of‑Work

  • Suggestions include strict rate limiting, mandatory respect for robots.txt, accurate scraper identification, and legal penalties for misbehaving crawlers.
  • Some recommend Cloudflare or similar CDNs; others fear over-centralization, opaque business practices, account shutdowns, and invasive bot challenges.
  • Proof-of-work schemes (e.g. Anubis, as used by GNOME’s GitLab) are floated as a way to throttle anonymous traffic, though people note targeted scrapers can adapt with headless browsers and cookie reuse.

Micropayments and HTTP 402

  • Several commenters see a fit for per-request micropayments (e.g. L402, HTTP 402 “Payment Required”) so scrapers pay for the resources they consume.
  • Others note this is conceptually similar to current “CPU payment” via heavy frontends or PoW challenges.
  • There’s hope that machines might handle micropayments better than humans did, though this would likely accelerate paywalling.

Good vs Bad Bots

  • A proposed distinction:
    • “Good bots”: search crawlers and useful automation that obey robots.txt, identify themselves, and rate-limit.
    • “Bad bots”: LLM scrapers, spam, fraud, DDoS—anything that increases costs or degrades service.
  • Verifying big search bots (Google, Bing) is straightforward via published methods; this may entrench incumbents and make life harder for new search engines.

Centralization and Cloudflare Concerns

  • Many dislike the growing dependence on a few CDNs, both for power concentration and jurisdictional control over traffic.
  • Multiple anecdotes describe Cloudflare as a “protection racket”: free or cheap at first, then expensive upsells, bandwidth surprises, or abrupt service changes.
  • Others defend Cloudflare’s technical quality while acknowledging philosophical and market-power worries.

Broader Reactions to LLMs and the Open Web

  • Some are unbothered, having always assumed anything online is public and scrapable; they see LLMs as just another user of data and find them practically useful.
  • Others feel viscerally exploited: they welcome humans reusing their work (e.g. YouTube videos with credit) but resent high-leverage automated reuse without consent or attribution.
  • A recurring cynical stance: “If you don’t want it used, don’t put it online,” which others argue leads directly to the death of the open, unauthenticated web.

Meta: Blogspam and Curation

  • A subthread criticizes the linked post as thin “blogspam” that adds little beyond an earlier, more in-depth article.
  • Others defend short commentary posts as legitimate curation and participation in the “participatory web,” especially compared to fully machine-generated content.