2025-03-20

I fear for the unauthenticated web

Copyright, Fair Use, and Legal Tactics

Some propose aggressive copyright notices or per-word fees to deter LLM training; others say such footer text is legally meaningless without an actual contract or EULA click-through.
Debate over whether LLM training is fair use:
- One side expects courts to treat training as transformative and non-infringing.
- Another cites recent fair-use rulings (e.g. Warhol) and argues market harm and paid licensing deals make “fair use” unlikely.
Others shift focus from copyright to computer misuse laws (e.g. CFAA): if you explicitly ban AI training in terms of access, every non-compliant GET request could be argued as unauthorized.
Skepticism that individuals can realistically enforce any of this against large AI companies with deep pockets and little regard for copyright.

Scraping Ethics and Changing Norms

Some note the tech community previously cheered unrestricted scraping (e.g. LinkedIn cases) and argue the law hasn’t changed—only people’s feelings about AI.
Others distinguish normal indexing from LLM crawlers that: ignore robots.txt, spoof user agents, and cause heavy load, likening them to abusive bots rather than traditional search engines.
There’s dissatisfaction that LLMs effectively republish and profit from others’ work without attribution.

Costs Externalized to Small Sites

Core concern: site owners are literally paying for bandwidth and compute so AI companies can extract value.
This particularly hurts on usage-based platforms (Vercel, Cloud Run, clouds without hard billing caps).
Rate limiting is seen as a precursor to putting more content behind logins/paywalls, degrading the open web.

Defenses: Rate Limits, CDNs, and Proof‑of‑Work

Suggestions include strict rate limiting, mandatory respect for robots.txt, accurate scraper identification, and legal penalties for misbehaving crawlers.
Some recommend Cloudflare or similar CDNs; others fear over-centralization, opaque business practices, account shutdowns, and invasive bot challenges.
Proof-of-work schemes (e.g. Anubis, as used by GNOME’s GitLab) are floated as a way to throttle anonymous traffic, though people note targeted scrapers can adapt with headless browsers and cookie reuse.

Micropayments and HTTP 402

Several commenters see a fit for per-request micropayments (e.g. L402, HTTP 402 “Payment Required”) so scrapers pay for the resources they consume.
Others note this is conceptually similar to current “CPU payment” via heavy frontends or PoW challenges.
There’s hope that machines might handle micropayments better than humans did, though this would likely accelerate paywalling.

Good vs Bad Bots

A proposed distinction:
- “Good bots”: search crawlers and useful automation that obey robots.txt, identify themselves, and rate-limit.
- “Bad bots”: LLM scrapers, spam, fraud, DDoS—anything that increases costs or degrades service.
Verifying big search bots (Google, Bing) is straightforward via published methods; this may entrench incumbents and make life harder for new search engines.

Centralization and Cloudflare Concerns

Many dislike the growing dependence on a few CDNs, both for power concentration and jurisdictional control over traffic.
Multiple anecdotes describe Cloudflare as a “protection racket”: free or cheap at first, then expensive upsells, bandwidth surprises, or abrupt service changes.
Others defend Cloudflare’s technical quality while acknowledging philosophical and market-power worries.

Broader Reactions to LLMs and the Open Web

Some are unbothered, having always assumed anything online is public and scrapable; they see LLMs as just another user of data and find them practically useful.
Others feel viscerally exploited: they welcome humans reusing their work (e.g. YouTube videos with credit) but resent high-leverage automated reuse without consent or attribution.
A recurring cynical stance: “If you don’t want it used, don’t put it online,” which others argue leads directly to the death of the open, unauthenticated web.

Meta: Blogspam and Curation

A subthread criticizes the linked post as thin “blogspam” that adds little beyond an earlier, more in-depth article.
Others defend short commentary posts as legitimate curation and participation in the “participatory web,” especially compared to fully machine-generated content.

Related topics