2025-01-10

OpenAI's bot crushed this seven-person company's web site 'like a DDoS attack'

Legal and liability questions

Commenters discuss whether excess hosting costs from scraping are legally recoverable.
Cited case law suggests scraping public data is generally not a criminal CFAA issue; disputes are mostly civil.
One small site reports successfully getting a few thousand dollars from an AI company for bandwidth overuse.
Several note robots.txt has no legal force; lawsuits would likely rest on general claims of harm, not robots violations.
There is disagreement on whether large-scale, non-consensual scraping could realistically succeed in court; some say “no precedent,” others point to past crawling lawsuits (unclear which).

Who is responsible for the overload?

One camp: if you run a public site without auth, rate limits, caching, or robots.txt, you should expect heavy crawling and design for it.
Opposing camp: small businesses can’t all be infra experts; bots that knock sites offline are behaving unreasonably, regardless of site quality.
Analogies (e.g., emptying a free library, filling a shop with non-buyers) are used to argue that “not illegal” ≠ ethical.

Crawler behavior and engineering quality

Many describe AI crawlers as poorly engineered: over-aggressive, ignoring 429 / Retry-After, re-crawling unchanged content, and sometimes spoofing user agents.
Some note that classic search bots historically provided support channels and honored robots more reliably.
Others argue the article’s “DDoS-like” framing is unproven because no request rates or timestamps are shown; Cloudflare IPs in logs further muddy attribution.

Mitigations and countermeasures

Suggested defenses: robots.txt (with explicit AI blocks), Cloudflare protection and AI-bot blocking, fail2ban, HTTP 429 with subnet-level throttling, ASN or country blocks, IPv6-only access, and .htaccess rules.
Some propose “data poisoning” defenses: serving gibberish, recursive content, or compressible text to abusive bots; others argue such gibberish is easy to filter in curation.

Broader implications for the web and AI

Concern that aggressive AI scraping will push more content behind logins/paywalls, reducing open information.
Some see AI agents as reviving “personal webcrawlers” and automating interactions with sites that don’t offer APIs.
Others worry this simply recreates a centralized, Google-like gatekeeper, now controlled by AI companies.

Related topics