2025-05-31

Using lots of little tools to aggressively reject the bots

Bot‑blocking techniques and tools

Many liked the article’s Nginx+fail2ban approach; others suggested more automated tools like Anubis or go-away, or platforms like tirreno with rule engines and dashboards.
People describe mixed strategies: IP/ASN blocking, honeypot endpoints, “bait” robots.txt entries that trigger zip bombs or bans, simple arithmetic captchas with cookies, and log-scan-and-ban systems.
Some argue whack‑a‑mole IP blocking is fragile and recommend fixing app hot spots instead (e.g., disabling Gitea “download archive” for commits, or putting heavy files behind separate rules or auth).
There’s debate over whether to focus on banning bots versus restructuring sites (caching, CDNs, removing costly features) so bots can be tolerated.

Robots.txt, user agents, and evasion

One camp reports that big AI crawlers identify themselves, obey robots.txt, and stop when disallowed.
Others provide detailed counterexamples: bots using random or spoofed UAs, ignoring robots.txt, hitting expensive endpoints (git blame, per‑commit archives) from thousands of residential IPs while mimicking human traffic.
Several note that many abusive bots simply impersonate reputable crawlers, so logs may not reflect who is actually behind the traffic.

Load, cost, and infrastructure constraints

Some say 20 r/s is negligible and that better caching or CDNs is the “real” fix.
Others reply that bandwidth, CPU-heavy endpoints, autoscaling bills, and low-end “basement servers” make this traffic genuinely harmful, especially with binary downloads or dynamic VCS views.
There is disagreement over whether small sites should be forced into CDNs and complex caching purely because of third‑party scrapers.

Ethics and purpose of scraping

One view: public data is by definition for anyone to access, including AI; people are inconsistent if they accept search engines but reject AI crawlers.
Opposing view: classic search engines share value and behave relatively considerately; many AI scrapers externalize costs, overwhelm infra, ignore consent, and provide little or no attribution or traffic.
Motivations for blocking include resource protection, copyright/licensing concerns, and hostility to unasked‑for commercial reuse.

Collateral damage to legitimate users

Multiple comments describe being locked out or harassed by CAPTCHAs, Cloudflare-style challenges, VPN/Tor/datacenter/IP-block rules, and JS-heavy verification walls.
Some criticize IP-range and /24‑style blocking as punishing privacy-conscious users, those behind CGNAT, or users of Apple/Google privacy relays.
There’s tension between “adapt to bot reality” and “we’re sliding into walled gardens, attested browsers and constant human‑proof burdens.”

Residential proxies and botnets

Several note that AI and other scrapers increasingly route through residential proxy networks (Infatica, BrightData, etc.), often via SDKs in consumer apps and smart-TV software, making IP‑based blocking and attribution very hard.
Suggestions include ISPs or network operators being stricter about infected endpoints, but others argue that would mean blocking almost everyone; security and attribution are fundamentally hard.

Alternative models and ideas

Ideas floated: push/submit indexing instead of scraping; “page knocking” or behavior‑based unlocking; separating static “landing/docs” from heavy dynamic views; restricting expensive operations (git blame, archives) to logged‑in users.
Some see aggressive bot defenses as necessary adaptation; others call them maladaptive, creating a worse web for humans.

Related topics