Using lots of little tools to aggressively reject the bots
Bot‑blocking techniques and tools
- Many liked the article’s Nginx+fail2ban approach; others suggested more automated tools like Anubis or go-away, or platforms like tirreno with rule engines and dashboards.
- People describe mixed strategies: IP/ASN blocking, honeypot endpoints, “bait” robots.txt entries that trigger zip bombs or bans, simple arithmetic captchas with cookies, and log-scan-and-ban systems.
- Some argue whack‑a‑mole IP blocking is fragile and recommend fixing app hot spots instead (e.g., disabling Gitea “download archive” for commits, or putting heavy files behind separate rules or auth).
- There’s debate over whether to focus on banning bots versus restructuring sites (caching, CDNs, removing costly features) so bots can be tolerated.
Robots.txt, user agents, and evasion
- One camp reports that big AI crawlers identify themselves, obey robots.txt, and stop when disallowed.
- Others provide detailed counterexamples: bots using random or spoofed UAs, ignoring robots.txt, hitting expensive endpoints (git blame, per‑commit archives) from thousands of residential IPs while mimicking human traffic.
- Several note that many abusive bots simply impersonate reputable crawlers, so logs may not reflect who is actually behind the traffic.
Load, cost, and infrastructure constraints
- Some say 20 r/s is negligible and that better caching or CDNs is the “real” fix.
- Others reply that bandwidth, CPU-heavy endpoints, autoscaling bills, and low-end “basement servers” make this traffic genuinely harmful, especially with binary downloads or dynamic VCS views.
- There is disagreement over whether small sites should be forced into CDNs and complex caching purely because of third‑party scrapers.
Ethics and purpose of scraping
- One view: public data is by definition for anyone to access, including AI; people are inconsistent if they accept search engines but reject AI crawlers.
- Opposing view: classic search engines share value and behave relatively considerately; many AI scrapers externalize costs, overwhelm infra, ignore consent, and provide little or no attribution or traffic.
- Motivations for blocking include resource protection, copyright/licensing concerns, and hostility to unasked‑for commercial reuse.
Collateral damage to legitimate users
- Multiple comments describe being locked out or harassed by CAPTCHAs, Cloudflare-style challenges, VPN/Tor/datacenter/IP-block rules, and JS-heavy verification walls.
- Some criticize IP-range and /24‑style blocking as punishing privacy-conscious users, those behind CGNAT, or users of Apple/Google privacy relays.
- There’s tension between “adapt to bot reality” and “we’re sliding into walled gardens, attested browsers and constant human‑proof burdens.”
Residential proxies and botnets
- Several note that AI and other scrapers increasingly route through residential proxy networks (Infatica, BrightData, etc.), often via SDKs in consumer apps and smart-TV software, making IP‑based blocking and attribution very hard.
- Suggestions include ISPs or network operators being stricter about infected endpoints, but others argue that would mean blocking almost everyone; security and attribution are fundamentally hard.
Alternative models and ideas
- Ideas floated: push/submit indexing instead of scraping; “page knocking” or behavior‑based unlocking; separating static “landing/docs” from heavy dynamic views; restricting expensive operations (git blame, archives) to logged‑in users.
- Some see aggressive bot defenses as necessary adaptation; others call them maladaptive, creating a worse web for humans.