Using lots of little tools to aggressively reject the bots

Bot‑blocking techniques and tools

  • Many liked the article’s Nginx+fail2ban approach; others suggested more automated tools like Anubis or go-away, or platforms like tirreno with rule engines and dashboards.
  • People describe mixed strategies: IP/ASN blocking, honeypot endpoints, “bait” robots.txt entries that trigger zip bombs or bans, simple arithmetic captchas with cookies, and log-scan-and-ban systems.
  • Some argue whack‑a‑mole IP blocking is fragile and recommend fixing app hot spots instead (e.g., disabling Gitea “download archive” for commits, or putting heavy files behind separate rules or auth).
  • There’s debate over whether to focus on banning bots versus restructuring sites (caching, CDNs, removing costly features) so bots can be tolerated.

Robots.txt, user agents, and evasion

  • One camp reports that big AI crawlers identify themselves, obey robots.txt, and stop when disallowed.
  • Others provide detailed counterexamples: bots using random or spoofed UAs, ignoring robots.txt, hitting expensive endpoints (git blame, per‑commit archives) from thousands of residential IPs while mimicking human traffic.
  • Several note that many abusive bots simply impersonate reputable crawlers, so logs may not reflect who is actually behind the traffic.

Load, cost, and infrastructure constraints

  • Some say 20 r/s is negligible and that better caching or CDNs is the “real” fix.
  • Others reply that bandwidth, CPU-heavy endpoints, autoscaling bills, and low-end “basement servers” make this traffic genuinely harmful, especially with binary downloads or dynamic VCS views.
  • There is disagreement over whether small sites should be forced into CDNs and complex caching purely because of third‑party scrapers.

Ethics and purpose of scraping

  • One view: public data is by definition for anyone to access, including AI; people are inconsistent if they accept search engines but reject AI crawlers.
  • Opposing view: classic search engines share value and behave relatively considerately; many AI scrapers externalize costs, overwhelm infra, ignore consent, and provide little or no attribution or traffic.
  • Motivations for blocking include resource protection, copyright/licensing concerns, and hostility to unasked‑for commercial reuse.

Collateral damage to legitimate users

  • Multiple comments describe being locked out or harassed by CAPTCHAs, Cloudflare-style challenges, VPN/Tor/datacenter/IP-block rules, and JS-heavy verification walls.
  • Some criticize IP-range and /24‑style blocking as punishing privacy-conscious users, those behind CGNAT, or users of Apple/Google privacy relays.
  • There’s tension between “adapt to bot reality” and “we’re sliding into walled gardens, attested browsers and constant human‑proof burdens.”

Residential proxies and botnets

  • Several note that AI and other scrapers increasingly route through residential proxy networks (Infatica, BrightData, etc.), often via SDKs in consumer apps and smart-TV software, making IP‑based blocking and attribution very hard.
  • Suggestions include ISPs or network operators being stricter about infected endpoints, but others argue that would mean blocking almost everyone; security and attribution are fundamentally hard.

Alternative models and ideas

  • Ideas floated: push/submit indexing instead of scraping; “page knocking” or behavior‑based unlocking; separating static “landing/docs” from heavy dynamic views; restricting expensive operations (git blame, archives) to logged‑in users.
  • Some see aggressive bot defenses as necessary adaptation; others call them maladaptive, creating a worse web for humans.