Wikipedia is struggling with voracious AI bot crawlers
Why crawl Wikipedia instead of using dumps?
- Many note Wikipedia provides complete database dumps and HuggingFace datasets, so crawling HTML is irrational technically and economically.
- Explanations offered: generic “one-size-fits-all” crawlers, developer laziness, lack of awareness of the dumps, or avoiding the work of parsing Wikipedia’s XML/markup and transclusion model.
- Others argue crawlers may need up‑to‑the‑minute versions for events (e.g., deaths), which dumps may lag on.
- Some suspect deliberate harm or “soft DDoS” to weaken an open competitor to proprietary AI services; this remains speculative and contested.
Quality and behavior of crawlers
- Many describe crawlers as poorly implemented: no rate limits, naive retry loops, multi-threaded hammering, ignoring robots.txt, and turning into “spaghetti code” due to edge cases.
- Distinction is drawn between a merely functional crawler and a genuinely “polite” one; the latter requires significant engineering effort that most companies don’t invest.
- Some connect bad behavior to “vibe-coded” / auto‑generated code by inexperienced developers or LLMs.
Impact beyond Wikipedia
- Multiple commenters report their own servers and small sites being hammered, sometimes to the point of crashes or disk exhaustion.
- Perception that many AI companies now run large, distributed crawls that collectively amount to a “worldwide DDoS” on the open web.
Proposed defenses and countermeasures
- Ideas: strict rate limiting, honeypot links (hidden via CSS/JS) that trigger autobans, IP or ASN blocking, spamhaus-style blocklists, tarpit techniques, or special “fake” pages for AI user agents.
- Captchas and identity verification (eID, video ID) are debated; critics argue they’re impractical, abusable, and won’t reliably distinguish humans from bots at scale.
- Some advocate “free for humans, pay for automation” models or paid APIs for bots, but enforcement is seen as hard.
Ethical and structural critiques
- Strong sentiments that many AI outfits behave like “sociopathic” or “criminal” enterprises: ignoring robots.txt, externalizing costs, exploiting open projects without attribution or reciprocity.
- Concern that relentless scraping could make hosting public content unaffordable for smaller actors, accelerating centralization under large platforms.
Side thread: Wikipedia finances
- Brief tangent questions whether Wikimedia overstates financial need; others push back, distinguishing “shady marketing” from actual corruption and noting this is off‑topic to the crawler issue.