Wikipedia is struggling with voracious AI bot crawlers

Why crawl Wikipedia instead of using dumps?

  • Many note Wikipedia provides complete database dumps and HuggingFace datasets, so crawling HTML is irrational technically and economically.
  • Explanations offered: generic “one-size-fits-all” crawlers, developer laziness, lack of awareness of the dumps, or avoiding the work of parsing Wikipedia’s XML/markup and transclusion model.
  • Others argue crawlers may need up‑to‑the‑minute versions for events (e.g., deaths), which dumps may lag on.
  • Some suspect deliberate harm or “soft DDoS” to weaken an open competitor to proprietary AI services; this remains speculative and contested.

Quality and behavior of crawlers

  • Many describe crawlers as poorly implemented: no rate limits, naive retry loops, multi-threaded hammering, ignoring robots.txt, and turning into “spaghetti code” due to edge cases.
  • Distinction is drawn between a merely functional crawler and a genuinely “polite” one; the latter requires significant engineering effort that most companies don’t invest.
  • Some connect bad behavior to “vibe-coded” / auto‑generated code by inexperienced developers or LLMs.

Impact beyond Wikipedia

  • Multiple commenters report their own servers and small sites being hammered, sometimes to the point of crashes or disk exhaustion.
  • Perception that many AI companies now run large, distributed crawls that collectively amount to a “worldwide DDoS” on the open web.

Proposed defenses and countermeasures

  • Ideas: strict rate limiting, honeypot links (hidden via CSS/JS) that trigger autobans, IP or ASN blocking, spamhaus-style blocklists, tarpit techniques, or special “fake” pages for AI user agents.
  • Captchas and identity verification (eID, video ID) are debated; critics argue they’re impractical, abusable, and won’t reliably distinguish humans from bots at scale.
  • Some advocate “free for humans, pay for automation” models or paid APIs for bots, but enforcement is seen as hard.

Ethical and structural critiques

  • Strong sentiments that many AI outfits behave like “sociopathic” or “criminal” enterprises: ignoring robots.txt, externalizing costs, exploiting open projects without attribution or reciprocity.
  • Concern that relentless scraping could make hosting public content unaffordable for smaller actors, accelerating centralization under large platforms.

Side thread: Wikipedia finances

  • Brief tangent questions whether Wikimedia overstates financial need; others push back, distinguishing “shady marketing” from actual corruption and noting this is off‑topic to the crawler issue.