2025-04-02

Wikipedia is struggling with voracious AI bot crawlers

Why crawl Wikipedia instead of using dumps?

Many note Wikipedia provides complete database dumps and HuggingFace datasets, so crawling HTML is irrational technically and economically.
Explanations offered: generic “one-size-fits-all” crawlers, developer laziness, lack of awareness of the dumps, or avoiding the work of parsing Wikipedia’s XML/markup and transclusion model.
Others argue crawlers may need up‑to‑the‑minute versions for events (e.g., deaths), which dumps may lag on.
Some suspect deliberate harm or “soft DDoS” to weaken an open competitor to proprietary AI services; this remains speculative and contested.

Quality and behavior of crawlers

Many describe crawlers as poorly implemented: no rate limits, naive retry loops, multi-threaded hammering, ignoring robots.txt, and turning into “spaghetti code” due to edge cases.
Distinction is drawn between a merely functional crawler and a genuinely “polite” one; the latter requires significant engineering effort that most companies don’t invest.
Some connect bad behavior to “vibe-coded” / auto‑generated code by inexperienced developers or LLMs.

Impact beyond Wikipedia

Multiple commenters report their own servers and small sites being hammered, sometimes to the point of crashes or disk exhaustion.
Perception that many AI companies now run large, distributed crawls that collectively amount to a “worldwide DDoS” on the open web.

Proposed defenses and countermeasures

Ideas: strict rate limiting, honeypot links (hidden via CSS/JS) that trigger autobans, IP or ASN blocking, spamhaus-style blocklists, tarpit techniques, or special “fake” pages for AI user agents.
Captchas and identity verification (eID, video ID) are debated; critics argue they’re impractical, abusable, and won’t reliably distinguish humans from bots at scale.
Some advocate “free for humans, pay for automation” models or paid APIs for bots, but enforcement is seen as hard.

Ethical and structural critiques

Strong sentiments that many AI outfits behave like “sociopathic” or “criminal” enterprises: ignoring robots.txt, externalizing costs, exploiting open projects without attribution or reciprocity.
Concern that relentless scraping could make hosting public content unaffordable for smaller actors, accelerating centralization under large platforms.

Side thread: Wikipedia finances

Brief tangent questions whether Wikimedia overstates financial need; others push back, distinguishing “shady marketing” from actual corruption and noting this is off‑topic to the crawler issue.

Related topics