LWN is currently under the heaviest scraper attack seen yet

Attack characteristics and status

  • Original report described “the heaviest” attack yet, with tens of thousands of IPs; follow‑up says it has currently subsided.
  • Debate over terminology: some argue it’s more accurate to call it aggressive scraping, not a DDoS, because intent seems to be data collection, not denial of service.
  • Others counter that sufficiently aggressive or misconfigured scrapers are operationally indistinguishable from a DDoS, regardless of intent.

Is this really “AI scraping”?

  • Some see an “AI scraper” pattern: attacks starting ~2022, persistent load, repeated incidents, and similarity with what other FOSS projects and small sites are seeing.
  • Skeptics argue LWN’s content is old and already in common crawls, so marginal value for a new scrape is low; they suggest misconfigured generic crawlers or non‑AI actors.
  • Counterargument: LWN is a primary source for kernel development; coding models have clear incentive to ingest and re‑ingest it, including new content.

Who is doing the scraping and why?

  • Speculation ranges from big labs to small unknown AI/data companies, Chinese AI firms, individuals using proxy services, and generic botnets. No hard attribution evidence is presented.
  • Residential proxy networks and SDKs that turn users into exit nodes are described as a likely enabler; this makes attacks appear as “10k residential IPs.”
  • Some think incompetence and bad incentives (KPIs around ingestion volume) drive over‑aggressive crawlers; others see deliberate attempts to evade blocking and treat it as effectively malicious.

Impact beyond LWN

  • Many reports of similar pressure on FOSS sites, niche forums, tiny browser games, and specialized wikis, often forcing content behind logins.
  • Concern that user‑side AI agents and easy “write me a crawler” tooling will further change traffic patterns and amplify load.

Mitigation ideas

  • Techniques discussed: JavaScript API sabotage, Shadow DOM, feeding garbage data to unwanted bots, robots.txt hardening, IP/user‑agent blocking, Cloudflare or similar fronting, light interaction gates (sliders, simple checks) instead of full logins.
  • Several note trade‑offs: breakage for testing tools, SEO impact, and arms‑race dynamics.

Legal and copyright concerns

  • Strong frustration that technical defenses are being used instead of lawsuits, but others note attribution is extremely hard.
  • Separate thread on AI “laundering” open‑source and copyleft code, LLMs regurgitating niche codebases, and confusion over who owns or violates copyright; legal status is described as unresolved.