LWN is currently under the heaviest scraper attack seen yet
Attack characteristics and status
- Original report described “the heaviest” attack yet, with tens of thousands of IPs; follow‑up says it has currently subsided.
- Debate over terminology: some argue it’s more accurate to call it aggressive scraping, not a DDoS, because intent seems to be data collection, not denial of service.
- Others counter that sufficiently aggressive or misconfigured scrapers are operationally indistinguishable from a DDoS, regardless of intent.
Is this really “AI scraping”?
- Some see an “AI scraper” pattern: attacks starting ~2022, persistent load, repeated incidents, and similarity with what other FOSS projects and small sites are seeing.
- Skeptics argue LWN’s content is old and already in common crawls, so marginal value for a new scrape is low; they suggest misconfigured generic crawlers or non‑AI actors.
- Counterargument: LWN is a primary source for kernel development; coding models have clear incentive to ingest and re‑ingest it, including new content.
Who is doing the scraping and why?
- Speculation ranges from big labs to small unknown AI/data companies, Chinese AI firms, individuals using proxy services, and generic botnets. No hard attribution evidence is presented.
- Residential proxy networks and SDKs that turn users into exit nodes are described as a likely enabler; this makes attacks appear as “10k residential IPs.”
- Some think incompetence and bad incentives (KPIs around ingestion volume) drive over‑aggressive crawlers; others see deliberate attempts to evade blocking and treat it as effectively malicious.
Impact beyond LWN
- Many reports of similar pressure on FOSS sites, niche forums, tiny browser games, and specialized wikis, often forcing content behind logins.
- Concern that user‑side AI agents and easy “write me a crawler” tooling will further change traffic patterns and amplify load.
Mitigation ideas
- Techniques discussed: JavaScript API sabotage, Shadow DOM, feeding garbage data to unwanted bots, robots.txt hardening, IP/user‑agent blocking, Cloudflare or similar fronting, light interaction gates (sliders, simple checks) instead of full logins.
- Several note trade‑offs: breakage for testing tools, SEO impact, and arms‑race dynamics.
Legal and copyright concerns
- Strong frustration that technical defenses are being used instead of lawsuits, but others note attribution is extremely hard.
- Separate thread on AI “laundering” open‑source and copyleft code, LLMs regurgitating niche codebases, and confusion over who owns or violates copyright; legal status is described as unresolved.