2026-01-16

LWN is currently under the heaviest scraper attack seen yet

Attack characteristics and status

Original report described “the heaviest” attack yet, with tens of thousands of IPs; follow‑up says it has currently subsided.
Debate over terminology: some argue it’s more accurate to call it aggressive scraping, not a DDoS, because intent seems to be data collection, not denial of service.
Others counter that sufficiently aggressive or misconfigured scrapers are operationally indistinguishable from a DDoS, regardless of intent.

Is this really “AI scraping”?

Some see an “AI scraper” pattern: attacks starting ~2022, persistent load, repeated incidents, and similarity with what other FOSS projects and small sites are seeing.
Skeptics argue LWN’s content is old and already in common crawls, so marginal value for a new scrape is low; they suggest misconfigured generic crawlers or non‑AI actors.
Counterargument: LWN is a primary source for kernel development; coding models have clear incentive to ingest and re‑ingest it, including new content.

Who is doing the scraping and why?

Speculation ranges from big labs to small unknown AI/data companies, Chinese AI firms, individuals using proxy services, and generic botnets. No hard attribution evidence is presented.
Residential proxy networks and SDKs that turn users into exit nodes are described as a likely enabler; this makes attacks appear as “10k residential IPs.”
Some think incompetence and bad incentives (KPIs around ingestion volume) drive over‑aggressive crawlers; others see deliberate attempts to evade blocking and treat it as effectively malicious.

Impact beyond LWN

Many reports of similar pressure on FOSS sites, niche forums, tiny browser games, and specialized wikis, often forcing content behind logins.
Concern that user‑side AI agents and easy “write me a crawler” tooling will further change traffic patterns and amplify load.

Mitigation ideas

Techniques discussed: JavaScript API sabotage, Shadow DOM, feeding garbage data to unwanted bots, robots.txt hardening, IP/user‑agent blocking, Cloudflare or similar fronting, light interaction gates (sliders, simple checks) instead of full logins.
Several note trade‑offs: breakage for testing tools, SEO impact, and arms‑race dynamics.

Legal and copyright concerns

Strong frustration that technical defenses are being used instead of lawsuits, but others note attribution is extremely hard.
Separate thread on AI “laundering” open‑source and copyleft code, LLMs regurgitating niche codebases, and confusion over who owns or violates copyright; legal status is described as unresolved.

Related topics