2025-10-26

Feed the bots

Implementation & Security of the “Markov Babbler”

Follow-up project with a C-based Markov text server is shared; some readers praise its elegance and speed.
Others raise concerns: incorrect pthread_detach call (since fixed), potential thread exhaustion (no concurrency limits), use of unsafe C functions, and manual HTTP parsing.
Recommendations include running it as an unprivileged user, inside a container, and compiling with aggressive warnings; some suggest not exposing ad‑hoc C code to the public internet at all.

Denial-of-Service & Threading Concerns

One camp argues a reverse proxy (nginx, etc.) protects against slow‑loris style attacks by handling connections event‑driven.
Others counter that even with a proxy, a determined attacker can flood with well-formed requests, and unbounded threads remain dangerous.
Alternative designs proposed: async I/O with a fixed worker pool rather than thread-per-connection.

Goal: Make Scraping Economically Expensive

Core idea: serve recursively linked Markov garbage so crawlers burn CPU, memory, and especially bandwidth, at very low cost to the site.
If widely adopted, this could shift economics for LLM scrapers; they might need an LLM gatekeeper to detect junk, greatly raising per-page cost.
Some note storage and bandwidth costs also fall on scrapers; others suspect many scrapers get paid per GB regardless of content quality.

Can Bots & LLMs Just Filter the Garbage?

Several argue LLMs can cheaply classify nonsense text and already train on mostly low-quality data; training pipelines already filter junk.
Others respond that even adding an LLM-in-the-loop triage step is expensive at scale and turns this into an arms race.
Markov text is seen as relatively easy to detect; more subtle poisoning (e.g., realistic but misleading content) is suggested but controversial.

Bot Behavior, Traps & Hidden Links

Strategy: hide links (CSS tricks, 0px links, off-screen elements) that humans won’t click but naive crawlers will, leading them into the garbage maze.
Counterpoint: more advanced bots could render pages, consider visual layout, or only follow “visible” links; at that point scraping costs rise further.
Some doubt the trap actually “protects” real pages, since many crawlers prioritize pre‑queued known URLs over newly discovered ones.

Legal & Authentication Angles

Idea: put content behind Basic Auth with public credentials like nobots/nobots. Some say bots could easily use them; others raise legal risks of using public or leaked passwords at scale.
Discussion touches on DMCA anti-circumvention and whether public credentials still count as “access controls” for bots vs. humans.

Ethical Debate

One side: serving junk to unauthorized scrapers is a legitimate self-defense against resource abuse and uncompensated data extraction.
Other side: deliberately poisoning training data increases global “information entropy” and could indirectly harm users of these systems; “two wrongs don’t make a right.”

Alternative Defenses & Operational Notes

Suggestions: use Cloudflare or other CDNs (met with distrust by some), more precise robots.txt, IP blocking (hard against residential/proxy botnets), basic hidden-link + fail2ban traps, or client-side encryption/obfuscation where JS is acceptable.
Gzip bombs are discussed as ineffective in practice; streaming decompression and small expansion ratios limit their impact.
Some note that for low‑traffic personal sites, static hosting (e.g., GitHub Pages) sidesteps most cost concerns entirely.

Related topics