Feed the bots
Implementation & Security of the “Markov Babbler”
- Follow-up project with a C-based Markov text server is shared; some readers praise its elegance and speed.
- Others raise concerns: incorrect
pthread_detachcall (since fixed), potential thread exhaustion (no concurrency limits), use of unsafe C functions, and manual HTTP parsing. - Recommendations include running it as an unprivileged user, inside a container, and compiling with aggressive warnings; some suggest not exposing ad‑hoc C code to the public internet at all.
Denial-of-Service & Threading Concerns
- One camp argues a reverse proxy (nginx, etc.) protects against slow‑loris style attacks by handling connections event‑driven.
- Others counter that even with a proxy, a determined attacker can flood with well-formed requests, and unbounded threads remain dangerous.
- Alternative designs proposed: async I/O with a fixed worker pool rather than thread-per-connection.
Goal: Make Scraping Economically Expensive
- Core idea: serve recursively linked Markov garbage so crawlers burn CPU, memory, and especially bandwidth, at very low cost to the site.
- If widely adopted, this could shift economics for LLM scrapers; they might need an LLM gatekeeper to detect junk, greatly raising per-page cost.
- Some note storage and bandwidth costs also fall on scrapers; others suspect many scrapers get paid per GB regardless of content quality.
Can Bots & LLMs Just Filter the Garbage?
- Several argue LLMs can cheaply classify nonsense text and already train on mostly low-quality data; training pipelines already filter junk.
- Others respond that even adding an LLM-in-the-loop triage step is expensive at scale and turns this into an arms race.
- Markov text is seen as relatively easy to detect; more subtle poisoning (e.g., realistic but misleading content) is suggested but controversial.
Bot Behavior, Traps & Hidden Links
- Strategy: hide links (CSS tricks, 0px links, off-screen elements) that humans won’t click but naive crawlers will, leading them into the garbage maze.
- Counterpoint: more advanced bots could render pages, consider visual layout, or only follow “visible” links; at that point scraping costs rise further.
- Some doubt the trap actually “protects” real pages, since many crawlers prioritize pre‑queued known URLs over newly discovered ones.
Legal & Authentication Angles
- Idea: put content behind Basic Auth with public credentials like
nobots/nobots. Some say bots could easily use them; others raise legal risks of using public or leaked passwords at scale. - Discussion touches on DMCA anti-circumvention and whether public credentials still count as “access controls” for bots vs. humans.
Ethical Debate
- One side: serving junk to unauthorized scrapers is a legitimate self-defense against resource abuse and uncompensated data extraction.
- Other side: deliberately poisoning training data increases global “information entropy” and could indirectly harm users of these systems; “two wrongs don’t make a right.”
Alternative Defenses & Operational Notes
- Suggestions: use Cloudflare or other CDNs (met with distrust by some), more precise robots.txt, IP blocking (hard against residential/proxy botnets), basic hidden-link + fail2ban traps, or client-side encryption/obfuscation where JS is acceptable.
- Gzip bombs are discussed as ineffective in practice; streaming decompression and small expansion ratios limit their impact.
- Some note that for low‑traffic personal sites, static hosting (e.g., GitHub Pages) sidesteps most cost concerns entirely.