2025-12-14

Stop crawling my HTML – use the API

HTML as Canonical Interface

Several argue that HTML/CSS/JS is the true canonical form because it is what humans consume; if APIs drift or die, the site still “works” in HTML.
From a scraper’s perspective, HTML is universal: every site has it, whereas APIs are inconsistent, undiscoverable, or absent.
Some push the view that “HTML is the API” and that good semantic markup already serves both humans and machines.

APIs: Promise vs. Reality

Critics of “use my API” note APIs are often:
- Rate-limited, paywalled, or require keys/KYC.
- Missing key data that is visible in HTML.
- Prone to rug-pulls, deprecations, and policy changes (e.g., social sites tightening API access).
Others counter that many sites (especially WordPress, plus RSS/Atom/JSON Feed, ActivityPub, oEmbed, sitemaps, GraphQL) already expose richer, cleaner machine endpoints and that big crawlers should exploit these, especially given WordPress’s huge share.
There’s disagreement over how common usable APIs/feeds really are.

Scraper and Crawler Practicalities

Large-scale scrapers value generic logic: one HTML parser works “everywhere,” whereas each API needs bespoke client code and semantics.
Some implement special handling for major CMSes (WordPress, MediaWiki) because their APIs are easy wins.
Others say that if you’re scraping a specific site, it’s reasonable to learn and use its API, especially when it’s standardised.

LLMs and Parsing

Debate over using LLMs to interpret HTML:
- Pro: they reduce the need to handcraft selectors; can quickly infer structure.
- Con: massive compute vs. simple parsing, probabilistic errors, and no clear audit trail; structured data remains essential where accuracy matters.

Robots.txt, Blocking, and Legal/Ethical Aspects

Many note that robots.txt is widely ignored, especially by AI crawlers.
Ideas raised: honeypot links, IP blocklists, user-agent rules, Cloudflare routing, browser fingerprinting; but participants see this as an arms race with collateral damage (e.g., cloud desktops, residential proxies).
EU law and “content signals” headers/robots extensions may provide some legal leverage, but there’s skepticism big AI companies will respect voluntary schemes.

Prompt Poisoning and Anti-scraping Gimmicks

Hiding adversarial text in HTML to poison AI outputs is discussed but seen as fragile:
- Sophisticated crawlers can render pages, detect hidden content, and filter it.
- Risk of breaking accessibility or legitimate hidden/interactive content.

Human vs AI Interfaces & Formats

Some fear that AI-specific APIs will eventually degrade human UIs, forcing users to go through agents.
Others point to lost opportunities like browser-side XSLT/XML+templates or standardized OpenAPI-style descriptions that could have unified human and machine consumption.

Related topics