Stop crawling my HTML – use the API

HTML as Canonical Interface

  • Several argue that HTML/CSS/JS is the true canonical form because it is what humans consume; if APIs drift or die, the site still “works” in HTML.
  • From a scraper’s perspective, HTML is universal: every site has it, whereas APIs are inconsistent, undiscoverable, or absent.
  • Some push the view that “HTML is the API” and that good semantic markup already serves both humans and machines.

APIs: Promise vs. Reality

  • Critics of “use my API” note APIs are often:
    • Rate-limited, paywalled, or require keys/KYC.
    • Missing key data that is visible in HTML.
    • Prone to rug-pulls, deprecations, and policy changes (e.g., social sites tightening API access).
  • Others counter that many sites (especially WordPress, plus RSS/Atom/JSON Feed, ActivityPub, oEmbed, sitemaps, GraphQL) already expose richer, cleaner machine endpoints and that big crawlers should exploit these, especially given WordPress’s huge share.
  • There’s disagreement over how common usable APIs/feeds really are.

Scraper and Crawler Practicalities

  • Large-scale scrapers value generic logic: one HTML parser works “everywhere,” whereas each API needs bespoke client code and semantics.
  • Some implement special handling for major CMSes (WordPress, MediaWiki) because their APIs are easy wins.
  • Others say that if you’re scraping a specific site, it’s reasonable to learn and use its API, especially when it’s standardised.

LLMs and Parsing

  • Debate over using LLMs to interpret HTML:
    • Pro: they reduce the need to handcraft selectors; can quickly infer structure.
    • Con: massive compute vs. simple parsing, probabilistic errors, and no clear audit trail; structured data remains essential where accuracy matters.

Robots.txt, Blocking, and Legal/Ethical Aspects

  • Many note that robots.txt is widely ignored, especially by AI crawlers.
  • Ideas raised: honeypot links, IP blocklists, user-agent rules, Cloudflare routing, browser fingerprinting; but participants see this as an arms race with collateral damage (e.g., cloud desktops, residential proxies).
  • EU law and “content signals” headers/robots extensions may provide some legal leverage, but there’s skepticism big AI companies will respect voluntary schemes.

Prompt Poisoning and Anti-scraping Gimmicks

  • Hiding adversarial text in HTML to poison AI outputs is discussed but seen as fragile:
    • Sophisticated crawlers can render pages, detect hidden content, and filter it.
    • Risk of breaking accessibility or legitimate hidden/interactive content.

Human vs AI Interfaces & Formats

  • Some fear that AI-specific APIs will eventually degrade human UIs, forcing users to go through agents.
  • Others point to lost opportunities like browser-side XSLT/XML+templates or standardized OpenAPI-style descriptions that could have unified human and machine consumption.