Stop crawling my HTML – use the API
HTML as Canonical Interface
- Several argue that HTML/CSS/JS is the true canonical form because it is what humans consume; if APIs drift or die, the site still “works” in HTML.
- From a scraper’s perspective, HTML is universal: every site has it, whereas APIs are inconsistent, undiscoverable, or absent.
- Some push the view that “HTML is the API” and that good semantic markup already serves both humans and machines.
APIs: Promise vs. Reality
- Critics of “use my API” note APIs are often:
- Rate-limited, paywalled, or require keys/KYC.
- Missing key data that is visible in HTML.
- Prone to rug-pulls, deprecations, and policy changes (e.g., social sites tightening API access).
- Others counter that many sites (especially WordPress, plus RSS/Atom/JSON Feed, ActivityPub, oEmbed, sitemaps, GraphQL) already expose richer, cleaner machine endpoints and that big crawlers should exploit these, especially given WordPress’s huge share.
- There’s disagreement over how common usable APIs/feeds really are.
Scraper and Crawler Practicalities
- Large-scale scrapers value generic logic: one HTML parser works “everywhere,” whereas each API needs bespoke client code and semantics.
- Some implement special handling for major CMSes (WordPress, MediaWiki) because their APIs are easy wins.
- Others say that if you’re scraping a specific site, it’s reasonable to learn and use its API, especially when it’s standardised.
LLMs and Parsing
- Debate over using LLMs to interpret HTML:
- Pro: they reduce the need to handcraft selectors; can quickly infer structure.
- Con: massive compute vs. simple parsing, probabilistic errors, and no clear audit trail; structured data remains essential where accuracy matters.
Robots.txt, Blocking, and Legal/Ethical Aspects
- Many note that robots.txt is widely ignored, especially by AI crawlers.
- Ideas raised: honeypot links, IP blocklists, user-agent rules, Cloudflare routing, browser fingerprinting; but participants see this as an arms race with collateral damage (e.g., cloud desktops, residential proxies).
- EU law and “content signals” headers/robots extensions may provide some legal leverage, but there’s skepticism big AI companies will respect voluntary schemes.
Prompt Poisoning and Anti-scraping Gimmicks
- Hiding adversarial text in HTML to poison AI outputs is discussed but seen as fragile:
- Sophisticated crawlers can render pages, detect hidden content, and filter it.
- Risk of breaking accessibility or legitimate hidden/interactive content.
Human vs AI Interfaces & Formats
- Some fear that AI-specific APIs will eventually degrade human UIs, forcing users to go through agents.
- Others point to lost opportunities like browser-side XSLT/XML+templates or standardized OpenAPI-style descriptions that could have unified human and machine consumption.