It seems that OpenAI is scraping [certificate transparency] logs

OpenAI bot behavior and identification

  • Commenters verify that the IP in the blog post is inside OpenAI’s published searchbot IP range and that the User-Agent is consistent with their declared crawler.
  • Some note the UA string is messy/malformed but still clearly self-identifying; others consider blocking malformed UAs entirely.
  • Header spoofing is mentioned as common among scrapers, but in this case the IP check confirms it really is OpenAI.

Certificate Transparency (CT) logs as a public feed

  • Multiple people stress that CT logs are explicitly designed as public, third‑party–consumable data (“transparency” is the point).
  • Many systems already monitor CT: search engines, security firms, archives, bots, “script kiddies,” etc. For some, this makes the story unremarkable.
  • One view: this is equivalent to using a phone book; anyone can read it and act on it.

Use cases: scrapers, security, and discovery

  • CT logs provide an almost real-time feed of new hostnames, useful for:
    • Discovering new websites to crawl/index.
    • Detecting rogue certificates issued for your domains.
    • Security scanning (e.g., finding fresh WordPress installs).
  • Some see OpenAI’s use as standard practice: if your job is to crawl the web, CT is a natural starting point.

Privacy, surprise, and mitigation

  • Several commenters admit they hadn’t realized that issuing a public TLS cert effectively announces a hostname to the entire world.
  • Concern: sites not linked anywhere but using public certs are still “found” immediately via CT.
  • Suggested mitigations:
    • Use wildcard certs (so subdomains aren’t individually exposed in logs), ideally terminated at a shared load balancer.
    • Use private CAs for internal/non-public services.
  • Tradeoffs are noted: wildcard certs increase blast radius if compromised.

Scraping ethics and “stolen” content

  • One side argues that any publicly served content is, by design, available to be read by anyone, including AI and search companies; calling it “stolen” is inaccurate.
  • Others worry about CT being used to shortcut organic discovery and accelerate scraping of brand‑new, possibly unready sites.
  • Some report OpenAI appears to respect robots.txt and published IP/UA conventions, unlike many other scrapers.

Tools, infrastructure, and experimentation

  • crt.sh and merklemap are discussed as CT search tools; merklemap’s scaling and ZeroFS backend come up briefly.
  • Ideas mentioned: honeypot domains discovered only via CT to study bot behavior; feeds that normalize or deduplicate CT data (e.g., names-only APIs).