2025-12-15

It seems that OpenAI is scraping [certificate transparency] logs

OpenAI bot behavior and identification

Commenters verify that the IP in the blog post is inside OpenAI’s published searchbot IP range and that the User-Agent is consistent with their declared crawler.
Some note the UA string is messy/malformed but still clearly self-identifying; others consider blocking malformed UAs entirely.
Header spoofing is mentioned as common among scrapers, but in this case the IP check confirms it really is OpenAI.

Certificate Transparency (CT) logs as a public feed

Multiple people stress that CT logs are explicitly designed as public, third‑party–consumable data (“transparency” is the point).
Many systems already monitor CT: search engines, security firms, archives, bots, “script kiddies,” etc. For some, this makes the story unremarkable.
One view: this is equivalent to using a phone book; anyone can read it and act on it.

Use cases: scrapers, security, and discovery

CT logs provide an almost real-time feed of new hostnames, useful for:
- Discovering new websites to crawl/index.
- Detecting rogue certificates issued for your domains.
- Security scanning (e.g., finding fresh WordPress installs).
Some see OpenAI’s use as standard practice: if your job is to crawl the web, CT is a natural starting point.

Privacy, surprise, and mitigation

Several commenters admit they hadn’t realized that issuing a public TLS cert effectively announces a hostname to the entire world.
Concern: sites not linked anywhere but using public certs are still “found” immediately via CT.
Suggested mitigations:
- Use wildcard certs (so subdomains aren’t individually exposed in logs), ideally terminated at a shared load balancer.
- Use private CAs for internal/non-public services.
Tradeoffs are noted: wildcard certs increase blast radius if compromised.

Scraping ethics and “stolen” content

One side argues that any publicly served content is, by design, available to be read by anyone, including AI and search companies; calling it “stolen” is inaccurate.
Others worry about CT being used to shortcut organic discovery and accelerate scraping of brand‑new, possibly unready sites.
Some report OpenAI appears to respect robots.txt and published IP/UA conventions, unlike many other scrapers.

Tools, infrastructure, and experimentation

crt.sh and merklemap are discussed as CT search tools; merklemap’s scaling and ZeroFS backend come up briefly.
Ideas mentioned: honeypot domains discovered only via CT to study bot behavior; feeds that normalize or deduplicate CT data (e.g., names-only APIs).

Related topics