It seems that OpenAI is scraping [certificate transparency] logs
OpenAI bot behavior and identification
- Commenters verify that the IP in the blog post is inside OpenAI’s published
searchbotIP range and that the User-Agent is consistent with their declared crawler. - Some note the UA string is messy/malformed but still clearly self-identifying; others consider blocking malformed UAs entirely.
- Header spoofing is mentioned as common among scrapers, but in this case the IP check confirms it really is OpenAI.
Certificate Transparency (CT) logs as a public feed
- Multiple people stress that CT logs are explicitly designed as public, third‑party–consumable data (“transparency” is the point).
- Many systems already monitor CT: search engines, security firms, archives, bots, “script kiddies,” etc. For some, this makes the story unremarkable.
- One view: this is equivalent to using a phone book; anyone can read it and act on it.
Use cases: scrapers, security, and discovery
- CT logs provide an almost real-time feed of new hostnames, useful for:
- Discovering new websites to crawl/index.
- Detecting rogue certificates issued for your domains.
- Security scanning (e.g., finding fresh WordPress installs).
- Some see OpenAI’s use as standard practice: if your job is to crawl the web, CT is a natural starting point.
Privacy, surprise, and mitigation
- Several commenters admit they hadn’t realized that issuing a public TLS cert effectively announces a hostname to the entire world.
- Concern: sites not linked anywhere but using public certs are still “found” immediately via CT.
- Suggested mitigations:
- Use wildcard certs (so subdomains aren’t individually exposed in logs), ideally terminated at a shared load balancer.
- Use private CAs for internal/non-public services.
- Tradeoffs are noted: wildcard certs increase blast radius if compromised.
Scraping ethics and “stolen” content
- One side argues that any publicly served content is, by design, available to be read by anyone, including AI and search companies; calling it “stolen” is inaccurate.
- Others worry about CT being used to shortcut organic discovery and accelerate scraping of brand‑new, possibly unready sites.
- Some report OpenAI appears to respect robots.txt and published IP/UA conventions, unlike many other scrapers.
Tools, infrastructure, and experimentation
- crt.sh and merklemap are discussed as CT search tools; merklemap’s scaling and ZeroFS backend come up briefly.
- Ideas mentioned: honeypot domains discovered only via CT to study bot behavior; feeds that normalize or deduplicate CT data (e.g., names-only APIs).