2024-09-02

Web scraping with GPT-4o: powerful but expensive

Why LLM-based scraping is trending

Many sites expose data only via rendered HTML (often server-side) rather than stable APIs, so scraping is the only way to access it.
Traditional scrapers are brittle: small DOM changes break XPaths/CSS selectors; building and maintaining them is tedious.
LLMs can treat scraping as “summarize this page into structured data,” making them more robust to layout/class changes and viable for many one-off or low-precision tasks.
Personal use cases abound: consolidating school or subscription communications, archiving articles, tracking receipts and purchases, hobby analytics during the pandemic, etc.
Some see this as part of a broader shift from “chat about static data” to “automate the web.”

Techniques & preprocessing

Approaches range from classic Playwright/Puppeteer + CSS selectors/regex to screenshot + OCR/vision models for heavily obfuscated or canvas-based UIs.
Several commenters advocate HTML “reduction”: strip scripts, styles, and attributes; keep visible text and minimal structure; or convert DOM to Markdown or simplified HTML.
Tools like readability-style extractors, semantic-markdown converters, and text-focused libraries can dramatically cut token counts while preserving semantics.
Some prefer LLMs only to generate scraper code (XPaths, CSS, BeautifulSoup, etc.), then run that code repeatedly until it breaks.

Model choice, cost, and infrastructure

GPT-4o is widely viewed as very capable but expensive at scale; mini variants and other “cheap” frontier models work better when combined with good preprocessing.
Several argue small open-source models (e.g., Llama variants) are already strong enough for extraction, especially when run on serverless GPUs or local inference engines.
OpenAI’s batch API can halve costs for non-real-time workloads but introduces latency and possible dropped requests.
Some note that proxy/bandwidth costs for large-scale scraping may exceed LLM fees.

Reliability, scale, and limits

Hallucinations remain an issue (e.g., mislabeling cities vs. countries, merging repeated table rows). Two-stage pipelines and LLM-as-judge validation are suggested.
Anti-bot systems (Cloudflare, similar) and paywalls are major practical obstacles; allowlisted partnerships are one workaround.
For many structured pages (lists, simple tables), heuristic or DOM-based extractors are cheaper, faster, and more reliable than LLMs.

Ethics, use cases, and skepticism

There is disagreement over scraping sites like Instagram: some dismiss ToS; others emphasize user respect and consent.
Some participants see LLM-based scraping as genuinely new leverage; others view it as overkill for an already-solved problem and question its energy and maintenance costs.

Related topics