Web scraping with GPT-4o: powerful but expensive
Why LLM-based scraping is trending
- Many sites expose data only via rendered HTML (often server-side) rather than stable APIs, so scraping is the only way to access it.
- Traditional scrapers are brittle: small DOM changes break XPaths/CSS selectors; building and maintaining them is tedious.
- LLMs can treat scraping as “summarize this page into structured data,” making them more robust to layout/class changes and viable for many one-off or low-precision tasks.
- Personal use cases abound: consolidating school or subscription communications, archiving articles, tracking receipts and purchases, hobby analytics during the pandemic, etc.
- Some see this as part of a broader shift from “chat about static data” to “automate the web.”
Techniques & preprocessing
- Approaches range from classic Playwright/Puppeteer + CSS selectors/regex to screenshot + OCR/vision models for heavily obfuscated or canvas-based UIs.
- Several commenters advocate HTML “reduction”: strip scripts, styles, and attributes; keep visible text and minimal structure; or convert DOM to Markdown or simplified HTML.
- Tools like readability-style extractors, semantic-markdown converters, and text-focused libraries can dramatically cut token counts while preserving semantics.
- Some prefer LLMs only to generate scraper code (XPaths, CSS, BeautifulSoup, etc.), then run that code repeatedly until it breaks.
Model choice, cost, and infrastructure
- GPT-4o is widely viewed as very capable but expensive at scale; mini variants and other “cheap” frontier models work better when combined with good preprocessing.
- Several argue small open-source models (e.g., Llama variants) are already strong enough for extraction, especially when run on serverless GPUs or local inference engines.
- OpenAI’s batch API can halve costs for non-real-time workloads but introduces latency and possible dropped requests.
- Some note that proxy/bandwidth costs for large-scale scraping may exceed LLM fees.
Reliability, scale, and limits
- Hallucinations remain an issue (e.g., mislabeling cities vs. countries, merging repeated table rows). Two-stage pipelines and LLM-as-judge validation are suggested.
- Anti-bot systems (Cloudflare, similar) and paywalls are major practical obstacles; allowlisted partnerships are one workaround.
- For many structured pages (lists, simple tables), heuristic or DOM-based extractors are cheaper, faster, and more reliable than LLMs.
Ethics, use cases, and skepticism
- There is disagreement over scraping sites like Instagram: some dismiss ToS; others emphasize user respect and consent.
- Some participants see LLM-based scraping as genuinely new leverage; others view it as overkill for an already-solved problem and question its energy and maintenance costs.