2024-07-30

Show HN: Turn any website into a knowledge base for LLMs

What the tool does

Crawls provided URLs/domains, extracts content, embeds it into a vector DB, and exposes an API for RAG and semantic search.
Works with most sites, including GitBook-style documentation.
Supports grouping multiple websites into a “collection” that can be queried as one KB.
Attempts to auto-discover sitemaps; explicit sitemap submission is on the roadmap.
Currently cloud-only; no on-prem/self-hosting option.

Tech stack and implementation details

Built with serverless Laravel plus Cloudflare and AWS functions; Pinecone for vector storage.
Pages are chunked based on heading hierarchy (h1, h2, etc.), not <section> tags; each chunk keeps its heading context.
Respects robots.txt according to the author, with plans to document user agents and behavior more clearly.
No image embeddings yet; considered a future feature.

Use cases, questions, and feature requests

Interest in applying it to PDFs, forums, dev docs, and internal sites; some ask about sitemaps and WARC ingest.
Logins / gated content: not supported unless you own the site; details for authenticated crawling are unclear.
Users want:
- Longer, richer answers and more conversational behavior.
- Prompt customization.
- Export of data (vectors vs. text chunks) and possibly no-code formats like PDF.

RAG quality, models, and limitations

Underlying LLM and specific RAG configuration are not clearly described; several ask about models and hallucination rates.
One user notes the demo chat feels limited and more like a proof-of-concept than the main product (the API).

Pricing, business viability, and reliability

Interest in reasonably priced non-enterprise tiers; some explicitly say they would pay for a solid service.
Enterprise pricing is “contact us,” which some find ominous or vague.
Others argue the niche is easy to build yourself and may be short-lived; counter-voices defend it as useful and non-trivial to productize.
Reports of 404s, internal server errors, and broken confirmation emails during launch.

Ethics, scraping, and broader trends

Debate over ethics:
- Concerns about GDPR, consent, and inability to easily block or audit this specific crawler.
- Criticism of disguising the crawler as a regular browser.
- Comparisons to wider AI scraping issues and to Clearview-like data use.
Some see services like this as accelerating bot-blocking and gating of websites; others suggest microtransactions or built-in site AI as future directions.
Discussion branches into IP, attribution, and whether small content owners gain or lose from AI-driven reuse of their data.

Alternatives and DIY approaches

Multiple users mention building similar systems themselves with Playwright, OCR, various embeddings, and open-source RAG stacks.
Several open-source tools and libraries are referenced (e.g., Vectara-based apps, SQLite-based RAG, Ollama + LangChain, web UIs), indicating a rich DIY ecosystem around “chat with any site” functionality.

Related topics