Show HN: Turn any website into a knowledge base for LLMs
What the tool does
- Crawls provided URLs/domains, extracts content, embeds it into a vector DB, and exposes an API for RAG and semantic search.
- Works with most sites, including GitBook-style documentation.
- Supports grouping multiple websites into a “collection” that can be queried as one KB.
- Attempts to auto-discover sitemaps; explicit sitemap submission is on the roadmap.
- Currently cloud-only; no on-prem/self-hosting option.
Tech stack and implementation details
- Built with serverless Laravel plus Cloudflare and AWS functions; Pinecone for vector storage.
- Pages are chunked based on heading hierarchy (h1, h2, etc.), not
<section>tags; each chunk keeps its heading context. - Respects
robots.txtaccording to the author, with plans to document user agents and behavior more clearly. - No image embeddings yet; considered a future feature.
Use cases, questions, and feature requests
- Interest in applying it to PDFs, forums, dev docs, and internal sites; some ask about sitemaps and WARC ingest.
- Logins / gated content: not supported unless you own the site; details for authenticated crawling are unclear.
- Users want:
- Longer, richer answers and more conversational behavior.
- Prompt customization.
- Export of data (vectors vs. text chunks) and possibly no-code formats like PDF.
RAG quality, models, and limitations
- Underlying LLM and specific RAG configuration are not clearly described; several ask about models and hallucination rates.
- One user notes the demo chat feels limited and more like a proof-of-concept than the main product (the API).
Pricing, business viability, and reliability
- Interest in reasonably priced non-enterprise tiers; some explicitly say they would pay for a solid service.
- Enterprise pricing is “contact us,” which some find ominous or vague.
- Others argue the niche is easy to build yourself and may be short-lived; counter-voices defend it as useful and non-trivial to productize.
- Reports of 404s, internal server errors, and broken confirmation emails during launch.
Ethics, scraping, and broader trends
- Debate over ethics:
- Concerns about GDPR, consent, and inability to easily block or audit this specific crawler.
- Criticism of disguising the crawler as a regular browser.
- Comparisons to wider AI scraping issues and to Clearview-like data use.
- Some see services like this as accelerating bot-blocking and gating of websites; others suggest microtransactions or built-in site AI as future directions.
- Discussion branches into IP, attribution, and whether small content owners gain or lose from AI-driven reuse of their data.
Alternatives and DIY approaches
- Multiple users mention building similar systems themselves with Playwright, OCR, various embeddings, and open-source RAG stacks.
- Several open-source tools and libraries are referenced (e.g., Vectara-based apps, SQLite-based RAG, Ollama + LangChain, web UIs), indicating a rich DIY ecosystem around “chat with any site” functionality.