Show HN: Turn any website into a knowledge base for LLMs

What the tool does

  • Crawls provided URLs/domains, extracts content, embeds it into a vector DB, and exposes an API for RAG and semantic search.
  • Works with most sites, including GitBook-style documentation.
  • Supports grouping multiple websites into a “collection” that can be queried as one KB.
  • Attempts to auto-discover sitemaps; explicit sitemap submission is on the roadmap.
  • Currently cloud-only; no on-prem/self-hosting option.

Tech stack and implementation details

  • Built with serverless Laravel plus Cloudflare and AWS functions; Pinecone for vector storage.
  • Pages are chunked based on heading hierarchy (h1, h2, etc.), not <section> tags; each chunk keeps its heading context.
  • Respects robots.txt according to the author, with plans to document user agents and behavior more clearly.
  • No image embeddings yet; considered a future feature.

Use cases, questions, and feature requests

  • Interest in applying it to PDFs, forums, dev docs, and internal sites; some ask about sitemaps and WARC ingest.
  • Logins / gated content: not supported unless you own the site; details for authenticated crawling are unclear.
  • Users want:
    • Longer, richer answers and more conversational behavior.
    • Prompt customization.
    • Export of data (vectors vs. text chunks) and possibly no-code formats like PDF.

RAG quality, models, and limitations

  • Underlying LLM and specific RAG configuration are not clearly described; several ask about models and hallucination rates.
  • One user notes the demo chat feels limited and more like a proof-of-concept than the main product (the API).

Pricing, business viability, and reliability

  • Interest in reasonably priced non-enterprise tiers; some explicitly say they would pay for a solid service.
  • Enterprise pricing is “contact us,” which some find ominous or vague.
  • Others argue the niche is easy to build yourself and may be short-lived; counter-voices defend it as useful and non-trivial to productize.
  • Reports of 404s, internal server errors, and broken confirmation emails during launch.

Ethics, scraping, and broader trends

  • Debate over ethics:
    • Concerns about GDPR, consent, and inability to easily block or audit this specific crawler.
    • Criticism of disguising the crawler as a regular browser.
    • Comparisons to wider AI scraping issues and to Clearview-like data use.
  • Some see services like this as accelerating bot-blocking and gating of websites; others suggest microtransactions or built-in site AI as future directions.
  • Discussion branches into IP, attribution, and whether small content owners gain or lose from AI-driven reuse of their data.

Alternatives and DIY approaches

  • Multiple users mention building similar systems themselves with Playwright, OCR, various embeddings, and open-source RAG stacks.
  • Several open-source tools and libraries are referenced (e.g., Vectara-based apps, SQLite-based RAG, Ollama + LangChain, web UIs), indicating a rich DIY ecosystem around “chat with any site” functionality.