Show HN: Building a web search engine from scratch with 3B neural embeddings

Overall reception

  • Strong enthusiasm for the project and the write-up; many call it one of the best technical articles they’ve read in a while.
  • People are impressed that a solo engineer built a working web-scale search engine with relatively low cost and detailed documentation.
  • Several commenters say they’d pay for it and see it as a credible seed for a community-run or even commercial alternative to Google/Kagi.

State of search and the web

  • Many lament Google’s decline: weaker exact-match search, heavy ad/SEO noise, and suspicion that profit is prioritized over quality.
  • Explanations given:
    • Arms race with SEO and ad-driven “garbage” content.
    • Fundamental change in the web: good content moved to walled gardens (social media, Discord, etc.), much of the old web has disappeared.
  • Some wish for an “old Google” style engine (n‑grams + PageRank) and even a mode that surfaces dead URLs as “missing” for research.

Technical approach and limitations

  • Praise for the clear cost breakdown, stack diagram, and use of neural embeddings + vector DB at scale.
  • Several note vector-only search misses important keyword-sensitive cases (e.g., recipes, “Apple” not returning apple.com first, SBERT definition queries).
  • Multiple commenters advocate hybrid search (BM25 + embeddings) with re-ranking for best quality.
  • There’s interest in scaling choices (HNSW vs IVF, RocksDB, CoreNN) and mention of alternatives like sparse embeddings (SPLADE).

Ranking, SEO, and spam

  • Some think click-based ranking is weak due to clickbait; propose penalizing ad-heavy pages as a better anti-spam signal.
  • Others argue embeddings/LLM ranking can also be gamed: adversarially generating text to target specific embedding vectors.
  • Counterpoint: using sentence embeddings (not instruction-following models) mitigates prompt-style attacks, and generating matching embeddings is more work than classic keyword stuffing.

Data sources, crawling, and openness

  • Strong encouragement to integrate Common Crawl and EU OpenWebSearch data; some dream of a high-quality non-profit search engine.
  • Discussion with Common Crawl about legal constraints: they stress they don’t own crawled content, can’t grant broad reuse rights, and must respect robots.txt.
  • Some ask for open sourcing the engine and/or building a federated or decentralized search network; others worry about sustainability.

LLMs, OpenAI, and privacy

  • Surprise at how cheap OpenAI’s batch embedding pricing is; speculation whether it’s a “honeypot” or “drug dealer” tactic.
  • Debate over whether OpenAI truly avoids training on API data; terms say no training unless users opt in, but some remain skeptical due to broader AI copyright concerns.

User experience and reliability

  • Early users report mostly good results but some “meta-ranking” pages over deep expertise, similar to major engines.
  • The demo experienced CORS/502 issues attributed to a “hug of death.”