Show HN: Building a web search engine from scratch with 3B neural embeddings
Overall reception
- Strong enthusiasm for the project and the write-up; many call it one of the best technical articles they’ve read in a while.
- People are impressed that a solo engineer built a working web-scale search engine with relatively low cost and detailed documentation.
- Several commenters say they’d pay for it and see it as a credible seed for a community-run or even commercial alternative to Google/Kagi.
State of search and the web
- Many lament Google’s decline: weaker exact-match search, heavy ad/SEO noise, and suspicion that profit is prioritized over quality.
- Explanations given:
- Arms race with SEO and ad-driven “garbage” content.
- Fundamental change in the web: good content moved to walled gardens (social media, Discord, etc.), much of the old web has disappeared.
- Some wish for an “old Google” style engine (n‑grams + PageRank) and even a mode that surfaces dead URLs as “missing” for research.
Technical approach and limitations
- Praise for the clear cost breakdown, stack diagram, and use of neural embeddings + vector DB at scale.
- Several note vector-only search misses important keyword-sensitive cases (e.g., recipes, “Apple” not returning apple.com first, SBERT definition queries).
- Multiple commenters advocate hybrid search (BM25 + embeddings) with re-ranking for best quality.
- There’s interest in scaling choices (HNSW vs IVF, RocksDB, CoreNN) and mention of alternatives like sparse embeddings (SPLADE).
Ranking, SEO, and spam
- Some think click-based ranking is weak due to clickbait; propose penalizing ad-heavy pages as a better anti-spam signal.
- Others argue embeddings/LLM ranking can also be gamed: adversarially generating text to target specific embedding vectors.
- Counterpoint: using sentence embeddings (not instruction-following models) mitigates prompt-style attacks, and generating matching embeddings is more work than classic keyword stuffing.
Data sources, crawling, and openness
- Strong encouragement to integrate Common Crawl and EU OpenWebSearch data; some dream of a high-quality non-profit search engine.
- Discussion with Common Crawl about legal constraints: they stress they don’t own crawled content, can’t grant broad reuse rights, and must respect robots.txt.
- Some ask for open sourcing the engine and/or building a federated or decentralized search network; others worry about sustainability.
LLMs, OpenAI, and privacy
- Surprise at how cheap OpenAI’s batch embedding pricing is; speculation whether it’s a “honeypot” or “drug dealer” tactic.
- Debate over whether OpenAI truly avoids training on API data; terms say no training unless users opt in, but some remain skeptical due to broader AI copyright concerns.
User experience and reliability
- Early users report mostly good results but some “meta-ranking” pages over deep expertise, similar to major engines.
- The demo experienced CORS/502 issues attributed to a “hug of death.”