2025-08-12

Show HN: Building a web search engine from scratch with 3B neural embeddings

Overall reception

Strong enthusiasm for the project and the write-up; many call it one of the best technical articles they’ve read in a while.
People are impressed that a solo engineer built a working web-scale search engine with relatively low cost and detailed documentation.
Several commenters say they’d pay for it and see it as a credible seed for a community-run or even commercial alternative to Google/Kagi.

State of search and the web

Many lament Google’s decline: weaker exact-match search, heavy ad/SEO noise, and suspicion that profit is prioritized over quality.
Explanations given:
- Arms race with SEO and ad-driven “garbage” content.
- Fundamental change in the web: good content moved to walled gardens (social media, Discord, etc.), much of the old web has disappeared.
Some wish for an “old Google” style engine (n‑grams + PageRank) and even a mode that surfaces dead URLs as “missing” for research.

Technical approach and limitations

Praise for the clear cost breakdown, stack diagram, and use of neural embeddings + vector DB at scale.
Several note vector-only search misses important keyword-sensitive cases (e.g., recipes, “Apple” not returning apple.com first, SBERT definition queries).
Multiple commenters advocate hybrid search (BM25 + embeddings) with re-ranking for best quality.
There’s interest in scaling choices (HNSW vs IVF, RocksDB, CoreNN) and mention of alternatives like sparse embeddings (SPLADE).

Ranking, SEO, and spam

Some think click-based ranking is weak due to clickbait; propose penalizing ad-heavy pages as a better anti-spam signal.
Others argue embeddings/LLM ranking can also be gamed: adversarially generating text to target specific embedding vectors.
Counterpoint: using sentence embeddings (not instruction-following models) mitigates prompt-style attacks, and generating matching embeddings is more work than classic keyword stuffing.

Data sources, crawling, and openness

Strong encouragement to integrate Common Crawl and EU OpenWebSearch data; some dream of a high-quality non-profit search engine.
Discussion with Common Crawl about legal constraints: they stress they don’t own crawled content, can’t grant broad reuse rights, and must respect robots.txt.
Some ask for open sourcing the engine and/or building a federated or decentralized search network; others worry about sustainability.

LLMs, OpenAI, and privacy

Surprise at how cheap OpenAI’s batch embedding pricing is; speculation whether it’s a “honeypot” or “drug dealer” tactic.
Debate over whether OpenAI truly avoids training on API data; terms say no training unless users opt in, but some remain skeptical due to broader AI copyright concerns.

User experience and reliability

Early users report mostly good results but some “meta-ranking” pages over deep expertise, similar to major engines.
The demo experienced CORS/502 issues attributed to a “hug of death.”

Related topics