Embeddings are underrated (2024)
Applications and Use Cases
- Commenters share many concrete uses: semantic blog “related posts”, RSS aggregators with arbitrary categories, patent similarity search, literature and arXiv search, legal text retrieval, code search over local repos, and personal knowledge tools (e.g., Recallify).
- Embeddings + classical ML (scikit-learn classifiers, clustering) are reported as practical and often “good enough” compared to fine‑tuning large language models, with vastly lower training cost.
- For clustering, embeddings make simple algorithms like k‑means work much better than old bag‑of‑words vectors.
- Some are exploring novel UX ideas like “semantic scrolling” and HNSW-based client‑side indexes for semantic browsing.
Search, RAG, and Technical Documentation
- Many see semantic search as the most compelling use: matching on meaning rather than exact words, handling synonyms and fuzzy queries like “that feature that runs a function on every column”.
- Hybrid search (keywords + embeddings) is reported as best in production: exact matches remain important, especially for jargon, while embeddings handle conceptual similarity.
- For technical docs, embeddings are framed as a tool for:
- Better in‑site search and “more like this” suggestions.
- Improving “discoveryness” across large doc sets.
- Supporting work on three “intractable” technical-writing challenges (coverage, consistency, findability), though details are mostly deferred to future posts and patents.
- In RAG, embeddings primarily serve as pointers back to source passages; more granular concept‑level citation is discussed, with GraphRAG suggested as promising.
Technical Nuances and Models
- There is extended discussion on:
- Directions vs dimensions in embedding spaces and how traits (e.g., gender) are encoded as directions, not single axes.
- High‑dimensional geometry (near‑orthogonality, Johnson–Lindenstrauss, UMAP for visualization).
- Limitations of classic word vectors (GloVe/word2vec) versus contextual transformer embeddings, plus the role of tokenization (BPE, casing, punctuation).
- Whether embeddings are meaningfully analogous to hashes; several argue they are fundamentally different despite both mapping variable-length input to fixed-length output.
- Embedding inversion and “semantic algebra” over texts as emerging research topics.
Evaluation, Limits, and Skepticism
- Some readers find the article too introductory and vague, wanting earlier definitions, clearer thesis, and concrete “killer apps” for tech writers.
- Others note embeddings are long-established in IR and recommender systems, so “underrated” mainly applies relative to LLM hype or within the technical-writing community.
- Several caution that embeddings are “hunchy”: great for similarity and clustering, but not for precise logical queries or structured data conditions.
- There is debate over whether text generation or embeddings will have the bigger long‑term impact on technical writing; many conclude the real power lies in combining both.
Performance, Deployment, and Ethics
- Commenters emphasize that generating an embedding is roughly one forward pass (like one token of generation), with some extra cost for bidirectional models.
- Lightweight open-source models (e.g., MiniLM, BGE, GTE, Nomic) are cited as small, fast, and sometimes outperforming commercial APIs on MTEB.
- Client‑side embeddings using ONNX and transformers.js, with static HNSW‑like indexes in Parquet queried via DuckDB, are highlighted as near‑free, low‑latency options.
- Ethical concerns focus on training data for embedding models, though many see embeddings as a strongly “augmentative” rather than replacement technology.