Embeddings are underrated (2024)

Applications and Use Cases

  • Commenters share many concrete uses: semantic blog “related posts”, RSS aggregators with arbitrary categories, patent similarity search, literature and arXiv search, legal text retrieval, code search over local repos, and personal knowledge tools (e.g., Recallify).
  • Embeddings + classical ML (scikit-learn classifiers, clustering) are reported as practical and often “good enough” compared to fine‑tuning large language models, with vastly lower training cost.
  • For clustering, embeddings make simple algorithms like k‑means work much better than old bag‑of‑words vectors.
  • Some are exploring novel UX ideas like “semantic scrolling” and HNSW-based client‑side indexes for semantic browsing.

Search, RAG, and Technical Documentation

  • Many see semantic search as the most compelling use: matching on meaning rather than exact words, handling synonyms and fuzzy queries like “that feature that runs a function on every column”.
  • Hybrid search (keywords + embeddings) is reported as best in production: exact matches remain important, especially for jargon, while embeddings handle conceptual similarity.
  • For technical docs, embeddings are framed as a tool for:
    • Better in‑site search and “more like this” suggestions.
    • Improving “discoveryness” across large doc sets.
    • Supporting work on three “intractable” technical-writing challenges (coverage, consistency, findability), though details are mostly deferred to future posts and patents.
  • In RAG, embeddings primarily serve as pointers back to source passages; more granular concept‑level citation is discussed, with GraphRAG suggested as promising.

Technical Nuances and Models

  • There is extended discussion on:
    • Directions vs dimensions in embedding spaces and how traits (e.g., gender) are encoded as directions, not single axes.
    • High‑dimensional geometry (near‑orthogonality, Johnson–Lindenstrauss, UMAP for visualization).
    • Limitations of classic word vectors (GloVe/word2vec) versus contextual transformer embeddings, plus the role of tokenization (BPE, casing, punctuation).
    • Whether embeddings are meaningfully analogous to hashes; several argue they are fundamentally different despite both mapping variable-length input to fixed-length output.
    • Embedding inversion and “semantic algebra” over texts as emerging research topics.

Evaluation, Limits, and Skepticism

  • Some readers find the article too introductory and vague, wanting earlier definitions, clearer thesis, and concrete “killer apps” for tech writers.
  • Others note embeddings are long-established in IR and recommender systems, so “underrated” mainly applies relative to LLM hype or within the technical-writing community.
  • Several caution that embeddings are “hunchy”: great for similarity and clustering, but not for precise logical queries or structured data conditions.
  • There is debate over whether text generation or embeddings will have the bigger long‑term impact on technical writing; many conclude the real power lies in combining both.

Performance, Deployment, and Ethics

  • Commenters emphasize that generating an embedding is roughly one forward pass (like one token of generation), with some extra cost for bidirectional models.
  • Lightweight open-source models (e.g., MiniLM, BGE, GTE, Nomic) are cited as small, fast, and sometimes outperforming commercial APIs on MTEB.
  • Client‑side embeddings using ONNX and transformers.js, with static HNSW‑like indexes in Parquet queried via DuckDB, are highlighted as near‑free, low‑latency options.
  • Ethical concerns focus on training data for embedding models, though many see embeddings as a strongly “augmentative” rather than replacement technology.