Embeddings are underrated

Overall sentiment: underrated vs overrated

  • Many argue embeddings are underused outside ML (especially in tech writing, search, tools), calling them a “bicycle for the mind” that augments rather than replaces thinking.
  • Others say they’re overrated: tend to overfit to word overlap, can give many false positives/negatives, and are often adopted by people who don’t rigorously evaluate results.
  • General consensus: embeddings are powerful but not magic; expectations should be realistic and combined with evaluation, classic IR techniques, and sometimes fine-tuning.

Applications people are excited about

  • Semantic search and discovery: docs, logs, man pages, email, git commits, nuclear-doc search, multi-language search, clustering comments and summarizing clusters.
  • Technical docs: chunk-level embeddings, similarity search to jump to the right section, possible auto-footnotes and annotations, “hypothetical document” indexing.
  • Job matching: matching resumes to job descriptions, personalized job boards, automatic job–resume matching; some early products already exist.
  • Classification and recommendation: embedding-based classifiers, user–item embeddings in recommender systems, niche ad targeting.
  • Misc: embeddings powering better note-taking, topic grouping, cross-language “Babelfish”-like search.

Technical debates and best practices

  • Chunking and preprocessing matter: using document structure or dynamic chunking rather than whole-doc embeddings; stripping markup selectively.
  • Evaluation: several references to the MTEB leaderboard; concern about benchmark overfitting and test-set contamination.
  • Model choice: tension between large 7B+ LLM-based encoders vs lighter specialized models; concern over small embedding dimensions possibly harming niche performance.
  • Sparse vs dense embeddings: sparse/BM25-ish variants seen as strong for large-scale retrieval, efficiency, interpretability, and user familiarity.
  • Fine-tuning: often recommended for domain-specific corpora or languages; claims that ~100k relevance pairs can significantly improve task-specific performance.

Structure of embedding space

  • Interest in decomposing embeddings into “content vs tone” or other factors using vector arithmetic, PCA, or special training; no definitive recipe, but multiple proposed methods.
  • Observations about dimensional collapse (similarity scores clustered high) and matryoshka representations suggesting significant room for future optimization.
  • Some discuss translating between embedding spaces or creating canonical “semantic hashes,” with disagreement on feasibility.

RAG, LLMs, and environment

  • Many find vanilla RAG underwhelming; semantic search plus optional LLM summarization (with citations) is seen as more robust.
  • Energy use of embedding models is raised; others counter that, relative to human work and travel, compute may be a net efficiency gain.