2024-11-01

Embeddings are underrated

Overall sentiment: underrated vs overrated

Many argue embeddings are underused outside ML (especially in tech writing, search, tools), calling them a “bicycle for the mind” that augments rather than replaces thinking.
Others say they’re overrated: tend to overfit to word overlap, can give many false positives/negatives, and are often adopted by people who don’t rigorously evaluate results.
General consensus: embeddings are powerful but not magic; expectations should be realistic and combined with evaluation, classic IR techniques, and sometimes fine-tuning.

Applications people are excited about

Semantic search and discovery: docs, logs, man pages, email, git commits, nuclear-doc search, multi-language search, clustering comments and summarizing clusters.
Technical docs: chunk-level embeddings, similarity search to jump to the right section, possible auto-footnotes and annotations, “hypothetical document” indexing.
Job matching: matching resumes to job descriptions, personalized job boards, automatic job–resume matching; some early products already exist.
Classification and recommendation: embedding-based classifiers, user–item embeddings in recommender systems, niche ad targeting.
Misc: embeddings powering better note-taking, topic grouping, cross-language “Babelfish”-like search.

Technical debates and best practices

Chunking and preprocessing matter: using document structure or dynamic chunking rather than whole-doc embeddings; stripping markup selectively.
Evaluation: several references to the MTEB leaderboard; concern about benchmark overfitting and test-set contamination.
Model choice: tension between large 7B+ LLM-based encoders vs lighter specialized models; concern over small embedding dimensions possibly harming niche performance.
Sparse vs dense embeddings: sparse/BM25-ish variants seen as strong for large-scale retrieval, efficiency, interpretability, and user familiarity.
Fine-tuning: often recommended for domain-specific corpora or languages; claims that ~100k relevance pairs can significantly improve task-specific performance.

Structure of embedding space

Interest in decomposing embeddings into “content vs tone” or other factors using vector arithmetic, PCA, or special training; no definitive recipe, but multiple proposed methods.
Observations about dimensional collapse (similarity scores clustered high) and matryoshka representations suggesting significant room for future optimization.
Some discuss translating between embedding spaces or creating canonical “semantic hashes,” with disagreement on feasibility.

RAG, LLMs, and environment

Many find vanilla RAG underwhelming; semantic search plus optional LLM summarization (with citations) is seen as more robust.
Energy use of embedding models is raised; others counter that, relative to human work and travel, compute may be a net efficiency gain.

Related topics