Embeddings are underrated
Overall sentiment: underrated vs overrated
- Many argue embeddings are underused outside ML (especially in tech writing, search, tools), calling them a “bicycle for the mind” that augments rather than replaces thinking.
- Others say they’re overrated: tend to overfit to word overlap, can give many false positives/negatives, and are often adopted by people who don’t rigorously evaluate results.
- General consensus: embeddings are powerful but not magic; expectations should be realistic and combined with evaluation, classic IR techniques, and sometimes fine-tuning.
Applications people are excited about
- Semantic search and discovery: docs, logs, man pages, email, git commits, nuclear-doc search, multi-language search, clustering comments and summarizing clusters.
- Technical docs: chunk-level embeddings, similarity search to jump to the right section, possible auto-footnotes and annotations, “hypothetical document” indexing.
- Job matching: matching resumes to job descriptions, personalized job boards, automatic job–resume matching; some early products already exist.
- Classification and recommendation: embedding-based classifiers, user–item embeddings in recommender systems, niche ad targeting.
- Misc: embeddings powering better note-taking, topic grouping, cross-language “Babelfish”-like search.
Technical debates and best practices
- Chunking and preprocessing matter: using document structure or dynamic chunking rather than whole-doc embeddings; stripping markup selectively.
- Evaluation: several references to the MTEB leaderboard; concern about benchmark overfitting and test-set contamination.
- Model choice: tension between large 7B+ LLM-based encoders vs lighter specialized models; concern over small embedding dimensions possibly harming niche performance.
- Sparse vs dense embeddings: sparse/BM25-ish variants seen as strong for large-scale retrieval, efficiency, interpretability, and user familiarity.
- Fine-tuning: often recommended for domain-specific corpora or languages; claims that ~100k relevance pairs can significantly improve task-specific performance.
Structure of embedding space
- Interest in decomposing embeddings into “content vs tone” or other factors using vector arithmetic, PCA, or special training; no definitive recipe, but multiple proposed methods.
- Observations about dimensional collapse (similarity scores clustered high) and matryoshka representations suggesting significant room for future optimization.
- Some discuss translating between embedding spaces or creating canonical “semantic hashes,” with disagreement on feasibility.
RAG, LLMs, and environment
- Many find vanilla RAG underwhelming; semantic search plus optional LLM summarization (with citations) is seen as more robust.
- Energy use of embedding models is raised; others counter that, relative to human work and travel, compute may be a net efficiency gain.