Vector indexing all of Wikipedia on a laptop
Cost and approach to embedding Wikipedia
- Several commenters say they have embedded all of English Wikipedia for around $10 of GPU time on Colab in ~8 hours, using lightweight open models, vs the article’s $5,000 estimate.
- Key reasons for the discrepancy:
- Article indexes 300+ languages, not just English.
- It uses a paid embedding API priced per million tokens.
- Some see this as an expensive choice given open-source options.
Chunking strategies and context length
- People agree you must chunk articles; you can’t embed “all of Wikipedia” as one vector.
- One practical approach: split into sentences and accumulate until a context-window limit, then pool chunk vectors for a per-article embedding.
- Debate over large context windows:
- Some argue bigger windows are an efficiency win.
- Others claim performance degrades and large chunks hurt retrieval precision, though they may better capture long-range semantics.
Proprietary vs open embeddings and retrieval quality
- Skepticism about relying on proprietary embedding services this early, given rapid model turnover and opaque training.
- Others note many businesses are comfortable with closed models and may use such datasets as a low-cost evaluation basis.
- Discussion that “embeddings aren’t magic”: similarity objectives need to match the retrieval task; “semantic meaning” is a vague target and not necessarily tuned for retrieval.
Vector search vs traditional information retrieval
- Multiple comments stress that classic IR (BM25, keyword search, metadata) remains important.
- A common mature pattern: use keyword/metadata search for first-pass recall, then embeddings for reranking or catching misses.
- Critique of “vector-only” systems built quickly for demos; they may be weaker than hybrid approaches.
JVector, large indexes, and segmentation
- JVector supports building indexes larger than RAM by using compressed vectors (PQ) during index construction while keeping edge lists in memory.
- Reported benchmarks show near-zero accuracy loss vs building with raw vectors.
- Comparison to DiskANN’s partition-and-merge approach: JVector compresses incrementally; DiskANN partitions and merges, increasing construction cost.
- Complexity of segmented search is debated; some argue theoretical O(N) vs O(log N) analysis is less useful than empirical performance.
System behavior: swap, JVM, and hardware
- Several comments note Linux may swap even with free RAM, heavily affecting performance.
- Workarounds discussed: disabling swap, tuning swappiness, using mlock, or careful JVM heap sizing.
- Laptops with 36–64GB+ RAM (including “mobile workstations” and high-end Macs) are considered adequate for the described index.
Ecosystem and alternatives
- Alternatives mentioned: pgvector in Postgres, hosted vector services, and hybrid models like ColBERT.
- Some ask about best vector DB choices; advice includes starting with vector extensions in existing databases for simplicity.
- Wikipedia database dumps and precomputed embedding datasets are highlighted as free resources.