2024-05-29

Vector indexing all of Wikipedia on a laptop

Cost and approach to embedding Wikipedia

Several commenters say they have embedded all of English Wikipedia for around $10 of GPU time on Colab in ~8 hours, using lightweight open models, vs the article’s $5,000 estimate.
Key reasons for the discrepancy:
- Article indexes 300+ languages, not just English.
- It uses a paid embedding API priced per million tokens.
- Some see this as an expensive choice given open-source options.

Chunking strategies and context length

People agree you must chunk articles; you can’t embed “all of Wikipedia” as one vector.
One practical approach: split into sentences and accumulate until a context-window limit, then pool chunk vectors for a per-article embedding.
Debate over large context windows:
- Some argue bigger windows are an efficiency win.
- Others claim performance degrades and large chunks hurt retrieval precision, though they may better capture long-range semantics.

Proprietary vs open embeddings and retrieval quality

Skepticism about relying on proprietary embedding services this early, given rapid model turnover and opaque training.
Others note many businesses are comfortable with closed models and may use such datasets as a low-cost evaluation basis.
Discussion that “embeddings aren’t magic”: similarity objectives need to match the retrieval task; “semantic meaning” is a vague target and not necessarily tuned for retrieval.

Vector search vs traditional information retrieval

Multiple comments stress that classic IR (BM25, keyword search, metadata) remains important.
A common mature pattern: use keyword/metadata search for first-pass recall, then embeddings for reranking or catching misses.
Critique of “vector-only” systems built quickly for demos; they may be weaker than hybrid approaches.

JVector, large indexes, and segmentation

JVector supports building indexes larger than RAM by using compressed vectors (PQ) during index construction while keeping edge lists in memory.
Reported benchmarks show near-zero accuracy loss vs building with raw vectors.
Comparison to DiskANN’s partition-and-merge approach: JVector compresses incrementally; DiskANN partitions and merges, increasing construction cost.
Complexity of segmented search is debated; some argue theoretical O(N) vs O(log N) analysis is less useful than empirical performance.

System behavior: swap, JVM, and hardware

Several comments note Linux may swap even with free RAM, heavily affecting performance.
Workarounds discussed: disabling swap, tuning swappiness, using mlock, or careful JVM heap sizing.
Laptops with 36–64GB+ RAM (including “mobile workstations” and high-end Macs) are considered adequate for the described index.

Ecosystem and alternatives

Alternatives mentioned: pgvector in Postgres, hosted vector services, and hybrid models like ColBERT.
Some ask about best vector DB choices; advice includes starting with vector extensions in existing databases for simplicity.
Wikipedia database dumps and precomputed embedding datasets are highlighted as free resources.

Related topics