Vector indexing all of Wikipedia on a laptop

Cost and approach to embedding Wikipedia

  • Several commenters say they have embedded all of English Wikipedia for around $10 of GPU time on Colab in ~8 hours, using lightweight open models, vs the article’s $5,000 estimate.
  • Key reasons for the discrepancy:
    • Article indexes 300+ languages, not just English.
    • It uses a paid embedding API priced per million tokens.
    • Some see this as an expensive choice given open-source options.

Chunking strategies and context length

  • People agree you must chunk articles; you can’t embed “all of Wikipedia” as one vector.
  • One practical approach: split into sentences and accumulate until a context-window limit, then pool chunk vectors for a per-article embedding.
  • Debate over large context windows:
    • Some argue bigger windows are an efficiency win.
    • Others claim performance degrades and large chunks hurt retrieval precision, though they may better capture long-range semantics.

Proprietary vs open embeddings and retrieval quality

  • Skepticism about relying on proprietary embedding services this early, given rapid model turnover and opaque training.
  • Others note many businesses are comfortable with closed models and may use such datasets as a low-cost evaluation basis.
  • Discussion that “embeddings aren’t magic”: similarity objectives need to match the retrieval task; “semantic meaning” is a vague target and not necessarily tuned for retrieval.

Vector search vs traditional information retrieval

  • Multiple comments stress that classic IR (BM25, keyword search, metadata) remains important.
  • A common mature pattern: use keyword/metadata search for first-pass recall, then embeddings for reranking or catching misses.
  • Critique of “vector-only” systems built quickly for demos; they may be weaker than hybrid approaches.

JVector, large indexes, and segmentation

  • JVector supports building indexes larger than RAM by using compressed vectors (PQ) during index construction while keeping edge lists in memory.
  • Reported benchmarks show near-zero accuracy loss vs building with raw vectors.
  • Comparison to DiskANN’s partition-and-merge approach: JVector compresses incrementally; DiskANN partitions and merges, increasing construction cost.
  • Complexity of segmented search is debated; some argue theoretical O(N) vs O(log N) analysis is less useful than empirical performance.

System behavior: swap, JVM, and hardware

  • Several comments note Linux may swap even with free RAM, heavily affecting performance.
  • Workarounds discussed: disabling swap, tuning swappiness, using mlock, or careful JVM heap sizing.
  • Laptops with 36–64GB+ RAM (including “mobile workstations” and high-end Macs) are considered adequate for the described index.

Ecosystem and alternatives

  • Alternatives mentioned: pgvector in Postgres, hosted vector services, and hybrid models like ColBERT.
  • Some ask about best vector DB choices; advice includes starting with vector extensions in existing databases for simplicity.
  • Wikipedia database dumps and precomputed embedding datasets are highlighted as free resources.