Succinct data structures

Core concepts and rank/select techniques

  • Many comments focus on how constant-time rank and select are achieved on bitvectors.
  • One explanation distinguishes dense vs. sparse bitvectors:
    • Dense: partition into blocks and sub-blocks, store prefix sums (indexes) per block, then use word-level popcount to finish.
    • Sparse: treat set bits as a sorted list of integers and use Elias–Fano encoding; the “high bits” are dense and can reuse the dense machinery.
  • Hardware support (e.g., POPCNT) is emphasized as a key enabler for practical implementations.

Succinct vs. compact; theory vs. practice

  • Several comments clarify the formal definition: space = information-theoretic minimum + o(n) bits.
  • 30% overhead would typically be considered “compact,” not “succinct.”
  • There’s significant discussion that asymptotic bounds can be misleading:
    • Theoretical parameter choices (b1 = log²n, b2 = log n, etc.) give O(1) queries and sublinear overhead but are often too slow in practice.
    • Real implementations fix constant-sized blocks tuned to word size and cache; overhead then becomes linear but small.
  • Some “succinct” operations like true O(1) select are rarely implemented exactly due to complexity.

Implementations, libraries, and algorithms

  • Multiple libraries and codebases are mentioned: Haskell packages, Zig implementation (minimal perfect hashes, constant-time select), SDSL-lite, Sux/Sux4J, marisa-trie, others.
  • Recent work on minimal perfect hash functions (RecSplit, Consensus-RecSplit, PtrHash) pushes memory use close to theoretical limits (≈1.4–4 bits/key) with varying speed tradeoffs.

Performance, hardware, and when they help

  • Succinct data structures can be slower than conventional ones when everything already fits comfortably in RAM.
  • They shine when:
    • Datasets are huge (e.g., genomics) and asymptotics matter.
    • Keeping data within LLC / a single NUMA node avoids latency.
    • Memory costs (especially in the cloud) dominate.
  • Several comments note the tension between memory footprint, cache behavior, and vectorization; impact on SIMD/vector ops is raised but not really answered (unclear).

Applications and practice

  • Real-world uses mentioned:
    • FM-index–based tools in bioinformatics (e.g., read aligners).
    • Large graph processing (via succinct libraries under the hood).
    • Tries for dictionaries (e.g., marisa-trie) with large memory savings but some slowdown.
  • One commenter reports very small, fast balanced-parentheses trees used in practice; another independently reinvented that idea.

Trees, XML/JSON, and streaming

  • Discussion around balanced-parentheses encoding:
    • It captures tree topology; payload data is stored separately in traversal order.
  • Several comments argue that in many “huge file” scenarios, streaming (SAX/StAX/JSON streaming, partial unmarshalling) is the first, simpler optimization.
  • Others counter that for complex hierarchical formats, streaming APIs or protobuf streaming are often missing or awkward, making in-memory, compact structures attractive.

Miscellaneous

  • Some debate article style (wordy vs. readable).
  • There’s side discussion on terminology (“wavelet” naming, the recency of the field).
  • A long tangent develops around the evolving usage of the word “literally.”