Succinct data structures
Core concepts and rank/select techniques
- Many comments focus on how constant-time
rankandselectare achieved on bitvectors. - One explanation distinguishes dense vs. sparse bitvectors:
- Dense: partition into blocks and sub-blocks, store prefix sums (indexes) per block, then use word-level popcount to finish.
- Sparse: treat set bits as a sorted list of integers and use Elias–Fano encoding; the “high bits” are dense and can reuse the dense machinery.
- Hardware support (e.g., POPCNT) is emphasized as a key enabler for practical implementations.
Succinct vs. compact; theory vs. practice
- Several comments clarify the formal definition: space = information-theoretic minimum + o(n) bits.
- 30% overhead would typically be considered “compact,” not “succinct.”
- There’s significant discussion that asymptotic bounds can be misleading:
- Theoretical parameter choices (b1 = log²n, b2 = log n, etc.) give O(1) queries and sublinear overhead but are often too slow in practice.
- Real implementations fix constant-sized blocks tuned to word size and cache; overhead then becomes linear but small.
- Some “succinct” operations like true O(1)
selectare rarely implemented exactly due to complexity.
Implementations, libraries, and algorithms
- Multiple libraries and codebases are mentioned: Haskell packages, Zig implementation (minimal perfect hashes, constant-time select), SDSL-lite, Sux/Sux4J, marisa-trie, others.
- Recent work on minimal perfect hash functions (RecSplit, Consensus-RecSplit, PtrHash) pushes memory use close to theoretical limits (≈1.4–4 bits/key) with varying speed tradeoffs.
Performance, hardware, and when they help
- Succinct data structures can be slower than conventional ones when everything already fits comfortably in RAM.
- They shine when:
- Datasets are huge (e.g., genomics) and asymptotics matter.
- Keeping data within LLC / a single NUMA node avoids latency.
- Memory costs (especially in the cloud) dominate.
- Several comments note the tension between memory footprint, cache behavior, and vectorization; impact on SIMD/vector ops is raised but not really answered (unclear).
Applications and practice
- Real-world uses mentioned:
- FM-index–based tools in bioinformatics (e.g., read aligners).
- Large graph processing (via succinct libraries under the hood).
- Tries for dictionaries (e.g., marisa-trie) with large memory savings but some slowdown.
- One commenter reports very small, fast balanced-parentheses trees used in practice; another independently reinvented that idea.
Trees, XML/JSON, and streaming
- Discussion around balanced-parentheses encoding:
- It captures tree topology; payload data is stored separately in traversal order.
- Several comments argue that in many “huge file” scenarios, streaming (SAX/StAX/JSON streaming, partial unmarshalling) is the first, simpler optimization.
- Others counter that for complex hierarchical formats, streaming APIs or protobuf streaming are often missing or awkward, making in-memory, compact structures attractive.
Miscellaneous
- Some debate article style (wordy vs. readable).
- There’s side discussion on terminology (“wavelet” naming, the recency of the field).
- A long tangent develops around the evolving usage of the word “literally.”