2025-03-06

Succinct data structures

Core concepts and rank/select techniques

Many comments focus on how constant-time rank and select are achieved on bitvectors.
One explanation distinguishes dense vs. sparse bitvectors:
- Dense: partition into blocks and sub-blocks, store prefix sums (indexes) per block, then use word-level popcount to finish.
- Sparse: treat set bits as a sorted list of integers and use Elias–Fano encoding; the “high bits” are dense and can reuse the dense machinery.
Hardware support (e.g., POPCNT) is emphasized as a key enabler for practical implementations.

Succinct vs. compact; theory vs. practice

Several comments clarify the formal definition: space = information-theoretic minimum + o(n) bits.
30% overhead would typically be considered “compact,” not “succinct.”
There’s significant discussion that asymptotic bounds can be misleading:
- Theoretical parameter choices (b1 = log²n, b2 = log n, etc.) give O(1) queries and sublinear overhead but are often too slow in practice.
- Real implementations fix constant-sized blocks tuned to word size and cache; overhead then becomes linear but small.
Some “succinct” operations like true O(1) select are rarely implemented exactly due to complexity.

Implementations, libraries, and algorithms

Multiple libraries and codebases are mentioned: Haskell packages, Zig implementation (minimal perfect hashes, constant-time select), SDSL-lite, Sux/Sux4J, marisa-trie, others.
Recent work on minimal perfect hash functions (RecSplit, Consensus-RecSplit, PtrHash) pushes memory use close to theoretical limits (≈1.4–4 bits/key) with varying speed tradeoffs.

Performance, hardware, and when they help

Succinct data structures can be slower than conventional ones when everything already fits comfortably in RAM.
They shine when:
- Datasets are huge (e.g., genomics) and asymptotics matter.
- Keeping data within LLC / a single NUMA node avoids latency.
- Memory costs (especially in the cloud) dominate.
Several comments note the tension between memory footprint, cache behavior, and vectorization; impact on SIMD/vector ops is raised but not really answered (unclear).

Applications and practice

Real-world uses mentioned:
- FM-index–based tools in bioinformatics (e.g., read aligners).
- Large graph processing (via succinct libraries under the hood).
- Tries for dictionaries (e.g., marisa-trie) with large memory savings but some slowdown.
One commenter reports very small, fast balanced-parentheses trees used in practice; another independently reinvented that idea.

Trees, XML/JSON, and streaming

Discussion around balanced-parentheses encoding:
- It captures tree topology; payload data is stored separately in traversal order.
Several comments argue that in many “huge file” scenarios, streaming (SAX/StAX/JSON streaming, partial unmarshalling) is the first, simpler optimization.
Others counter that for complex hierarchical formats, streaming APIs or protobuf streaming are often missing or awkward, making in-memory, compact structures attractive.

Miscellaneous

Some debate article style (wordy vs. readable).
There’s side discussion on terminology (“wavelet” naming, the recency of the field).
A long tangent develops around the evolving usage of the word “literally.”

Related topics