Ask HN: Fast data structures for disjoint intervals?
Problem framing
- Data: many disjoint, often adjacent time intervals (usually “free” slots) with day‑level granularity over tens of years.
- Core query: find the nearest interval (often in either direction) of at least duration X from a given time.
- Scale: thousands of intervals; hundreds of thousands of queries in a real‑time, client‑side, interactive scheduler.
- Current baseline: an ordered map (Rust
BTreeMap<u32, u32>) that coalesces adjacent intervals; already “very fast” but still a bottleneck for larger instances.
Baseline structures and trade‑offs
- Ordered map / sorted vector:
- Many commenters argue a sorted array or ordered map is hard to beat for a few thousand intervals due to cache locality.
- OP’s benchmarks:
Vec<(u32,u32)>with binary search is roughly on par or slightly worse thanBTreeMap.
- Interval trees, R‑trees, KD‑trees, BVHs:
- Often suggested, but multiple experiments (1D R‑tree, BVH‑like structures) failed to outperform a simple ordered map for this workload, especially with updates.
- Interval trees are noted as better for overlapping intervals; here intervals are disjoint.
Structures exploiting extra properties
- Priority search trees and range trees:
- Proposed to handle queries with a bound in one dimension (time) and inequality in another (duration).
- Range trees with per‑node “max gap” are highlighted as a way to avoid linear scans over many too‑small intervals and still get O(log n) queries.
- Augmented trees / monoid‑cached trees:
- Idea: store additional metadata per node (e.g., longest free/occupied block under that node) to jump directly to intervals long enough for duration X.
- Requires custom tree or hacking a B‑tree; no ready‑made Rust library identified.
Bitsets, roaring bitmaps, and SIMD
- Dense bitsets and roaring bitmaps are discussed as allocator‑like representations.
- OP’s tests: dense bitsets underperformed the
BTreeMapdue to time spent searching within words; roaring may help in some sparsity regimes, but API and run‑length needs are a concern. - Several bit‑hack/SIMD schemes are proposed to find long runs of 1s, possibly aided by multiresolution summaries.
Higher‑level design ideas
- Analogies to
mallocand job‑shop/RCPSP scheduling: problem is structurally hard; data‑specific tuning may matter more than asymptotics. - Suggestions include:
- Multi‑bucket free lists by size class (e.g., ≥2^n).
- Coarse “summary” arrays or trees over time buckets to skip hopeless regions.
- Separate structures for free vs used intervals or connected “clusters” of use.
- Some skepticism that anything can significantly beat a well‑tuned B‑tree without custom, domain‑specific augmentation.