2024-07-18

Ask HN: Fast data structures for disjoint intervals?

Problem framing

Data: many disjoint, often adjacent time intervals (usually “free” slots) with day‑level granularity over tens of years.
Core query: find the nearest interval (often in either direction) of at least duration X from a given time.
Scale: thousands of intervals; hundreds of thousands of queries in a real‑time, client‑side, interactive scheduler.
Current baseline: an ordered map (Rust BTreeMap<u32, u32>) that coalesces adjacent intervals; already “very fast” but still a bottleneck for larger instances.

Baseline structures and trade‑offs

Ordered map / sorted vector:
- Many commenters argue a sorted array or ordered map is hard to beat for a few thousand intervals due to cache locality.
- OP’s benchmarks: Vec<(u32,u32)> with binary search is roughly on par or slightly worse than BTreeMap.
Interval trees, R‑trees, KD‑trees, BVHs:
- Often suggested, but multiple experiments (1D R‑tree, BVH‑like structures) failed to outperform a simple ordered map for this workload, especially with updates.
- Interval trees are noted as better for overlapping intervals; here intervals are disjoint.

Structures exploiting extra properties

Priority search trees and range trees:
- Proposed to handle queries with a bound in one dimension (time) and inequality in another (duration).
- Range trees with per‑node “max gap” are highlighted as a way to avoid linear scans over many too‑small intervals and still get O(log n) queries.
Augmented trees / monoid‑cached trees:
- Idea: store additional metadata per node (e.g., longest free/occupied block under that node) to jump directly to intervals long enough for duration X.
- Requires custom tree or hacking a B‑tree; no ready‑made Rust library identified.

Bitsets, roaring bitmaps, and SIMD

Dense bitsets and roaring bitmaps are discussed as allocator‑like representations.
OP’s tests: dense bitsets underperformed the BTreeMap due to time spent searching within words; roaring may help in some sparsity regimes, but API and run‑length needs are a concern.
Several bit‑hack/SIMD schemes are proposed to find long runs of 1s, possibly aided by multiresolution summaries.

Higher‑level design ideas

Analogies to malloc and job‑shop/RCPSP scheduling: problem is structurally hard; data‑specific tuning may matter more than asymptotics.
Suggestions include:
- Multi‑bucket free lists by size class (e.g., ≥2^n).
- Coarse “summary” arrays or trees over time buckets to skip hopeless regions.
- Separate structures for free vs used intervals or connected “clusters” of use.
Some skepticism that anything can significantly beat a well‑tuned B‑tree without custom, domain‑specific augmentation.

Related topics