2025-02-10

Undergraduate shows that searches within hash tables can be much faster

What the new hash table result actually is

Commenters clarify: the result gives sublinear worst-case probes for queries and insertions in open-addressing hash tables when the table is very full, on the order of (log(1/δ))² where δ is the empty fraction, versus prior Ω(1/δ)–type bounds.
High‑level idea (from paper/talk summaries): conceptually split the table into exponentially smaller regions (“layers” or “funnels”). Insertions probe a bounded number of slots in each layer, possibly skipping early free slots, and “funnel” down to smaller layers.
This deliberately non‑greedy probing breaks a long‑standing conjectured lower bound for greedy schemes, while preserving constant average query time even near full capacity.

Big‑O, models of computation, and “O(1) hash tables”

Large subthread debates whether hash tables are really O(1):
- One side argues that in pure big‑O, as keyspace and table grow, hashing and comparisons require O(log n) work (word size, key length), so hash tables and trees both pick up extra log factors.
- Others counter that big‑O is always relative to an assumed cost model (word‑RAM, hashing oracle, external memory, etc.), and in that standard model, “one hash + one memory access” is taken as O(1).
Related discussion on how these abstract models ignore cache hierarchies and memory latency, so asymptotics alone don’t predict real performance.

Practical impact and implementations

Several note that typical production hash tables resize well before they get “almost full” (e.g., 75–87.5% load), making worst‑case chains short; over‑allocating is often simpler and faster.
Concern that the new structures may have poor locality and large constants, making them “galactic” algorithms: asymptotically better but slower for all realistic sizes.
Others point out niche value: nearly-full, non‑resizable tables or memory‑constrained settings (databases, large in‑memory indexes).
Links are shared to early PoC implementations of funnel/elastic hashing and to other high‑performance tables (Swiss tables, Robin Hood, bucketed designs) for comparison; some skepticism that uniform‑probing tables like in the paper are actually used in production.

Article presentation and terminology

Multiple readers complain Quanta’s piece is heavy on human‑interest and light on the actual mechanism, with no real pseudocode, benchmarks, or clarity on when it helps.
Some object to calling this “data science” rather than theoretical computer science/data structures, speculating the term came from misreading arXiv’s cs.DS tag or editorial trendiness.
Others defend the profile style as celebrating a notable undergraduate result and inspiring students.

AI, originality, and prior work

One commenter tried major LLMs on the open problem; none rediscovered the technique, prompting remarks that “human‑driven computer science is still safe.”
Thread splits between:
- Those who see this as evidence that fresh human intuition still beats current AI on deep theory.
- Those who note AI has improved other algorithms and could plausibly find such results as models and tooling improve.
A long meta‑discussion explores whether breakthroughs often come from ignoring prior work vs. building on it:
- Pro‑ignorance camp cites this result and game‑design anecdotes as benefits of not being trapped by prior “impossibility” lore.
- Others emphasize survivorship bias: most such attempts just re‑invent or fail; the undergrad was still deeply embedded in current theory and inspired by an earlier “Tiny Pointers” paper, not working in a vacuum.

Monty Hall and counterintuitive proofs

A huge side‑thread uses the Monty Hall problem as an analogy: a simple, counterintuitive probabilistic result that many smart people initially rejected.
Debate centers on whether the classic statement is underspecified (host strategy, conditional probabilities) or well‑formed; people argue about independence assumptions and how real‑world behavior affects probabilities.
This is tied back rhetorically to how counterintuitive results in hashing and complexity can also clash with experts’ heuristics.

Other notable discussion points

Some ask for Rust or other language implementations; answers range from “be the change” to skepticism about AI‑generated implementations for a “fiddly and novel” algorithm.
Explanation that in some theory subfields author order is alphabetical, so the undergraduate not being “first author” is normal; contribution statements are suggested as clearer practice.
One commenter highlights that the work was NSF‑funded and raises concern that proposed cuts to such funding would reduce exactly these kinds of foundational advances.

Related topics