2025-05-19

A lost decade chasing distributed architectures for data analytics?

Small vs. “Big” Data and Hardware Reality

Many commenters report that most real-world “big data” workloads are only a few GB–TB and often fit comfortably on a single modern server or VM (sometimes even a 2012-era laptop).
NVMe and large-RAM machines make single-node analytics viable for far more use cases than the “web-scale” narrative suggested.
Some note that median and even 99.9%-percentile Redshift/Snowflake scans are modest, but others argue those small reads partly reflect users contorting workloads around platform limitations.

DuckDB, Small Data, and Analyst Ergonomics

DuckDB is praised for revolutionizing workflow more than raw capability: easy local analysis, SQL joins, and integration with notebooks and Parquet.
Comparison is often to pandas/dplyr/Polars: DuckDB is seen as more convenient for joins and larger-than-RAM-ish datasets, though R data.table and dplyr remain strong for in-memory work.
Critics stress DuckDB’s sweet spot: static or slowly changing data, few writers, small-ish total datasets, and tolerable multi‑second latencies.

Database Choice: More Than Query Speed

One side argues databases must fit into a broader ecosystem: governance, operations, compliance, collaboration, and business processes often dominate over pure performance.
Others counter that a database’s core job is reliable storage and fast queries, and everything else is layered on top.

SQL vs. NoSQL / JSON Stores

Several comments revisit the long-running “relational vs. hierarchical/JSON” debate:
- Pro-SQL voices cite relational algebra, flexibility of querying, and historical cycles (network/XML/JSON DBs repeatedly losing ground).
- Defenders of MongoDB/Cassandra note they solve real problems, have strong commercial traction, and are appropriate when schemas are uncontrolled or application-defined.
There is pushback against using company revenue as proof of technical merit; success is seen as weak evidence of architectural soundness.

Distributed Stacks, Spark, and Scala

Multiple practitioners report being forced onto Spark/Scala “big data” stacks for sub-GB feeds, describing them as slow to develop, operationally heavy, and unnecessary for most jobs.
Others reply that:
- Centralized clusters solve governance/productionization problems (no copying sensitive data to laptops).
- Single-node Spark is possible, and you may someday need to scale without rewriting.
Opinions on Scala are polarized: some see it as powerful and innovative; others report painful experiences with tooling, compilation speed, and “personal dialects.”

Statistics, Benchmarks, and Geometric Mean

A side thread debates geometric vs arithmetic mean for timing benchmarks.
Pro-geo-mean arguments: symmetric treatment of speedups/slowdowns, appropriate for multiplicative effects.
Critics show concrete examples where geometric mean understates real wall-clock impact, arguing it only fits compounding scenarios (e.g., price changes), not sequential tasks.

Hype, Incentives, and the “Lost Decade” Question

Several comments frame the 2010s big‑data wave as driven by:
- Investor and management obsession with “web-scale,” microservices, and modern data stacks.
- Resume-driven architecture and VC-funded ecosystems that lock data into hosted platforms.
Others argue the distributed push was justified for genuine petabyte‑scale analytics and high-ingest, low-latency workloads (logs, observability, SIEM, etc.), where single-node tools are insufficient.
A recurring theme: data size alone is a poor proxy; concurrency, latency, ingest, governance, and economics often determine whether distributed architectures are warranted.

Related topics