A lost decade chasing distributed architectures for data analytics?

Small vs. “Big” Data and Hardware Reality

  • Many commenters report that most real-world “big data” workloads are only a few GB–TB and often fit comfortably on a single modern server or VM (sometimes even a 2012-era laptop).
  • NVMe and large-RAM machines make single-node analytics viable for far more use cases than the “web-scale” narrative suggested.
  • Some note that median and even 99.9%-percentile Redshift/Snowflake scans are modest, but others argue those small reads partly reflect users contorting workloads around platform limitations.

DuckDB, Small Data, and Analyst Ergonomics

  • DuckDB is praised for revolutionizing workflow more than raw capability: easy local analysis, SQL joins, and integration with notebooks and Parquet.
  • Comparison is often to pandas/dplyr/Polars: DuckDB is seen as more convenient for joins and larger-than-RAM-ish datasets, though R data.table and dplyr remain strong for in-memory work.
  • Critics stress DuckDB’s sweet spot: static or slowly changing data, few writers, small-ish total datasets, and tolerable multi‑second latencies.

Database Choice: More Than Query Speed

  • One side argues databases must fit into a broader ecosystem: governance, operations, compliance, collaboration, and business processes often dominate over pure performance.
  • Others counter that a database’s core job is reliable storage and fast queries, and everything else is layered on top.

SQL vs. NoSQL / JSON Stores

  • Several comments revisit the long-running “relational vs. hierarchical/JSON” debate:
    • Pro-SQL voices cite relational algebra, flexibility of querying, and historical cycles (network/XML/JSON DBs repeatedly losing ground).
    • Defenders of MongoDB/Cassandra note they solve real problems, have strong commercial traction, and are appropriate when schemas are uncontrolled or application-defined.
  • There is pushback against using company revenue as proof of technical merit; success is seen as weak evidence of architectural soundness.

Distributed Stacks, Spark, and Scala

  • Multiple practitioners report being forced onto Spark/Scala “big data” stacks for sub-GB feeds, describing them as slow to develop, operationally heavy, and unnecessary for most jobs.
  • Others reply that:
    • Centralized clusters solve governance/productionization problems (no copying sensitive data to laptops).
    • Single-node Spark is possible, and you may someday need to scale without rewriting.
  • Opinions on Scala are polarized: some see it as powerful and innovative; others report painful experiences with tooling, compilation speed, and “personal dialects.”

Statistics, Benchmarks, and Geometric Mean

  • A side thread debates geometric vs arithmetic mean for timing benchmarks.
  • Pro-geo-mean arguments: symmetric treatment of speedups/slowdowns, appropriate for multiplicative effects.
  • Critics show concrete examples where geometric mean understates real wall-clock impact, arguing it only fits compounding scenarios (e.g., price changes), not sequential tasks.

Hype, Incentives, and the “Lost Decade” Question

  • Several comments frame the 2010s big‑data wave as driven by:
    • Investor and management obsession with “web-scale,” microservices, and modern data stacks.
    • Resume-driven architecture and VC-funded ecosystems that lock data into hosted platforms.
  • Others argue the distributed push was justified for genuine petabyte‑scale analytics and high-ingest, low-latency workloads (logs, observability, SIEM, etc.), where single-node tools are insufficient.
  • A recurring theme: data size alone is a poor proxy; concurrency, latency, ingest, governance, and economics often determine whether distributed architectures are warranted.