A lost decade chasing distributed architectures for data analytics?
Small vs. “Big” Data and Hardware Reality
- Many commenters report that most real-world “big data” workloads are only a few GB–TB and often fit comfortably on a single modern server or VM (sometimes even a 2012-era laptop).
- NVMe and large-RAM machines make single-node analytics viable for far more use cases than the “web-scale” narrative suggested.
- Some note that median and even 99.9%-percentile Redshift/Snowflake scans are modest, but others argue those small reads partly reflect users contorting workloads around platform limitations.
DuckDB, Small Data, and Analyst Ergonomics
- DuckDB is praised for revolutionizing workflow more than raw capability: easy local analysis, SQL joins, and integration with notebooks and Parquet.
- Comparison is often to pandas/dplyr/Polars: DuckDB is seen as more convenient for joins and larger-than-RAM-ish datasets, though R data.table and dplyr remain strong for in-memory work.
- Critics stress DuckDB’s sweet spot: static or slowly changing data, few writers, small-ish total datasets, and tolerable multi‑second latencies.
Database Choice: More Than Query Speed
- One side argues databases must fit into a broader ecosystem: governance, operations, compliance, collaboration, and business processes often dominate over pure performance.
- Others counter that a database’s core job is reliable storage and fast queries, and everything else is layered on top.
SQL vs. NoSQL / JSON Stores
- Several comments revisit the long-running “relational vs. hierarchical/JSON” debate:
- Pro-SQL voices cite relational algebra, flexibility of querying, and historical cycles (network/XML/JSON DBs repeatedly losing ground).
- Defenders of MongoDB/Cassandra note they solve real problems, have strong commercial traction, and are appropriate when schemas are uncontrolled or application-defined.
- There is pushback against using company revenue as proof of technical merit; success is seen as weak evidence of architectural soundness.
Distributed Stacks, Spark, and Scala
- Multiple practitioners report being forced onto Spark/Scala “big data” stacks for sub-GB feeds, describing them as slow to develop, operationally heavy, and unnecessary for most jobs.
- Others reply that:
- Centralized clusters solve governance/productionization problems (no copying sensitive data to laptops).
- Single-node Spark is possible, and you may someday need to scale without rewriting.
- Opinions on Scala are polarized: some see it as powerful and innovative; others report painful experiences with tooling, compilation speed, and “personal dialects.”
Statistics, Benchmarks, and Geometric Mean
- A side thread debates geometric vs arithmetic mean for timing benchmarks.
- Pro-geo-mean arguments: symmetric treatment of speedups/slowdowns, appropriate for multiplicative effects.
- Critics show concrete examples where geometric mean understates real wall-clock impact, arguing it only fits compounding scenarios (e.g., price changes), not sequential tasks.
Hype, Incentives, and the “Lost Decade” Question
- Several comments frame the 2010s big‑data wave as driven by:
- Investor and management obsession with “web-scale,” microservices, and modern data stacks.
- Resume-driven architecture and VC-funded ecosystems that lock data into hosted platforms.
- Others argue the distributed push was justified for genuine petabyte‑scale analytics and high-ingest, low-latency workloads (logs, observability, SIEM, etc.), where single-node tools are insufficient.
- A recurring theme: data size alone is a poor proxy; concurrency, latency, ingest, governance, and economics often determine whether distributed architectures are warranted.