Big data is dead (2023)

Scale vs Reality

  • Many argue most organizations’ data is “small”: often tens of GB, occasionally TB, usually fitting in RAM or on a single SSD.
  • Common refrain: people reach for Hadoop/Spark/“data lakes” when a single Postgres/SQLite/DuckDB instance, or even awk/shell scripts, would suffice.
  • Several anecdotes: interview questions about 6 TiB leading to unnecessarily complex “stacks” instead of simple single-machine solutions.

Overengineering and Architecture

  • Strong criticism of “planning for unicorn scale” as premature optimization that slows delivery and agility.
  • Counterview: if you truly aim for high growth (VC-backed, unicorn ambition), you should at least sketch an architectural path to scale, without implementing it upfront.
  • Consensus trend: optimize for the next few months, keep obvious pivot points flexible, avoid speculative complexity.

AI and Big Data

  • Some see AI as “Big Data 2.0” or a rebranding; others stress the tech stack and use cases are quite different (Hadoop vs GPUs, batch queries vs models/chatbots).
  • LLM hallucinations are seen as a poor fit for trustworthy data analysis, though traditional ML (classifiers, anomaly detection) remains valuable.
  • AI and internal “data science” are often used politically: to confirm management beliefs or signal modernity rather than drive decisions.

What “Big Data” Actually Means

  • Reminder of the 3 Vs: Volume (largely “solved”), Velocity (solved but expensive), Variety (still hard: heterogeneous, poorly described, semi‑structured data).
  • True “big data” problems persist in domains like SAR/radio astronomy, seismology, climate, genomics, high‑frequency finance, and heavy IoT telemetry, where PB‑scale storage and compute are genuine bottlenecks.
  • For most business workloads, “big data” is now more cognitive (making sense of many disparate sources) than infrastructural.

Tools, Databases, and Formats

  • Strong preference for SQL and OLAP warehouses (BigQuery, Snowflake, Databricks, ClickHouse, DuckDB) for analytics; NoSQL mostly for specialized OLTP or key‑value use.
  • MongoDB is widely criticized; Postgres and SQLite often praised as default choices.
  • Columnar formats (especially Parquet) are lauded for compression and predicate pushdown, though some note scaling limits and under‑documented edge cases.
  • Debate over cloud warehouses vs DIY: managed services are seen as pragmatic and cheap at modest scales; others highlight runaway costs and complexity.

Data Quality, Regulation, and Value

  • “Garbage in, garbage out”: many firms hoard logs/telemetry with little information content, generating dashboards but few decisions.
  • GDPR and similar regulations turned large opaque data lakes into liabilities, encouraging aggressive deletion and tighter scope.
  • Some advocate ingest‑time dimensionality reduction (e.g., PCA, factor models) to keep only useful structure and outliers.

Sampling and Statistics

  • Question raised: why not just sample instead of aggregating everything?
  • Responses: sampling is common and powerful but requires careful design, ETL, and error communication; row‑level predictions, audits, or skewed data often need full datasets or sophisticated sketches.

Organizational and Hiring Dynamics

  • Big data/AI often used as resume‑driven development or managerial empire‑building.
  • Interview anecdotes show misaligned expectations: some penalize simple, correct solutions; others use “trick” questions to filter for pragmatic generalists.