2024-05-27

Big data is dead (2023)

Scale vs Reality

Many argue most organizations’ data is “small”: often tens of GB, occasionally TB, usually fitting in RAM or on a single SSD.
Common refrain: people reach for Hadoop/Spark/“data lakes” when a single Postgres/SQLite/DuckDB instance, or even awk/shell scripts, would suffice.
Several anecdotes: interview questions about 6 TiB leading to unnecessarily complex “stacks” instead of simple single-machine solutions.

Overengineering and Architecture

Strong criticism of “planning for unicorn scale” as premature optimization that slows delivery and agility.
Counterview: if you truly aim for high growth (VC-backed, unicorn ambition), you should at least sketch an architectural path to scale, without implementing it upfront.
Consensus trend: optimize for the next few months, keep obvious pivot points flexible, avoid speculative complexity.

AI and Big Data

Some see AI as “Big Data 2.0” or a rebranding; others stress the tech stack and use cases are quite different (Hadoop vs GPUs, batch queries vs models/chatbots).
LLM hallucinations are seen as a poor fit for trustworthy data analysis, though traditional ML (classifiers, anomaly detection) remains valuable.
AI and internal “data science” are often used politically: to confirm management beliefs or signal modernity rather than drive decisions.

What “Big Data” Actually Means

Reminder of the 3 Vs: Volume (largely “solved”), Velocity (solved but expensive), Variety (still hard: heterogeneous, poorly described, semi‑structured data).
True “big data” problems persist in domains like SAR/radio astronomy, seismology, climate, genomics, high‑frequency finance, and heavy IoT telemetry, where PB‑scale storage and compute are genuine bottlenecks.
For most business workloads, “big data” is now more cognitive (making sense of many disparate sources) than infrastructural.

Tools, Databases, and Formats

Strong preference for SQL and OLAP warehouses (BigQuery, Snowflake, Databricks, ClickHouse, DuckDB) for analytics; NoSQL mostly for specialized OLTP or key‑value use.
MongoDB is widely criticized; Postgres and SQLite often praised as default choices.
Columnar formats (especially Parquet) are lauded for compression and predicate pushdown, though some note scaling limits and under‑documented edge cases.
Debate over cloud warehouses vs DIY: managed services are seen as pragmatic and cheap at modest scales; others highlight runaway costs and complexity.

Data Quality, Regulation, and Value

“Garbage in, garbage out”: many firms hoard logs/telemetry with little information content, generating dashboards but few decisions.
GDPR and similar regulations turned large opaque data lakes into liabilities, encouraging aggressive deletion and tighter scope.
Some advocate ingest‑time dimensionality reduction (e.g., PCA, factor models) to keep only useful structure and outliers.

Sampling and Statistics

Question raised: why not just sample instead of aggregating everything?
Responses: sampling is common and powerful but requires careful design, ETL, and error communication; row‑level predictions, audits, or skewed data often need full datasets or sophisticated sketches.

Organizational and Hiring Dynamics

Big data/AI often used as resume‑driven development or managerial empire‑building.
Interview anecdotes show misaligned expectations: some penalize simple, correct solutions; others use “trick” questions to filter for pragmatic generalists.

Related topics