Big data is dead (2023)
Scale vs Reality
- Many argue most organizations’ data is “small”: often tens of GB, occasionally TB, usually fitting in RAM or on a single SSD.
- Common refrain: people reach for Hadoop/Spark/“data lakes” when a single Postgres/SQLite/DuckDB instance, or even awk/shell scripts, would suffice.
- Several anecdotes: interview questions about 6 TiB leading to unnecessarily complex “stacks” instead of simple single-machine solutions.
Overengineering and Architecture
- Strong criticism of “planning for unicorn scale” as premature optimization that slows delivery and agility.
- Counterview: if you truly aim for high growth (VC-backed, unicorn ambition), you should at least sketch an architectural path to scale, without implementing it upfront.
- Consensus trend: optimize for the next few months, keep obvious pivot points flexible, avoid speculative complexity.
AI and Big Data
- Some see AI as “Big Data 2.0” or a rebranding; others stress the tech stack and use cases are quite different (Hadoop vs GPUs, batch queries vs models/chatbots).
- LLM hallucinations are seen as a poor fit for trustworthy data analysis, though traditional ML (classifiers, anomaly detection) remains valuable.
- AI and internal “data science” are often used politically: to confirm management beliefs or signal modernity rather than drive decisions.
What “Big Data” Actually Means
- Reminder of the 3 Vs: Volume (largely “solved”), Velocity (solved but expensive), Variety (still hard: heterogeneous, poorly described, semi‑structured data).
- True “big data” problems persist in domains like SAR/radio astronomy, seismology, climate, genomics, high‑frequency finance, and heavy IoT telemetry, where PB‑scale storage and compute are genuine bottlenecks.
- For most business workloads, “big data” is now more cognitive (making sense of many disparate sources) than infrastructural.
Tools, Databases, and Formats
- Strong preference for SQL and OLAP warehouses (BigQuery, Snowflake, Databricks, ClickHouse, DuckDB) for analytics; NoSQL mostly for specialized OLTP or key‑value use.
- MongoDB is widely criticized; Postgres and SQLite often praised as default choices.
- Columnar formats (especially Parquet) are lauded for compression and predicate pushdown, though some note scaling limits and under‑documented edge cases.
- Debate over cloud warehouses vs DIY: managed services are seen as pragmatic and cheap at modest scales; others highlight runaway costs and complexity.
Data Quality, Regulation, and Value
- “Garbage in, garbage out”: many firms hoard logs/telemetry with little information content, generating dashboards but few decisions.
- GDPR and similar regulations turned large opaque data lakes into liabilities, encouraging aggressive deletion and tighter scope.
- Some advocate ingest‑time dimensionality reduction (e.g., PCA, factor models) to keep only useful structure and outliers.
Sampling and Statistics
- Question raised: why not just sample instead of aggregating everything?
- Responses: sampling is common and powerful but requires careful design, ETL, and error communication; row‑level predictions, audits, or skewed data often need full datasets or sophisticated sketches.
Organizational and Hiring Dynamics
- Big data/AI often used as resume‑driven development or managerial empire‑building.
- Interview anecdotes show misaligned expectations: some penalize simple, correct solutions; others use “trick” questions to filter for pragmatic generalists.