2024-08-19

What If Data Is a Bad Idea?

Data quality, provenance, and negative value

Several practitioners describe spending most of their time wading through poorly collected data lacking provenance (how, why, by whom, and with what transformations it was created).
Such data is often judged to have negative value: it misleads, blocks good decisions, and can “sabotage” teams when logging or analytics are misconfigured.
Examples include broken onboarding flows hidden by bad server‑side logging, contaminated “data lakes” ruining models, and expensive logging systems (e.g. Splunk) treated as opaque oracles by a small priesthood.
In cybersecurity, data is likened to a “toxic asset”: easy to misuse, hard to secure, and now used to train models that inherit its low quality.

Data, information, meaning, and models

Recurrent theme: data ≠ information ≠ meaning. Data is an abstraction and “leaky”; the map is not the territory.
Some argue data is useful precisely because it is “dumb” and separate from interpretation; others stress that meaning always depends on an interpreter and context (semiotics, “meaning as use”).
Shannon/Weaver’s separation of information from meaning is invoked; also the distinction between extensional (data) vs intensional (models/programs) representations.
Models with strong inductive biases can need far less data; with perfect models, data demand would approach zero.
Some speculate that LLMs and agents could act as “ambassadors” or interpreters between heterogeneous data formats, or even replace much raw data usage by model queries.

Big Data, science, and organizational behavior

Many comments compare corporate “data‑drivenness” to modern augury: collecting the wrong data for poorly posed questions, then forcing it to justify pre‑chosen decisions.
A/B testing, metrics filters, and physics experiments are cited as places where inconvenient signals get filtered away until results match expectations.
Critique: managers and product teams often lack training in the scientific method, treat numbers as oracles, and conflate quantification with truth.

Privacy, surveillance, and consent

Strong concern that profiling data is used not just for ads but dynamic pricing, customer support stratification, credit scoring, and state surveillance (including “parallel construction” and social‑credit‑like systems).
Cookie/consent popups are debated: some see them as necessary friction for out‑of‑bounds data sharing; others as annoying dark patterns users work around with blockers.
There’s tension between legal requirements for explicit consent and industry efforts to nudge, bundle, or obscure refusal options.

Philosophical and legal responses

Some call for seeing data as political power: centralizing in private warehouses concentrates control.
Proposals include: banning the sale of personal data, strict definitions of anonymity, per‑use consent with clear terms, and a government portal listing all data uses with revocation controls.
Ideas like a periodic “data jubilee” or mandated deletion cycles are floated, sometimes framed via religious or historical analogies.

Related topics