Big Data's religious faith denies the reality of failed promises, privacy Chernobyls

Maciej Ceglowski (previously) spoke to a O'Reilly's Strata Big Data conference this month about the toxicity of data — the fact that data collected is likely to leak, and that data-leaks resemble nuclear leaks in that even the "dilute" data (metadata or lightly contaminated boiler suits and tools) are still deadly when enough of them leak out (I've been using this metaphor since 2008).

Ceglowski also raises a critical point: Big Data has not lived up to its promises, especially in life sciences, where we were promised that deep analysis of data would yield up new science that has spectacularly failed to materialise. What's more, the factors that confound Big Data in life science are also at play in other domains, including the business domains where so much energy has been expended.

The key point is that people react to manipulation through Big Data: when you optimize a system to get people to behave in ways they don't want to (to spend more money, to click links they aren't interested in, etc) then people adapt to your interventions and regress to the mean.

Big Data's advocates believe that all this can be solved with more Big Data. This requires them to deny the privacy harms from collecting (and, inevitably, leaking) our personal information, and to assert without evidence that they can massage the data so that it can't be associated with the humans from whom it was extracted.

As Ceglowski puts it, 'people speak of the "data driven organization" with the same religious fervor as a "Christ-centered life".'

This has been a bitter pill to swallow for the pharmacological industry. They bought in to the idea of big data very early on.

The growing fear is that the data-driven approach is inherently a poor fit for life science. In the world of computers, we learn to avoid certain kinds of complexity, because they make our systems impossible to reason about.

But Nature is full of self-modifying, interlocking systems, with interdependent variables you can't isolate. In these vast data spaces, directed iterative search performs better than any amount of data mining.

My contention is that many of you doing data analysis on the real world will run into similar obstacles, hopefully not at the same cost as pharmacology.

The ultimate self-modifying, adaptive system is any system that involves people. In other words, the kind of thing most of you are trying to model. Once you're dealing with human behavior, models go out the window, because people will react to what you do.

In Soviet times, there was the old anecdote about a nail factory. In the first year of the Five-Year Plan, they were evaluated by how many nails they could produce, so they made hundreds of millions of uselessly tiny nails

Haunted By Data [Maciej Ceglowski/Idle Words]

(via O'Reilly Radar)