Inherent biases warp Big Data

The theory of Big Data is that the numbers have an objective property that makes their revealed truth especially valuable; but as Kate Crawford points out, Big Data has inherent, lurking bias, because the datasets are the creation of fallible, biased humans. For example, the data-points on how people reacted to Hurricane Sandy mostly emanate from Manhattan, because that's where the highest concentration of people wealthy enough to own tweeting, data-emanating smartphones are. But more severely affected locations -- Breezy Point, Coney Island and Rockaway -- produced almost no data because they had fewer smartphones per capita, and the ones they had didn't work because their power and cellular networks failed first.

I wrote about this in 2012, when Google switched strategies for describing the way it arrived at its search-ranking. Prior to that, the company had described its ranking process as a mathematical one and told people who didn't like how they got ranked that the problem was their own, because the numbers didn't lie. After governments took this argument to heart and started ordering Google to change its search results -- on the grounds that there's no free speech question if you're just ordering post-processing on the outcome of an equation -- Google started commissioning law review articles explaining that the algorithms that determined search-rank were the outcome of an expressive, human, editorial process that deserved free speech protection.

Read the rest