Big Data is a new book from Viktor Mayer-Schonberger, a respected Internet governance theorist; and Kenneth Cukier, a long-time technology journalist who's been on the Economist for many years. As the title and pedigree imply, this is a business-oriented book about "Big Data," a computational approach to business, regulation, science and entertainment that uses data-mining applied to massive, Internet-connected data-sets to learn things that previous generations weren't able to see because their data was too thin and diffuse.
Big Data is an eminently practical and sensible book, but it's also an exciting and excitable text, one that conveys enormous enthusiasm for the field and its fruits. The authors use well-chosen examples to show how everything from shipping logistics to video-game design to healthcare stand to benefit from studying the whole data-set, rather than random samples. They even pose this as a simple way of thinking of big data versus "small data." Small data relies on statistical sampling, and emphasises the reliability and accuracy of each measurement. With big data, you sample the entire pool of activities -- all the books sold, all the operations performed -- and worry less about inaccuracies and anomalies in individual measurements, because these are drowned out by the huge numbers of observations performed.
As you'd expect, Big Data is particularly fascinating when it explores the business implications of all this: the changing leverage between firms that own data versus the firms that know how to make sense of it, and why sometimes data is best processed by unaffiliated third parties who can examine data from rival firms and find out things from which all parties stand to benefit, but which none of them could have discovered on their own. They also cover some of the bigger Big Data business blunders through history -- companies whose culture blinkered them to the opportunities in their data, which were exploited by clever rivals.
While Big Data is an excellent primer on the opportunities of the field, it's thin on the risks, overall. For example, Big Data is rightly fascinated with stories about how we can look at data sets and find predictors of consequential things: for example, when Google mined its query-history and compared it with CDC data on flu outbreaks, it found that it could predict flu outbreaks ahead of the CDC, which is amazingly useful. However, all those search-strings were entered by people who didn't expect to have them mined for subsequent action. If searching for "scratchy throat" and "runny nose" gets your neighborhood quarantined (or gets it extra healthcare dollars), you might get all your friends to search on those terms over and over -- or not at all. Google knows this -- or it should -- because when it started measuring the number of links between sites to define the latent authority of different parts of the Internet, it got great results, but immediately triggered a whole scummy ecosystem of linkfarms and other SEO tricks that create links whose purpose is to produce more of the indicators Google is searching for.
Another important subject is looking at algorithmic prediction in domains where the outcome is punishment, instead of reward. British Airways may get great results from using an algorithm to pick out passengers for upgrades, trying to find potential frequent fliers. But we should be very cautious about applying the same algorithm to building the TSA's No-Fly list. If BA's algorithm fails 20% of the time, it just means that a few lucky people get to ride up front of the plane. If the TSA has a 20% failure rate, it means that one in five "potential terrorists" is an innocent whose fundamental travel rights have been compromised by a secretive and unaccountable algorithm.
Secrecy and accountability are the third important area for examination in a Big Data world. Cukier and Mayer-Schonberger propose a kind of inspector-general for algorithms who'll make sure they're not corrupted to punish the undeserving or line someone's pockets unjustly. But they also talk about the fact that these algorithms are likely to be illegible -- the product of a continuously evolving machine-learning system -- and that no one will be able to tell you why a certain person was denied credit, refused insurance, kept out of a university, or blackballed for a choice job. And when you get into a world where you can't distinguish between an algorithm that gets it wrong because the math is unreliable (a "fair" wrong outcome) from an algorithm that gets it wrong because its creators set out to punish the innocent or enrich the undeserving, then we can't and won't have justice. We know that computers make mistakes, but when we combine the understandable enthusiasm for Big Data's remarkable, counterintuitive recommendations with the mysterious and oracular nature of the algorithms that produce those conclusions, then we're taking on a huge risk when we put these algorithms in charge of anything that matters.