Submit a link Features Reviews Podcasts Video Forums More ▾

Big Data has big problems


Writing in the Financial Times, Tim Harford (The Undercover Economist Strikes Back, Adapt, etc) offers a nuanced, but ultimately damning critique of Big Data and its promises. Harford's point is that Big Data's premise is that sampling bias can be overcome by simply sampling everything, but the actual data-sets that make up Big Data are anything but comprehensive, and are even more prone to the statistical errors that haunt regular analytic science.

What's more, much of Big Data is "theory free" -- the correlation is observable and repeatable, so it is assumed to be real, even if you don't know why it exists -- but theory-free conclusions are brittle: "If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down." Harford builds on recent critiques of Google Flu (the poster child for Big Data) and goes further. This is your must-read for today.

Read the rest

Big Data Hubris: Google Flu versus reality

In The Parable of Google Flu: Traps in Big Data Analysis [PDF], published in Science, researchers try to understand why Google Flu (which uses search history to predict flu outbreaks) performed so well at first but has not done well since. One culprit: people don't know what the flu is, so their search for "flu" doesn't necessarily mean they have flu. More telling, though, is that Google can't let outsiders see their data or replicate their findings, meaning that they can't get the critical review that might help them spot problems before years of failure. (via Hacker News) Cory 2

Full NHS hospital records uploaded to Google servers, "infinitely worse" story to come

PA Consulting, a management consulting firm, obtained the entire English and Welsh hospital episode statistics database and uploaded it to Google's Bigquery service. The stats filled 27 DVDs and took "a couple of weeks" to transfer to Google's service, which is hosted in non-EU data centres. This is spectacularly illegal. The NHS dataset includes each patient's NHS number, post code, address, date of birth and gender, as well as all their inpatient, outpatient and emergency hospital records. Google's Bigquery service allows for full data-set sharing with one click.

The news of the breach comes after the collapse of a scheme under which the NHS would sell patient records to pharma companies, insurers and others (there was no easy way to opt out of the scheme, until members of the public created the independent Fax Your GP service).

According to researcher and epidemiologist Ben Goldacre, this story is just the beginning: there's an "infinitely worse" story that is coming shortly.

Read the rest

Weinberger's "Too Big to Know" in paperback

David Weinberger's 2012 book Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room was one of the smartest, most thought-provoking reads I had the pleasure of being buffeted by in 2012. I'm delighted to learn that it's out in paperback this month. Here's my original review from 2012:

David Weinberger is one of the Internet's clearest and cleverest thinkers, an understated and deceptively calm philosopher who builds his arguments like a bricklayer builds a wall, one fact at a time. In books like Everything is Miscellaneous and Small Pieces, Loosely Joined, he erects solid edifices with no gaps between the bricks, inviting conclusions that are often difficult to reconcile with your pre-existing prejudices, but which are even harder to deny.

Too Big to Know, Weinberger's latest book-length argument, is another of these surprising brick walls. Weinberger presents us with a long, fascinating account of how knowledge itself changes in the age of the Internet -- what it means to know something when there are millions and billions of "things" at your fingertips, when everyone who might disagree with you can find and rebut your assertions, and when the ability to be heard isn't tightly bound to your credentials or public reputation for expertise.

Read the rest

Chicago PD's Big Data: using pseudoscience to justify racial profiling


The Chicago Police Department has ramped up the use of its "predictive analysis" system to identify people it believes are likely to commit crimes. These people, who are placed on a "heat list," are visited by police officers who tell them that they are considered pre-criminals by CPD, and are warned that if they do commit any crimes, they are likely to be caught.

The CPD defends the practice, and its technical champion, Miles Wernick from the Illinois Institute of Technology, characterizes it as a neutral, data-driven system for preventing crime in a city that has struggled with street violence and other forms of crime. Wernick's approach involves seeking through the data for "abnormal" patterns that correlate with crime. He compares it with epidemiological approaches, stating that people whose social networks have violence within them are also likely to commit violence.

The CPD refuses to share the names of the people on its secret watchlist, nor will it disclose the algorithm that put it there.

This is a terrible way of running a criminal justice system.

Read the rest

Comic explains problems with Oakland's Domain Awareness Center surveillance plan


Hugh sez, "What's wrong with Oakland's proposed Domain Awareness Center? This new comic by Susie Cagle lays out the issues."

The Testing Ground for the New Surveillance (Thanks, Hugh!)

Fax Your GP: quick opt-out from insane NHS plan to sell your medical records


The UK National Health Service has initiated a plan to take the nation's private health records and sell them off to private companies in a process overseen by notorious multinational bumblewads ATOS. If you live in the UK England, your records -- mental health records, prescriptions, records of surgeries including abortions, and other sensitive personal information -- will be handed over to a wide-ranging group of companies all over the world.

Unless you opt out. And opting out isn't easy. There's no central place to opt out. Instead, you have to send a letter to your GP's surgery, which means you have to look up your GP's surgery's address, compose a legally sufficient letter, print it out, find an envelope and a stamp -- etc.

However! There's a better way. A group of volunteers whom I trust implicitly, including the astounding Stef Magdalinski (who made the Faxyourmp service that is the ancestor of Theyworkforyou) have created Fax Your GP, a dead-simple form that will look up your GP's fax number for you, create a form opt-out letter you can fill in in just a few easy steps, and then they'll fax that letter directly to your GP's surgery. I just opted out.

Read the rest

UK set to sell sensitive NHS records to commercial companies with no meaningful privacy protections - UPDATED

The UK government's Health and Social Care Information Centre quietly announced plans to share all patient records held by the National Health Service with private companies, from insurers to pharmaceutical companies. The information sharing is on an opt-out basis, so if you don't want your "clinical records, mental health consultations, drug addiction rehabilitation details, dsexual health clinic attendance and abortion procedures" shared, along with your "GP records, HS numbers, post-codes, gender, date of birth," you need to contact your doctor and opt out of the process.

This is a complex issue. Large data-sets are the lifeblood of epidemiology and evidence-based care and policy, and the desire to extract useful health information from this data is a legitimate one.

However, it's clear that no one involved in the process gives a damn about privacy. These data-sets -- which will be sold on the open market to commercial operators -- are "anonymized" and "pseudonymized" through processes that don't work, have never worked, and are well-documented to be without any basis in reality.

And that's the thing that brings the whole enterprise out of the realm of legitimate scientific project and into the realm of corporatist hucksterism. Once the architects of this project announced that its privacy protections would be based on junk science, they lost any claim they had to operating in good faith.

Effectively, the managers of this programme have said, "We can't figure out how to protect the most private, potentially damaging facts of your life, so we're not going to try." It is pure cynicism, and it makes me furious. It brings the whole field of evidence-based medicine into disrepute. It is a scandal. And as it goes ahead, it will spectacularly destroy the lives of random people in the UK through the involuntary, totally foreseeable disclosure of health information, in ways that make the general public leery of any participation in this kind of inquiry.

If you set about to discredit the open data movement, you could do no better than this.


Update: As if that wasn't bad enough, Noemi adds, "The contract for handling and managing the care data has been given to ATOS. This is the same company whose disability benefit assessment has been found to be flawed and unacceptable in 40% of cases by the Audit Commission." Here's more.

Read the rest

Officemax sends junkmail addressed to "Daughter Killed In Car Crash"


Officemax sent junkmail to Mike Seay at his address in Lindenhurst, IL, with the notation "Daughter Killed In Car Crash" under his name. Seay's 17 year old daughter was killed in a crash last year. Officemax says it bought Seay's name from a marketing company, and implies that the company had made the notation in its list. It's not clear what marketing purpose this information was intended for (is there a sub-list for "bereaved parents" that's rented out to grief counselors looking for business?) or whether this was a one-off in a data-entry department.

Seay is understandably very upset. The Officemax call-center person he spoke to refused to believe him, as did an official spokesdroid. He's seeking an apology from Officemax's CEO.

Read the rest

Judge rules that NSA metadata surveillance is constitutional

U.S. District Judge William Pauley of New York, a Clinton appointee, has ruled (PDF) that the bulk-collection of metadata by the NSA and the phone companies is Constitutional. He called it a "vital tool" for fighting terrorism, and pooh-poohed claims that it was invasive, in part because people "voluntarily" give their data to large corporations. The suit was brought by the ACLU, and was dismissed by Pauley at government request. The ACLU will appeal.

Earlier this month, a different federal judge ruled that NSA spying was illegal. It was likely from the start that that case would go to the Supreme Court, but that likelihood just shot up now that there's a circuit split brewing among the federal courts.

Judge Pauley's ruling advanced the theory that mass spying detects "relationships so attenuated and ephemeral they would otherwise escape notice," though there's no evidence that this "attenuated relationship detection" leads to any useful counterterrorism -- and there's an abundance of evidence that it generates thousands and thousands of false positives: people judged guilty by a secret and unaccountable algorithm.

Pauley has subscribed to the NSA's Greater Manure Pile theory of crimefighting ("If the pile of manure is big enough, there must be a pony underneath it somewhere!"). The fact that the evidence in support of the Greater Manure Pile is secret means that its advocates can simply wink and lay their fingers alongside their noses and say "If you only knew what I knew..." and then ask for another billion dollars for their own surveillance empires.

Both rulings -- in support of, and against NSA spying -- cite Smith v. Maryland, a Supreme Court case that held that spying on one person's phone-metadata for a limited time was legal in order to catch a purse-snatcher. A secret interpretation of Smith was used by the Obama administration and the NSA to justify harvesting all phone metadata, of all people, all the time. Judge Pauley agreed that this was a reasonable interpretation. The ACLU disagreed: "[The decision] misinterprets the relevant statutes, understates the privacy implications of the government’s surveillance and misapplies a narrow and outdated precedent to read away core constitutional protections."

Read the rest

NYC think-tank devoted to critical analysis of Big Data seeks fellows

Outstanding social scientist danah boyd has founded a new thinktank (or "think/do-tank") called The Data & Society Research Institute, based in New York City, and devoted to critical analysis of big data, and "social, technical, ethical, legal, and policy issues that are emerging because of data-centric technological development." It's well-funded, with an exciting mission, and they're hiring.

Read the rest

Understanding spurious correlation in data-mining


Last May, Dave at Euri.ca took at crack at expanding Gabriel Rossman's excellent post on spurious correlation in data. It's an important read for anyone wondering whether the core hypothesis of the Big Data movement is that every sufficiently large pile of horseshit must have a pony in it somewhere. As O'Reilly's Nat Torkington says, "Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this."

Read the rest

Big Data meets Bigfoot

NewImage

Big Data meets Bigfoot in Penn State PhD candidate Joshua Stevens's visualization of nearly a century of Sasquatch sighting reports in the US and Canada. Stevens mapped and graphed more than 3,000 sightings included in the Bigfoot Field Researchers Organizations's database of geocoded and timestamped reports. Stevens writes:

Right away you can see that sightings are not evenly distributed. At first glance, it looks a lot like a map of population distribution. After all, you would expect sightings to be the most frequent in areas where there are a lot of people. But a bivariate view of the data shows a very different story. There are distinct regions where sightings are incredibly common, despite a very sparse population. On the other hand, in some of the most densely populated areas sasquatch sightings are exceedingly rare.
"‘Squatch Watch: 92 Years of Bigfoot Sightings in the US and Canada" (Thanks, everyone!)

Can has data-optimized cheeseburger? Yes.

Here's an Ignite talk by Hilary Mason, chief scientist at Bitly, explaining how she scraped data from multiple sources to create a service that locates NYC's most optimal cheeseburger, using an algorithm that balances out price, proximity, and sentiment analysis from various review sites. As Mason points out, this isn't about cheeseburgers, really: it's about the power (and limits) of cross-referenced data.

Read the rest

Unsupervised AI makes up some pretty funny jokes

Unsupervised joke generation from big data [PDF], a paper by University of Edinburgh researchers Sasa Petrovic and David Matthews, describes an ingenious and successful method for teaching a computer to make up jokes like "I like my relationships like I like my source, open;" "I like my coffee like I like my war, cold;" and "I like my boys like I like my sectors, bad." The researchers wrote code that called on Google's n-gram database to find noun-attribute pairs, zero in on nouns with ambiguous meaning, and automatically generate jokes.

Read the rest