Big Data should not be a faith-based initiative
Cory Doctorow summarizes the problem with the idea that sensitive personal information can be removed responsibly from big data: computer scientists are pretty sure that's impossible.
The debate is a hot one, and a lot of non-technical privacy regulators have been led on by sweet promises from the companies that they regulate about the possibility of creating booming markets in highly sensitive personal data that is somehow neutralized through a magic "de-identification" process that lets information about, say, the personal lives of cancer patients be bought and sold without compromising the patients' privacy.
The most recent example of this is a report by former Ontario privacy commissioner Ann Cavoukian and Daniel Castro from the pro-market thinktank the Information Technology and Innovation Foundation. The authors argue that the risk of "re-identification" has been grossly exaggerated and that it is indeed possible to produce meaningful, valuable datasets that are effectively "de-identified."
Princeton's Arvind Narayanan and Ed Felten have published a stinging rebuttal, pointing out the massive holes in Cavoukian and Castro's arguments -- cherry picking studies, improperly generalizing, ignoring the existence of multiple re-identification techniques, and so on.
As Narayanan and Felten demonstrate, the Cavoukian/Castro position is grounded in a lack of understanding of both computer science and security research. The "penetrate-and-patch" method they recommend -- where systems are fielded with live data, broken through challenges, and then revised -- has been hugely ineffective in both traditional information security development and in de-identification efforts. And as Narayanan and Felten point out, there is no shortage of computer science experts who could have helped them with this.
Cavoukian and Castro are rightly excited by Big Data and the new ways that scientists are discovering to make use of data collected for one purpose in the service of another. But they do not admit that the same theoretical advances that unlock new meaning in big datasets also unlock new ways of re-identifying the people whose data is collected in the set.
Re-identification is part of the Big Data revolution: among the new meanings we are learning to extract from huge corpuses of data is the identity of the people in that dataset. And since we're commodifying and sharing these huge datasets, they will still be around in ten, twenty and fifty years, when those same Big Data advancements open up new ways of re-identifying -- and harming -- their subjects.
Narayanan and Felten would like to have a "best of both worlds" solution that lets the world reap the benefits of Big Data without compromising the privacy of the subjects of the datasets. But if there is such a solution, it is to be found through rigorous technical examinations, not through hand-waving, wishful thinking, and bad stats.
The faith-based belief in de-identification is at the root of the worst privacy laws in recent memory. In the EU, the General Data Protection Regulation -- the most-lobbied regulatory effort in EU history -- decided to divide data protection into two categories: identifiable data and "de-identified" data, with practically no limits on how the latter could be bought and sold. The mirrors the existing UK approach, which allows companies to unilaterally declare that the data they hold has been "de-identified" and then treat it as a commodity. In both cases, it's a disaster, as I wrote in the Guardian last year. You can't make good technical regulations by ignoring technical experts, even if the thing those technical experts are telling you is that your cherished plans are impossible.
I recommend you read both Narayanan and Felten's paper, and Cavoukian and Castro's. But in the meantime, Narayanan has helpfully summarized the debate:
Specifically, we argue that:
1. There is no known effective method to anonymize location data, and no evidence that it’s meaningfully achievable.
2. Computing re-identification probabilities based on proof-of-concept demonstrations is silly.
3. Cavoukian and Castro ignore many realistic threats by focusing narrowly on a particular model of re-identification.
4. Cavoukian and Castro concede that de-identification is inadequate for high-dimensional data. But nowadays most interesting datasets are high-dimensional.
5. Penetrate-and-patch is not an option.
6. Computer science knowledge is relevant and highly available.
7. Cavoukian and Castro apply different standards to big data and re-identification techniques.
8. Quantification of re-identification probabilities, which permeates Cavoukian and Castro’s arguments, is a fundamentally meaningless exercise.
Wearing an activity tracker gives insurance companies the data they need to discriminate against people like you
Many insurers offer breaks to people who wear activity trackers that gather data on them; as Cathy “Mathbabe” O’Neil points out, the allegedly “anonymized’ data-collection is trivial to re-identify (so this data might be used against you), and, more broadly, the real business model for this data isn’t improving your health outcomes — it’s dividing […]
A "travel mode" for social media - after all, you don't take all your other stuff with you on the road
As the US government ramps up its insistence that visitors (and US citizens) unlock their devices and provide their social media accounts, the solution have run the gamut from extreme technological caution, abandoning mobile devices while traveling, or asking the government to rethink its policy. But Maciej Cegłowski has another solution: a “travel mode” for […]
For $170, Motherboard’s Joseph Cox bought SpyPhone Android Rec Pro, an Android app that you have to sideload on your target’s phone (the software’s manufacturer sells passcode-defeating apps that help you do this); once it’s loaded, you activate it with an SMS and then you can covertly operate the phone’s mic, steal its photos, and […]
Not all hackers are malicious information thieves—white-hat ethical hackers work with technology companies to ensure the security of their computer systems and user data. With all of today’s high-profile data breaches, ethical hackers are in considerable demand. To learn these critical skills and break into the high-paying cyber security field, try taking the courses in this […]
Making people aware of goods and services in the digital age requires an array of new strategies from social media and email to number-crunching tools like Google Analytics. To get a handle on the techniques used to capture attention and convert traffic into dollars in a crowded online environment, the Full-Stack Marketer Bundle offers 22 hours of training to get […]
Having a luxurious bed isn’t just a fairy tale from a catalog; it is a real, affordable possibility with offerings like this Olive+Owen bedroom set. If you’re thinking of doing some “spring cleaning”, this bed set is an easy way to completely upgrade your room in one purchase.This 20-piece collection has all of the expected slumberland elements, […]