An incredibly important paper on whether data can ever be "anonymized" and how we should handle release of large data-sets

Even the most stringent privacy rules have massive loopholes: they all allow for free distribution of "de-identified" or "anonymized" data that is deemed to be harmless because it has been subjected to some process.

But the reality of "re-identification" attacks tells a different story: again and again and again and again and again, datasets are released on the promise that they have been de-identified, only to be rapidly (and often trivially) re-identified, putting privacy, financial security, lives and even geopolitical stability at risk.

The problems of good anonymization are myriad: one is that re-identification risk increases over time: a database of taxi-trips, say, might be re-identifiable later when a database of user accounts, or home addresses, or specific voyages taken by one person, leaks. Releasing a database of logins and their corresponding real names might allow a database of logins without real names to be mass re-identified with little effort.

Anonymization suffers from the "wanting it badly is enough" problem: industry and regulators would benefit immensely if there was such a thing as reliable de-identification, and the fact that it would be useful to have this creates the certainty that it's possible to have this (see also: crypto backdoors, DRM, no-fly lists, etc).

Regulators, wary of being overly proscriptive, often use "industry best practices" as a benchmark for whether anonymization has taken place. But this only works if the public rewards companies that practice good anonymization, meaning that companies will compete with one another to find effective anonymization techniques. Since it's impossible for a prospective customer to evaluate which anonymization techniques work until after there has been a breach, markets don't reward companies that spend their resources perfecting anonymization, meaning that industries race to the bottom to effect the cheapest methods without regard to whether they work, and this becomes "best practice."

You can see this in other sectors: UK anti-money laundering rules require that banks identify their customers' home addresses using "industry best practices." Bank customers don't care if their bank is complicit in money-laundering, so they don't preferentially choose banks with good anti-money-laundering practices, which means that all the UK banks converged on using laserprinted, easily forged gas-bills as proof of address, despite the fact that these offer nothing in the fight against money-laundering. It's enough they they're cheap to process and can be waved around when a bank is caught in a money-laundering prosecution.

A Precautionary Approach to Big Data Privacy, a 2015 paper by Princeton computer scientists Arvind Narayanan (previously), Joanna Huey and Edward Felten (previously), is the best look at this subject I've yet read, and should be required reading.

The authors point out that the traditional — and controversial security methodology of "penetrate and patch" (where bugs are identified after the product is rolled out and then fixed) — is totally unsuited to the problem of data-anonymization. That's because once an "anonymized" database is distributed, it's impossible to patch the copies floating around in the wild (you can't recall the data and remove problematic identifiers), and partner organizations that you've given the data to have no incentive to "patch" it by taking away identifiers that they might be using or find useful in the future.

Instead of "ad-hoc" de-identification methods, the authors recommend a wonkish, exciting idea called "differential privacy," in which precise amounts of noise is added to the data-set before it is released, allowing the publisher to quantify exactly how likely a future re-identification attack is, balancing the benefits of release with the potential costs to the people implicated in it.

Differential privacy is hard to do right, though: famously Apple flubbed one of the first wide-use applications for it.

The authors set out a set of clear guidelines for how to turn their recommendations into policy, and give examples of how existing data-releases could integrate these principles.

We're living in a moment of unprecedented data-gathering and intentional release, but the entire policy framework for those releases is based on a kind of expedient shrug: "We're not sure if this'll work, but it needs doing, so…" This paper — written in admirably clear, non-technical language — establishes a sound policy/computer science basis for undertaking these activities, and not a moment too soon.

New attributes continue to be linked with identities: search queries,
social network data,
genetic information (without DNA samples from the targeted people),
and geolocation data
all
can permit re-identification, and Acquisti, Gross, and Stutzman have shown that it is possible to
determine some people's interests and Social Security numbers from only a photo of their faces.
The realm of potential identifiers will continue to expand, increasing the
privacy risks of
already released datasets.

Furthermore, even staunch proponents of current de-identification methods admit that they are
inadequate for high-dimensional data.
These high-dimensional datasets, which contain many
data points for each individual's record, have become the norm: social network data has at least a
hundred dimensions
and genetic data can have millions.
We expect that datasets will continue
this trend towards higher dimensionality as the costs of data storage decrease and the
ability to track a large number of observations about a single individual increase. High dimensionality is
one of the hallmarks of "big data."

Finally, we should note that re-identification of particular datasets is likely underreported. First,
the re-identification of particular datasets is likely to be included in the academic literature only
if it involves a novel advancement of techniques, so while the first use of a re-identification
method may be published, reuses rarely are. Similarly, people who blog or otherwise report re-identification vulnerabilities are unlikely to do so unless interesting methods or notable datasets
are involved. Second, those with malicious motivations for re-identification are probably unwilling to announce their successes.
Thus, even
if a specific dataset has not been re-identified publicly, it should not be presumed secure.