Machine learning systems are pretty good at finding hidden correlations in data and using them to infer potentially compromising information about the people who generate that data: for example, researchers fed an ML system a bunch of Google Play reviews by reviewers whose locations were explicitly given in their Google Plus reviews; based on this, the model was able to predict the locations of other Google Play reviewers with about 44% accuracy.
"Differential privacy" (previously) is a promising, complicated statistical method for analyzing data while preventing reidentification attacks that de-anonymize people in aggregated data-sets.
"Anonymized data" is one of those holy grails, like "healthy ice-cream" or "selectively breakable crypto" — if "anonymized data" is a thing, then companies can monetize their surveillance dossiers on us by selling them to all comers, without putting us at risk or putting themselves in legal jeopardy (to say nothing of the benefits to science and research of being able to do large-scale data analyses and then publish them along with the underlying data for peer review without posing a risk to the people in the data-set, AKA "release and forget").
The millions of Hong Kong people participating in the #612strike uprising are justifiably worried about state retaliation, given the violent crackdowns on earlier uprisings like the Umbrella Revolution and Occupy Central; they're also justifiably worried that they will be punished after the fact.
Late last year, a pair of economists released an interesting paper that used mobile location data to estimate the likelihood that political polarization had shortened family Thanksgiving dinners in 2016.
Even the most stringent privacy rules have massive loopholes: they all allow for free distribution of "de-identified" or "anonymized" data that is deemed to be harmless because it has been subjected to some process.
Strava is a popular fitness route-tracker focused on sharing the maps of your workouts with others; last November, the company released an "anonymized" data-set of over 3 trillion GPS points, and over the weekend, Institute for United Conflict Analysts co-founder Nathan Ruser started a Twitter thread pointing out the sensitive locations and details revealed by the release.
The Australian government's open data initiative is in the laudable business of publishing publicly accessible data about the government's actions and spending, in order to help scholars, businesses and officials understand and improve its processes.
In their Defcon 25 presentation, "Dark Data", journalist Svea Eckert and data scientist Andreas Dewes described how easy it was to get a massive trove of "anonymized" browsing habits (collected by browser plugins) and then re-identify the people in the data-set, discovering (among other things), the porn-browsing habits of a German judge and the medication regime of a German MP.
In Evaluating the privacy properties of telephone metadata, a paper by researchers from Stanford's departments of Law and Computer Science published in Proceedings of the National Academy of Sciences, the authors analyzed metadata from six months' worth of volunteers' phone logs to see what kind of compromising information they could extract from them.
The largest carriers in the world partner with companies like SAP to package up data on your movements, social graph and wake/sleep patterns and sell it to marketing firms.
Economist Paul Mason's blockbuster manifesto Postcapitalism suggests that markets just can't organize products whose major input isn't labor or material, but information, and that means that, for the first time in history, it's conceivable that we can have a society based on abundance.
Cory Doctorow summarizes the problem with the idea that sensitive personal information can be removed responsibly from big data: computer scientists are pretty sure that's impossible.