In their Defcon 25 presentation, "Dark Data", journalist Svea Eckert and data scientist Andreas Dewes described how easy it was to get a massive trove of "anonymized" browsing habits (collected by browser plugins) and then re-identify the people in the data-set, discovering (among other things), the porn-browsing habits of a German judge and the medication regime of a German MP.
The pair were making a point about the ease of "re-identification" attacks on data-sets that have been "anonymized," a very active field, that is especially relevant because the EU's strict data-handling rules can be bypassed if you "anonymize" your data.
The data they were eventually given came, for free, from a data broker, which was willing to let them test their hypothetical AI advertising platform. And while it was nominally an anonymous set, it was soon easy to de-anonymise many users.
Dewes described some methods by which a canny broker can find an individual in the noise, just from a long list of URLs and timestamps. Some make things very easy: for instance, anyone who visits their own analytics page on Twitter ends up with a URL in their browsing record which contains their Twitter username, and is only visible to them. Find that URL, and you’ve linked the anonymous data to an actual person. A similar trick works for German social networking site Xing.
For other users, a more probabilistic approach can deanonymise them. For instance, a mere 10 URLs can be enough to uniquely identify someone – just think, for instance, of how few people there are at your company, with your bank, your hobby, your preferred newspaper and your mobile phone provider. By creating “fingerprints” from the data, it’s possible to compare it to other, more public, sources of what URLs people have visited, such as social media accounts, or public YouTube playlists.