Study shows detailed, compromising inferences can be readily made with metadata

In Evaluating the privacy properties of telephone metadata, a paper by researchers from Stanford's departments of Law and Computer Science published in Proceedings of the National Academy of Sciences, the authors analyzed metadata from six months' worth of volunteers' phone logs to see what kind of compromising information they could extract from them.

The research comes at a time when the UK Tory government is pushing for sweeping, invasive domestic surveillance powers, with Home Secretary Theresa May arguing that this kind of data is nothing more than an "itemised phone bill."

Another argument advanced in service of "metadata" collection is that agencies can minimise harm by limiting how long they hold data for, and how many "hops" they follow through their subjects' social graphs — the paper also investigates these claims.

In the case of limiting the number of hops, the researchers found that even a two-hop constraint was effectively useless at limiting collection: because so many people call into a few high-density nodes (a GCHQ research paper on data-mining called these "pizza nodes," and used neighborhood pizza parlors as an example), two hops from a suspect captures nearly everyone. Once the suspect calls a highly trafficked line (like a customer service number for a popular service), then everyone who's ever called that number is now two hops away from the suspect, and within the captured dataset.

The researchers also investigated surveillance agencies' claims that the data in phone records is anonymised and thus not "personally identifying information" (PII). The paper confirms what other researchers have found: it is trivial to re-identify/de-anonymize the records using metadata. A very simple, automated re-identification technique was able to de-anonymize a third of the records. An equally simple manual re-identification process de-anonymized the "overwhelming majority" of records.

The paper goes on to enumerate other kinds of compromising data that could be extracted from an "itemised phone bill," including location, personal relationships, and "sensitive traits," including "familial, political, professional, religious, and sexual associations."

The results of our study are unambiguous: there are significant privacy impacts associated with telephone metadata surveillance. Telephone metadata is densely interconnected, easily reidentifiable, and trivially gives rise to location, relationship, and sensitive inferences. In combination with independent reviews that have found bulk metadata surveillance to be an ineffective intelligence strategy (7, 8), our findings should give policymakers pause when authorizing such programs.

More broadly, this project emphasizes the need for scientifically rigorous surveillance regulation. Much of the law and policy that we explored in this research was informed by assumption and conventional wisdom, not quantitative analysis. To strike an appropriate balance between national security and civil liberties, future policymaking must be informed by input from the relevant sciences.

Our results also bear on commercial data practices. It is routine practice for telecommunications firms to collect, retain, and transfer subscriber telephone records, often dubbed "Customer Proprietary Network Information" (49, 50). Telecommunications regulation should also incorporate a scientifically rigorous understanding of the privacy properties of these data.

There remains much future work to be done in this space. To conduct this study, we were compelled to rely on a small and unrepresentative dataset. Future efforts would benefit from population-scale data; the challenges are in sourcing the data, not computing on them. Future work could also pair telephone records with more comprehensive ground truth than the Facebook data we accessed. Subscriber records and cell site location information, for instance, would better enable testing for inferences. Another potential direction is testing more advanced approaches to automated inferences; the machine-learning techniques we applied in this study were effective, although relatively rudimentary.

Evaluating the privacy properties of telephone metadata [Jonathan Mayera, Patrick Mutchlera, and John C. Mitchell/Proceedings of the National Academy of Sciences]

Itemised phone logs reveal scary personal details about you, study finds
[Madhumita Murgia/Telegraph]

(Thanks, William Hay!)

(Image: Scanned page from my AT&T phone bill, Dave Winer, CC-BY-SA)