The Australian government's open data initiative is in the laudable business of publishing publicly accessible data about the government's actions and spending, in order to help scholars, businesses and officials understand and improve its processes.
As part of that initiative, the government published 10% of its Medicare Benefits Schedule (MBS), after encrypting the patient and provider ID numbers to protect the privacy of the Australians whose sensitive health information was contained in the database.
This is called "de-identification" and holy shit, it's very, very hard to get right. When you have a lot of data to work with, there are so many ways to fill in the blanks, from using metadata to uncover the identities of people in the set (called "linkage attacks") to targeting weaknesses in the encryption used to scramble the data ("encryption attacks").
A group of computer scientists from the University of Melbourne conducted an encryption attack on the scrambled provider identities in the dataset and were able to easily unscramble them (they didn't attempt to descramble the patient identifiers). They reported this weakness to the Australian government, which quickly pulled the dataset offline.
This is an interesting story because the researchers found vulnerabilities in the encryption, rather than re-identifying by using metadata, which is generally considered the easiest route to re-identification.
In a fascinating postmortem, the researchers describe the implications of their work for the important practice of publishing government datasets for research and evaluation, while protecting sensitive information.
The Australian Government's open data program provides numerous benefits, allowing better decisions to be made based on evidence, careful analysis, and widespread access to accurate information.Decisions about data publication itself should follow the same philosophy.
We have some important decisions to make about what personal data to publish and how it should be anonymised, encrypted or linked. Making good decisions requires accurate technical information about the security of the system and the secrecy of the data.
Details about the privacy protections should be published long in advance.
They can then be subject to empirical testing, scientific analysis, and open public review, before they are used on real data. Then we can make sound, evidence-based decisions about how to benefit from open data without sacrificing individual privacy.
Understanding the maths is crucial for protecting privacy [Dr Chris Culnane, Dr Benjamin Rubinstein and Dr Vanessa Teague/Department of Computing and Information Systems, University of Melbourne]
(via 4 Short Links)