Pete Warden writes on O'Reilly Radar about the problems of anonymizing datasets. AOL, Netflix and others have been burned by releasing datasets that they thought had been stripped of identifiable elements, only to discover that de-anonymizing some or all of the data was easier than they thought. He cites research by Arvind Naryanan, and then makes some practical recommendations for handling user data (including the most important principle: minimize what data you collect in the first place):
Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone's actions has a good chance of matching identifiable public records. Arvind first demonstrated this when he and his fellow researcher took the "anonymous" dataset released as part of the first Netflix prize, and demonstrated how he could correlate the movie rentals listed with public IMDB reviews. That let them identify some named individuals, and then gave access to their complete rental histories. More recently, he and his collaborators used the same approach to win a Kaggle contest by matching the topography of the anonymized and a publicly crawled version of the social connections on Flickr. They were able to take two partial social graphs, and like piecing together a jigsaw puzzle, figure out fragments that matched and represented the same users in both.