Researchers show method for de-anonymizing 95% of "anonymous" cellular location data

Unique in the Crowd: The privacy bounds of human mobility, a Nature Scientific Reports paper by MIT researchers and colleagues at Belgium's Universite Catholique de Louvain, documents that 95% of "anonymous" location data from cellphone towers can be de-anonymized to the individual level. That is, given data from a region's cellular towers, the researchers can ascribe individuals to 95% of the data-points.

“We show that the uniqueness of human mobility traces is high, thereby emphasizing the importance of the idiosyncrasy of human movements for individual privacy,” they explain. “Indeed, this uniqueness means that little outside information is needed to re-identify the trace of a targeted individual even in a sparse, large-scale, and coarse mobility dataset. Given the amount of information that can be inferred from mobility data, as well as the potentially large number of simply anonymized mobility datasets available, this is a growing concern.”

The data they studied involved users in an unidentified European country, possibly Belgium, and involved anonymized data collected by their carriers between 2006 and 2007.

Anonymized Phone Location Data Not So Anonymous, Researchers Find [Wired/Kim Zetter]



  1. After reading the paper I wonder if this would be even easier in a bayesian framework. (Not easier as in faster to calculate, but easier as in less data hungry and less prone to type 1 errors.) Probability of a subscriber is in cell B given the fact a subscriber was in cell A, and so on. Building an hierarchical spatially explicit model based on priors gathered from other data populations might not be so difficult, or am I wrong?

    I’m just starting to learn that Bayesian stuff, can anyone comment on my thoughts?

    1. I’m not sure how other populations would fit into this – you can’t identify someone from the data set of monitored cellphone users, if they’re not in that data set.  Unless I’m totally misunderstanding what a “data population” is…

      As I understand it, what’s been demonstrated is that given three observations of a particular subject’s location and time (specific to instances when they make or receive a phone call), you can then identify their “anonymous” identifier from a set of supposedly anonymized cellphone tracking data.

      So, for instance, if you’ve been with someone when they took two phone calls, and had one phone conversation with them in which they tell you their location, that would be enough to use the supposedly “anonymized” data set of phone tower data to infer a lot of detail about their movements, including things they might wish to keep confidential from you.

      1. You may have misunderstood my crude phrasing. I just wanted to suggest that you could use an informative priors (rather than flat priors) based on observed behaviour in other datasets. (You can’t use the same dataset to do that.)

        BTW, their methodology seems to be different from what you suggest, but that’s not very important. I’m still unsure about the interpretation. My current understanding is that Cory’s text could be misleading. Montjoye et al. seem to prove that with known four positions (at four different times) they could determine uniqueness of traces (that is: find a difference in the movement) of 95% of their dataset. But I’m puzzled by this, this seems too obvious, I must have overlooked something.

        The real of the study merit is that they show that data aggregation either over time or over space doesn’t make a strong difference for identifying “uniqueness”, that is: to tell apart different movements in time and space.

        Which is still… not important in practice, I guess? Because given that there is a unique identifier in the dataset (IMSI, IMEI, UDID, or even just “customer 3″, whatever) we don’t need to find “uniqueness” of movements?
        I don’t get it.

        1. I think you’ve got the methodology right, but are missing the implication – they had the set of time-place-anonymous-identifier tuples, where

          – each tuple represents an instance where a person made or received a call at the time given
          – time is at the granularity of an hour
          – place is at the granularity of a cell tower’s coverage area
          – anonymous-identifier stands in for something actually essentially personally identifying, like an IMEI

          Then, if they randomly selected four tuples T1, T2, T3, T4 with a common anonymous-identifier A, 95% of the time there would be no other anonymous-identifier B such that you could swap in B for A in the four tuples T1-T4, and find existing tuples T1b, T2b, T3b, T4b.

          So far, so meaninglessly math-y.  But what does it mean in practice?

          It means that if a phone company decided “Sure, we can publish anonymized cell tower traces for researchers’ use.  We’ll have to remove the personally identifying stuff like IMEIs, and reduce the resolution of the timestamps to an hour.  That should protect users’ privacy.”

          The problem:  If someone gets a hold of that data, they can de-anonymize it by knowing only a few data points that must exist for a particular person.

Comments are closed.