Researchers show method for de-anonymizing 95% of "anonymous" cellular location data


4 Responses to “Researchers show method for de-anonymizing 95% of "anonymous" cellular location data”

  1. Luther Blissett says:

    After reading the paper I wonder if this would be even easier in a bayesian framework. (Not easier as in faster to calculate, but easier as in less data hungry and less prone to type 1 errors.) Probability of a subscriber is in cell B given the fact a subscriber was in cell A, and so on. Building an hierarchical spatially explicit model based on priors gathered from other data populations might not be so difficult, or am I wrong?

    I’m just starting to learn that Bayesian stuff, can anyone comment on my thoughts?

    • dragonfrog says:

      I’m not sure how other populations would fit into this – you can’t identify someone from the data set of monitored cellphone users, if they’re not in that data set.  Unless I’m totally misunderstanding what a “data population” is…

      As I understand it, what’s been demonstrated is that given three observations of a particular subject’s location and time (specific to instances when they make or receive a phone call), you can then identify their “anonymous” identifier from a set of supposedly anonymized cellphone tracking data.

      So, for instance, if you’ve been with someone when they took two phone calls, and had one phone conversation with them in which they tell you their location, that would be enough to use the supposedly “anonymized” data set of phone tower data to infer a lot of detail about their movements, including things they might wish to keep confidential from you.

      • Luther Blissett says:

        You may have misunderstood my crude phrasing. I just wanted to suggest that you could use an informative priors (rather than flat priors) based on observed behaviour in other datasets. (You can’t use the same dataset to do that.)

        BTW, their methodology seems to be different from what you suggest, but that’s not very important. I’m still unsure about the interpretation. My current understanding is that Cory’s text could be misleading. Montjoye et al. seem to prove that with known four positions (at four different times) they could determine uniqueness of traces (that is: find a difference in the movement) of 95% of their dataset. But I’m puzzled by this, this seems too obvious, I must have overlooked something.

        The real of the study merit is that they show that data aggregation either over time or over space doesn’t make a strong difference for identifying “uniqueness”, that is: to tell apart different movements in time and space.

        Which is still… not important in practice, I guess? Because given that there is a unique identifier in the dataset (IMSI, IMEI, UDID, or even just “customer 3″, whatever) we don’t need to find “uniqueness” of movements?
        I don’t get it.

        • dragonfrog says:

          I think you’ve got the methodology right, but are missing the implication – they had the set of time-place-anonymous-identifier tuples, where

          - each tuple represents an instance where a person made or received a call at the time given
          - time is at the granularity of an hour
          - place is at the granularity of a cell tower’s coverage area
          - anonymous-identifier stands in for something actually essentially personally identifying, like an IMEI

          Then, if they randomly selected four tuples T1, T2, T3, T4 with a common anonymous-identifier A, 95% of the time there would be no other anonymous-identifier B such that you could swap in B for A in the four tuples T1-T4, and find existing tuples T1b, T2b, T3b, T4b.

          So far, so meaninglessly math-y.  But what does it mean in practice?

          It means that if a phone company decided “Sure, we can publish anonymized cell tower traces for researchers’ use.  We’ll have to remove the personally identifying stuff like IMEIs, and reduce the resolution of the timestamps to an hour.  That should protect users’ privacy.”

          The problem:  If someone gets a hold of that data, they can de-anonymize it by knowing only a few data points that must exist for a particular person.

Leave a Reply