Remember when they caught the Golden State Killer by comparing DNA crime-scene evidence to big commercial genomic databases (like those maintained by Ancestry.com, 23 and Me, etc) to find his family members and then track him down?
It's not just him.
If you're an American of European descent, there's a 60% chance that you can be identified from genomic database searches, because even if you've never signed up for one of these junk science services, your stupid cousins have.
That's the conclusion of a group of computer science, computational biology, genomics, and public health researchers from Columbia, and the Hebrew University of Jerusalem, who published their findings in the journal Science: Identity inference of genomic data using long-range familial searches (Sci-Hub mirror).
They also predict that in the "near future," "nearly any US individual of European descent" will be identifiable from commercial genomic databases.
The researchers propose a mitigation technique for avoiding nonconsensual genetic profiling: "DTC providers should cryptographically sign the text file containing the raw data available to customers (fig. S6). Third-party services will be able to authenticate that a raw genotyping file was created by a valid DTC provider and not further modified. If adopted, our approach has the potential to prevent the exploitation of long-range familial searches to identify research subjects from genomic data. Moreover, it will complicate the ability to conduct unilaterally long-range familial searches from DNA evidence.
Consumer genomics databases have reached the scale of millions of individuals. Recently, law enforcement authorities have exploited some of these databases to identify suspects via distant familial relatives. Using genomic data of 1.28 million individuals tested with consumer genomics, we investigated the power of this technique. We project that about 60% of the searches for individuals of European-descent will result in a third cousin or closer match, which can allow their identification using demographic identifiers. Moreover, the technique could implicate nearly any US-individual of European-descent in the near future. We demonstrate that the technique can also identify research participants of a public sequencing project. Based on these results, we propose a potential mitigation strategy and policy implications to human subject research.
Identity inference of genomic data using long-range familial searches [Yaniv Erlich, Tal Shor, Itsik Pe'er and Shai Carmi/Science] (Sci-Hub mirror)