Identifying Personal Genomes by Surname Inference

  title={Identifying Personal Genomes by Surname Inference},
  author={Melissa Gymrek and Amy L. McGuire and David Golan and Eran Halperin and Yaniv Erlich},
  pages={321 - 324}
Anonymity Compromised The balance between maintaining individual privacy and sharing genomic information for research purposes has been a topic of considerable controversy. Gymrek et al. (p. 321; see the Policy Forum by Rodriguez et al.) demonstrate that the anonymity of participants (and their families) can be compromised by analyzing Y-chromosome sequences from public genetic genealogy Web sites that contain (sometimes distant) relatives with the same surname. Short tandem repeats (STRs) on… 

Identity inference of genomic data using long-range familial searches

Testing models of relatedness, Erlich et al. show that many individuals of European ancestry in the United States—even those that have not undergone genetic testing—can be identified on the basis of available genetic information, indicating a need for procedures to help maintain genetic privacy for individuals.

Found your DNA on the web: reconciling privacy and progress.

The researchers used surname inferences from commercial genealogy databases and Internet searches to deduce the identity of nearly fifty research participants whose supposedly private data were stored in large, publicly available datasets.

Challenges in Genomic Privacy : An Analysis of Surname Attacks in the Population of Britain 1 Sahel

In 2013, Gymrek et al. reported that personal genomes can be re-identified through surname inference using patrilineal information inherent in the Y chromosome. They highlighted that the attack is

Identification of Anonymous DNA Using Genealogical Triangulation

This work presents a “genealogical triangulation” algorithm and shows that for over 50% of targets, their anonymous DNA can be identified (matched to the correct individual or same-sex sibling) when the genetic database includes just 1% of the population.

Reconciling Utility with Privacy in Genomics

An obfuscation mechanism is proposed that enables the genomic data to be publicly available for research, while protecting the genomic privacy of the individuals in a family, and an extension of the optimization algorithm to cope with the non-linear constraints induced by the correlations between SNPs.

Genomics: Finding Mr Anonymous

Y-STR haplotypes, derived from personal whole-genome sequences, could be combined with associated demographic data to identify the individual participant in some cases and were found to be able to deduce their known surname with a ~12% success rate.

A utility maximizing and privacy preserving approach for protecting kinship in genomic databases

Results indicate that concurrent sharing of data pertaining to a parent and an offspring results in high risks of kinship privacy, whereas the sharing data from further relatives together is often safer, and it is shown arrival order of family members have a high impact on the level of privacy risks and on the utility of sharing data.

Attacks on genetic privacy via uploads to genealogical databases

Several methods by which an adversary who wants to learn the genotypes of people in the database can do so by uploading multiple datasets are described, and simple-to-implement suggestions that will prevent the exploits are provided.

Pedigrees and Perpetrators: Uses of DNA and Genealogy in Forensic Investigations.

  • S. Katsanis
  • Biology
    Annual review of genomics and human genetics
  • 2020
The necessary policies will take time to develop but can be informed by reflection on the familial searching policies developed for searches of the federal DNA database and considerations of the anonymity and privacy interests of civilians.

Addressing the concerns of the lacks family: quantification of kin genomic privacy

This work formalizes the problem and detail an efficient reconstruction attack based on graphical models and belief propagation, and introduces the quantification of health privacy, specifically the measure of how well the predisposition to a disease is concealed from an attacker.



Founders, Drift, and Infidelity: The Relationship between Y Chromosome Diversity and Patrilineal Surnames

A comparative analysis of published data on Y diversity within Irish surnames demonstrates a relative lack of surname frequency dependence of coancestry, a difference probably mediated through distinct Irish and British demographic histories including even more marked genetic drift in Ireland.

Y-chromosomes and the extent of patrilineal ancestry in Irish surnames

Ireland has one of the oldest systems of patrilineal hereditary surnames in the world and there is a substantial role for the Y-chromosome and a molecular genealogical approach to complement and expand existing sources.

From linkage maps to quantitative trait loci: the history and science of the Utah genetic reference project.

The families recruited from Utah provided the most widely used samples in the Centre d'Etudes du Polymorphisme Humain set, were instrumental in generating human linkage maps, and often serve as the benchmark for establishing allele frequency when a new variant is identified.

A new statistic and its power to infer membership and phenotype in a genome-wide association study using genotype frequencies

Using a likelihood-based statistical framework, an improved statistic is developed that uses genotype frequencies and individual genotypes to infer whether a specific individual or any close relatives participated in the GWAS and, if so, what the participant's phenotype status is.

lobSTR: A short tandem repeat profiler for personal genomes

The speed and reliability of lobSTR exceed the performance of current mainstream algorithms for STR profiling, and the algorithm was used to conduct a comprehensive survey of STR variations in a deeply sequenced personal genome.

Surnames and the Y chromosome.

A randomly ascertained sample of males with the surname "Sykes" was typed with four Y-chromosome microsatellites, which points to a single surname founder for extant Sykes males, even though written sources had predicted multiple origins.

A map of human genome variation from population-scale sequencing

The pilot phase of the 1000 Genomes Project is presented, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms, and the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants are described.

Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays

High-density single nucleotide polymorphism genotyping microarrays are used to demonstrate the ability to accurately and robustly determine whether individuals are in a complex genomic DNA mixture, and suggest future research efforts into assessing the viability of previously sub-optimal DNA sources due to sample contamination.