Identification of individuals by trait prediction using whole-genome sequencing data

Christoph Lippert(Human Longevity (United States)), Riccardo Sabatini(Human Longevity (United States)), M. Cyrus Maher(Human Longevity (United States)), Eun Yong Kang(Human Longevity (United States)), Seunghak Lee(Human Longevity (United States)), Okan Arıkan(Human Longevity (United States)), Alena Harley(Human Longevity (United States)), Axel Bernal(Human Longevity (United States)), Peter Garst(Human Longevity (United States)), Victor Lavrenko(Human Longevity (United States)), Ken Yocum(Human Longevity (United States)), Theodore M. Wong(Human Longevity (United States)), Mingfu Zhu(Human Longevity (United States)), Wen-Yun Yang(Human Longevity (United States)), Christopher Chang(Human Longevity (United States)), Tim Lu(Human Longevity (United States)), Charlie W. H. Lee(Human Longevity (United States)), Barry Hicks(Human Longevity (United States)), Smriti Ramakrishnan(Human Longevity (United States)), Haibao Tang(Human Longevity (United States)), Chao Xie, Jason Piper, Suzanne Brewerton, Yaron Turpaz(Human Longevity (United States)), Amalio Telenti(Human Longevity (United States)), Rhonda K. Roby(J. Craig Venter Institute), Franz Josef Och(Human Longevity (United States)), J. Craig Venter(J. Craig Venter Institute)
Proceedings of the National Academy of Sciences
September 5, 2017
Cited by 159Open Access
Full Text

Abstract

Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications.


Related Papers

No related papers found

Powered by citation graph analysis