Semi-automated assembly of high-quality diploid human reference genomes

Erich D. Jarvis(Howard Hughes Medical Institute), Giulio Formenti(Rockefeller University), Arang Rhie(National Institutes of Health), Andrea Guarracino(Human Technopole), Chentao Yang(BGI Group (China)), Jonathan Wood(Wellcome Sanger Institute), Alan Tracey(Wellcome Sanger Institute), Françoise Thibaud‐Nissen(National Institutes of Health), Mitchell R. Vollger(University of Washington), David Porubskỳ(University of Washington), Haoyu Cheng(Harvard University), Mobin Asri(University of California, Santa Cruz), Glennis A. Logsdon(University of Washington), P. Carnevali(Chan Zuckerberg Initiative (United States)), Mark Chaisson(University of Southern California), Chen-Shan Chin, Sarah Cody(James S. McDonnell Foundation), Joanna Collins(Wellcome Sanger Institute), Peter Ebert(Heinrich Heine University Düsseldorf), Merly Escalona(University of California, Santa Cruz), Olivier Fédrigo(Rockefeller University), Robert S. Fulton(James S. McDonnell Foundation), Lucinda Fulton(James S. McDonnell Foundation), Shilpa Garg(University of Copenhagen), Jennifer L. Gerton(Stowers Institute for Medical Research), Jay Ghurye(Dovetail Genomics (United States)), Anastasiya Granat(Illumina (United States)), Richard E. Green(University of California, Santa Cruz), William T. Harvey(University of Washington), Patrick Hasenfeld(European Molecular Biology Laboratory), Alex Hastie(BioNano Genomics (United States)), Marina Haukness(University of California, Santa Cruz), Erich Jaeger(Illumina (United States)), Miten Jain(University of California, Santa Cruz), Melanie Kirsche(Johns Hopkins University), Mikhail Kolmogorov(University of California San Diego), Jan O. Korbel(European Molecular Biology Laboratory), Sergey Koren(National Institutes of Health), Jonas Korlach(Pacific Biosciences (United States)), Joyce Lee(BioNano Genomics (United States)), Daofeng Li(Washington University in St. Louis), Tina Lindsay(James S. McDonnell Foundation), Julian Lucas(University of California, Santa Cruz), Feng Luo(Clemson University), Tobias Marschall(Heinrich Heine University Düsseldorf), Matthew W. Mitchell(Coriell Institute For Medical Research), Jennifer McDaniel(National Institute of Standards and Technology), Fan Nie(Central South University), Hugh E. Olsen(University of California, Santa Cruz), Nathan D. Olson(National Institute of Standards and Technology), Trevor Pesout(University of California, Santa Cruz), Tamara Potapova(Stowers Institute for Medical Research), Daniela Puiu(Johns Hopkins University), Allison Regier(DNAnexus (United States)), Jue Ruan(Agricultural Genomics Institute at Shenzhen), Steven L. Salzberg(Johns Hopkins University), Ashley D. Sanders(Max Delbrück Center), Michael C. Schatz(Johns Hopkins University), Anthony D. Schmitt(Arima Genomics (United States)), Valérie Schneider(National Institutes of Health), Siddarth Selvaraj(Arima Genomics (United States)), Kishwar Shafin(University of California, Santa Cruz), Alaina Shumate(Johns Hopkins University), Nathan O. Stitziel(James S. McDonnell Foundation), Catherine Stober(European Molecular Biology Laboratory), James Torrance(Wellcome Sanger Institute), Justin Wagner(National Institute of Standards and Technology), Jianxin Wang(Central South University), Aaron M. Wenger(Pacific Biosciences (United States)), Chuan‐Le Xiao(Sun Yat-sen University), Aleksey V. Zimin(Johns Hopkins University), Guojie Zhang(Zhejiang University), Ting Wang(James S. McDonnell Foundation), Heng Li(Dana-Farber Cancer Institute), Erik Garrison(University of Tennessee Health Science Center), David Haussler(Howard Hughes Medical Institute), Ira M. Hall(Yale University), Justin M. Zook(National Institute of Standards and Technology), Evan E. Eichler(Howard Hughes Medical Institute), Adam M. Phillippy(National Institutes of Health), Benedict Paten(University of California, Santa Cruz), Kerstin Howe(Wellcome Sanger Institute), Karen H. Miga(University of California, Santa Cruz)
Nature
October 19, 2022
Cited by 226Open Access
Full Text

Abstract

Abstract The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society 1,2 . However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals 3,4 . Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome 5 . To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity 6 . Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.


Related Papers

No related papers found

Powered by citation graph analysis