Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

Valérie Schneider(National Institutes of Health), Tina A. Graves-Lindsay(James S. McDonnell Foundation), Kerstin Howe(Wellcome Sanger Institute), Nathan Bouk(National Institutes of Health), Hsiu-Chuan Chen(National Institutes of Health), Paul Kitts(National Institutes of Health), Terence D. Murphy(National Institutes of Health), Kim D. Pruitt(National Institutes of Health), Françoise Thibaud‐Nissen(National Institutes of Health), Derek Albracht(James S. McDonnell Foundation), Robert S. Fulton(James S. McDonnell Foundation), Milinn Kremitzki(James S. McDonnell Foundation), Vincent Magrini(James S. McDonnell Foundation), Chris Markovic(James S. McDonnell Foundation), Sean McGrath(James S. McDonnell Foundation), Karyn Meltz Steinberg(James S. McDonnell Foundation), Kate Auger(Wellcome Sanger Institute), William Chow(Wellcome Sanger Institute), Joanna Collins(Wellcome Sanger Institute), Glenn Harden(Wellcome Sanger Institute), Tim Hubbard(Wellcome Sanger Institute), Sarah Pelan(Wellcome Sanger Institute), Jared T. Simpson(Wellcome Sanger Institute), Glen Threadgold(Wellcome Sanger Institute), James Torrance(Wellcome Sanger Institute), Jonathan Wood(Wellcome Sanger Institute), Laura Clarke(European Bioinformatics Institute), Sergey Koren(National Institutes of Health), Matthew Boitano(Pacific Biosciences (United States)), Paul Peluso(Pacific Biosciences (United States)), Heng Li(Broad Institute), Chen-Shan Chin(Pacific Biosciences (United States)), Adam M. Phillippy(National Institutes of Health), Richard Durbin(Wellcome Sanger Institute), Richard K. Wilson(James S. McDonnell Foundation), Paul Flicek(European Bioinformatics Institute), Evan E. Eichler(Howard Hughes Medical Institute), Deanna M. Church(National Institutes of Health)
Genome Research
April 10, 2017
Cited by 1,280Open Access
Full Text

Abstract

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.


Related Papers

No related papers found

Powered by citation graph analysis