Towards complete and error-free genome assemblies of all vertebrate species

Arang Rhie(National Institutes of Health), Shane McCarthy(University of Cambridge), Olivier Fédrigo(Rockefeller University), Joana Damas(University of California, Davis), Giulio Formenti(Rockefeller University), Sergey Koren(National Institutes of Health), Marcela Uliano‐Silva(Berlin Center for Genomics in Biodiversity Research), William Chow(Wellcome Sanger Institute), Arkarachai Fungtammasan(DNAnexus (United States)), Juwan Kim(Seoul National University), Chul Lee(Seoul National University), Byung June Ko(Seoul National University), Mark Chaisson(University of Southern California), Gregory Gedman(Rockefeller University), Lindsey Cantin(Rockefeller University), Françoise Thibaud‐Nissen(National Center for Biotechnology Information), Leanne Haggerty(European Bioinformatics Institute), Iliana Bista(University of Cambridge), Michelle Smith(Wellcome Sanger Institute), Bettina Haase(Rockefeller University), Jacquelyn Mountcastle(Rockefeller University), Sylke Winkler(Center for Systems Biology Dresden), Sadye Paez(Rockefeller University), Jason T. Howard, Sonja C. Vernes(Radboud University Nijmegen), Tanya M. Lama(University of Massachusetts Amherst), Frank Grützner(The University of Adelaide), Wesley C. Warren(University of Missouri), Christopher N. Balakrishnan(East Carolina University), David W. Burt(The University of Queensland), Julia M. George(Clemson University), Matthew T. Biegler(Rockefeller University), David Iorns, Andrew Digby, Daryl Eason, Bruce C. Robertson(University of Otago), Taylor Edwards(University of Arizona), Mark Wilkinson(Natural History Museum), George F. Turner(Bangor University), Axel Meyer(University of Konstanz), Andreas F. Kautt(Harvard University), Paolo Franchini(University of Konstanz), H. William Detrich(Northeastern University), Hannes Svardal(Naturalis Biodiversity Center), Maximilian Wagner(University of Graz), Gavin J. P. Naylor(Florida Museum of Natural History), Martin Pippel(Center for Systems Biology Dresden), Milan Malinsky(University of Basel), Mark P. Mooney(Technology Affinity Group), Maria Simbirsky(DNAnexus (United States)), Brett T. Hannigan(DNAnexus (United States)), Trevor Pesout(University of California, Santa Cruz), Marlys L. Houck(Zoological Society of San Diego), Ann Misuraca(Zoological Society of San Diego), Sarah B. Kingan(Pacific Biosciences (United States)), Richard Hall(Pacific Biosciences (United States)), Zev Kronenberg(Pacific Biosciences (United States)), Ivan Sović(Pacific Biosciences (United States)), Christopher Dunn(Pacific Biosciences (United States)), Zemin Ning(Wellcome Sanger Institute), Alex Hastie(BioNano Genomics (United States)), Joyce Lee(BioNano Genomics (United States)), Siddarth Selvaraj(Arima Genomics (United States)), Richard E. Green(University of California, Santa Cruz), Nicholas H. Putnam(Santa Cruz County Office of Education), Marta Gut(Universitat Pompeu Fabra), Jay Ghurye(Dovetail Genomics (United States)), Erik Garrison(University of California, Santa Cruz), Ying Sims(Wellcome Sanger Institute), Joanna Collins(Wellcome Sanger Institute), Sarah Pelan(Wellcome Sanger Institute), James Torrance(Wellcome Sanger Institute), Alan Tracey(Wellcome Sanger Institute), Jonathan Wood(Wellcome Sanger Institute), Robel E. Dagnew(University of Southern California), Dengfeng Guan(Harbin Institute of Technology), Sarah E. London(University of Chicago), David F. Clayton(Clemson University), Claudio V. Mello(Oregon Health & Science University), Samantha R. Friedrich(Oregon Health & Science University), Peter V. Lovell(Oregon Health & Science University), Ekaterina Osipova(Max Planck Institute for the Physics of Complex Systems), Farooq O. Al-Ajli(Monash University Malaysia), Simona Secomandi(University of Milan), Heebal Kim(Seoul National University), Constantina Theofanopoulou(Rockefeller University), Michael Hiller(Goethe University Frankfurt), Yang Zhou(BGI Group (China)), Robert S. Harris(Pennsylvania State University), Kateryna D. Makova(Pennsylvania State University), Paul Medvedev(Pennsylvania State University), Jinna Hoffman(National Center for Biotechnology Information), Patrick Masterson(National Center for Biotechnology Information), Karen Clark(National Center for Biotechnology Information), Fergal J. Martin(European Bioinformatics Institute), Kevin Howe(European Bioinformatics Institute), Paul Flicek(European Bioinformatics Institute), Brian P. Walenz(National Institutes of Health), Woori Kwak, Hiram Clawson(University of California, Santa Cruz), Mark Diekhans(University of California, Santa Cruz), Luis R Nassar(University of California, Santa Cruz), Benedict Paten(University of California, Santa Cruz), R.H. Kraus(University of Konstanz), Andrew J. Crawford(Universidad de Los Andes), M. Thomas P. Gilbert(University of Copenhagen), Guojie Zhang(University of Copenhagen), Byrappa Venkatesh(Agency for Science, Technology and Research), Robert W. Murphy(Royal Ontario Museum), Klaus‐Peter Koepfli(Smithsonian Conservation Biology Institute), Beth Shapiro(Howard Hughes Medical Institute), Warren E. Johnson(Smithsonian Institution), Federica Di Palma(University of East Anglia), Tomàs Marquès‐Bonet(Institució Catalana de Recerca i Estudis Avançats), Emma C. Teeling(University College Dublin), Tandy Warnow(University of Illinois Urbana-Champaign), Jennifer A. Marshall Graves(La Trobe University), Oliver A. Ryder(Zoological Society of San Diego), David Haussler(University of California, Santa Cruz), Stephen J. O’Brien(ITMO University), Jonas Korlach(Pacific Biosciences (United States)), Harris A. Lewin(John Muir Health), Kerstin Howe(European Bioinformatics Institute), Eugene W. Myers(Center for Systems Biology Dresden), Richard Durbin(University of Cambridge), Adam M. Phillippy(National Institutes of Health), Erich D. Jarvis(Howard Hughes Medical Institute)
Nature
April 28, 2021
Cited by 3,031Open Access
Full Text

Abstract

Abstract High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species 1–4 . To address this issue, the international Genome 10K (G10K) consortium 5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.


Related Papers

No related papers found

Powered by citation graph analysis