Petabase-scale sequence alignment catalyses viral discovery

R. C. Edgar; Jeff Taylor; Victor S.-Y. Lin; Tomer Altman; Pierre Barbera; Dmitry Meleshko; Dan Lohr; Gherman Novakovsky; Benjamin Buchfink; Basem Al-Shayeb; Jillian F. Banfield; Marcos de la Peña; Anton Korobeynikov; Rayan Chikhi; Artem Babaian

doi:10.1101/2020.08.07.241729

Petabase-scale sequence alignment catalyses viral discovery

R. C. Edgar, Jeff Taylor, Victor S.-Y. Lin, Tomer Altman(Verisk Analytics (United States)), Pierre Barbera(Heidelberg Institute for Theoretical Studies), Dmitry Meleshko(St Petersburg University), Dan Lohr, Gherman Novakovsky(University of British Columbia), Benjamin Buchfink(Max Planck Institute for Developmental Biology), Basem Al-Shayeb(University of California, Berkeley), Jillian F. Banfield(Planetary Science Institute), Marcos de la Peña(Instituto de Biología Molecular y Celular de Plantas), Anton Korobeynikov(St Petersburg University), Rayan Chikhi(Centre National de la Recherche Scientifique), Artem Babaian

bioRxiv (Cold Spring Harbor Laboratory)

August 10, 2020

10.1101/2020.08.07.241729

Cited by 57Open Access

Full Text

Abstract

Abstract Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, now exceeding multiple petabases and growing exponentially [1, 2]. We developed a cloud computing infrastructure, Serratus , to enable ultra-high throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA dependent RNA polymerase, identifying well over 10 5 novel RNA viruses and thereby expanding the number of known species by roughly an order of magnitude. We characterised novel viruses related to coronaviruses and to hepatitis δ virus, respectively and explored their environmental reservoirs. To catalyse a new era of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.

Joseph Felsenstein|Evolution|1985|41.4k

Fast and sensitive protein alignment using DIAMOND

Benjamin Buchfink, Chao Xie, Daniel H. Huson|Nature Methods|2014|15k

EMBOSS: The European Molecular Biology Open Software Suite

Peter Rice, Ian Longden, Alan J. Bleasby|Trends in Genetics|2000|9.8k

FragGeneScan: predicting genes in short and error-prone reads

Mina Rho, Haixu Tang, Yuzhen Ye|Nucleic Acids Research|2010|894

Comparative population genomics in animals uncovers the determinants of genetic diversity

Jonathan Romiguier, Philippe Gayral, M. Ballenghien et al.|Nature|2014|695

Petabase-scale sequence alignment catalyses viral discovery

Abstract

Related Papers