Petabase-scale sequence alignment catalyses viral discovery

R. C. Edgar, Jeff Taylor, Victor S.-Y. Lin, Tomer Altman(Verisk Analytics (United States)), Pierre Barbera(Heidelberg Institute for Theoretical Studies), Dmitry Meleshko(St Petersburg University), Dan Lohr, Gherman Novakovsky(University of British Columbia), Benjamin Buchfink(Max Planck Institute for Developmental Biology), Basem Al-Shayeb(University of California, Berkeley), Jillian F. Banfield(Planetary Science Institute), Marcos de la Peña(Instituto de Biología Molecular y Celular de Plantas), Anton Korobeynikov(St Petersburg University), Rayan Chikhi(Centre National de la Recherche Scientifique), Artem Babaian
bioRxiv (Cold Spring Harbor Laboratory)
August 10, 2020
Cited by 57Open Access
Full Text

Abstract

Abstract Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, now exceeding multiple petabases and growing exponentially [1, 2]. We developed a cloud computing infrastructure, Serratus , to enable ultra-high throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA dependent RNA polymerase, identifying well over 10 5 novel RNA viruses and thereby expanding the number of known species by roughly an order of magnitude. We characterised novel viruses related to coronaviruses and to hepatitis δ virus, respectively and explored their environmental reservoirs. To catalyse a new era of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.


Related Papers