Petabyte-scale innovations at the European Nucleotide Archive

Guy Cochrane(European Bioinformatics Institute), R.A. Akhtar(European Bioinformatics Institute), James Bonfield(European Bioinformatics Institute), L. Bower(European Bioinformatics Institute), F. Demiralp(European Bioinformatics Institute), Nadeem Faruque(European Bioinformatics Institute), Richard Gibson(European Bioinformatics Institute), G. Hoad(European Bioinformatics Institute), Tim Hubbard(European Bioinformatics Institute), Chris Hunter(European Bioinformatics Institute), M. Jang(European Bioinformatics Institute), Szilveszter Juhos(European Bioinformatics Institute), Rasko Leinonen(European Bioinformatics Institute), Susan R. Leonard(European Bioinformatics Institute), Q. Lin(European Bioinformatics Institute), Rodrigo López(European Bioinformatics Institute), D. Lorenc(European Bioinformatics Institute), Hamish McWilliam(European Bioinformatics Institute), G. Mukherjee(European Bioinformatics Institute), S. Plaister(European Bioinformatics Institute), Rajesh Radhakrishnan(European Bioinformatics Institute), Stephen J. Robinson(European Bioinformatics Institute), S. Sobhany(European Bioinformatics Institute), Petra ten Hoopen(European Bioinformatics Institute), Robert Vaughan(European Bioinformatics Institute), Vadim Zalunin(European Bioinformatics Institute), Ewan Birney(European Bioinformatics Institute)
Nucleic Acids Research
November 1, 2008
Cited by 104Open Access
Full Text

Abstract

Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.


Related Papers

No related papers found

Powered by citation graph analysis