Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data

Marco Masseroli(Politecnico di Milano), Arif Canakoglu(Politecnico di Milano), Pietro Pinoli(Politecnico di Milano), Abdulrahman Kaitoua(German Research Centre for Artificial Intelligence), Andrea Gulino(Politecnico di Milano), Olha Horlova(Politecnico di Milano), Luca Nanni(Politecnico di Milano), Anna Bernasconi(Politecnico di Milano), Stefano Perna(Politecnico di Milano), Eirini Stamoulakatou(Politecnico di Milano), Stefano Ceri(Politecnico di Milano)
Bioinformatics
August 6, 2018
Cited by 80Open Access
Full Text

Abstract

MOTIVATION: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. RESULTS: The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. AVAILABILITY AND IMPLEMENTATION: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Related Papers

No related papers found

Powered by citation graph analysis