Big omics data experience

Patricia Kovatch; Anthony Costa; Zachary Giles; Eugene Fluder; Hyung Min Cho; Svetlana Mazurkova

doi:10.1145/2807591.2807595

Big omics data experience

Patricia Kovatch(Icahn School of Medicine at Mount Sinai), Anthony Costa(Icahn School of Medicine at Mount Sinai), Zachary Giles(Icahn School of Medicine at Mount Sinai), Eugene Fluder(Icahn School of Medicine at Mount Sinai), Hyung Min Cho(Icahn School of Medicine at Mount Sinai), Svetlana Mazurkova(Icahn School of Medicine at Mount Sinai)

Unknown

October 27, 2015

10.1145/2807591.2807595

Cited by 9Open Access

Full Text

Abstract

As personalized medicine becomes more integrated into healthcare, the rate at which human genomes are being sequenced is rising quickly together with a concomitant acceleration in compute and storage requirements. To achieve the most effective solution for genomic workloads without re-architecting the industry-standard software, we performed a rigorous analysis of usage statistics, benchmarks and available technologies to design a system for maximum throughput. We share our experiences designing a system optimized for the "Genome Analysis ToolKit (GATK) Best Practices" whole genome DNA and RNA pipeline based on an evaluation of compute, workload and I/O characteristics. The characteristics of genomic-based workloads are vastly different from those of traditional HPC workloads, requiring different configurations of the scheduler and the I/O subsystem to achieve reliability, performance and scalability. By understanding how our researchers and clinicians work, we were able to employ techniques not only to speed up their workflow yielding improved and repeatable performance, but also to make more efficient use of storage and compute resources.

Related Papers

No related papers found

Powered by citation graph analysis