Abstract 1085: Analyzing economic storage solutions for cancer research data
Abstract
Abstract Purpose: The National Cancer Institute’s (NCI) Cancer Research Data Commons (CRDC) is a cloud-based data ecosystem that allows researchers to share and access clinical, genomic, proteomic, and imaging data. CRDC currently houses more than 10 petabytes (PB) of data (predominantly genomic). The volume of genomic data in CRDC has more than doubled since 2022 from 3.7 PB to 8.8 PB. To address the escalating data storage costs, CRDC must identify economic genomic data storage/compression strategies to achieve long-term sustainability. Methods: To evaluate the impact of genomic compression algorithms and cloud storage solutions, a data compression and storage pilot study was conducted based on two CRDC data sources:185GB from 1000 Genomes Project, and 151GB from the Integrated Canine Data Commons (ICDC). Four compression algorithms - PetaGene, CRAM, PigZ, and Genozip - were chosen based on their current use within the cancer research community. The study consisted of three parts: Compression-Only: four compression algorithms were evaluated for efficiency and cost-effectiveness Cloud Storage+Compression: compressed data were placed in various AWS S3 storage tiers including AWS Intelligent-Tiering (Assumption: no monitoring and automation costs for intelligent tiering) AWS HealthOmics: a range of possible storage costs were based on two scenarios- (1) data accessed monthly, and (2) data never accessed. (Assumptions: 4 gigabases per gigabyte, no egress costs when moving data off HealthOmics, each genome is downloaded in 500 parts generating 500 GET API calls) Results: Of the four algorithms tested, PetaGene performed best on both datasets. For 1000 Genomes data, PetaGene compressed data by 76% in ∼70 minutes for a one-time cost of $2.86. For ICDC data, PetaGene compressed data by 83% in ∼63 minutes for $2.57. It is broadly accepted that tiered storage yields cost savings. When compressed data was placed in tiered storage, yet more savings were realized. For both datasets, the most cost-effective strategy was S3 Intelligent-Tiering of PetaGene compressed data. The annual cost for PetaGene compressed data stored in intelligent tiering ranges from $2.10 - $6.90 (ICDC) and $3.70 - $12.14 (1000 Genomes). The annual cost to store data in AWS HealthOmics is $11.15 - $41.82 (ICDC) and $13.66 - $51.23 (1000 Genomes) for data accessed monthly. Conclusion: Significant cost savings can be achieved using effective genomic data compression tools paired with intelligent tiering storage solutions. There were two main limitations. Data license costs of compression algorithms were not studied, and while the frequency of data access should be considered for real world application, this was not part of the scope of this study. Both these factors will need to be considered as CRDC selects strategies to mitigate cost and inform overall infrastructure. Citation Format: Juergen Klenk, Dina Mikdadi, Bhavani Singh, Chelsea Owens, Eric Barner, Ross Campbell, Mary A. Sears, Ina Felau, Michael Warfe, Erika Kim, Tanja Davidsen. Analyzing economic storage solutions for cancer research data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 1085.
Related Papers
No related papers found
Powered by citation graph analysis