Synthetic data generation methods in healthcare: A review on open-source tools and methods

Vasileios C. Pezoulas(University of Ioannina), Dimitrios I. Zaridis(National Technical University of Athens), Eugenia Mylona(University of Ioannina), Christos Androutsos(University of Ioannina), Kosmas Apostolidis(University of Ioannina), Nikolaos S. Tachos(University of Ioannina), Dimitrios I. Fotiadis(University of Ioannina)
Computational and Structural Biotechnology Journal
July 9, 2024
Cited by 195Open Access
Full Text

Abstract

Synthetic data generation has emerged as a promising solution to overcome the challenges which are posed by data scarcity and privacy concerns, as well as, to address the need for training artificial intelligence (AI) algorithms on unbiased data with sufficient sample size and statistical power. Our review explores the application and efficacy of synthetic data methods in healthcare considering the diversity of medical data. To this end, we systematically searched the PubMed and Scopus databases with a great focus on tabular, imaging, radiomics, time-series, and omics data. Studies involving multi-modal synthetic data generation were also explored. The type of method used for the synthetic data generation process was identified in each study and was categorized into statistical, probabilistic, machine learning, and deep learning. Emphasis was given to the programming languages used for the implementation of each method. Our evaluation revealed that the majority of the studies utilize synthetic data generators to: (i) reduce the cost and time required for clinical trials for rare diseases and conditions, (ii) enhance the predictive power of AI models in personalized medicine, (iii) ensure the delivery of fair treatment recommendations across diverse patient populations, and (iv) enable researchers to access high-quality, representative multimodal datasets without exposing sensitive patient information, among others. We underline the wide use of deep learning based synthetic data generators in 72.6 % of the included studies, with 75.3 % of the generators being implemented in Python. A thorough documentation of open-source repositories is finally provided to accelerate research in the field.


Related Papers

No related papers found

Powered by citation graph analysis