Blog

Advance in Synthetic Data: Current State and Challenges

Data is essential for decision making and its importance is expanding after the emergence of artificial intelligence (AI)-based technologies and methods. Even before the advent of AI, statistical analysis of real data has been widely performed to support the humans’ predictions.
In the field of healthcare, however, data collection encounters the challenges of privacy due to the sensitive nature of health/disease-related data. Safety concern is also present in experiments and clinical studies. Synthetic data is a potential solution by substituting the real data.

Synthetic data are artificially generated data designed to represent the real-world. This review1) presents the comprehensive open-source tools and synthetic data generation methods in healthcare.
Synthetic data is receiving attentions because of its potential to serve as a robust and generalizable substitute for real data when training AI models. The reliability and accuracy of AI predictions dramatically drops down when the training data set is not substantial for the prediction target. In healthcare, there is higher privacy risk to touch individual and sensitive information when you try to access more data.
Synthetic data generation offer an opportunity to leverage the privacy and real data. Regulations can be incorporated in the data generation algorism to implement the privacy security as well.

There are many types of synthetic data. Tabular, imaging, time-series and omics data are widely applicable data classes in health care studies. Regardless of the data class, the workflow of data generation is basically the same: data acquisition to retrieve the real data securely, data preparation to curate, preprocess and transform it suitable for modeling, data modelling to generate data that retain the essence of the original data, and data quality evaluation to assess the fidelity, utility and privacy.
Numerous methods have been proposed for the generation of those data in high quality and fidelity and are categorized into four groups:

- Statistical methods (ex. MVND (multivariate normal distribution)2))
- Probabilistic-based methods (ex. SBM (Stochastic Block Models)3))
- Machine learning-based methods (ex. Gaussian Mixture Models4))
- Deep learning-based methods (ex. Generative Adversarial networks (GANs)5))

At the timing of the publication of this review, deep learning covered 72.6% of the published studies for data generation. Deep learning-based generators has been developed to tackle with a specific challenge for synthetic data generation and integration are also in progress.
The review summarizes the studies so far with open-source cords or libraries information. It is a useful dictionary of potentials of synthetic data. It also provides us the guideline for the selection of methods because advantages and disadvantages of major data generation algorithms are well organized and described in a tabular manner.

While the opportunity for synthetic data is expanding, there are still challenges to overcome.
First, data generation is mostly driven by AI and the diverse data set for training the initial model is necessary to mitigate the potential bias. It would be a severe issue in the case of underrepresented populations where the dataset itself is biased. There is a need for extrapolation according to the situation and purpose of prediction though generation of synthetic data.
Secondly, transparency of synthetic data needs to be guaranteed in any case for the protection of privacy. Maintaining and improving ethical and regulatory guidelines for the use of synthetic data in healthcare is also essential.

Synthetic data generation is a powerful approach for the application of AI-based technology in healthcare. It is worth keeping in mind that data fidelity needs to be evaluated in multiple ways before application to real healthcare. Privacy consideration is incredibly important as well. But it opens up the opportunity for more efficient and individualized treatment. More trials and research would enable generation of a new modality in healthcare.

1) https://doi.org/10.1016%2Fj.csbj.2024.07.005
2) https://doi.org/10.1002/psp4.12613
3) https://doi.org/10.1186/s12859-022-04826-4
4) https://doi.org/10.1109%2FOJEMB.2022.3181796
5) https://doi.org/10.3389/frai.2022.918813

Scroll to top
en_USEnglish