Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation

Background Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients’ privacy while properly reflecting the data. Objective This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. Methods We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. Results The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. Conclusions We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.


Section S1: The Choice of the Loss Functions in the GCP Tensor Decomposition
The choice of the loss function in the GCP tensor decomposition depends on how the original data is generated, which can be found below [1].x > 0, m ≥ 0

Section S2: The Model Block of STAN
Here is the model block of STAN for sampling from the patient factor matrix using HMC.The x and x_sim in the model block of STAN represent the patient factor matrix variables a i and their simulation âi , respectively, and N is the number of patients.

Section S3: The Outcomes of Generating Synthetic Continuous Data Using β-loss in GCP Decomposition
In the following, we present the outcomes of synthetic continuous data generated by GCP using β-divergence with β = 0.75, R = 15, where the fit score and MSE were about .977and 2.5, respectively.The dataset used in this experiment consists of 226 patients, 4 laboratory tests, and 36 clinical visits.It is the imputed version of the continuous dataset derived from the MIMIC dataset.As the MSE is not too small so it was expected that the result would not be outstanding, However, Copula and sequential trees performed better than HMC.As can be observed, all three recommended methods of patient factor matrix sampling have a much greater correlation than the real one.Copula, sequential trees, and the HMC results can be found in the following, respectively.According to the summary table below, the minimum of variables "Sodium" and "Hematocrit" are somewhat higher than the original.The below summary displays that the range of variable "Sodium" has been significantly improved respect to the previous analysis using Copula, refer to Table 3  Here are the results of the Hamiltonian Monte Carlo performance on the dataset.If the distribution of the HMC model is properly defined, the outcome would be quite satisfactory.We did not expect HMC performing well here since the β-divergence loss causes a non-Gaussian latent space, and we won't get a good result even when standardizing the latent space.On the other hand, defining a proper model distribution for the HMC would considerably enhance the findings.However, due to the time constraints of this study, we were unable to test alternative model distribution such as Tweedie.At last, we provide Figure 10 to make it easier comparing the three sampling techniques on the GCP decomposition with β-divergence loss.

Section S5: Results Plot
Here is an outcome of sampling patient factor matrix using sequential trees approach in experiments on continuous dense data.The figure demonstrates that all three synthetic datasets have similar statistical properties in terms of dependency and univariate distributions.

Section S10: The Outcomes of Generating Synthetic Categorical Data Using Poisson log link in GCP Decomposition
Here are the results from generating categorical data using the GCP decomposition with a Poisson log link.In addition, the simulation for the patient factor matrix was conducted using HMC.

Section S11: Results Plot
The following is an outcome of the GCP decomposition with Gaussian loss function, and using HMC for the patient factor matrix simulation of categorical variables.

Figure 1 :
Figure 1: The different modes (Patients, Laboratory tests, and Clinical visits) of Copula's generated data and the original data are shown.

Figure 4 :
Figure 4: The different modes (Patients, Laboratory tests, and Clinical visits) of the sequential trees's generated data are shown.

Figure 7 :
Figure 7: The different modes (Patients, Laboratory tests, and Clinical visits) of HMC's generated data are shown.
Figure 8: The plot shows the correlation and distribution of the original data and the data generated by HMC.

Figure 10 :
Figure 10: The distribution and scatter plots of original dataset and synthetic data generated using Copula, the sequential trees, and HMC.

Figure 12 :
Figure 12: The different modes (Patients, Laboratory tests, and Clinical visits) of the sequential trees' generated data and the original data.

Figure 14 :
Figure 14: The distribution and scatter plots of the original variables and synthetic variables generated using Copula, sequential trees, and HMC in experiments on continuous dense data.

Figure 17 :
Figure 17: The different modes (Patients, Categorical features, and Clinical visits) of HMC's generated categorical data are shown.

Figure 20 :
Figure 20: The different modes (Patients, Categorical features, and Clinical visits) of HMC's generated categorical data are shown.

Table 1 :
Loss functions

Table 2 :
The Copula's synthetic data summary.

Table 3 :
The original data summary.

Table 4 :
The sequential decision trees' synthetic data summary.Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation for the summary of the original dataset.

Table 5 :
Table 3 reveals that HMC performed poorly in this particular scenario.The HMC's synthetic data summary.Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation