Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing

Supplemental Digital Content is available in the text.

differential privacy framework we generated realistic samples that can be used for initial analysis while guaranteeing a specified level of participant privacy.
The source code for all analyses is available under a permissive open source license in our repository [1]. In addition, continuous analysis [2] was used to re-run all analyses, to generate docker images matching the environment of the original analysis, and to track intermediate results and logs. These artifacts are freely available [3].

Background
A pair of recent preprints have reported generation of synthetic individual participant data via neural networks [4,5]. For example, Esteban et al., generated synthetic patient data and showed that a neural network could not distinguish between the synthetic data and real data. However, it is not enough to simply build synthetic participants. Numerous linkage and membership inference attacks on both biomedical datasets [6][7][8][9][10][11][12][13] and from machine learning models [14][15][16] have demonstrated the ability to re-identify participants or reveal participation in a study.
To provide a formal privacy guarantee, we built GANs to generate realistic synthetic individual participant data with mathematical properties like those of the original participants' data, adding the extra protection of differential privacy [17]. Differential privacy protects against common privacy attacks including membership inference, homogeneity and background knowledge attacks. Informally, differential privacy requires that no single study participant has a significant influence on the information released by the algorithm (see Materials and Methods for a formal definition). Despite being a stringent notion, differential privacy allows us to generate new plausible individuals while revealing almost nothing about any single study participant. Within the biomedical domain, Simmons and Berger developed a method using differential privacy to enable privacy preserving genome-wide association studies [18]. Recently, methods have also been developed to train deep neural networks under differential privacy with formal assurances about privacy risks [19,20]. In the context of a GAN, the discriminator is the only component that accesses the real, private, data. By training the discriminator under differential privacy, we can produce a differentially private GAN framework.

Auxiliary Classifier Generative Adversarial Network
We implemented the AC-GAN as described in Odena et al. [21] using Keras [22] to simulate systolic and diastolic blood pressures as well as the number of hypertension medications prescribed. Results shown use a latent vector of dimension 100, a learning rate of 0.0002, and a batch size of 1 trained for 500 epochs. To conform with the privacy claims laid out in Abadi et al. [19], gradients must be clipped per example, in our implementation this requires the batch size to be 1. To handle edge cases and mimic the sensitivity of the real data measurements, we take the floor of zero or the simulated value and convert all values to integers. Full implementation details can be seen in the GitHub repository [1].
We chose convolutional layers because of the structure imposed by sequential measurements made during the clinical trial. The features were ordered according to timing, so local structure was tied to temporality. We used deep convolutional neural networks for both the generator and discriminator (Supp. Fig. 1B, 1C).

Differential Privacy
Differential privacy is a stability property for algorithms, specifically for randomized algorithms [23]. Informally, it requires that the change of any single data point in the data set has little influence on the output distribution by the algorithm. To formally define differential privacy, let us consider X as the set of all possible data records in our domain. A dataset is a collection of n data records from X. A pair of datasets D and D' are neighboring if they differ by at most one data record. In the following, we will write R to denote the output range of the algorithm, which in our case correspond to the set of generative models. [17]]: Let ε, δ > 0. An algorithm A: X n → R satisfies (ε, δ)differential privacy if for any pair of neighboring datasets D, D', and any event S ⊆ R, the following holds

Definition 1 [Differential Privacy
where the probability is taken over the randomness of the algorithm.
A crucial property of differential privacy is its resilience to post-processing ---any data independent post-processing procedure on the output by a private algorithm remains private.
More formally: Lemma [Resilience to Post-Processing]: Let algorithm A: X n → R be an (ε, δ)-differentially private algorithm. Let A' : R → R' be a "post-processing" procedure. Then their composition of running A over the dataset D, and then running A' over the output A(D) also satisfies (ε, δ)differential privacy.
Robustness to post-processing is critical to our application because it means that all downstream uses of the data are also (ε, δ)-differentially private. Therefore, by making the discriminator, the only part of the system that accesses the real data, differentially private, the rest of the system is also differentially private.

Determination of Privacy Budget
The privacy budget is determined a priori and takes the form of (ε, δ). In non-technical terms, ε, represents the amount the results of an analysis can change due to a single study participant and δ represents the likelihood that δ is exceeded. Therefore, the problem becomes an optimization where the goal is to minimize both (ε, δ) while still generating the same conclusions from the analysis that is performed. Within deep learning, the current standard is to achieve ε < 10 and δ <10 -4 in accordance with Abadi et al. [19]. It is also important to note that accounting for ε in stochastic gradient descent is still a loose upper bound and the actual upper bound is likely to be significantly lower. In more recent work, Beaulieu-Jones et al. [24] found that the Renyi Differential privacy [25] calculates a tighter bound of privacy. For our use case the Renyi privacy cost was roughly one fourth the moments accountant used in this work. Renyi differential privacy could be used to reduce the epsilon value reported in this work.

Training AC-GANs in a Differentially Private Manner
We trained under differential privacy by limiting the effect any single SPRINT study participant has on the training process and by adding random noise based on the maximum effect of a single study participant. From the technical perspective, we limited the effect of participants by clipping the norm of the discriminator's training gradient and added proportionate Gaussian noise. This combination ensures that training cannot be tied to an individual and that it could have been guided by a different subject within or outside the real training data. The maximum effect of an outlier is limited and bounded. Comparing the neural network loss functions of the private and non-private training process demonstrates the effects of these constraints. Under normal training the losses of the generator and discriminator converged to an equilibrium before eventually increasing steadily (Supp. Fig. 1D). Under differentially private training the losses converged to and remained in a noisy equilibrium (Supp. Fig. 1F). At the beginning of training the neural networks changed rapidly. As training continued and the model achieved a better fit these steps, the gradient, decreased. Eventually the gradient becomes too small in comparison to the noise for training to continue any further.
As the models achieve better fit, the gradient shrinks, causing the gradient to noise ratio to decrease. This can occasionally lead to the private generator and discriminator falling out of sync (Supp. Fig. 3) or more commonly the private model generating less realistic samples due to noise. To best select epochs, or training steps, where synthetic samples closely resemble real samples, we tested each epoch's data by training an additional classifier that must distinguish whether a generated participant was a part of the normal or intensive treatment groups.
During the training of AC-GAN, the only part that requires direct access to the private (real) data was the training of the discriminator. To achieve differential privacy, we only needed to "privatize" the training of the discriminators. The differential privacy guarantee of the entire AC-GAN directly followed because the output generative models are simply post-processing from the discriminator. We trained the discriminator using a differentially private version of the Adam method [26]. The standard Adam method iteratively updated the model parameters based on the gradients of the underlying loss function. To preserve privacy, we added noise to the gradient computed at each step as follows: first, we to ensured that the ℓ 2 -norm of the gradient is bounded by clipping the gradient; then we perturbed each coordinate of the gradient by adding noise drawn from the Gaussian distribution with mean 0 and standard deviation proportional to the gradient clip size. The more noise we added (relative to the clipped norm of the gradient) the better the privacy guarantee we provide.
Due to the noisy training process, the losses of the discriminator and generator do not always converge (Supp. Fig. 1) and the training algorithm may have to be rerun. To properly account the total privacy loss from all the runs, we started with a target privacy budget (given by privacy parameters ε and δ) and repeatedly ran the private training algorithm until the AC-GAN converges or the privacy budget is exhausted. We used the moments accountant described in Abadi et al. [19] to keep track of the privacy parameters (ε, δ) over time.
We clipped the ℓ 2 -norm of the gradient at 0.0001 and added noise from a normal distribution with a σ 2 of 1 (Ɲ(0, 1 * (0.0001 2 ))). In our experiment, the AC-GAN trained in the second run of the algorithm converged, and the entire training process incurred a privacy loss within the budget (ε = 4, δ = 10 -5 ).

Differentially Private Model Selection
We trained under differential privacy by limiting the effect any single SPRINT study participant has on the training process and by adding random noise based on the maximum effect of a single study participant. To do this, we first clipped the norm of the discriminator's training gradient.
This clipping provides an upper bound on the maximum effect of a single study participant. We then added Gaussian noise according to this upper bound and the specified acceptable privacy loss. This combination ensures that training cannot be tied to an individual and that it could have been guided by a different subject within or outside the real training data. The maximum effect of an outlier is limited and bounded. Comparing the neural network loss functions of the private and non-private training process demonstrates the effects of these constraints. Under normal training the losses of the generator and discriminator converged to an equilibrium before eventually increasing steadily (Supp. Fig. 1D). Under differentially private training the losses converged to and remained in a noisy equilibrium (Supp. Fig. 1F). At the beginning of training the neural networks changed rapidly. As training continued and the model achieved a better fit these steps, the gradient, decreased. Eventually the gradient becomes too small in comparison to the noise for training to continue any further.
As the models achieve better fit, the gradient shrinks, causing the gradient to noise ratio to decrease. This can occasionally lead to the private generator and discriminator falling out of sync (Supp. Fig. 2) or more commonly the private model generating less realistic samples due to noise. To best select epochs, or training steps, where synthetic samples closely real samples, we tested each epoch's data by training an additional classifier that must distinguish whether a generated participant was a part of the normal or intensive treatment groups.
We found that sampling from multiple different epochs throughout training provided a more diverse training set. This provided summary statistics closer to the real data and higher accuracy in the transfer learning task. During the GAN training, we saved all the generative models across all epochs. We then generated a batch of synthetic data from each generative model and used a machine learning algorithm (logistic regression or random forest) to train a prediction model based on each synthetic batch of data. We then tested each prediction model on the training set from the real dataset and calculate the resulting accuracy. This testing is done only against the training set to ensure the test set is only used for evaluation. To select epochs that generate training data for the most accurate models under differential privacy, we used the standard "Report Noisy Min" subroutine: first add independent Laplace noise to the accuracy of each model (drawn from Lab(1/(n*ε)) to achieve (ε, 0) differential privacy where n is the size of the private dataset we perform the prediction on and output the model with the best noisy accuracy.
Unlike the differentially private stochastic gradient descent training process, the "Report Noisy Min" subroutine can be performed with δ = 0 [17]. We report the aggregate of these two processes to determine the upper bound of privacy risk -(ε = 4.5, δ = 10 -5 ), In practice, we choose the top five models that performed best on the transfer learning task for the training data using both logistic regression classification and random forest classification (for a total of 10 models). We performed this task under (0.5, 0)-differential privacy. In each of the ten rounds of selection, ε was set to 0.05. We found that in simulated cases of the "Report Noisy Min" subroutine this provided a strong signal to noise ratio (i.e. returned 10 values from the top 5% of epochs >99% of the time), while only adding a total of 0.5 ε. The ε value is a hyperparameter that could be optimized depending on the application. We applied two common machine learning classification algorithms and selected the top epochs in a differentially private manner (Supp. Fig. 2B and 2C). However, selecting only a single epoch does not account for the AC-GAN training process. Because the discriminator and generator compete from epoch to epoch, their results can cycle around the underlying distribution. The non-private models consistently improved throughout training (Supp. Fig. 4A, Supp. Fig. 5A), but this could be due to the generator eventually learning characteristics specific to individual participants. We observed that epoch selection based on the training data was important for the generation of realistic populations from models that incorporated differential privacy (Supp. Fig. 4B, Supp. Fig. 5B). To address this, we simulated 1,000 participants from each of the top five epochs by both the logistic regression and random forest evaluation on the training data and combined them to form a multi-epoch training set. This process maintained differential privacy and resulted in a generated population that, throughout the trial, was consistent with the real population (Supp Fig. 2D). The epoch selection process was independent of the holdout testing data.  Table 1. Spearman Correlation between variable importance scores (Random Forests) and model coefficients (Support Vector Machine and Logistic Regression) for the SPRINT trial data.