1 Introduction

Recently, we have witnessed an increasing number of data science applications in sustainable development field of wise information technology of medicine (WITMED). More applications include drug discovery and disease surveillance, where personal information such as name, age, gender, postal code, profession, disease, and medical history can be collected, published and used by third-party terminal devices or authorities. The analysis and applications of medical data have become a hot topic in recent years [1, 2].

Combining data science and modern medicine, the benefits of analyzing medical data span disease prediction, new drug research and development, auxiliary diagnosis and treatment, and health management. However, as more data is collected and processed through interconnected devices [3], privacy becomes a significant concern due to private sensitive information that may be contained within the data.

In data science research, data privacy-preserving has become increasingly significant in addressing security and privacy challenges. The development of privacy-enhancing techniques, including differential privacy, secure multi-party computation and homomorphic encryption, is imperative for enabling privacy protection while collecting and analyzing data collaboratively. Additionally, transparent and accountable data governance frameworks that protect privacy and facilitate informed consent should be developed to ensure the responsible utilization of data. Therefore, adopting a comprehensive approach that encompasses both technical and ethical considerations is necessary to effectively address the privacy challenges that arise at the intersection of artificial intelligence and data science.

As for medical data analysis, we have observed that the attacks on medical datasets and models have increased rapidly in recent years. Therefore, the research on privacy-preserving methods has become a crucial area of study in medical informatics field. Privacy computing can realize medical simulation, prediction and security statistical analysis of medical data with specific privacy-preserving levels. For publishing medical data, anonymous methods are capable of defending against linking attacks, skewness attacks and similarity attacks, to name a few. However, they do not have enough resistance to background knowledge [4]. Differential privacy is not only robust to differential attacks, but also defending against all of the above attacks on medical sensitive data. Moreover, for publicly published models, differential privacy algorithms also prevent adversarial recovery of private information from the original medical data.

In recent years, there has been a surge in development of novel algorithms for differential privacy medical analysis, which this paper aims to conduct a survey on. And the efforts of this paper can be summarized as follows. First, we discuss why differential privacy is considerable in medical data publishing data and data mining. Second, we discuss typical differential privacy methods based on noises, which can help better understand existing work. Third, we analyze the limitations of differential privacy strategy and summarize possible future challenges, highlighting future research directions of medical applications of differential privacy.

The rest of this paper is structured as follows. Section 2 introduces privacy computing technology of medical data and the characteristics of anonymous methods. The fundamental theories of differential privacy and its noise mechanisms are obtained in Sect. 3. Section 4 illustrates applications of differential privacy to medical data. Subsequently, we analyze and discuss partial possible future challenges of differential privacy in Sect. 5. Section 6 concludes the paper.

2 Medical Data Privacy Computing

The development of privacy connotation is dynamic, continuing to enrich its meaning with the progress of social politics, economic culture and the improvement of human consciousness. The so-called privacy computing is a series of privacy-preserving methods that protect sensitive data from being visible but available when using conjoint analysis and computing collaboratively on model data.

Unlike secure blockchain framework [5] or some web attack detection techniques in cloud-IoT system [6], privacy computing has mainly integrated cryptography, artificial intelligence and computer hardware technologies into a relatively mature technical system represented by multi-party security computation, trusted execution environment and federated learning. Meanwhile, it also regards differential privacy, homomorphic encryption, zero-knowledge proof and others as auxiliary technology, providing a technical guarantee for data security and circulation.

The research on privacy problems can be divided into five categories: financial privacy, Internet privacy, medical privacy, political privacy and information privacy [7]. Among them, medical privacy comes from a wide range of sources and complex types of medical data, mainly including information that patients do not want to be known to the outside world, such as genomic information, past medical history, medical records, etc. They are commonly stored in the form of electronic medical record (EMR), electronic health records (EHR) and personal health records (PHR).

Medical data scattered in different institutions is difficult to interconnect each part, which may seriously restrict the output of clinical scientific research results. For this problem, privacy computing technology has the ability to provide a series of practical solutions to achieve data circulation and take full advantage of medical data. What’s more, it can also solve the problem of insufficient samples from a single institution that leads to credibility loss of research results.

During the COVID-19 epidemic prevention and control period [8, 9], analyses on medical services and tests, pulse count, body temperature and the overall effect of age and gender was done [10, 11]. Furthermore, the use of privacy computing technology such as multi-party security computing enables researchers from all over the world to jointly conduct genome analysis of case samples and share sequencing results without disclosing detailed personal information, so as to implement real-time tracking of the current virus situation and prediction of future strain evolution [1, 12]. This will help more countries diagnose COVID-19 patients efficiently and take effective measures in time.

Generally, genome analysis relies on a large number of personally private data. Using privacy computing will have the original genetic data sealing in local database and realize safe sharing of sensitive genomic data. Then the joint calculation and association analysis will be carried out. In this way, various genome resources can be mined by different medical institutions under the premise of privacy-preserving.

For clinical medical research, utilizing local data protected by privacy computing technology can implement distributed statistical analysis algorithms to joint modeling and obtain related results, such as feasibility analysis of clinical research, cohort study with large samples, disease prediction and drug insight, etc. Therefore, the application of privacy computing will greatly improve medical research efficiency and accelerate the transformation of scientific research achievements.

As shown in Fig. 1, the complete medical data life cycle incorporates data publishing, storing, mining and utilizing [13, 14]. Data publishers, storage parties, miners and users are involved in this process. Both the private data threats and the corresponding privacy-preserving techniques are different at each phase.

Fig. 1
figure 1

Privacy-preserving life cycle of medical data

In practical medical scenarios, the data publishing phase usually involves continuous release of medical data, and attracts the attention of adversaries, who are able to combine specific background knowledge to carry out a series of analyses and attacks on sensitive medical data. Thus, in data publishing phase, while ensuring efficient transmission and strong usability of data, considering how to safely and reliably deal with sensitive information which may be leaked is also a crucial issue for medical researchers and clinicians.

Traditional anonymous publishing methods are usually adopted in the process of medical data release, including k-anonymity [15], l-diversity [16] and t-closeness [17]. Through generalization, suppression and substitution of dataset tuples, they align identifier classification based on specific rules so as to meet the need for medical data desensitization release. Although anonymous approaches are capable of protecting sensitive plaintext information [18, 19], they cannot effectively prevent the attackers from using background knowledge depending on external databases to link attacks, and their privacy protection effect lacks strict theoretical proof. Exactly, the differential privacy computing technology mainly introduced in the following can make up for the disadvantages of anonymization methods to solve corresponding problem.

3 Differential Privacy

In a hypothetical scenario, if data collectors have to collect the published patient diagnosis and treatment records from a hospital, differential privacy can protect sensitive information by adding random noise or disturbance to the original records, which not only cannot reveal certain personal data of a certain user in the datasets, but also ensures the overall statistical characteristics within specified bounds, thus maintaining data utility to a certain extent. That strategy greatly ensures the privacy and security of medical data.

Proposed by Dwork et al. [20], the concept of differential privacy comes from semantic security in cryptography. On the one hand, differential privacy makes it impossible for adversaries to distinguish the encryption results of different plaintexts. On the other hand, it provides a strict upper limit of privacy protection in mathematics, that is, privacy budget. To prevent differential attacks by adding random noise is the direct purpose of differential privacy, so that the adversary cannot effectively infer personal privacy while maximizing the availability of query results in neighboring datasets. The differential attack is that the adversary makes use of subtraction thinking in neighboring datasets to infer sensitive data of a certain person by comparing statistical results of queries.

In the data publishing phase, using differential privacy can ensure that one same data is queried in two neighboring datasets and the results are basically the same, so as to confuse the judgment of the adversary. In addition to guarding against differential attacks, differential privacy can also prevent link attacks based on background knowledge to a large extent.

3.1 Definition

Generally speaking, differential privacy is defined as follows: Given a randomized algorithm (query function) \(M\), \(P_{m}\) is the set of all range values that \(M\) outputs, and \(S_{m} \subseteq P_{m}\). For any two neighboring datasets \(D\) and \(D^{\prime}\) (at most differing on one-row data), if the algorithm \(M\) satisfies:

$$\begin{array}{*{20}c} {\Pr \left[ {M\left( D \right) \in S_{m} } \right] \le e^{\varepsilon } \cdot \Pr \left[ {M\left( {D^{\prime}} \right) \in S_{m} } \right]} \\ \end{array}$$
(3.1)

Then it is said that algorithm \(M\) satisfies \(\varepsilon\)-differential privacy, where the parameter \(\varepsilon\) is the privacy budget. As can be seen from Eq. (3.1) (or put \(e^{\varepsilon }\) on the right side alone), the smaller the privacy budget is, the probability distribution of query results returned by M on neighboring datasets is more similar, accompanied by the harder it is for the adversary to distinguish the pair of neighboring datasets. It provides higher protection degree of sensitive data, but correspondingly, data utility will get worse gradually. On the contrary, a larger privacy budget will lower the degree of privacy protection and improve data utility.

Notably, the probabilities of the third party querying neighboring datasets to get the same statistic value are only very close, not exactly equal. While protecting specific data from leakage, it is also essential to prevent the data from being completely randomized, leading to the loss of usability.

3.2 Noise-Based Mechanisms

In this part, we discuss three noise mechanisms commonly used in differential privacy.

3.2.1 Laplace Mechanism

The query request of the original dataset \(D\) is regarded as the value of a function \(f\) on \(D\). Laplace mechanism is achieved by adding noise \(\eta\) to \(f\left( D \right)\) and the result is \(f\left( D \right) + \eta\). \(\eta\) is a continuous random variable satisfying \(Lap\left( {0,\frac{{{\Delta }\left( f \right)}}{\varepsilon }} \right)\) distribution and its probability density function is:

$$\begin{array}{*{20}c} {P\left( \eta \right) = \frac{1}{2\lambda }e^{{ - \frac{\left| \eta \right|}{\lambda }}} } \\ \end{array}$$
(3.2)

In Eq. (3.2), the expected value of the Laplace distribution is 0, the variance is \(2\lambda^{2}\), and the parameter \(\lambda\) reflects the amplitude of noise and the intensity of privacy protection. Larger \(\lambda\) means the greater range of noise added and the higher degree of privacy protection. In addition, the sensitivity is also an important factor affecting the strength of privacy protection.

Given a query function \(f\), if \(f:D \to R\) (query result), the global sensitivity of \(f\) is:

$$\begin{array}{*{20}c} {\Delta \left( f \right) = \mathop {\max }\limits_{{D, D^{\prime}}} \parallel f\left( D \right) - f\left( {D^{\prime}} \right)\parallel_{1} } \\ \end{array}$$
(3.3)

for all neighboring datasets D and D'.

The global sensitivity reflects the maximum range of variation of a query function over neighboring datasets, in conjunction with privacy budgets to control the amount of generated noise.

3.2.2 Gaussian Mechanism

The Gaussian noise is a mechanism to achieve \(\left( {\varepsilon , \delta } \right)\)-differential privacy, which is defined as follows:

$$\begin{array}{*{20}c} {\Pr \left[ {M\left( D \right) \in S_{m} } \right] \le e^{\varepsilon } \cdot \Pr \left[ {M\left( {D^{\prime}} \right) \in S_{m} } \right] + \delta } \\ \end{array}$$
(3.4)

Here in (3.4), the additive term \(\delta\) denotes the probability of violating plain \(\varepsilon\)-differential privacy is allowed. Given a function \(f\) over dataset \(D\), if \(\varepsilon < 1, \delta \in \left( {0,1} \right)\) and \(\delta \ge \frac{4}{5}e^{{\frac{{ - \left( {\sigma \varepsilon } \right)^{2} }}{2}}}\) [21], \(\delta > \sqrt {2ln\frac{1.25}{\delta }} \Delta f/\varepsilon\), Gaussian noise mechanism can be expressed as: \(M\left( D \right) = f\left( D \right) + N\left( {0, \Delta f^{2} \cdot \sigma^{2} } \right)\) [22, 23], \(N\) is the standard Gaussian distribution with zero-mean Gaussian noise parameter \(\sigma\) and a standard deviation of \(\Delta f \cdot \sigma\). Compared with \(L_{1}\)-sensitivity norm used by Laplace mechanism, Gaussian mechanism follows the same privacy composition, but uses the \(L_{2}\)-sensitivity norm.

3.2.3 Exponential Mechanism

The above two noise mechanisms are mainly used to protect numerical data, while the exponential mechanism is suitable for non-numerical data. It defines a practical evaluation function \(q\), in charge of calculating a satisfaction score \(\omega\) for each output scheme. The scheme with high score will have a higher probability to be published, the exponential mechanism satisfies:

$$\begin{array}{*{20}c} {\Pr \left( \omega \right) \propto \exp \left( {\frac{\varepsilon }{2\Delta \left( q \right)}q\left( {D,\omega } \right)} \right)} \\ \end{array}$$
(3.5)

In Formula (3.5), \(\Delta \left( q \right)\) is the global sensitivity of the evaluation function.

3.3 Classification of Differential Privacy

Traditional differential privacy will gather the original datasets to a data center and then release relevant statistical information satisfying differential privacy, which is called centralized differential privacy (CDP). In other words, CDP’s protection of sensitive information has always been based on the assumption that the third-party data collectors are trusted, that is, they will not steal or disclose sensitive information from users. However, in practical applications, users’ privacy is still not guaranteed [24]. An investigation in 2018 showed that most mobile health apps jeopardized users’ privacy by violating data protection regulations and revealing sensitive information [25].

In view of this, local differential privacy (LDP) [26] emerges in the scenario of untrusted third-party data collectors. When suffering the same quantified privacy attacks of CDP, LDP will subdivide the protection of sensitive personal information. Specifically, LDP delivers data protection authority to each user, enabling users to protect sensitive personal information independently, thus achieving more thorough privacy preservation locally. At present, LDP has been mainly used in frequency estimation, mean estimation [27] and gradually been put into industrial applications. For example, Apple [28] applied it in iOS 10 operating system to protect user device data, and Google [29] used it to collect users’ behavior statistics from the Chrome browser.

3.4 Differential Privacy in Machine Learning

Recently, differential privacy has also been gradually applied in data mining field and combined with increasing machine learning algorithms.

Differential privacy depends on noise or disturbance, so compared with other privacy computing methods, it has low computational complexity, improving its application efficiency in the field of machine learning while providing more explicit privacy guarantees. Noises can not only be added to original data, objective function, output model parameters or features extracted by neural network [30], but also be disturbed or screened for sensitive features specified by users or automatically detected by the recognition network [31, 32]. Shokri et al. [33] used differential privacy mechanisms to design a distributed learning method for privacy protection early on. In their method, privacy loss can be calculated according to the parameters of the model, but too many model parameters may lead to huge privacy loss. On this basis, Abadi et al. [22] improved it and introduced a more efficient gradient descent algorithm based on differential privacy, which has a smaller privacy budget and better performance. More importantly, Abadi et al. [22] also introduced a measuring method of privacy loss, Moment Accountant, to automate the calculation of privacy loss. The differential privacy stochastic gradient descent (DP-SGD) algorithm mentioned in the paper also laid the foundation for more scholars researching on machine learning of privacy protection in the future.

Applying differential privacy to machine learning will reduce the probability that the adversary can reversely deduce sensitive personal information from the model in the original training datasets. Data utility and model security are both crucial in this process. On the one hand, it is necessary to reasonably select and control the privacy budget in the training process according to the privacy loss. Methods such as dynamic allocation of privacy budget [34], utilizing differential privacy post-processing property for noise reduction [35], or reducing privacy budget that may be caused by combination characteristics [36] can be considered. On the other hand, some model architectures that are more conducive to protecting user privacy can also be selected [37, 38].

4 Differential Privacy for Medical Data

In medical data, differential privacy is mainly applied to data publishing and data mining. In the data publishing phase, it can greatly prevent the privacy leakage caused by the data query based on background knowledge. In the data mining phase, it can resist the privacy leakage caused by the membership inference attack (MIA) of the adversary on the model.

As Fig. 2 shows, current applications research focuses on genomic data, medical wearable devices, electronic medical records and medical images, etc.

Fig. 2
figure 2

Differential privacy application to medical data

4.1 Genomic Data

Genomic data in medicine is DNA sequence with genetic benefits of individuals, such particular data is difficult to change over the life cycle and of long-lived value [39,40,41]. Given this, some enterprises may be tempted by commercial interests to violate the genetic privacy of others.

Genome-wide association study (GWAS) is conducive to learning genome-phenome associations by analyzing the statistical correlation between the variants of a case group (phenotype positive) and a control group (phenotype negative) [4]. The adversary may infer the potential traits and genotypes of victims depending on trait associations available from GWAS catalogue [42]. In order to reduce the possibility of leaking genome privacy from published aggregate statistics of GWAS, differential privacy strategies can be widely introduced in it. For example, to a certain extent, differential privacy can prevent attackers from inferring the number and location of single nucleotide polymorphisms (SNPs) that might be significantly linked with certain diseases in the original genetic datasets, so as to protect the gene privacy [43, 44]. For another example, the controlled noise in differential privacy can be added to query results from genomic database, which promotes genome openness while preserving privacy [45, 46]. However, large scale of added noise to high-dimensional genomic data will inevitably degrade data utility. To address this problem, He et al. [47] proposed an effective method to factorize a huge-dimensional distribution into a set of local distributions, reducing the scale of added noise.

Moreover, Almadhoun et al. [48] showed that the adversary could infer genome privacy from query results added noise by exploiting the correlations between the genomes of family members with dependency, then Almadhoun et al. [49] formalized the differential privacy notion to avoid sensitive information inference by adversary relying on tuples prior knowledge. Similar to this work, in order to strengthen the effect of differential privacy against correlation attacks, Yilmaz et al. [50] proposed a scheme which eliminates certain states of a SNP loosely correlated with previously shared SNPs. Chen et al. [51] researched on machine learning model’s ability to defend against MIA on genomic data and evaluated the effect of model sparsity on privacy vulnerability with different differential privacy settings.

4.2 Wearable Device Data

Medical wearable devices storing personal health data such as heart rate and blood sugar play an important role in disease diagnosis and treatment, and they made it possible to collect real-time medical health data continuously [52]. Personal sensitive data stored in medical wearable devices need to be collected in real time, they also have a demand for privacy-preserving in data publishing.

Tu et al. [53] applied differential privacy to numerical mean stream data publishing of medical wearable devices, and adopted an adaptive sampling algorithm based on Kalman filter adjustment error to allocate privacy budgets, which improves the usability of published stream data. Kim et al. [54] added Laplace noise to salient points for collecting one-dimensional heart rate data, but existing large data error.

Revolving around Laplace mechanism, researchers have extended a series of works to provide better data utility and privacy guarantee. Li et al. [55] proposed an improved randomized method to tackle stream medical data collection with a single attribute. That method incorporates random response and Laplace mechanism, further improving the availability of mean value estimation with stream data in medical wearable devices. Moreover, for partitioning or temporal medical datasets, the geometric technique [56], Haar Wavelet technique [57], bucket partition algorithm [58] and Fourier perturbation algorithm [59] have also been adopted to combine with Laplace distribution of differential privacy.

4.3 Other Medical Data

As an inevitable product of modern information technology in the medical field, the electronic medical record is the carrier of various medical information in diagnosis and treatment process, greatly benefiting modern management of hospital medical records. Combining with LDP strategy, Wu et al. [60] designed a blockchain-enabled framework to provide attribute-based privacy protection for transactions. Medical diagnosis results also belong to a part of electronic medical records, Chen et al. [61] presented a differential privacy quasi-identifier classification scheme to tackle original disease dataset and defined privacy ratio for evaluating dataset vulnerability. Zhang et al. [62] designed an attribute association-based differential privacy classification tree method of data publishing, conducting experiments on real medical record datasets.

In addition, Ziller et al. [63] proposed an open-source software framework based on DP-SGD algorithm application to deal with medical imaging classification and semantic segmentation deep learning tasks. Yuan et al. [64] exploited collaborative deep learning with Gaussian noise mechanism to experiment on X-ray Images (Pneumonia) dataset and found the accuracy loss was small, affecting little to the results. Adnan et al. [65] indicated that federated learning with differential privacy has been the viable and reliable collaborative machine learning framework for medical image analysis.

5 Discussions

Although differential privacy to medical data has made some achievements at present, it still faces difficulties and challenges in terms of practical application.

Firstly, we still need to explore how to constantly improve data utility when medical data is shared and circulated across institutions, and to select suitable algorithm strategies to reduce global sensitivity and control privacy budget.

Second, due to the complexity of the scale and structure to medical data, rapidly increasing medical data has begun to be expressed in an unstructured form. As a popular method to describe networked data [66, 67], graph neural network (GNN) has also been successively applied to kinds of medical tasks by plenty of researchers, such as predicting chemical properties of molecules, biological interaction properties of proteins, drug recommendation, etc. [68,69,70]. However, when the GNN models are uploaded to the server and the graph nodes or labels involve personal sensitive information, the process of learning graph data still has the possibility of privacy leakage. For this scenario, differential privacy strategy can also be used to add noise locally [71, 72]. Combined with differential privacy, graph data has a more complex structure than general medical data types. On the one hand, the structural characteristics of the graph may extremely increase the global sensitivity of queries, resulting in excessive noises. On the other hand, since each user locally perturbs the data independently, how to ensure the relevance between original data and then build a graph structure with high availability based on disturbed data also become the main challenges in current practical applications.

Third, existing privacy-preserving computation methods have their own limitations. Finding a reasonable trade-off between privacy-preserving intensity, data utility and algorithm execution efficiency has always been the common goal of these methods [73,74,75]. Regarding differential privacy as a privacy-enhancing technique to combine with mainstream privacy computing methods like federated learning can be considered and widely applied to distributed training of decentralized medical data in the future.

6 Conclusion

Due to increasingly large scale and complex structure, medical data contains sensitive personal information inevitably, and the demand for privacy-preserving is particularly prominent. In this survey, we discussed the development of differential privacy and its applications to medical data. As a privacy computing method with strict mathematical limitations and various implementations, differential privacy is capable of solving the security and efficiency challenges during medical data publishing and mining. Moreover, this work provided a reliable environment and solution for medical data analysis. Finally, we discussed major challenges and future research directions about the medical data applications of differential privacy.