Privacy-preservation Framework for Sharing Genomic Data: A Game Theoretic Approach

Sharing genomic is important for healthcare but privacy must be protected because links between de-identified genomic data and named persons can be re-established by users with malicious intents. In this paper, a game theoretic approach is developed for quantifiable protections of genomic data sharing. This approach accounts for adversarial behavior and capabilities. The game model is developed to discover the best solution for sharing genomic summary statistics under an economically driven recipient’s (adversary) inference attack based on a Stackelberg game. The inference attack checks if a targeted DNA is in a genome pool with published summary statistics (that is, minor allele frequency of Single Nucleic Polymorphisms)


Introduction
This is huge information encoded in the human genome and these are very useful in personalized medicine, paternity testing, disease susceptibility testing and genetic compatibility testing. The question on everybody's mind is; how can we protect patient privacy while still making the most out of their data? [1] Some researchers are apprehensive that preservation of privacy might be impossible to realize, even when sharing only summary statistics [2]. This concern is obsessed with high profile demonstrations over the past decade of how de-identified genomic data can be tracked back to named persons, leading to public apologies [3].
A model for genomic data dissemination can be achieved using game theory to account for adversarial behavior and capabilities. The proposed approach is unique about genomic data privacy, though such techniques have already been used to analyse the reidentification risk in [4].
In this paper, game model is applied to determine the optimal set of protections for genomic data sharing using a public resource, the Sequence and Phenotype Integration Exchange (SPHINX) system for a case study. SPHINX reports single-nucleotide polymorphism (SNP) summary statistics (that is, MAFs) on data collected from the NIH-sponsored Electronic Medical Records and Genomics Pharmacogenomics (eMERGE-PGx) network [1].

Related Works
In this section, a brief overview of existing techniques adopting cryptographic and non-cryptographic approaches is presented. Cryptographic approaches make computation on encrypted data to guarantee the privacy of individuals. A cryptographic approach to outsource genomic sequences in a cloud server is presented [9]. Researchers leveraged a trusted hardware inside untrusted cloud to ensure privacy [5]. This secure hardware helps server to execute queries independently. Instead of homomorphic encryption, the authors use symmetric cryptosystem. However, both of these techniques can only process count queries. Some recent works from scientists shows some secure versions of statistical algorithms used in genomic studies like Hardy-Weinberg Equilibrium, Pearson Goodness-Of-Fit Test, and Linkage Disequilibrium [6]. A new notion of 'Similar Patient Queries' was introduced by [4] which showed the importance of secure ranked query. The main contribution of this work approximated Edit Distance which can be securely computed between two parties. With this distance (or string difference) you can rank the sequences of different patients and get similar patients like the searched one. There are a number of other cryptographic solutions proposed for genomic data privacy. These techniques use either homomorphic encryption Yao's garbled circuit or both as the underlying secure computation primitive.
Non-cryptographic approaches implement various sanitization techniques to ensure the privacy of genomic data. Privacy preserving data publishing (PPDP) has been researched extensively for various types of data. These techniques study how to transform raw data into a version that is immunized against privacy attacks but that still preserves useful information for data analysis. Existing techniques are primarily based on two major privacy models: k anonymity and ε-differential privacy. In spite of its wide applicability in the healthcare domain, recent research results indicate that k anonymity-based techniques are vulnerable to an adversary's background knowledge. This has stimulated a discussion in the research community in favor of the ε-differential privacy model, which provides provable privacy guarantees independent of an adversary's background knowledge. However, it is not well understood yet whether differential privacy is the right privacy model for biomedical data as it fails to provide adequate data utility proved that de-identification is an ineffective way to protect the privacy of participants in genome-wide association studies [3,4].
Building upon [3] and [7] provide quantitative guidelines for researchers willing to make a certain number of SNPs publicly available in GWAS, without revealing the presence of a single individual within a study group. Scientist proposed using differential privacy to protect the identities of participants in scientific study [8]. In the same vein, researchers proposed privacy-preserving algorithms for computing various statistics related to the SNPs, while guaranteeing differential privacy [10]. Scientist proposed privacy-preserving schemes for medical tests and personalized medicine methods that use patients' genomic data [9]. For privacy-preserving clinical genomics, a group of researchers proposes to outsource some costly computations to a public cloud or semi-trusted service provider [4].

Methodology
Genome contains very sensitive information about the owners. Predisposition to some diseases can be determined based on Single Nucleotide Polymorphism (SNPs). SNP occurs when a nucleotide at a specific position on the Deoxyribonucleic acid (DNA) varies between individuals of a given population The SNPs of all individuals are represented by the random variable X that takes value in the set X={0,1,2}^(n × m),containing n individuals and m SNPs in a single DNA sequence. The genomic privacy game is depicted using equation 1.1: where P is the set of players (sharer and recipient), S is the set of strategies, U is the set of payoff functions and Σ is a finite set of (DNA) alphabets Σ={A(Adenine), C(Cytosine), G(Guanine), T(Thiamine)}. Sharer(S) and Recipient (R) are the two actors involved in genomic data sharing interactions. The sharer is an investigator of a study or an organization such as an academic medical center that manages genomic data and the recipient requests and accesses the data for some purpose (For example, replication of published findings or discovery of new associations). The privacy worry is on recipient with potential to exploit named genomes or targets by determining their presence in the research study. A core motivation for both sharing and attacking genomic records is the belief that the data have intrinsic value. The sharer benefits by gaining utility from disseminating data, while the recipient benefits by detecting (and exploiting) the targets. Attacks entail costs, such as obtaining identified data needed for linkage, as well as the human capital or computational power necessary to run the attack. The sharer's decision about a combination of instituting a DUA and technical protection measures (For instance, suppressing information on certain SNPs), as well as the recipient's decision as to whether an attack is worth its cost, constitute a Stackelberg game, a natural model of this interaction. In this model, the sharer is a leader who can (1) require a DUA with liquidated damages in the event of a breach of contract and (2) share a subset of SNP summary statistics from a specific study (suppressing the rest). The recipient of the data then follows by determining whether or not the benefits gained by attacking each target outweigh the costs. Importantly, the sharer chooses the policy that optimally balances the anticipated utility and privacy risk. To be precise, g is defined as a set of genomic variants (or SNPs) to be shared, a as a set of individuals to be attacked, B_s (g) as the benefit, and Ĉ_s (a) as the estimated cost to the sharer. The sharer's goal is to maximize the following payoff function by selecting the best strategy g^* represent the sharer's payoff and In this model, the recipient aims to maximize his or her own payoff (achieved through exploiting the data) and thus determines the subset of targets that they believe can successfully be attacked. We use Ĉ_R (a) to define the estimated cost to the recipient for attacking targets, which includes the expected prefixed liquidated damage penalty for breach of DUA. The estimated benefit to the recipient is denoted B ̂_R (g,a) which is the benefit the recipient expects to gain from attacking targets.
For any successfully attacked target, the recipient gains a fixed amount, GR, but the sharer loses a fixed amount, LS. As a result, both the sharer's cost and the recipient's benefit are proportional to the number of successfully attacked targets.
To simulate the recipient's uncertainty, the framework introduced by [11] which compares MAFs between the shared data and a public reference is adopted (which assumes the shared data are drawn from a reference population), so that the estimated number of successfully attacked targets is computed as: where π(i,g) is the posterior probability that target i is in the study, I is the set of individuals available for attack, p[i] is the prior probability that target i is in the study, l(i,g) is the likelihood ratio comparing the likelihood that target i is in the study versus that it is in a reference population, τ[i] is the probability that individual i is targeted, and a[i] is a binary variable, which is 1 if target i is attacked and 0 otherwise

System implementation and results
A Genomic Privacy Game Solver is developed to find the best solution for sharing genomic summary statistics (MAF of SNPs) under an economically motivated recipient(adversary)'s inference attack based on a Stackelberg game model. Inference attack is to infer if a targeted DNA is in a genome pool with published summary statistics (that is, minor allele frequency of SNPs).
In Experimental Setup, the proposed model is applied to determine the optimal set of protections for genomic data sharing. The model is evaluated with the data of 8,194 individuals from the Sequence and Phenotype Integration Exchange (SPHINX) system datasets (Auton,2015) for a case study. SPHINX reports single-nucleotide polymorphism (SNP) summary statistics (i.e., minor-allele frequencies [MAFs]) on data collected from the NIH-sponsored Electronic Medical Records and Genomics Pharmacogenomics (eMERGE-PGx) network The genomic data include 82 genes identified as important for pharmacogenomics, with 38,112 variant regions (that is, areas of the genome with any type of notable variation across individuals), including 51,826 SNPs, and the phenomic data include various clinical factors extracted from the electronic medical records of these participants (that is., diagnosis codes, procedural codes, and medications). SPHINX publishes all summary statistics and requires an end user licensing agreement that prohibits re-identification attempts.

Experimental Setup Datasets
We use the GWAS datasets from the 2015 iDASH Healthcare Privacy Protection challenge, which consists of 311 SNPs located on human chromosome 2 from 200 PGP participants.
Searching for the Best Solution to the Game Globally and locally optimal solutions to the Stackelberg game are provided via a backward induction algorithm (BIA). Input: The pool dataset (D), the reference dataset (R), where each row represents an individual, each column represents a SNP, each cell represents the genotype using integer numbers from -1 to 2, where 2 represents minor-minor, 1 represents minor-major, 0 represents majormajor, and -1 represents missing genotype values, the maximal allowable missing rate ( ̅ ), a threshold on minor allele frequency (mafcutoff), and a threshold on the p-value indicating linkage disequilibrium (ldcutoff).

Algorithm 1.4: Backward induction algorithm
In the Stackelberg game, the sharer needs to evaluate his or her payoff for each available strategy. For each of the sharer's strategies, the recipient can play any of their own available strategies. The sharer will choose the strategy that maximizes his or her own payoff. Given the large space of possible strategy combinations, BIA is applied to facilitate the search. BIA is a brute force approach that systematically evaluates all of the possible strategies to discover the one with the maximal payoff.

Backward induction algorithm (BIA)
Input: The pool dataset (D), the reference dataset (R), where each row represents an individual, each column represents a SNP, each cell represents the genotype using integer numbers from -1 to 2, where 2 represents minor-minor, 1 represents minormajor, 0 represents major-major, and -1 represents missing genotype values,the maximal allowable missing rate ( ̅ ),a threshold on minor allele frequency (mafcutoff),a threshold on the p-value indicating linkage disequilibrium (ldcutoff),the worth of the data to the sharer ( ),the prior probability that a target is in the pool ( ),the gain to the recipient per successful attack ( ),the cost of access to the recipient per attack ( ),the expected cost of penalty to the recipient per attack ( ),the loss to the sharer per successful attack ( ), andthe number of targets ( ).  Table 1 shows Average running time (ART) of BIA. is the number of individuals in the study. is number of SNPs available for sharing.

Experimental Results Presentations
is the number of iterations. is the size of each subpopulation. Figure 1 shows a bar chart showing ART of BIA. Table 2 shows the performance comparison of two algorithms on computational results of the sharer's payoff. It can be seen that the results of the two algorithms are the same. is the number of individuals in the study. is number of SNPs available for sharing. Figure 2 shows Sharer's payoff. BIA is compared with the work of researchers that used and Genetic Algorithm which are compared on two types of performance: 1) computational complexity and average running time and 2) accuracy, to determine which one 1) more efficient in terms of computational results.
The parameters used by BIA are shown in Table 3. To measure the runtime (in milliseconds, or ms), we use an Intel Core i7 3 GHz       Table 4 shows the performance comparison of two algorithms on computational complexity and running time with varying size of study dataset. It can be seen that initially, when there are only 5 SNPs and 20 individuals, GA is approximately 263 times slower than BIA. However, as the size of the search space grows, the running time for BIA quickly outpaces GA. By the time there are 20 SNPs and 200 individuals; GA is 189 × faster than BIA. Given that the dataset used in real life scenario case study contains hundreds of SNPs and thousands of individuals.
is the number of individuals in the study. is number of SNPs available for sharing. is the number of iterations. is the size of each subpopulation. Table 5 shows the performance comparison of two algorithms on computational results of the sharer's payoff. is the number of individuals in the study. is number of SNPs available for sharing.

Conclusion
Biomedical research cannot succeed without human genomic data sharing, and genomic data sharing cannot progress without some reasonable level of assurance that de-identified data from patients and other research participants will stay de-identified after they're released for research.

Parameter Settings
Loss to the sharer per successful attack L s The gain to the recipient per successful attack The worth of the data to the sharer × The cost of access to the recipient per attack The expected cost of penalty to the recipient per attack The prior probability that a target is in the study The threshold on minor allele frequency Mafcutoff The threshold on the p-value indicating linkage disequilibrium Ldcutoff The maximal allowable missing rate The number of targets   Data use agreements that carry penalties for attempted reidentification of participants may be a deterrent, but they're hardly a guarantee of privacy. Genomic data can be partially suppressed as they're released; addressing vulnerabilities and rendering individual records unrecognizable, but suppression quickly spoils a data set's scientific usefulness. Game frameworks provide a quantitative framework for modeling the interaction between sharers and recipients. This game and its solution could serve as a basis for decision making to predict attacker's behavior. Game theoretical perspective is used to represent the way sharer and recipient can interact with each other around the release of genomic data. Estimating risk and the attacker's costs, the model estimates the likelihood that any named individual genotype record already held by the attacker is included in the de-identified data set slated for release.