Privacy Preserving for Multiple Sensitive Attributes against Fingerprint Correlation Attack Satisfying c -Diversity

,


Introduction
Data generation and sharing have shown a drastic increase in the ongoing decade. e reason behind is obviously the growing sources of data due to huge research and smart revolution (smart grids, cities, devices, etc.). e utility of the shared/published data is utilized in research and analysis by the data researchers. e research and analysis may involve data mining, statistical data analysis, and other policy makings. In the context of health records, the data owners are the individuals to whom the data belong. e hospital that collects, manipulates, and shares that data is known as the data publisher. e data researchers may be a wide range of stakeholders (e.g., pharmaceuticals, government agencies, and survey organizations). e collected data contain private information (e.g., name, contact number, and social security number), partial identifiers (e.g., age, gender, zipcode, and country), and confidential or sensitive information (e.g., disease) about the data owners. Sharing such sensitive information is a privacy breach and legislatively wrong, if disclosed to unauthorized parties.
To ensure privacy of such information, most of the existing algorithms [1][2][3][4][5][6] in the literature deal exist with a single sensitive attribute only. However, a dataset may practically have multiple sensitive attributes (MSAs) [7][8][9][10][11][12][13][14]. For example, a hospital may publish data with more than one sensitive attribute, such as disease, symptom, and physician as shown in Table 1. e sensitive nature of healthcare records urges researchers to handle such scenarios and assure that the privacy of an individual may not be breached.
In data publications, along with privacy, data utility is also a major concern so that researchers may perform research and analysis. erefore, data should be anonymized in such a way that the research analysts may extract useful information. Balancing privacy and utility in privacy preserving data publishing (PPDP) is a NP-hard problem [15][16][17][18][19][20]. erefore, the scenario in this paper is more challenging, as we consider the dimensionality in quasi-identifiers (QIs) as well as more than one sensitive attributes, i.e., MSAs.
An adversary or an attacker is a person who tries to breach the data privacy using different types of background knowledge (bk) about the MSA dataset. e bk includes the fact that certain pattern of values in published data is more likely to be observed than other values. For example, this knowledge can be fingerprint correlation (fcorr) knowledge, QI knowledge (qik) [10], or nonmembership knowledge (nmk) [21,22]. MSA values in a table that belongs to a specific individual form a fingerprint. e fcorr between two kanonymous [1] groups can increase an adversary knowledge. e qik is the personally identifiable information (PII) [21] for an adversary to uniquely identify an individual, and according to nmk, an individual cannot be linked to a specific sensitive value (SV). e (p, k)-Angelization [22] is a strong privacy algorithm for MSAs, where p represents the different sensitivity level of categorical SAs and k implies the k-anonymous QIs. e (p, k)-Angelization algorithm shown in Tables 2 and  3 are obtained from the original microdata in Table 1. e authors in [22] overcame the problem of the nmk attack but still privacy could be breached with fcorr named as the fcorr attack. e fcorr attack is comparatively considered as a strong privacy attack. If the adversary is intended to disclose privacy of almost every individual, the fcorr attack iteratively can breach the privacy of the whole dataset. e privacy breach scenario is explained in Section 1.1 in detail.

Motivation.
e (p, k)-Angelization [22] algorithm directly adopts the single SA approach named as angelization [23] to implement privacy for MSAs. is approach invalidates the (p, k)-Angelization for the fcorr attack. e privacy breach scenario I explains the invalidation for [22] in detail. e complexity, lack of utility, and privacy breaches in SLOMS [24] and SLAMSA [25] techniques have already been invalidated by the (p, k)-Angelization. Although [22] is an efficient solution for utility improvement, the intruder can easily breach the privacy for a record using the bk and his intelligence. Our work has been motivated by the following limitations in the (p, k)-Angelization algorithm: (i) Privacy breach scenario I. For example, an adversary (i.e., David) intends to identify p2 (Lisa) information in Table 1. Since they both live in a neighbourhood, age, gender, and Zipcode are known (21, F, and 34607). Using QIs, David identifies her presence in group 3 of the generalized table (GT), i.e., Table 2, and through the batch ID, the sensitive batch table (SBT), i.e., Table 3, in group 3 can be accessed. For the (p, k)-Angelization, physician is a maximum weighted attribute (see Section 5.2). e maximum weighted attribute implies high dependency that has high privacy risk. An attack on it can easily breach privacy. So the intruder starts the attack from the physician attribute. It is an iterative process that leads to the record identification of the target  is is a column-wise vertical correlation between two SA fingerprint buckets (SAFBs) in SBT that has common physicians and other SA values. e intruder takes intersection of SAFB 3 with groups having common physicians and proceeds iteratively until p2 is identified. So, he takes intersection between SAFB 3 and SAFB 2 because of the common physician Jack, between SAFB 2 and SAFB 4 because of Tom, and then between SAFB 3 and SAFB 1 because of Alan. Table 4 depicts the identified SVs and hence the disclosed individuals. Although the intruder was interested to identify only p2, the privacy of p1 and p4 was also breached during the process, which implies that this process iteratively can breach the individuals in the complete dataset. e intruder uses Table 3 (SBT) and on each step stores the values in Table 4 and finally identifies all the sensitive information related to p2. In Table 4, the values against each physician attribute are the values obtained by taking intersection between two SAFBs in Table 3 linked through common physician's names. In Table 3, Jack is common between SAFB 3 and SAFB 2, so whatever value David gets from intersection, he adds against Jack in Table 1. First, chest X-ray is common in the diagnostics method. e leftover value ultrasound for sure belongs to Alan. While in group 3, both the remaining diagnosis values cannot be assigned to Tom, as Tom may have only one value, so the intruder is not sure at this stage. In the symptoms attribute, back pain is common and is stored against Jack. Here, another symptom value "swelling" is definitely for Alan because there is neither physician nor symptom. Since any further intersection for cancer treatment and cancer type does not produce any value, the process is forwarded to SAFB 2 and SAFB 4 because of Tom. Similarly, for the diagnostic method, Tom had CT scan and Blood test and no value for Frank. Although there is one value for Frank, the intruder can refine this while taking Frank intersection with other SAFBs that are not in the current sample dataset. In the symptom column, abdominal pain for Tom and the lifted value in SAFB 4 is testis swelling for Frank. e weight loss and back pain symptoms in SAFB 2 cannot be assigned to either Tom or Jack because the intruder has no enough information about this yet. For cancer treatment, there is no common value while for the cancer type prostrate is assigned to Tom. e last intersection process is between SAFB 1 and SAFB 3. Although there is no common value for the diagnostic method and the values in SAFB 3 are only related to rays, the intruder is intelligent enough that he can easily assign MRI test to Alan. is may not be the exact value, but can help to guess or identify the record. For symptom although we already have swelling for Alan, taking back pain for Alan which is already assigned to Jack and the only two values in this cell do not suit to the intruder knowledge. For cancer treatment, the common value is surgery and for the cancer type breast is the only attribute value. In SAFB 3, the leftover values are Rectal for Jack and in SAFB 1 and colon for Daisy. At the end, as the intruder also knows that p2 is a female, her attribute values {breast cancer, swelling, and ultrasound/MRI} can easily identify p2. e weighted sensitive attributes values disclosed against the linkable (L) SA identifies the patient p2 record. Table 4 shows that during the process, the details about patient p1 and p4 are also identified. Some of the information regarding Frank and Daisy is incomplete or incorrect due to the fact that no further intersection with any other group is possible since the current data are sample data. is process iteratively executes and can also identify the remaining patients MSAs values.
(ii) Need for bucketization. Deeply analysing the (p, k)-Angelization, it is observed that all the features of angelization [16] were not well utilized. Tables 2 (GT) and 3 (SBT) by the (p, k)-angelization have one-to-one correspondence/ linking between the two tables using the bucket id (BID). Due to the one-to-one correspondence, both tables are not considered as independent while the purpose of angelization was to publish both tables independently. Applying SA diversity may affect the utility in GT. Similarly, increasing the dimensionality in QIs in GT also decreases the utility. e adversary after finding a presence of an individual in a bucket in GT can easily move from GT to the exact group in SBT, where the L fingerprint buckets may help in isolating the sensitive values. In fact, splitting the table into GT and SBT in the (p, k)-angelization is useless.
In our proposed (c, k)-anonymization algorithm, the bucketization approach is adopted, which separates the QIs and SAs into two separate tables: generalized table (GT) and sensitive table (ST), independently. Both tables are respectively linked through BID.
GT consists of k-anonymous QIs generalized buckets (GBs), and an adversary cannot get additional information about an individual's privacy. e ST is a bucket table with MSAs in the bucketized form named as sensitive attributes fingerprint buckets (SAFBs). Anatomy [26] and Angel [23] are examples of bucketization for preserving privacy; however, they are applicable to single sensitive attributes. In this work, we use the bucketization for MSAs that can prevent different types of adversary's attack, e.g., fcorr attack. Tables 5  and 6 are the GT and ST produced by the proposed (c, k)anonymization algorithm. In the proposed approach, better privacy has been achieved with minimum utility loss. It is also not necessary that the publisher should always publish the data with all their QIs attributes known as marginal publication. Marginal publication is to publish the GT with few QI attributes instead of all QI attributes, along with ST. e idea of marginal publication was introduced in [23]. e bucketization has the minimum information loss because of the independent publishing of the GT and ST. In both these tables, the connection is not between the buckets, instead it is between the records in generalized buckets and sensitive buckets.

Contributions.
We propose an efficient solution (c, k)-anonymization for privacy preservation in MSAs. In (p, k)-angelization [22], privacy can be breached under the fcorr attack (explained in Section 1.1). e tables published by the proposed (c, k)-anonymization are depicted in Tables 5 and 6. e "Name" attribute in Table 5 is not published while publishing the data. e proposed approach also prevents against the adversary nmk and qik. e main contributions are as follows: (i) We propose an improvement of (p, k)-angelization, named as the (c, k)-anonymizaiton algorithm, for MSAs privacy. e proposed solution prevents against fcorr attack. For reducing the privacy risk, the real (i.e., one to one) linking between GTand ST is transformed to one-to-many (i.e., real and likely) linking. (ii) We formally model and investigate the invalidation of (p, k)-angelization for the fcorr attack and correctness of the proposed (c, k)-anonymization algorithm. (iii) Based on the above points, the experimental results prove that our proposed approach provides better privacy and utility as compared to its counterpart.

Related Work
In this section, we broadly categorize the data privacy models in order to define boundaries of the proposed work in the available literature.

Data Privacy Models and Methods.
Privacy models can be categorized as (i) syntactic (i.e., partition), or (ii) semantic (i.e., randomized). e syntactic approach achieves privacy in two levels: clustering data and privacy framework. e k-anonymity   [1] and then its extension l-diversity [3] and then the t-closeness [4] are the examples of syntactic data privacy models, in which the final set of groups are called equivalence classes (ECs). In the semantic approach, the original values are noised in a random way. ϵ-differential privacy [27] is an example of the semantic data model. e researchers have proposed both the syntactic and semantic privacy models for different types of data, e.g., single sensitive attribute [3,4,6], or MSAs [7][8][9], or 1 : m (i.e., one individual having many records) [28] microdata. For preserving the privacy, the algorithms in privacy models practice different approaches. ese approaches can be categorized to (i) generalization [1][2][3][4][5][13][14][15] (i.e., greedily convert the more specialized values to less specialized values), (ii) anatomy [25,26] (i.e., partition the QI and S attributes), and (iii) microaggregation [29,30] (i.e., dataset is partitioned into clusters where QI values of records are replaced with the mean of value). e proposed work in this paper considers the syntactic data privacy, using generalization and anatomy for MSAs.  [25] is an effective technique for MSAs. Although it does not generalize the QIs attributes, it enhances utility but publishes many tables, which makes the solution more complex. To prevent against proximity breach, the authors in [7] have adopted multisensitive bucketization (MSB) technique using clustering. However, it is applicable to numerical data only. e (α, l) model [8] for a single sensitive attribute satisfies the privacy requirements for MSAs. e authors in [31] prevent the negative and positive disclosure of associating between MSAs. In [33], rating of MSAs was proposed that fulfils the privacy requirements. However, the inherent relationship between the SAs can cause association rule attack. An adversary can use related bk to breach the privacy. e authors in [32] prevented the data from association attack and removed the weakness of the rating algorithm. In [37,38], the authors perform vertical partitioning (i.e., anatomy) and implement decomposition and decomposition plus, respectively, to achieve l-diversity for MSAs. Decomposition plus [38] optimizes the noise value selection in [37] and keeps it closer to the original. e possibility of skewness and similarity attacks in [4,39] was eliminated by the p + sensitive t-closeness model [40]. It combines the good features from p-sensitive k-anonymity [39] and t-closeness [4] approaches.

Syntactic Anonymization
ANGELMS (anatomy with generalization for MSA) [34] vertically partitions the dataset into the QIs table and several SAs tables satisfying the k-anonymity [1] and l-diversity [3] principles, but still it can be attacked with similarity, skewness, and sensitivity attacks. In [16,18], the KC Slice for dynamic data publishing of MSAs integrates the features of KC-privacy and slicing techniques. e authors have presented the method for a single release, and no studies for multiple releases are available to prove the dynamic claim. In [35,36], MSAs were handled for achieving privacy but the ldiversity [3] principle was directly adopted that caused huge information loss. e nmk attack was prevented in [41] but still caused high information loss due to the grouping conditions over the data and vulnerability to the background join attack. e proposed work categorizes the sensitivity of MSAs as top secret, secret, less secret, and nonsecret. c-diverse fingerprint buckets are created that contain records from different categories. e QI values of the created fingerprint buckets are bottom-up generalized through k-anonymity [1].

Preliminaries
Let table T � EI, QI, S { } (as shown in Table 1) is the private data form for a publisher to publish. Let there be t tuples in T, and each tuple represents an individual or record respondent i. e components for tuple t ∈ T are explicit identifier attributes (also called identifying attribute) EI � QIs are the partial identifiers or personally identifiably information (PII) that can identify an individual i if linked with external data, e.g., voting or census data. Data privacy is all about protecting the sensitive information, which are the confidential and private information belonging to an individual. In this work, we consider a challenging scenario of more than one sensitive attributes for a single individual named as multiple sensitive attributes (MSAs). Notations used in the paper are shown in Table 7.
e MSAs values in table T that belongs to a specific individual form a fingerprint known as MSA fingerprint.
Definition 2 (sensitive attribute fingerprint bucket (SAFB)). A sensitive attribute partitioning of the microdata T consists of a list of SAFB: FB 1 , FB 2 , . . . , FB m according to the following conditions: In literature, an adversary is the attacker, who intends to breach the privacy and has different types of knowledge known as bk. Data correlation is an important type of adversary knowledge that breaches the privacy. e data correlation can be attribute correlation that exists among two or more attributes, e.g., [42], or row correlation between two or more rows, e.g., [43]. is paper is related to row correlation and more specifically FB correlation. is work focuses on reducing the threat exposed by FB correlation linked through the high-weighted SA value. Each FB contains few fingerprints that belong to k individuals in a specific GB inside GT. e adversary uniquely identifies an individual from the fingerprint correlation knowledge which has direct correspondence with the QI values.
Definition 4 (nonmembership knowledge (nmk) [22]). If an adversary knows that an individual i in GB cannot be linked to a specific SV in FB it is known as nmk.
e MSA values obtained from correlating two linkable FBs, i.e., FB i ∩ FB j , can be assigned to a specific individual.
Based on the available information, we consider the adversary's bk consists of bk � GT, ST, available publically e adversary applies the bk on available anonymized data to perform an attack and to breach an individual's privacy.
Definition 6 (fingerprint correlation (fcorr) attack). e adversary with known QIs values and fcorr is able to perform fcorr attack by deducing single SVs from the intersection of FBs that are linked via δ. e fcorr attack can be (i) Partial (pfcorr) attack: few of the SAs from fingerprints in two or more FBs may produce unique SVs (ii) Full (ffcorr) attack: all of the SAs from fingerprints in two or more FBs may produce unique SVs For an adversary, the ffcorr attack has no doubt to uniquely identify an individual while with the pfcorr attack can identify a record respondent if the resulted sensitive information belongs to the above minimum weighted attributes (ζ w i ) (see 5.1 algorithm). e ζ w i attributes do no contribute in an individual record identification. Definition 7 (high-level petri-nets (HLPNs) [44]). A petrinet is used as a model to examine the control of information in a system. HLPN formally analyses the system with mathematical properties. A HLPN is a 7-tuple N � {P, T, F, φ, R n , L, and M 0 }, where P is a set of places represented by circles, T is a set of transitions represented by rectangular boxes such that P ∪ T ≠ ∅ and P ∩ T � ∅, F is the flow relations such that F⊆(P × T) ∪ (P ∪ T), L are the labels on F, φ maps places to types., R n represents the rules for transitions, and M 0 is the initial marking. In short, L, φ, and R n represent the static semantic, whereas P, T, and F depict the dynamic structure.
and SA are the MSA from table T. (iii) Batch partitioning satisfies (p, k)-anonymity [16] where every bucket has records from p categories and have k-tuples to prevent against linking attack while the group partitioning satisfies (p, k)-anonymity [23].
(i) e invalidation of the existing (p, k)-angelization is due to the fcorr that causes fcorr attack (as shown in Table 4). Although the records from p categories are k indistinguishable on MSAs fingerprint, they are uniquely distinguishable because of unique SAFB values obtained from linkable buckets. So, Lemma 2 in [22] is incorrect. Its corrected form is given in Lemma 1 (Section 5). (ii) Invalidation of theorem 2 in [15]: the adversary can correlate the sensitive information from qik. Scenario I explains the fcorr attack that extracts unique sensitive information using QI values.
Now, we formally model the (p, k)-angelization algorithm to check its invalidation with respect to the fcorr attack. e (p, k)-angelization algorithm is depicted with HLPN and formally analysed with its mathematical properties. e purpose of using HLPNs is to depict (i) the interconnection of the model components and processes, (ii) a clear flow of data among the processes, and (iii) the in depth inside about how the process of information takes place, in order to isolate the flaw in (p, k)angelization. Figure 1 depicts HLPNs for (p, k)- e variable types and mapping of data types on places are shown in Tables 9 and 10, respectively. e adversarial model in Figure 1 comprises of three entities: end user, trusted data sanitizer, and adversary. e initial transition is referred to as input transition that contains the raw data (e.g., patient's EHRs) collected from a health organization. e trusted data sanitizer anonymizes the data using the (p, k)-angelization algorithm and produces GT ( Table 2) and SBT (Table 3) tables.
e produced tables are ready to be published which are exploited by the adversary through the fcorr attack in Table 4.
Rule 1 checks the existence of the number of dependent SAs with respect to another SA. Rule 2 counts the dependent attributes and selects the maximum weighted attributes. However, if there exists more than one in the weight set, then an external factor ε is added to one of them, based on some external facts. e weight set is sorted in descending to select the maximum weight as in rule 3 and 4. Based on weight calculation and MSAs in T, the category table is formed in rule 5.
{ } e problem arises from rule 6 onward. e (p, k)-angelization [22] blindly follows the basic angelization [23] mechanism. According to rule 6, the data in table T  based on the category table (Table 8) use angelization to create GT and SBT. In [23], the L between two SA buckets  [3]

Proposed (c, k)-Anonymization for Multiple Sensitive Attributes
Although the (p, k)-Angelization model is a state-of-the-art approach for MSAs, especially for the categorization of sensitive values. But the ST still lacks in privacy because of blindly using the same angelization [23] approach for MSAs. It leads to fcorr attack, i.e., ffcorr attack or pfcorr attack. We name the improved form of (p, k)-angelization as (c, k)anonymization for MSAs and is describe as follows: x 7 x 6 x 1 1 x 1 6 x 1 7 x 1 8 x 2 2 x 2 1 x 1 9 x1 w max x20 Figure 1: HLPN for (p, k)-angelization.

Lemma 1 (uncertainty in L SAFBs). If for T having the MSAs dataset, the anonymized form T' satisfies (c, k)anonymization, then T' satisfies (c, k)-diversity for MSA fingerprints.
Proof. Let sbp be the random sensitive bucket partitioning in T'. ere must be at least t tuples from c categories in sbp such that (sbp a ∩ sbp b > 0 or 1, a ≠ b) or the fcorr ≤ ζ w j ∀ sbp L , where ζ w j are the minimum weighted SAs having the lowest dependency (see algorithm 5.1). So the c categorized records are indistinguishable on the MSA fingerprint by k records. us, sbp satisfies the definition of (c, k)diversity and uncertainty in L SAFBs is maximized.  Tables 2 and 3. e (p, k)angelization has the one-to-one real linking through BID. e real linking has high privacy leakage and 100% chances of presence attack. Another type of linking can be the likely linking between GT and ST where all BID linking are not real. Relating the real and likely linking, the privacy risk for an individual is defined as PriRisk(t) � (r/n), where r are the real linking and n are the total number of likely linking. e likely linking for a specific size of EC varies, and it depends on the QI values in GT that have linking with certain FBs. For preventing privacy leakage PriRisk(t) ≤ (1/l), where l represents the l-diversity [3]. Every FB has c-diverse fingerprints that correspond to at least k individuals. In our proposed (c, k)-anonymization, for c � 1, implies l � 2; for c � 2, implies l ≤ 4; for c � 3, implies l ≤ 6, and so on.

(c, k)-Anonymization Algorithm.
e objective of the proposed (c, k)-anonymization algorithm is to provide a sustainable privacy for MSAs. e algorithm gets a microdata table T (Table 1) as input and produces two anonymized tables, i.e., GT (Table 5) and ST (Table 6). e proposed Algorithm 1 performs two major functionalities: categorizing the MSAs based on calculated weights and creating secure FBs for the whole dataset. For SA categorization, the algorithm calculates weights for all the MSAs in the dataset to get to know the dependency of SAs. e dependency shows the sensitivity level of the SAs. ese weights are sorted in the descending order to get categorized MSAs. e weights calculation for MSAs creates the category table (CtgT) that helps to create ldiverse (c-diverse with respect to category) FBs. e FBs must satisfy equations (4) and (5) in order to prevent fcorr attack. erefore, if some of the FBs are not according to equations (4) or (5) they are refined to fulfil. e purpose to refine is because the input data may be of different nature and may contain SVs that may not be grouped initially. e complete algorithm (Algorithm 1) and its working are explained in the following: Sensitive attributes weight calculation. In Algorithm 1, a calling function wtCalc( ), shown in Function 1, calculates weights for all the MSAs from Table 1. ere are six different types of SAs, and each has its own level of sensitivity or sensitivity weights. Let W � {w 1 , w 2 , . . . , w n } is the set of weights for each SA such that s 1 has weight w 1 , s 2 has weight w 2 , and so on. SA weights are the dependency on all other SAs. Similar to [22] the weight is calculated in the following equation: where m are the total number of attributes dependent on attribute s u . e dependency of an SA is determined by the total range of attributes identifying the SAs. e for loop in the beginning calculates the sensitivity for s u with all other SAs, i.e., s v . is determines the sensitivity level for all the SAs. To calculate the weights (second for loop), cardinality checking is performed of all dependent attributes. Calculated weights for the SAs that exist in microdata are shown in Table 11.

Wireless Communications and Mobile Computing
Maximum weighted sensitive attributes selection. From the calculated dependencies through Function 1, maximum weighted attributes ξ w j are selected. Maximum weights mean high dependency leads to high disclosure risk. So attributes set with ξ w j needs maximum protection. Although there are very rare chances, the problem arises if there exists more than one SAs having the same ξ w j . en, an external factor χ is added to select only one maximum weighted attribute. e algorithm adds an external factor χ, i.e., w j + χ(w j ) to each attribute in set ξ w j , and the weights w i are then sorted in the descending order to get one single maximum weighted SA (δ) (can be seen in Algorithm 1).
Categorizing sensitive attributes. Attribute occurring in the first location of the set w j is selected as δ. e descending order of calculated weights for MSA is categorized through the categorize() function as top secret, secret, less secret, and nonsecret, in Table 8. From weight calculations, although disease and physician have equal weights, we select physician as the maximum weighted SA (δ) because of some other external factors χ. For example, physician information may be publically available on the Internet.
Creating FBs. FBs consist of MSAs with the BIDs column at the time of anonymization. Function 2, function CreateFB( ), shows the whole process for creating c-diverse FBs. Create FBs mainly based on CtgT (as shown in Table 8). Let A s � A s 1 , A s 2 , . . . , A s m and N are the total number of records in the microdata table T. To create a 2-anonymous 2diverse FB, i.e,. k � 2, l � 2 (i.e., c � 1), for example, r u and r v are two different records in T such that r u ∈ ctg x and r v ∈ ctg y , where ctg x ≠ ctg y and ctg i ∈ CtgT. e union of these two different records from different sensitive categories creates an FB, as shown in the following equation: where u ≠ v and x ≠ y. Selecting records from different categories to create an FB is to implement the l-diversity principle in the form of c-diversity in our case. Privacy in data is all about creating an EC that prevents an intruder to breach any of its sensitive contents. Unlike [22], we focus on creating an FB that satisfies c-diversity and prevents against any attack from the adversary, e.g., fcorr attack. e fcorr attack is prevented by taking two measures: (i) minimizing the likability or linking (L) between two FBs and (ii) uncorrelating the records or maximizing the uncertainty between the L FBs (explained in Refine FB). Patients records (i.e., SVs) are high in number as compared to the physician attribute and both can normally be correlated or linked with one another. is L provides a reason for an attack. erefore, to minimize the L, we use the linkability control factor (c f ), (1 ≤ c f ≤ max count for specific δ). c f minimizes the repetition of the same δ value in different FBs. e table (Table 6) published by the proposed algorithm has c f � 2, which means that a maximum of two records for a single physician can exist in an FB. is brings the existing δ values to minimum FBs and L is reduced. So, the chances of the fcorr attack on possible FB are ultimately reduced (Table 3 has 3 L FBs while Table 6 has only 1 L FB). e high value for c f further reduces L Input: T: Microdata table � {ID, QI, S} χ: External Factor Output: GData: Generalized Table-GT  SAFB: Sensitive Table-  but increases information loss, so a balance should be maintained between c f and utility preservation. Refining FBs. FBs are refined through the function Refine FB( ), as depicted in Function 3. e fcorr attack between any two L FBs can breach the privacy. e purpose is to completely avoid the correlation. A percentage of record that discloses from intersection between L FBs can be associated to a specific individual. For example, any percentage of record obtained from equation (3) that results in a single value for each SA correlation is a privacy breach and is not acceptable, especially for high weighted attributes.
where i ≠ j and i, j are L via δ. e high percentage disclosure infers a high intruder confidence for privacy breach and vice versa. e decreasing order of w i is the decrease in probability of privacy breach, among the n L FBs. e measure to prevent against the fcorr attack is no or minimum data expose from intersection. e refining process in Function 3 for FBs works under two approaches, i.e., strict and relax. Strict approach is given in the following equation: where i ≠ j and FBs are L through δ, i.e., δ FB i � δ FB j , for example, they are L through the physician and n are the total number of FBs where the same physician exists. e intersection in this case ensures that fcorr should be zero or have more than one SVs in common to create uncertainty for single SA.
In case of the worst dataset, there may be some of the records that do not fit in any FB via the strict approach. A relax strategy is adopted with no breach in privacy. e percentage of record exposure from fcorr is minimized to an acceptable value. In the worst case scenario, the proposed (c, k)-anonymization maintains fcorr ≤ ζ w j , ∀FB L where ζ w j are the minimum weighted attributes that have no dependency. Relax strategy is given in the following equation: (5) where i ≠ j and FBs are L through δ, i.e.,δ EC i � δ FB j , for example, they are L through physician and n are the total number of FBs where the same physician exists. According to equation (5), only fcorr ≤ ζ w j , ∀FB L is acceptable which is a percentage of information leakage but not a privacy breach because of not dependent attributes and hence impossible for an intruder to link with other SA or QI of a specific patient record. e working of RefineFB() function for equation (5) is the same as it is for equation (4).
Generalization. Function 4 deals with QI attributes with BIDs that correspond to unique FBs in ST. Initially, the records are sorted by QIs. en, every individual record is generalized to achieve k-anonymous EC. For generalizing the tuple t ∈ N, the t QI � {x 1 ,x 2 , . . . , x n } is generalized to QI ′ �([y 1 − z 1 ], [y 2 − z 2 ],. . ., [y n − z n ]), where y i ≤ x i ≤ z i and y i , z i are the close boundaries for x i . e Generalization() function in Function 4 shows the generalization process.

Input:
FB � FB 1 , FB 2 , FB 3 , . . . , FB FB , all FBs in the whole dataset T r: source record to create FB, it can be r u or r v , u ≠ v N: no. of records in the actual dataset T c f : linkability control factor k: k-anonymity level (minimum FB size) ctg x : any category of SA in category Table 8 ctg y : any category of SA in category Table 8 Output: FB having list of FB k with k-anonymous records in each g. c f � 2, will select max 2 //records having same physician

Formal Modelling and Analysis for Proposed
(c, k)-Anonymization Algorithm. In this section, we do formal modelling of the proposed (c, k)-anonymization algorithm to analyse and formally validate against adversary's background knowledge, i.e., fcorr attack. We have used HLPN (Definition 7) to model the proposed system. e HLPN provides the mathematical representation to analyse the behaviour of the proposed system. For a system representation in HLPN, first, the data types associated with the P (Places) are defined and then the set of rules for HLPN are defined. Figure 2 represents the HLPN for the proposed (c, k)-anonymization algorithm. Tables 12 and 13 show the data types and mapping of data types on places that are described which are involved in the proposed algorithm HLPN. e algorithm begins from weight calculation as in [22] to create the CtgT table (Table 8). So, the transitions in rules 1, 2, 3, 4, and 5 are the same for the proposed (c, k)-anonymization algorithm. e main goal is to nullify rule 7 and to create such FBs that prevent against fcorr attack. In rule 8, the FBs are created. e function createFB() takes T (i.e., P (QI × MSF × PID)), and based on category id C the categorized MSA are accommodated into different c-diverse FBs.
In the next step, the created FBs in 8 are evaluated to prevent fcorr attack. Rule 9 gets an input of equations (4) and (5) to verify the nonexistence of fcorr knowledge between different L FBs. All those FBs that satisfy either of the equation are stored at place SAFB via rule 10, while other SAFBs are forwarded for refinement to satisfy the equations as shown in rule 11.
e minimum requirement to prevent from the fcorr attack is to at least satisfy equation (5). In rule 12, such requirements are fulfilled via function RefineFB() for all the FALSE FBs from rule 9 and are stored at place SAFBr. Rule 13 just combines all the secured FBs from places SAFB and SAFBr to create one ST that can prevent from any bk attack, e.g., fcorr attack.
{ } e generalization of the QIs in T with respect to the sensitivity categorization and the BID obtained from ST (rule 13) is performed in rule 14. e created GT and ST tables via rules 13 and 14, respectively, can thwart against an adversary's presence and fcorr attacks and are ready to publish. Rule 15 shows the adversary's zero gain after attack.
In rule 15, the adversary's bk, i.e., fcorr, consists of QI, MSA, and PID. To apply the fcorr attack, the adversary compares the SVs in bk with published tables MSA buckets and with the corresponding QIs to get a matching MSAs that belong to a specific PID as in the original table T. However, the adversary fails to do so and the union of the bk yields an empty set, which shows that the bk could not identify a specific individual record.

Experimental Analysis
In this section, we evaluate our proposed anonymization algorithm (c, k)-anonymization to compare the performance with (p, k)-Angelization. Both the algorithms are implemented in JAVA language on a machine having Windows 10 operating system with 4 GB RAM and Intel Core i5 2.39 GHz processor. e values plotted for the (p, k)-angelization algorithm have been obtained from the algorithm's program code executed on the same machine. e dataset obtained from the Cleveland Clinic Foundation Heart disease is available at https://archive.ics.uci.edu/ml/ datasets/Heart-+ Disease. is dataset consists of 75 attributes. e experiments are performed on two QI attributes: age and gender and 12 sensitive attributes. ese attributes are enough to evaluate the performance of the proposed algorithm. Table 14 shows the QIs and SAs used in the experiment with attribute description and number of distinct values in each domain.
Different general purpose posteriori measures for utility and privacy loss [9,15,18,22] are available for generalization-based algorithms. In these approaches, the publisher does not know about the recipient analysing method. e x 2 1 x2 3 x1 9 x 1 4 x 3 6 x 3 7 x 1 5 x 9 x 7 x 6 x 5 x4 x 3 x 2 x 3 0 x 3 3 x 3 2 x 3 5 x 3 8 x2 9 x 2 7 x 2 8 x 3 4 x 3 1 x 1 2 x 1 3 x 1 1 publisher only evaluates the similarity between the original and anonymized data. e lower values in utility loss and privacy loss reflect the effectiveness of the developed algorithm. We measure the utility loss using the normalized certainty penalty (NCP) [22], query accuracy [22], and privacy loss by calculating the vulnerable records. Also, both algorithms execution time are analysed and a discussion is provided at the end.

Utility Loss.
For utility loss measure, we analyse our algorithm using the following techniques. e total weighted certainty penalty for the whole table is the sum of all attributes in a tuple, and then NCP obtained from all tuples is added as shown in the following equation: where NCP(t) � q i�1 w i · NCP Q i (t) represents penalty for a tuple, w i are weights associated to attributes, and T * is the final anonymized release. Figure 3 shows the percentage value for NCP for varying values of k-anonymity, keeping fixed number of attributes (e.g., MSA � 6) for analysing (c, k)-anonymization and (p, k)-anonymization algorithms. e penalty, i.e., NCP% value for (p, k)-angelization continuously increases while increasing the k-anonymity because the k groups have the oneto-one correspondence with FBs in ST. is means high diversity in FBs may further effect the utility in GT. So, the splitting of table T (Table 1) into GT (Table 2) and ST (Table 3) has no benefit at all. e attributes in each table are still dependent on each other. While the proposed (c, k)-anonymization has one-to-many correspondence between ST and GT where the GB creates closer k-anonymous groups for the same k size class. e comparatively less generalized QIs have lower utility loss and have almost zero loss.

Query Accuracy or Precision of Data Analysis Queries.
e purpose of anonymized data is to extract useful statistics and contribute in decision-making. Such utility of the anonymized release is measured through aggregate query answering. Consider the following type of aggregate query: Published Calculating the query error is a common matrix to measure the utility of the anonymized release. We compare the utility of the (p, k)-angelization and (c, k)-anonymization by generating 1000 random queries and averaging their query errors. Figure 4 depicts the query error for (p, k)angelization and (c, k)-anonymization comparatively. In Figure 4(a), the comparative increase in the query error for varying k size is because of the new record insertions that increase the generalization range. While the proposed (c, k)anonymization shows a comparative low error rate because of comparative decrease in QI generalization. e selectivity, i.e., θ graph in Figure 4(b), depicts that for high selectivity, more number of records are selected. So, the difference to calculate the query error via equation (8) will automatically decrease.

Privacy Loss.
Identification of individual record respondents from an anonymized release is directly proportional to the privacy loss. erefore, the privacy loss is measured with respect to the number of single record identifications that are obtained from the intersection of L FBs, considered as vulnerable records. We analyse the privacy loss with varying value of k and MSA. Figure 5 shows the results obtained from the experiments for the (p, k)angelization and (c, k)-anonymization algorithms. In Figure 5(a), for (p, k)-angelization, the number of vulnerable records increases with increase in k size because there are more number of records that have single SV obtained from the intersection among L FBs. Similarly, in Figure 5(b), with increasing the MSAs, the chances of getting a single SV against each SA increases and hence the number of vulnerable records increases for (p, k)-angelization. While in both cases, i.e., Figures 5(a) and 5(b), the proposed (c, k)anonymization has no such single SV from the intersection in L FBs because of satisfying equations (4) and (5).
erefore, there exists no vulnerable records in the proposed (c, k)-anonymization algorithm.

Execution Time Analysis.
e execution time for the proposed (c, k)-anonymization is high as compared to the (p, k)-angelization.
is is because of the privacy requirements to satisfy equations (4) and (5). Although the categorization and record selection are almost the same in both algorithms, however, to satisfy equations (4) and (5) is time consuming and may need further time to merge records between L FBs. erefore, at the cost of improved privacy, the execution time increases. In Figure 6, with the increase in number of MSAs, the proposed (c, k)-anonymization algorithm execution time comparatively increases because of satisfying privacy equations for all the available attributes; however, the algorithm execution time is still small and acceptable.

Discussion.
e proposed algorithm has been analysed through utility loss and privacy loss metrics. e main goal of the algorithm is to prevent attribute disclosures using some background knowledge. e algorithm achieves this goal. e main priority of the algorithm is to prevent from the fcorr attack. For this purpose, the algorithm creates a classification strategy in the form of CtgT to have c-diverse records in each FB and to prevent attribute disclosures. In the defensive position, the algorithm focuses on reducing the L between FBs using c f in order to reduce the number of possible attacks. In the next phase, we enforce each FB for the fulfilment of equation (4) and in the worst case equation (5) which completely avoids the risks of fcorr attack. Partitioning of attributes based on weights and enforcement for conditions reduces the privacy loss, and creating closest kanonymous QI class also results in low information loss. e evaluation parameters for privacy and utility proved that there is a minimum disclosure risk and information loss for the proposed (c, k)-anonymization algorithm.

Conclusion
In this paper, privacy for MSAs of healthcare data has been addressed. Irrespective of adopting any predefined methodology for our proposed approach similar to (p, k)angelization, we proposed a novel algorithm, (c, k)-anonymization for privacy and utility improvement in MSAs. e proposed algorithm consists of two major steps. First, it categorizes the MSAs based on calculated weights. Second, FBs are created and refined iteratively to implement privacy. To categorize the MSAs, the weight calculation is done as in [22]. e privacy risk is reduced by implementing one-to-many linking which disassociate the buckets in GT and ST. e one-to-many linking not only reduces the probability of adversary's attacks but also improves the utility in GT. e major step in privacy implementation is to reduce the correlation between L FBs by satisfying equations (4) and (5). Such measures prevent the main cause of privacy breach, i.e., correlation between SAs. It makes the adversary unable to disclose the privacy of an intended individual. e experiment's results show that both with respect to utility and privacy the proposed (c, k)anonymization algorithm performs well as compared to the (p, k)-Angelization algorithm.
Preserving the same privacy, this work can be extended to 1 : M microdata [28]. Another challenging future work can be combining 1 : M with MSA in a dynamic data publishing [6,16] scenario. To the best of our knowledge, for the later scenario, there exists no available literature.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.