ANONYMISING THE SPARSE DATASET : A NEW PRIVACY PRESERVATION APPROACH WHILE PREDICTING DISEASES

*Corresponding author: Email: shyamalasusan@gmail.com,Tel:+91-9916913776


BACKGROUND STUDY
Health information systems have largely helped to increase the possibility of constructing the medical documents available to researchers, public health organizations, and the others who hold interest in medical data.However, health care data generally includes a large amount of patient privacy.Sharing this kind of data directly will be a great threat in the case of patient privacy.Hence it becomes a necessity for practical techniques to be developed Data mining techniques analyze the medical dataset with the intention of enhancing patient's health and privacy.Most of the existing techniques are properly suited for low dimensional medical dataset.The proposed methodology designs a model for the representation of sparse high dimensional medical dataset with the attitude of protecting the patient's privacy from an adversary and additionally to predict the disease's threat degree.In a sparse data set many non-zero values are randomly spread in the entire data space.Hence, the challenge is to cluster the correlated patient's record to predict the risk degree of the disease earlier than they occur in patients and to keep privacy.The first phase converts the sparse dataset right into a band matrix through the Genetic algorithm along with Cuckoo Search (GCS).This groups the correlated patient's record together and arranges them close to the diagonal.The next segment dissociates the patient's disease, which is a sensitive value (SA) with the parameters that determine the disease normally Quasi Identifier (QI).Finally, density based clustering technique is used over the underlying data to create anonymized groups to maintain privacy and to predict the risk level of disease.Empirical assessments on actual health care data corresponding to V.A.Medical Centre heart disease dataset reveal the efficiency of this model pertaining to information loss, utility and privacy.

KEY WORDS
for balancing the healthcare data sharing and privacy preservation.In recent times, the relevance concerned with Privacy-Preserving Data Mining (PPDM) methods is analyzed thoroughly and studied by Mat win [9].Usage of certain techniques showed their capability of avoiding the discriminatory utilization of data mining.Few techniques provided a proposal that any stigmatized group must not be focused over a number of generalization of data compared over the common population.
Recently, K Anonymization technique additionally utilized for ensuring the assurance of patient information [7] [9][10].Once the security of protection is pertinent, the information is readied for investigation and the learning which helps choice making is extricated.It is confused, misfortune in utility of data [7] [9].Zhu and Peng [11] initially depicted obviously the present state of China Restorative informatization level and clarified the need of cross-authoritative data sharing.
[10] The specific issue known as connecting assault is additionally considered.The K-anonymity is then again integrated with data mining technique for protecting the identity disclosure of the.
Once the protection of privacy is applicable, the data is prepared for analysis and the knowledge which assists decision making is extracted.It is complicated, and there is loss in utility of data.Zhu and Peng [11] first described clearly the current condition of China Medical informatization level and explained the need of crossorganizational information sharing.The particular problem known as linking attack is also studied.Then a respective K Anonymization model is formulated along with Suppression techniques which are utilized for preserving privacy.
Machanavajjhala et al [6] proposed l-diversity which, not like k-anonymity, had the knowledge of the distribution of values pertaining to the sensitive attributes and regarding the impacts of background knowledge.l -diversity, a framework which provides stronger privacy assurance is employed to Inpatient Micro data that is gathered from adult dataset samples.They help in protecting the privacy of the user, either by making changes to quasi identifier values or through the addition of noise.The identity of patients has to be protected while the patient data are shared.
Gal et al [12] introduced a privacy model which is an upgraded version of K anonymity and l-diversity along with multiple sensitive attributes.Here the patient data set is got from the Kentucky Cancer Registry.But this model finds difficulty in distinguishing QIs and SAs.
T-closeness which is the upgraded variant of k-anonymity that is introduced by Soria-Comas et al [13] is highly correlated with εdifferential privacy.This approach is useful in improving the quality of anonymity and minimization of the loss of information for Patient Discharge Data.
Soria-Comas et al [14] introduced new kind of refinements with respect to k-anonymity, in which t-closeness performs better as the one providing with the guarantees of strictest privacy.It depends on generalization and suppression and the benefits of micro aggregation are examined, and then multiple micro aggregation algorithms for the purpose of k-anonymous t-closeness are proposed and later evaluated empirically.
Loukides and Gkoulalas-divanis [15] presented a new mechanism for anonymizing the data by meeting the data publishers' utilization requirements with a low information loss experience.A measurement of accurate information loss and an efficient anonymization algorithm are brought into use for minimizing the information losses.Experimental investigations related to on click-stream and medical data exposed that the new technique permitted more robust query answers rather than the state sophisticated techniques that are equal to them in terms of efficiency.The need for privacy is imposed by implementing a partition in the patient record dataset into sets patient records which are disjoint that are called as anonymized groups.The probability of the association of any kind of transaction in G with that item amounts to half [16].
CBA [17] presented an efficient technique to preserve privacy with minimal information loss by modifying prognosis codes and measured the data loss due to generalization and suppression and anonymise the patient's identity thro Clustering-established Anonymizer (CBA).However this clustering approach is not suitable for high dimensional data and suffers from high information loss and fails to protect against l-diversity.Biomedical researchers anonymise the data with improved utility, but it doesn't help in high dimensionality problem.In DRC[18] approach, utility metrics were particular to the data recipient's requirements.Utility loss is observed to increase with the variation in the data recipient's requirement.
The problem that exists with the available methods is that a big portion of the initial terms are typically absent from the anonymized dataset and every other method is applicable to low dimensional dataset samples.In Anatomy [10], the quasi identifiers are isolated from sensitive values and are provided protection against attribute disclosure.As it generates and issues the quasi identifiers directly it also renders a compromise over data utility.

COMPUTER SCIENCE
Protection of privacy employing disassociation is observed as a complex issue in [10].There are only less works [18-19] which aid in preserving the original data, without the addition of noise, on the basis of an anatomy[10] idea.

PROBLEM SPECIFICATION
The proposed approach designs a model for the case of representation of sparse high dimensional medical data with a perspective of shielding the patient's privacy and in addition to aid the sufferers in having skills over the sickness's threat degree .The heart disease dataset comprises less number of SA that determines the risk stage of heart disease as well as QI values which are the parameters that identify the disease.As the dataset are arbitrarily distributed over the complete area, the task lies in efficiently grouping patients' records with similar QI values collectively to predict the hazard stage and anonymising the patient record with the disease to preserve privacy that has no sensitive value.The distinct contribution of this work is given beneath: The first segment minimises the bandwidth of the sparse patient report by Genetic algorithm with Cuckoo Search (GCS).This permutes the and thus yields the adjacent rows correlated and brings them near to the diagonal.As it gets hold of the correlation well, it maximizes the utility and reduces the search space.When the patients' records are regrouped as a band matrix, the subsequent step is the construction of anonymised group of the patient with a view to guard the privacy and to predict the sickness.Anatomisation method is used over the regrouped band with the intention to dissociate the QI values with SA.This results in two tables which are the QIT and ST.Then the density based clustering algorithm anonymises the Sensitive attributes in ST with QI values in QIT through the clustering the closest non-sensitive QI values with SA.This helps in protecting the privacy and each group predict the risk level of the patient.Empirical assessments on original health care data corresponding to V.A.Medical Centre heart disease dataset illustrate the efficiency of this model corresponding to information loss, utility and privacy.

Case study
The V.A.Medical Centre database is the well-known heart disease data set extensively used by ML researchers for dealing with heart diseases.This database comprises of 76 attributes; nonetheless all the experiments that are published refer about making use of a subset of 14 among them.
The Table-1 below is the sample test report of the patients used by the medical practitioner for determining the measure of heart disease earlier than they occur in patients.Every one of these attributes can be grouped into three kinds: 1. Identifier attributes: a minimal set of attribute which can make the explicit identification of individual records.2. Sensitive Attributes (SA): The sensitive attribute are regarded to be a privacy breach in case related to a specified individual, and are provided in the right of the Table .3. Quasi Identifiers (QI) attributes: The remaining of the attributes that are non-sensitive, are utilized for determining the level of heart disease and can also is utilized by an adversary for re-identifying individual patient record.
These Experiments has been carried out with the V.A.Medical Centre database on determining the probability related to the existence of values 1 and 2, for the case of the non-sensitive classes.In this work, only 14 attributes are considered by the ML researchers that are given below Age,sex,type of chest pain(Cpt),blood pressure at rest(Rbp),serum Scestoral(Sc),blood sugar at fasting(Fbs),electrographic at rest(Recg),maximum heart rate(Thalach),ST depression made by exercise corresponding to rest(Old peak), exercise induced angina(Exang), slope of the peak exercise ST(Slope), number of major vessels(Ca), blood disorder (Thal), the predicted attribute (Class).Then the patient's record is reorganized as a band matrix by employing GCS approach on the basis of their correlation which is depicted in Table-5.The subsequent step improves the quality of anonymisation by employing disassociation is carried out that outputs two Tables namely QIT and ST which discloses privacy.QIT indicates the patient history of records and ST refers to the patient's level of heart disease.This is provided in Table 6.

Eight patients' treatment history records are considered for evaluation as depicted in
Age

Table: 6. Final Published groups
Patient id QIT ST ca fbs thal oldpeak trestbps chol thalach Cp Then Modified Density Based Clustering (MDBSCAN) is performed over QIT and ST in Table 6 which again gives the anonymised Table as result by splitting the medical records into two clusters as illustrated in Table -7 and Table-8.This protects the privacy in such way that P7, P4, P6, P3, P1 patients' are groups as one cluster and P2, P5, P8 in other.In a similar manner, the diseases are predicted in such a manner that p7, p4, p6, p2 are more susceptible to type 2 and p2 p5 may be vulnerable to type 1 and the patients are advised to meet the clinicians before the disease becomes worsened.The GCS approach with clustering approach proved its efficiency in terms of utility, information loss and execution time.

Notation and description
The goal of the paper is the anonymization of the V.A.Medical Centre heart disease data which comprises of a set of patient records , n = |T|, Each patient record contains attributes that are received from an attribute set d = |A|.The data is specified as a binary matrix A with n number of rows and d number of columns.For example Table-1 has 16 patient records with 14 attributes which is discussed elaborately again in the section that follows. ( For instance, the matrix for heart disease data shown in table 4 is expressed as (2) In the set of attributes for the patient records , some are sensitive to privacy, such as the heart disease is severe or medium condition seen in running example [Table -1].

Definition 1 (Sensitive Attributes (SA)):
The set having the attributes denoting a privacy threat if they are associated with certain patient records, renders the sensitive attributes, m = |S|.The rest of the attributes in the table indicated as Quasi Identifiers (QIs) are insensitive, which means that their association with a specific individual is not dangerous.In contrast, these items that are harmless can be used by an adversary for reidentification of individual patient records, as shown in the introductory section.These items are specified by QID attributes.

Definition 2 (Quasi Identifier (QI) attributes):
The collection of attributes in A that an attacker can get knowledge about, for re-identifying separate patient records comprise the set of QIs.Typically, any attribute that is non-sensitive is QIs, therefore , .The records constituting of attributes from SA

EXPERIMENTALEVALUATION
The heart disease dataset that is available at ftp://ftp.ics.uci.edu/pub/machine-learning-databases/heartdisease/processed.va.datais used to evaluate the GCS technique with MDBSCAN.The data set contains 76 raw attributes.However, the most published experiments best has reference to 14 of them with classification.
The data set comprises of 200 rows.The efficiency of this methodology is checked with respect to utility and effectiveness and proved that this method performs better than the existing techniques like The reconstruction error is measured by changing the p the degree of privacy, m is the number of sensitive item that is arbitrarily selected, and r is the number of QI values that varies.More than 100 group-by queries are created by arbitrarily choosing q1,q2,q3...qnand s1,s2,s3.....sm.The average reconstruction error is determined and the clustering accuracy is measured.The experimental result is illustrated in the Figure -1.The clustering algorithm preserves correlations in a better manner between the patient records for better utility.The results that are affected in terms of privacy and utility are measured.If one selects a different measure for privacy (or utility), then the figure may appear in a different way.From the figure, GCS outperforms compared to other privacy requirements.The results indicates that GCS yields considerable better data utility rather than RCM and RCM with greedy (RCMG), as in the GCS method unsymmetric band matrix reconstruction is also taken into consideration.

CONCLUSION
The proposed methodology designs a model for the representation of the sparse high dimensional medical dataset obtained from V.A.Medical Centre heart disease dataset with the attitude of protecting the patient's privacy from an adversary and additionally to predict the disease's threat degree.The CGS algorithm converts the sparse highdimensional heart disease data set as a band matrix that limits the search space and maximizes the utility.
Clustering-Based Anonymizer (CBA) [17], Data Recipient Centered (DRC) [18] approach.Utility: The anonymized data set Utility is measures by calculating the distance present between the original and estimated pdf over all cells.It is measured by KL-divergence [24] as a metric which provides meaning for the evaluation of the amount of data loss endured by data anonymization.KL_Divergence(Act,Est) = ∑_(∀cell C)▒〖Act_C^S log〖(Act_C^S)/(Est_C^S )〗〗 (5) The actual pdf value of Sensitive Attribute (SA) for a cell C is expressed by Act_C^S=(Occurrences of S in C)/(Occurrences of S in PR) (6) The estimated pdf Est_C^SC is Calculated in a similar way, excepting that the numerator consists of equ(5) that is summed over all groups intersecting cell C a • b/|G| (7) The number of occurrences of the item s in G is represented by a, and the number of transactions that match the QIT selection predicate (last line of (5)) are denoted by b.For each (p, r) setting, 100 group-by queries gets generated by means of the random choice of SA and qi1 . . .qir from this the average reconstruction error is computed.The reconstruction error is measured with the following query.SELECT COUNT ( * ) FROM T WHERE (Sensitive Item sa is present) AND (qi1 = val1) ∧ . . .∧ (qir= valr)

Figure- 2
Figure-2 illustrates the execution time of CBA, DRC and MDBSCAN for p = 20, that is the most important factor possessing influencing on runtime performance.MDBSCAN is time-effective, with completion time ranging at most 16 sec for the heart disease dataset.A more considerable overhead is suffered by GCS execution, which needs 185 sec for the heart disease dataset.Nevertheless, this overhead is only observed for the input transformation only once, regardless of p values.However the execution time corresponding to the new MDBSCAN with GCS is less compared with the other clustering methods.Since the MDBSCAN that hashigh dimension problem is solved by using of GCS.DRC only has the capability of dealing with execution times that are in in the range of 300 sec for the heart disease dataset.The efficiency of QI is evaluated in terms of clustering accuracy.For this p=10 is fixed upon and the number of QI is varied in {2,4,6,8}.Figure-3 illustrated the results on learning a QI attribute.It can be observed that clustering accuracy reduces only with a slight increase of QI, as the most correlated attributes are yet in the same column.In every case, MDBSCAN with GCS is seen with better clustering accuracy in comparison with other clustering techniques.

Figure- 4
Figure-4indicates the utility loss vs. privacy loss with regard to various privacy requirements (p=10 and p=20).The results that are affected in terms of privacy and utility are measured.If one selects a different measure for privacy (or utility), then the figure may appear in a different way.From the figure, GCS outperforms compared to other privacy requirements.The results indicates that GCS yields considerable better data utility rather than RCM and RCM with greedy (RCMG), as in the GCS method unsymmetric band matrix reconstruction is also taken into consideration.

Table - 2
and Table-3 indicates the ranges for diagnosing the diseases.In Table-2 only two patients' disease (P1 and P8) are diagnosed.Now the challenge is to predict heart risk level for the other patients and at the same time to preserve privacy of the patient (P1 and P8) by anonymising the table.The initial phase removes the attributes Recg, Exang and slope as they have no values in Table-1.Table-4 is replaced with "1" on the basis of the range specified in Table-3 and 0 if not in the range or else it is left blank