An Insight in to Privacy Preserving Data Mining Methods

— Recent advances in information, communications, data mining, and security technologies have gave rise to a new era of research, known as Privacy Preserving Data Mining (PPDM). Several data mining algorithms, incorporating privacy preserving mechanisms, have been developed that allow one to extract relevant knowledge from large amount of data, while hide sensitive data or information from disclosure or inference. PPDM is a new attempt; thus, several research questions have often being asked. For instance: (1) How to measure the performance of these algorithms? (2) How effective of these algorithms in terms of privacy preserving? (3) Will they impact the accuracy of data mining results? and (4) Which one can better protect sensitive information? To help answer these questions, we conduct an extensive review on literature. We present a classification scheme, adopted from early studies, to guide the review process. Finally, we share directions for future research.


INTRODUCTION
NCREASING network complexity, affording greater access, sharing information and a growing emphasis on the Internet have made information security and privacy a major concern for individuals and organizations.Data mining is a well-known technology for automatically and intelligently extracting knowledge from large amount of data.Such a process, however, can also disclosure sensitive information about individuals compromising the individual's right to privacy.Moreover, data mining techniques can reveal critical information about business transactions, compromising the free competition in a business setting [Bertino et al., 2005].Privacy preserving data mining (PPDM) is a new era of research in data mining.Its ultimate goal is to develop efficient algorithms that allow one to extract relevant knowledge from large amount of data, while prevent sensitive information from disclosure or inference.
PPDM research usually takes one of the three philosophical approaches: (1) data hiding, in which sensitive raw data like identifiers, name, addresses, etc. were altered, blocked, or trimmed out from the original database, in order for the users of the data not to be able to compromise another person's privacy; (2) rule hiding, in which sensitive knowledge extracted from the data mining process be excluded for use, because confidential information may be derived from the released knowledge.This problem is also commonly called the "database inference problem;" and (3) Secure Multiparty Computation (SMC), where distributed data are encrypted before released or shared for computations; thus, no party knows anything except its own inputs and the results.
PPDM is a fast growing research area.Given the number of different algorithms have been developed over the last years, there is an emerging need of synthesizing literature to understand the nature of problem, identify potential research issues, standardize new research area, and evaluate the relative performance of different approaches [Verykios et al., 2004;Bertino et al., 2005].The main purpose of this study is to review the state-of-the-art in current PPDM research in order to better understand existing algorithms, answer research questions and move forward the field of research.

II. CLASSIFICATION FRAMEWORK FOR PPDM
In this paper, we propose to consolidate and simplify the taxonomy brought by Bertino et al., (2005).We propose to reduce the PPDM taxonomy into four levels: data distribution, purposes of hiding, data mining algorithms, and privacy preserving techniques (see figure 1).The difficulties of applying PPDM algorithms to a distributed DB can be attributed to two reasons: first, the data owners have privacy concerns so they may not willing to release their own data for others; second, even if they are willing to share data for data mining, the communication cost between the sites is too expensive.In today's global digital environment, most data are often stored in different sites, thus, more attention and research should be focused on distributed PPDM algorithms.

Hiding Purposes
The PPDM algorithms can be further classified into two types, data hiding and rule hiding, according to the purposes of hiding.Data hiding refers to the cases where the sensitive data from original database like identity, name, and address that can be linked, directly or indirectly, to an individual person are hided.In contrast, in rule hiding, we remove the sensitive knowledge derived from original database after applying data mining algorithms.

Privacy Preservation Technique
Four techniques -sanitation, blocking, distort, and generalization -have been used to hide data items for a centralized data distribution.Data sanitation is to remove or modify items in a database to reduce the support of some frequently used item sets such that sensitive patterns cannot be mined.The blocking approach replaces certain attributes of the data with a question mark.In this regard, the minimum support and confidence level will be altered into a minimum interval.As long as the support and/or the confidence of a sensitive rule lie below the middle in these two ranges, the confidentiality of data is expected to be protected.Data distort protects privacy for individual data records through modification of its original data, in which the original distribution of the data is reconstructed from the randomized data.These techniques aim to design distortion methods after which the true value of any individual record is difficult to ascertain, but "global" properties of the data remain largely unchanged.Generalization transforms and replaces each record value with a corresponding generalized value.
The privacy preservation technique used in a distributed database is mainly based on cryptography techniques.SMC algorithms deal with computing any function on any input, in a distributed network where each participant holds one of the inputs, while ensuring that no more information is revealed to a participant in the computation than can be inferred from that participant's input and output.Data distort is the most popular method used in hiding data [Du & Zhan, 2003;Evfimievski et al., 2003Evfimievski et al., , 2004

III. SUGGESTIONS FOR FUTURE WORK
There are many future research directions for privacy preserving data mining.First, present studies tend to use different terminology to describe similar or related practice.For instance, people used data modification, data perturbation, data sanitation, data hiding, and preprocessing as possible methods for preserving privacy; however, all are in fact related to the use of some types of technique to modify original data so that private data and knowledge remain private even after the mining process.Lacking a common language for discussions will cause misunderstanding and slow down the research breakthrough.Therefore, there is an emerging need of standardizing the terminology and PPDM practice.
Second, most prior PPDM algorithms were developed for use with data stored in a centralized database.However, in today's global digital environment, data is often stored in different sites.With recent advances in information and communication technologies, the distributed PPDM methodology may have a wider application, especially in medical, health care, banking, military and supply chain scenarios.
Third, data hiding techniques have been the dominated methods for protecting privacy of individual information.However, those algorithms do not pay full attention to data mining results, which may lead to sensitive rules leakages.While some algorithms are designed for preserving the rule such as with sensitive information, it may degrade the accuracy of other non-sensitive rules.Thus, further investigation, focusing on combining data and rule hiding, may be beneficial, specifically, when taking into account the interactive impact of sensitive and non-sensitive rules.
Fourth, although many machine learning methods have been used for classification, clustering, and other data mining tasks (e.g., diagnose, prediction, optimization), currently only the association rules method has been predominately used for classification.It would be interesting to see how to extend the current technique and practice into other problem domains or data mining tasks.Furthermore, it is important to find the privacy preserving technique that is independent of data mining task so that after applying privacy preserving technique a database can be released without being constrained to the original task.
Finally, identifying suitable evaluation criteria and developing benchmarks for algorithm selection are two important aspects in PPDM research.A framework for evaluating selected association rule hiding algorithms has been proposed by Bertino et al., (2005).Future research can consider testing the proposed evaluation framework for other privacy preservation algorithms, such as data distortion or cryptography methods.

IV. CONCLUSIONS
PPDM has recently emerged as a new field of study.As a new comer, PPDM may offer a wide application prospect but at the same time it also brings us many issues / problems to be answered.In this study, we conduct a comprehensive survey on 29 prior studies to find out the current status of PPDM development.We propose a generic PPDM framework and a simplified taxonomy to help understand the problem and explore possible research issues.We also examine the strengths and weaknesses of different privacy preserving techniques and summarize general principles from early research to guide the selection of PPDM algorithms.As part of future work, we plan to apply the proposed evaluation framework to formally test a complete spectrum of PPDM algorithms.