Elsevier

Knowledge-Based Systems

Volume 43, May 2013, Pages 82-94
Knowledge-Based Systems

Anonymizing classification data using rough set theory

https://doi.org/10.1016/j.knosys.2013.01.007Get rights and content

Abstract

Identity disclosure is one of the most serious privacy concerns in many data mining applications. A well-known privacy model for protecting identity disclosure is k-anonymity. The main goal of anonymizing classification data is to protect individual privacy while maintaining the utility of the data in building classification models. In this paper, we present an approach based on rough sets for measuring the data quality and guiding the process of anonymization operations. First, we make use of the attribute reduction theory of rough sets and introduce the conditional entropy to measure the classification data quality of anonymized datasets. Then, we extend conditional entropy under single-level granulation to hierarchical conditional entropy under multi-level granulation, and study its properties by dynamically coarsening and refining attribute values. Guided by these properties, we develop an efficient search metric and present a novel algorithm for achieving k-anonymity, Hierarchical Conditional Entropy-based Top-Down Refinement (HCE-TDR), which combines rough set theory and attribute value taxonomies. Theoretical analysis and experiments on real world datasets show that our algorithm is efficient and improves data utility.

Introduction

Identity disclosure is one of the most serious privacy concerns in many data mining applications. Some organizations, such as hospitals and insurance companies, have collected a large amount of microdata, which refers to data published in its raw, non-aggregated form. The microdata can provide tremendous opportunities for knowledge-based decision making. However, these organizations are reluctant to publish the data because of privacy threats. One important type of privacy attack is the re-identification of individuals by joining data from multiple public tables; such an attack is called a linking attack. For example, according to [27], more than 85% of the population of the United States can be uniquely identified using their gender, zipcode, and date of birth. The minimal set of attributes in a table is called the quasi-identifier (QI), which can be joined with external information to re-identify individual records.

To prevent linking attacks through QI, k-anonymity was proposed [27]. A table satisfies k-anonymity if each record in the table is indistinguishable from at least (k  1) other records with respect to certain QI attributes and such a table is called a k-anonymous table. Consequently, the probability of identifying an individual from a specific record through QI is at most 1/k. This ensures that individuals cannot be uniquely identified by linking attacks. For example, Fig. 1 illustrates how k-anonymization hinders linking attacks. The joining of the original table in Fig. 1a with the public data in Fig. 1c would reveal that Alice’s income is high and Bob’s is low. Fig. 1b shows a 3-anonymous table that generalizes QI = {Job, Age, Sex} from the original table using the attribute value taxonomies in Fig. 2. The 3-anonymous table has two distinct groups on QI, “White_collar, [40, 99), Male” and “Blue_collar, [1, 40), Female”. Because each group contains at least 3 records, the table is 3-anonymous. If we link the records in Fig. 1b to the records in Fig. 1c through the QI, each record is linked to either no record or at least 3 records in Fig. 1c. Therefore, the outcome of joining the 3-anonymous table with the public data is ambiguous.

Data in their original form often contain sensitive information about individuals. However, the data typically do not satisfy the k-anonymity requirement, and publishing such data would violate individual privacy. A task of the utmost importance is to modify the data so that the modified data remain practically useful for data mining while individual privacy is preserved. For example, drug companies and researchers may be interested in patient records for drug development. Data mining harnesses a large amount of patient data available for extracting knowledge crucial to the progress of drug research. Such additional uses of data are important and should certainly be supported. However, privacy-sensitive information related to individual patients should be protected as well. To address the conflicting requirements of assuring privacy while supporting legitimate uses, the original data should be modified by applying some anonymization methods while ensuring that the anonymized data can be effectively used for data mining.

Generalization and suppression are popular anonymization methods. In generalization, quasi-identifier values are replaced with values that are less specific but are semantically consistent according to a given attribute value taxonomies. For example, in Fig. 2, the parent node, White_collar, is more general than its child nodes, Adm_clerical and Sales. The root node, ANY_Job, represents the most general value in Job. If the following information, “Job = Sales, Age = [35, 40), Sex = Male”, is too specific in a table, e.g., fewer than k men of age [35, 40) work for sales, the probability for identifying these people to a specific record through {Job, Age, Sex} is greater than 1/k. In this case, with the help of additional information, there is a chance that an attacker could uniquely identify these individual records from the data table. The larger the value of k results in greater generalization and better protection privacy. If the record is generalized as “Job = White_collar, Age = [1, 40), Sex = male”, more than k people will have the same person-identifiable information in the data, and therefore their privacy is better preserved. The most generalized form of a record is “ANY_Job, ANY_Age, ANY_Sex”. When values are generalized to the highest level, this generalization is called suppression.

Meanwhile, information loss is an unfortunate consequence of anonymization. To make the anonymous data as useful as possible, the information loss must be minimized. The information metric for measuring the data usefulness can be categorized as a data metric and a search metric. A data metric measures the data quality in the entire anonymous table with respect to the data quality in the raw table. The problem of finding the optimal k-anonymous table using generalization has been proven to be NP-hard [1]. Therefore, heuristic algorithms are needed. A search metric is used to guide each step of the anonymization operations to identify an anonymous table with the maximum information or minimum distortion.

When the anonymous data are used to build classification models, protecting individual privacy in the data while ensuring that the data remain useful for building classification models is a challenge. Some data metrics, such as the minimal distortion [27] and the discernibility metric [1], have been considered for achieving k-anonymity. However, these data metrics do not consider any particular data mining task. As a result, the anonymous tables of these data metrics might not be suitable for every classification algorithm. Much research has been conducted to evaluate the data quality of anonymous tables for classification [12], [16], [30]. These efforts have not considered a search metric. For classification tasks, a more relevant approach is to search for a useful anonymization operation according to certain heuristics. An anonymization operation is ranked high if it preserves useful classification information. A search metric could be adopted to guide each step of the anonymization operations to identify an anonymous table using various anonymization algorithms, such as a greedy algorithm or a hill climbing optimization algorithm. Therefore, because the anonymous table identified by a search metric is eventually evaluated by a data metric, the two types of metrics usually share the same principle of measuring data quality. Some past research [8], [31] has proposed a search metric based on the tradeoff principle between information gain and anonymity loss. However, this information gain metric is defined only for a single attribute in a single equivalence class and may not retain useful classification information.

In rough set theory, attribute reduction seeks to find a minimum subset of condition attributes that has the same classification ability as the set of all condition attributes with respect to the decision attributes [20], [24], [25]. The classification ability of all condition attributes with respect to the decision attributes can be measured by conditional entropy [29]. An anonymous table does not have to be close to the original table at the value level, but a classification model built on the anonymous table should be as good as a classification model built on the original table. For classification tasks, the data quality of a table can be considered as the classification ability. Therefore, we apply the conditional entropy to measure the classification ability of an anonymous table.

In this paper, we aim at releasing a k-anonymous table for modeling classification of the form S(P  Q, D), where P  Q is a finite set of condition attributes, P is a subset of condition attributes, such as sensitive attributes and neutral attributes, that must be retained for modeling classification, the quasi-identifier Q = (a1, a2,  , am) is a subset of condition attributes that potentially identify individuals in the table, P  Q = Ø, and D is a class attribute. Attributes in the quasi-identifier are associated with a set of attribute value taxonomies {AVT(a1), AVT(a2),  , AVT(am)} which define the method for generalizing values. These attribute value taxonomies are created manually, and the process requires prior knowledge of the problem domain. Our method starts from a table containing the most general values of the taxonomy trees for all the attributes in Q and iteratively tries to refine a general value into child values. Each refinement operation increases the information of the table for classification, and decreases the anonymity of the table. Through the attribute reduction theory of rough sets, we use conditional entropy to compare the data quality of an anonymous table with the data quality in the original table. Meanwhile, by using an information gain metric that is defined by the conditional entropy reduction on the set of all condition attributes in the table, we introduce a new search metric that prefers the refinement that maximizes the information gained per each loss of anonymity. The search metric is used to guide the top-down refinement process for building anonymous tables until any further refinement violates the k-anonymity constraint. We apply the k-anonymity method on the original training samples and obtain new anonymous training samples. Additionally, we obtain a global cut that is used to transform the test samples into new anonymous test samples. We also experimentally evaluate the impact of anonymization by building a classifier from the anonymous training samples and observing the classifier’s performance on the anonymous testing samples.

To the best of our knowledge, this attempt is the first to combine rough set theory and attribute value taxonomies to achieve k-anonymity. The contributions of this paper are threefold. First, we define the k-anonymity property using the equivalence class concept from rough sets, extending the single-level granulation rough set model to a hierarchical (i.e., multi-level) granulation rough set model (HGRS) by combining attribute value taxonomies and full-subtree generalization, applying the conditional entropy from the information viewpoint to measure the classification ability of an anonymous table, and developing the goal of k-anonymity for classification analysis from the information viewpoint. Second, we extend general conditional entropy under single-level granulation to hierarchical conditional entropy under multi-level granulation and study its properties with dynamically coarsening and refining attribute values. Finally, we develop an efficient search metric for the tradeoff principle between information gain and anonymity loss and present a novel approach for achieving k-anonymity, Hierarchical Conditional Entropy-based Top-Down Refinement (HCE-TDR).

The rest of this paper is organized as follows. We review related work in Section 2. In Section 3, we introduce some basic concepts of rough sets, the k-anonymity model, and full-subtree generalization, and then give a formal definition of k-anonymization for classification. In Section 4, we investigate the main properties of HGRS from the information viewpoint. The HCE-TDR method is proposed in Section 5. An experimental evaluation of the proposed method is given in Section 6. Finally, Section 7 concludes the paper.

Section snippets

Related work

Privacy-preserving data mining (PPDM) addresses the tradeoff between the utility of the mining process and the privacy of the data subjects, aiming to minimize the privacy exposure with minimal effect on the mining results [2], [23]. Data perturbation and data anonymity are two popular obfuscation techniques adopted in PPDM. A data perturbation framework was presented in [11] for protecting individual privacy while maintaining data quality, which perturbs all data values by adding noise to all

Preliminaries and definitions

In this section, we will introduce several basic notions of rough set theory, the k-anonymity model, and full-subtree generalization. Then, we give a fundamental concept of the hierarchical decision table and a formal definition of k-anonymization for classification.

Hierarchical granulation rough set

In this section, several important properties of the hierarchical granulation rough set (HGRS) model in a hierarchical decision table are discussed.

Most previous studies of rough sets focus on a single concept level of abstraction, in essence, still processing data from multigranular levels (each attribute can be regarded as a problem solving granular) and single levels for each granular by considering adding and deleting multiple attributes at a time. When some of a decision table’s condition

HCE-TDR algorithm

In this section, based on the above properties, we design a search metric, with a comparative analysis of existing search metrics, and then present a heuristic algorithm for achieving k-anonymity, named HCE-TDR.

Experimental evaluation

In order to evaluate the performance of the proposed method for applying k-anonymity to data sets used for classification tasks, comparative experiments are conducted on benchmark data sets. Specifically, the experimental study has the following goals. (1) We compare the obtained classification error on anonymity data with the original error without applying k-anonymity, and also compare with the error caused by simply removing all QI attributes (i.e., a trivially anonymized dataset). (2) We

Conclusions

In this paper, we employed the conditional entropy from the information viewpoint to measure the classification ability (i.e., the data quality or data utility for classification) of an anonymized dataset. By extending the traditional single-level granulation rough set model to a hierarchical granulation rough set model from the information viewpoint, we identified several important characteristics about data utility and privacy for classification. These characteristics show that the anonymity

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments, which have improved the paper vastly. This paper is in part supported by the National High Technology Research and Development Program (863 Program) of China under Grant 2012AA011005, the National 973 Program of China under Grant 2013CB329604, the National Natural Science Foundation of China (NSFC) under Grants 60975034, 61272540 and 61229301 and the US National Science Foundation (NSF) under Grant CCF-0905337.

References (36)

  • R.J. Bayardo, R. Agrawal, Data privacy through optimal k-anonymization, in: Proceedings of the 21st International...
  • K. Chen et al.

    Geometric data perturbation for privacy preserving outsourced data mining

    Knowledge and Information Systems

    (2011)
  • A. Friedman et al.

    Providing k-anonymity in data mining

    International Journal of Very Large Data Bases

    (2008)
  • A. Friedman, A. Schuster, Data mining with differential privacy, in: Proceedings of the 16th ACM SIGKDD International...
  • B.C.M. Fung et al.

    Anonymizing classification data for privacy preservation

    IEEE Transactions on Knowledge and Data Engineering

    (2007)
  • S.V. Iyengar, Transforming data to satisfy privacy constraints, in: Proceedings of the 8th ACM SIGKDD International...
  • H. Jo, Y.C. Na, B. Oh, J. Yang, V. Honavar, Attribute value taxonomy generation through matrix based adaptive genetic...
  • K. Kisilevich et al.

    Efficient multidimensional suppression for k-anonymity

    IEEE Transactions on Knowledge and Data Engineering

    (2010)
  • Cited by (0)

    View full text