Abstract
Protecting individual privacy is an important problem in microdata distribution and publishing. Anonymization algorithms typically aim to satisfy certain privacy definitions with minimal impact on the quality of the resulting data. While much of the previous literature has measured quality through simple one-size-fits-all measures, we argue that quality is best judged with respect to the workload for which the data will ultimately be used.
This article provides a suite of anonymization algorithms that incorporate a target class of workloads, consisting of one or more data mining tasks as well as selection predicates. An extensive empirical evaluation indicates that this approach is often more effective than previous techniques. In addition, we consider the problem of scalability. The article describes two extensions that allow us to scale the anonymization algorithms to datasets much larger than main memory. The first extension is based on ideas from scalable decision trees, and the second is based on sampling. A thorough performance evaluation indicates that these techniques are viable in practice.
- Adam, N. and Wortmann, J. 1989. Security-control methods for statistical databases. ACM Comput. Surv. 21, 4, 515--556. Google ScholarDigital Library
- Aggarwal, C. and Yu, P. 2004. A condensation approach to privacy-preserving data mining. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT).Google Scholar
- Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., and Zhu, A. 2005. Anonymizing tables. In Proceedings of the 10th International Conference on Database Theory (ICDT). Google ScholarDigital Library
- Aggarwal, G., Feder, T., Kenthapadi, K., Panigrahy, R., Thomas, D., and Zhu, A. 2006. Achieving anonymity via clustering in a metric space. In Proceedings of the 25th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS). Google ScholarDigital Library
- Agrawal, R., Ghosh, S., Imielinski, T., and Swami, A. 1993. Database mining: A performance perspective. In IEEE Trans. Knowl. Data Engin. 5. Google ScholarDigital Library
- Agrawal, R. and Srikant, R. 2000. Privacy-preserving data mining. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Bayardo, R. and Agrawal, R. 2005. Data privacy through optimal k-anonymity. In Proceedings of the 21st International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- Blake, C. and Merz, C. 1998. UCI repository of machine learning databases. University of California Irvine.Google Scholar
- Blum, A., Dwork, C., McSherry, F., and Nissim, K. 2005. Practical privacy: The SuLQ framework. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). Google ScholarDigital Library
- Breiman, L., Freidman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth International Group, Belmont, CA.Google Scholar
- Chawla, S., Dwork, C., McSherry, F., Smith, A., and Wee, H. 2005. Toward privacy in public databases. In Proceedings of the 2nd Theory of Cryptography Conference. Google ScholarDigital Library
- Chen, B., Chen, L., Lin, Y., and Ramakrishnan, R. 2005. Prediction cubes. In Proceedings of the 31st International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Chen, B., LeFevre, K., and Ramakrishnan, R. 2007. PrivacySkyline: Privacy with multidimensional adversarial knowledge. In Proceedings of the 33rd International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Domingo-Ferrer, J. and Mateo-Sanz, J. 2002. Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Engin. 4, 1. Google ScholarDigital Library
- Dwork, C. 2006. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages, and Programming (ICALP). Google ScholarDigital Library
- Dwork, C., McSherry, F., Nissim, K., and Smith, A. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography Conference. Google ScholarDigital Library
- Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. 2002. Privacy preserving mining of association rules. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Fung, B., Wang, K., and Yu, P. 2005. Top-down specialization for information and privacy preservation. In Proceedings of the 21st International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W. 1999. BOAT: Optimistic decision tree construction. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Gehrke, J., Ramakrishnan, R., and Ganti, V. 1998. RainForest: A framework for fast decision tree construction of large datasets. In Proceedings of the 24th International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- HIP. 2002. Standards for privacy of individuals identifiable health information. U.S. Department of Health and Human Services.Google Scholar
- Iwuchukwu, T. and Naughton, J. 2007. K-anonymization as spatial indexing: Toward scalable and incremental anonymization. In Proceedings of the 33rd International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Iyengar, V. 2002. Transforming data to satisfy privacy constraints. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Kenthapadi, K., Mishra, N., and Nissim, K. 2005. Simulatable auditing. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). Google ScholarDigital Library
- Kifer, D. and Gehrke, J. 2006. Injecting utility into anonymized datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- LeFevre, K., DeWitt, D., and Ramakrishnan, R. 2005. Incognito: Efficient full-domain k-anonymity. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- LeFevre, K., DeWitt, D., and Ramakrishnan, R. 2006a. Mondrian multidimensional k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- LeFevre, K., DeWitt, D., and Ramakrishnan, R. 2006b. Workload-aware anonymization. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Li, N., Li, T., and Venkatasubramanian, S. 2007. t-Closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the IEEE International Conference on Data Engineering (ICDE).Google Scholar
- Machanavajjhala, A., Gehrke, J., Kifer, D., and Venkitasubramaniam, M. 2006. l-Diversity: Privacy beyond k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- Martin, D., Kifer, D., Machanavajjhala, A., Gehrke, J., and Halpern, J. 2007. Worst-case background knowledge in privacy. In Proceedings of the IEEE International Conference on Data Engineering (ICDE).Google Scholar
- Meyerson, A. and Williams, R. 2004. On the complexity of optimal k-anonymity. In Proceedings of the 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS). Google ScholarDigital Library
- Mishra, N. and Sandler, M. 2006. Privacy via pseudorandom sketches. In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS). Google ScholarDigital Library
- Mokbel, M., Chow, C., and Aref, W. 2006. The new casper: Query processing for location services without compromising privacy. In Proceedings of the 32nd International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Quinlan, R. 1993. C4.5 Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Rizvi, S. and Haritsa, J. R. 2002. Maintaining data privacy in association rule mining. In Proceedings of the 28th International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Samarati, P. 2001. Protecting respondants' identities in microdata release. IEEE Trans. Knowl. Data Engin. 13, 6. Google ScholarDigital Library
- Sweeney, L. 2002a. Achieving k-anonymity privacy protection using generalization and suppression. Inter. J. Uncertainty, Fuzziness, Knowl.-Based Syst. 10, 5, 571--588. Google ScholarDigital Library
- Sweeney, L. 2002b. K-anonymity: A model for protecting privacy. Inter. J. Uncertainty, Fuzziness, Knowl.-Based Syst. 10, 5, 557--570. Google ScholarDigital Library
- Wang, K. and Fung, B. 2006. Anonymizing sequential releases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Wang, K., Yu, P., and Chakraborty, S. 2004. Bottom-up generalization: A data mining solution to privacy protection. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM). Google ScholarDigital Library
- Witten, I. and Frank, E. 2005. Data Mining: Practical Machine Learning Yools and Techniques 2nd Ed. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Xiao, X. and Tao, Y. 2006. Personalized privacy preservation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Xiao, X. and Tao, Y. 2007. m-Invariance: Towards privacy preserving re-publication of dynamic datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Yao, C., Wang, X., and Jajodia, S. 2005. Checking for k-anonymity violation by views. In Proceedings of the 31st International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Zhang, J. and Honavar, V. 2003. Learning decision tree classifiers from attribute value taxonomies and partially specified data. In Proceedings of the 20th International Conference on Machine Learning (ICML).Google Scholar
Index Terms
- Workload-aware anonymization techniques for large-scale datasets
Recommendations
Workload-aware anonymization
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningProtecting data privacy is an important problem in microdata distribution. Anonymization algorithms typically aim to protect individual privacy, with minimal impact on the quality of the resulting data. While the bulk of previous work has measured ...
Efficient and flexible anonymization of transaction data
Transaction data are increasingly used in applications, such as marketing research and biomedical studies. Publishing these data, however, may risk privacy breaches, as they often contain personal information about individuals. Approaches to anonymizing ...
(α, k)-anonymous data publishing
Privacy preservation is an important issue in the release of data for mining purposes. The k-anonymity model has been introduced for protecting individual identification. Recent studies show that a more sophisticated model is necessary to protect the ...
Comments