ABSTRACT
Many websites provide form-like interfaces which allow users to execute search queries on the underlying hidden databases. In this paper, we explain the importance of protecting sensitive aggregate information of hidden databases from being disclosed through individual tuples returned by the search queries. This stands in contrast to the traditional privacy problem where individual tuples must be protected while ensuring access to aggregating information. We propose techniques to thwart bots from sampling the hidden database to infer aggregate information. We present theoretical analysis and extensive experiments to illustrate the effectiveness of our approach.
- M. Atallah, E. Bertino, A. K. Elmagarmid, M. Ibrahim, V. S. Verykios, Disclose Limitation of Sensitive Rules. Knowledge and Data Exchange Workshop 1999. Google ScholarDigital Library
- R. Agrawal, A. Evfimievski, and R. Srikant, Information Sharing Across Private Databases. SIGMOD 2003. Google ScholarDigital Library
- R. Agrawal and R. Srikant, Privacy-Preserving Data Mining, SIGMOD 2000. Google ScholarDigital Library
- R. Agrawal, R. Srikant, and D. Thomas, Privacy Preserving OLAP, SIGMOD 2005. Google ScholarDigital Library
- K. Bharat and A. Broder. A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. WWW 1998. Google ScholarDigital Library
- Z. Bar-Yossef and M. Gurevich. Random Sampling from a Search Engine's Index. WWW 2006. Google ScholarDigital Library
- Z. Bar-Yossef and M. Gurevich: Efficient search engine measurements. WWW 2007. Google ScholarDigital Library
- N. Bruno, L. Gravano, A. Marian: Evaluating Top-k Queries over Web-Accessible Databases. ICDE 2002.Google ScholarCross Ref
- J. P. Callan, M. E. Connell: Query-based sampling of text databases. ACM Trans. Inf. Syst. 19(2): 2001. Google ScholarDigital Library
- K. C-C. Chang, S. Hwang: Minimal probing: supporting expensive predicates for top-k queries. SIGMOD 2002. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, WebTables: Exploring the Power of Tables on the Web, VLDB 2008. Google ScholarDigital Library
- C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu, Tools for Privacy Preserving Distributed Data Mining, ACM SIGKDD Explorations, 4(28): 2003. Google ScholarDigital Library
- A. Dasgupta, G. Das, H. Mannila: A random walk approach to sampling hidden databases. SIGMOD 2007. Google ScholarDigital Library
- C. Dwork, F. McSherry, K. Nissim, and A. Smith, Calibrating noise to sensitivity in private data analysis. Theory of Cryptography Conference 2006. Google ScholarDigital Library
- A. Dasgupta, N. Zhang, G. Das: Leveraging COUNT Information in Sampling Hidden Databases. ICDE 2009. Google ScholarDigital Library
- A. Dasgupta, N. Zhang, G. Das, S. Chaudhuri, On Privacy Preservations of Aggregates in Hidden Databases, Technical Report TR-GWU-CS-09-001, George Washington University, 2009.Google Scholar
- J. Elson, J. R. Douceur, J. Howell, J. Saul: Asirra: a CAPTCHA that exploits interest-aligned manual image categorization, CCS 2007.Google Scholar
- http://code.google.com/apis/soapsearch/api_faq.htmlGoogle Scholar
- A. Gkoulalas-Divanis and V. S. Verykios, An Integer Programming Approach for Frequent Itemset Hiding. CIKM 2006 Google ScholarDigital Library
- S. Hettich and S. D. Bay, The UCI KDD Archive {http://kdd.ics.uci.edu}. Irvine, CA: University of California, Department of Information and Computer Science. 1999.Google Scholar
- Y. Hedley, M. Younas, A. E. James, M. Sanderson: A two-phase sampling technique for information extraction from hidden web databases. WIDM 2004. Google ScholarDigital Library
- Y. Hedley, M. Younas, A. E. James, M. Sanderson: Sampling, information extraction and summarisation of Hidden Web databases. Data Knowl. Eng. 59(2): 2006. Google ScholarDigital Library
- P. G. Ipeirotis, L. Gravano: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. VLDB 2002. Google ScholarDigital Library
- S. Jajodia, P. Samarati, M. L. Sapino, V. S. Subrahmanian, Flexible support for multiple access control policies. TODS 26(2): 2001. Google ScholarDigital Library
- K. Kenthapadi, N. Mishra, and K. Nissim, Simulatable auditing. PODS 2005. Google ScholarDigital Library
- A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, l-Diversity: Privacy Beyond k-Anonymity. TKDD 1(1): 2007. Google ScholarDigital Library
- J. Madhavan, D. Ko, A. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy, Google's Deep-Web Crawl, VLDB 2008. Google ScholarDigital Library
- S. U. Nabar, B. Marthi, K. Kenthapadi, N. Mishra, and R. Motwani, Towards robustness in query auditing. VLDB 2006. Google ScholarDigital Library
- R. S. Sandhu, E. J. Coyne, H. L. Feinstein, and C. E. Youman, Role-based access control models. IEEE Computer, 29(2): 1996. Google ScholarDigital Library
- L. Sweeney, k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5): 2002. Google ScholarDigital Library
- V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni, Association rule hiding, TKDE 16(4): 2004. Google ScholarDigital Library
- N. Zhang and W. Zhao, Privacy-Preserving Data Mining Systems. IEEE Computer, 40(4): 2007. Google ScholarDigital Library
Index Terms
- Privacy preservation of aggregates in hidden databases: why and how?
Recommendations
IMR based Anonymization for Privacy Preservation in Data Mining
KMO '16: Proceedings of the The 11th International Knowledge Management in Organizations Conference on The changing face of Knowledge Management Impacting SocietyPrivacy Preserving Data Mining (PPDM) is a data mining research area that aims to protect individual's personal information from unsolicited or unauthorized disclosure. Privacy relates to personal information that a person would not wish others to know ...
Privacy risks in health databases from aggregate disclosure
PETRA '09: Proceedings of the 2nd International Conference on PErvasive Technologies Related to Assistive EnvironmentsThis paper focuses on privacy risks in health databases that arise in assistive environments, where humans interact with the environment and this information is captured, assimilated and events of interest are extracted. The stakeholders of such an ...
Comments