Abstract
A k-means clustering with a new privacy-preserving concept, user-centric privacy preservation, is presented. In this framework, users can conduct data mining using their private information by storing them in their local storage. After the computation, they obtain only the mining result without disclosing private information to others. In most cases, the number of parties that can join conventional privacy-preserving data mining has been assumed to be only two. In our framework, we assume large numbers of parties join the protocol; therefore, not only scalability but also asynchronism and fault-tolerance is important. Considering this, we propose a k-mean algorithm combined with a decentralized cryptographic protocol and a gossip-based protocol. The computational complexity is O(log n) with respect to the number of parties n, and experimental results show that our protocol is scalable even with one million parties.
Similar content being viewed by others
References
Breese J, Heckerman D (1998) Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the fourteenth conference on uncertainty in artificial intelligence (UAI), pp 43–52
Dåmgard I, Jurik M (2001) A generalisation, a simplification and some applications of Paillier’s probabilistic public-key system. In: Public key cryptography. Springer, Berlin
Du W, Zhan Z (2002) Building decision tree classifier on private data. In: Proceedings of the IEEE international conference on privacy, security and data mining, vol 14, pp 1–8. Australian Computer Society, Darlinghurst
Evfimievski A et al (2004) Privacy preserving mining of association rules. Inf Syst 29(4): 343–364
Goldreich O (2004) Foundations of Cryptography: basic applications, vol 2. Cambridge University Press, London
Jagannathan G, Wright RN (2005) Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 593–599. ACM Press, New York
Jelasity M et al (2005) Gossip-based aggregation in large dynamic networks. ACM Trans Comput Syst (TOCS) 23(3): 219–252
Jha S et al (2005) Privacy preserving clustering. Lect Notes Comput Sci 3679: 397
Kantarcioglu M, Clifton C (2004) Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Transactions on Knowledge and Data Engineering, pp 1026–1037
Kearns M et al (2007) Privacy-preserving belief propagation and sampling. In: NIPS 20, vol 20. MIT Press, Cambridge
Kempe D et al (2003) Gossip-based computation of aggregate information. In: Proceedings of 44th annual IEEE symposium on foundations of computer science 2003 (FOCS), pp 482–491
Kowalczyk W, Vlassis N (2005) Newscast EM. In: Proceedings of neural information processing system, vol 17. MIT Press, Cambridge, pp 713–720
Laur S et al (2006) Cryptographically private support vector machines. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 618–624
Lin X et al (2005) Privacy-preserving clustering with distributed EM mixture modeling. Knowl Inform Syst 8(1): 68–81
Lindell Y, Pinkas B (2002) Privacy preserving data mining. J Cryptol 15(3): 177–206
Malkhi D et al (2004) Fairplay: a secure two-party computation system. In: Proceedings of the 13th USENIX security symposium, pp 287–302
Merugu S, Ghosh J (2003) Privacy-preserving distributed clustering using generative models. In: Proceedings of third IEEE international conference on data mining (ICDM), pp 211–218
Padmanabhan V et al (2003) Resilient peer-to-peer streaming. In: Proceedings of eleventh IEEE international conference on network protocols, pp 16–27
Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: Proceedings of Eurocrypt’99, Springer, Berlin, pp 223–238
Pedersen T et al (1991) A threshold cryptosystem without a trusted party. Eurocrypt 91: 129–140
Sakuma J et al (2008) Privacy-preserving reinforcement learning. In: Proceedings of the 25th international conference on machine learning (ICML). ACM Press, New York, pp 864–871
Sweeney L (2002) k-Anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst 10(5): 557–570
Teng Z, Du W (2009) A hybrid multi-group approach for privacy-preserving data mining. Knowl Inform Syst 19(2): 133–157
Tran D et al (2003) ZIGZAG: an efficient peer-to-peer scheme for media streaming. In: Proceedings of twenty-second annual joint conference of the IEEE computer and communications societies 2003 (INFOCOM), vol 2, pp 1283–1292
Vaidya J, Clifton C (2003) Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 206–215
Vaidya J et al (2008) Privacy-preserving Naïve Bayes Classification. VLDB J 17(4): 879–898
Vaidya J et al (2008) Privacy-preserving SVM classification. Knowl Inform Syst 14(2): 161–178
Yang Z et al (2005) Privacy-preserving classification of customer data without loss of accuracy. In: Proceedings of the 5th international conference on data mining (ICDM). Society for Industrial Mathematics
Yao AC-C (1986) How to generate and exchange secrets. In: Proceedings of the 27th IEEE symposium on foundations of computer science, pp 162–167
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sakuma, J., Kobayashi, S. Large-scale k-means clustering with user-centric privacy-preservation. Knowl Inf Syst 25, 253–279 (2010). https://doi.org/10.1007/s10115-009-0243-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0243-x