Abstract
We propose a novel collaborative approach for document classification, combining the knowledge of multiple users for improved organization of data such as individual document repositories or emails. To this end, we distribute locally built classification models in a network of participating users, and combine the shared classifiers into more powerful meta models. In order to increase the propagation efficiency, we apply a method for selecting the most discriminative model components and transmitting them to other participants. In our experiments on four large standard collections for text classification we study the resulting tradeoffs between network cost and classification accuracy. The experimental results show that the proposed model propagation has negligible communication costs and substantially outperforms current approaches with respect to efficiency and classification quality.
Similar content being viewed by others
References
Adamic LA, Huberman BA (2002) Zipf’s law and the internet. Glottometrics 3:143–150
Allwein EL, Schapire RE, Singer Y (2001) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
Ang HH, Gopalkrishnan V, Hoi SCH, Ng WK (2008) Cascade RSVM in peer-to-peer networks. In: ECML/PKDD, pp 55–70
Ang HH, Gopalkrishnan V, Hoi SCH, Ng WK (2010) Adaptive ensemble classification in p2p networks. In: DASFAA, pp 34–48
Ang HH, Gopalkrishnan V, Ng WK, Hoi SCH (2009) Communication-efficient classification in P2P networks. In: ECML/PKDD, pp 83–98
Attenberg J, Weinberger K, Dasgupta A, Langford J, Smola A, Zinkevich M (2009) Collaborative email-spam filtering with consistently bad labels using feature hashing. In: CEAS
Baraglia R, Dazzi P, Mordacchini M, Ricci L (2013) A peer-to-peer recommender system for self-emerging user communities based on gossip overlays. J Comput Syst Sci 79(2):291–308
Baraglia R, Dazzi P, Mordacchini M, Ricci L, Alessi L (2011) Group: a gossip based building community protocol. In: NEW2AN, pp 496–507
Bennett PN, Dumais ST, Horvitz E (2002) Probabilistic combination of text classifiers using reliability indicators: models and results. In: SIGIR. ACM, New York, pp 207–214
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167
Caragea D, Silvescu A, Honavar V (2001) Analysis and synthesis of agents that learn from distributed dynamic data sources.In: Emergent neural computational architectures based on neuroscience, pp 547–559
Castro M, Costa M, Rowstron A (2005) Debunking some myths about structured and unstructured overlays. In: NSDI
Chen L, Wright P, Nejdl W (2009) Improving music genre classification using collaborative tagging data. In: WSDM
Cormack GV (2007) Trec 2007 spam track overview. In: Text REtrieval Conference (TREC)
Cormack GV (2007) TREC 2007 spam track overview. In: TREC
Cormack GV, Lynam TR (2007) Online supervised spam filter evaluation. ACM Trans Inf Syst 25(3):11
Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H (2006) Distributed data mining in P2P networks. IEEE Internet Comput 10(4):18–26
Datta S, Giannella CR, Kargupta H (2009) Approximate distributed K-means clustering over a P2P network. IEEE TKDE 21(10):1372–1388
Di Fatta G, Blasa F, Cafiero S, Fortino G (2011) Epidemic k-means clustering. In: 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), pp 151–158
Dumais ST, Platt JC, Hecherman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: CIKM
Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: CIKM, pp 127–136
Eugster P, Guerraoui R, Kermarrec A-M, Massoulie L (2004) Epidemic information dissemination in distributed systems. Computer 37(5):60–67
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Fürnkranz J, Hüllermeier E, Loza Mencía E, Brinker K (2008) Multilabel classification via calibrated label ranking. Mach Learn 73(2):133–153
Gkantsidis C, Mihail M, Saberi A (2004) Random walks in peer-to-peer networks. In: INFOCOM
Heymann P, Koutrika G, Garcia-Molina H (2008) Can social bookmarking improve web search? In: WSDM
Kamvar SD, Schlosser MT, Garcia-Molina H (2003) The eigentrust algorithm for reputation management in P2P networks. In: WWW, pp 640–651
Kong J, Rezaei B, Sarshar N, Roychowdhury V, Boykin P (2006) Collaborative spam filtering using e-mail networks. IEEE Computer 39(8):67–73
Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley
Lee Y-J, Mangasarian OL (2001) RSVM: reduced support vector machines, pp 55–70
LIBSVM library and data collection (2010) Available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Luo P, Xiong H, Lü K, Shi Z (2007) Distributed classification in peer-to-peer networks. In: SIGKDD, pp 968–976
Mehyar M, Spanos D, Pongsajapan J, Low SH, Murray RM (2007) Asynchronous distributed averaging on communication networks. IEEE/ACM Trans Netw 15(3):512–520
Merugu S, Ghosh J (2003) Privacy-preserving distributed clustering using generative models. In: ICDM
Minack E, Paiu R, Costache S, Demartini G, Gaugaz J, Ioannou E, Chirita P-A, Nejdl W (2010) Leveraging personal metadata for desktop search: the Beagle ++ system. J Web Sem 8(1):37–54
Mladenić D, Brank J, Grobelnik M, Milic-Frayling N (2004) Feature selection using linear classifier weights: interaction with classification models. In: SIGIR
Overell S, Sigurbjörnsson B, van Zwol R (2009) Classifying tags using open content resources. In: WSDM
Papapetrou O, Siberski W, Siersdorfer S (2011) Collaborative classification over p2p networks. In: WWW (Companion Volume), pp 97–98
Schuster A, Wolff R, Trock D (2005) A high-per- formance distributed algorithm for mining association rules. Knowl Inf Syst 7(4):458–475
Siersdorfer S, Sizov S (2008) Meta methods for model sharing in personal information systems. ACM Trans Inf Syst 26(4):1–35
Silerston T, Fourmaux O (2007) Measuring p2p iptv systems. In: NOSSDAV
Terpstra WW, Kangasharju J, Leng C, Buchmann AP (2007) Bubblestorm: resilient, probabilistic, and exhaustive peer-to-peer search. In: SIGCOMM
Vaidya J, Clifton C (2004) Privacy preserving naive bayes classifier for vertically partitioned data. In: SDM
Voulgaris S, Gavidia D, van Steen M (2005) Cyclon: inexpensive membership management for unstructured p2p overlays. J. Network Syst. Manage. 13(2):197–217
Voulgaris S, van Steen M (2005) Epidemic-style management of semantic overlays for content-based searching. In: Euro-Par, pp 1143–1152
Wolpert D (1992) Stacked generalization. Neural Netw 5(2):241–259
Wu H, Zubair M, Maly K (2007) Collaborative classification of growing collections with evolving facets. In: HT
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Papapetrou, O., Siberski, W. & Siersdorfer, S. Efficient model sharing for scalable collaborative classification. Peer-to-Peer Netw. Appl. 8, 384–398 (2015). https://doi.org/10.1007/s12083-014-0259-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12083-014-0259-1