Skip to main content
Log in

Efficient model sharing for scalable collaborative classification

  • Published:
Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Abstract

We propose a novel collaborative approach for document classification, combining the knowledge of multiple users for improved organization of data such as individual document repositories or emails. To this end, we distribute locally built classification models in a network of participating users, and combine the shared classifiers into more powerful meta models. In order to increase the propagation efficiency, we apply a method for selecting the most discriminative model components and transmitting them to other participants. In our experiments on four large standard collections for text classification we study the resulting tradeoffs between network cost and classification accuracy. The experimental results show that the proposed model propagation has negligible communication costs and substantially outperforms current approaches with respect to efficiency and classification quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Adamic LA, Huberman BA (2002) Zipf’s law and the internet. Glottometrics 3:143–150

    Google Scholar 

  2. Allwein EL, Schapire RE, Singer Y (2001) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141

    MATH  MathSciNet  Google Scholar 

  3. Ang HH, Gopalkrishnan V, Hoi SCH, Ng WK (2008) Cascade RSVM in peer-to-peer networks. In: ECML/PKDD, pp 55–70

  4. Ang HH, Gopalkrishnan V, Hoi SCH, Ng WK (2010) Adaptive ensemble classification in p2p networks. In: DASFAA, pp 34–48

  5. Ang HH, Gopalkrishnan V, Ng WK, Hoi SCH (2009) Communication-efficient classification in P2P networks. In: ECML/PKDD, pp 83–98

  6. Attenberg J, Weinberger K, Dasgupta A, Langford J, Smola A, Zinkevich M (2009) Collaborative email-spam filtering with consistently bad labels using feature hashing. In: CEAS

  7. Baraglia R, Dazzi P, Mordacchini M, Ricci L (2013) A peer-to-peer recommender system for self-emerging user communities based on gossip overlays. J Comput Syst Sci 79(2):291–308

    Article  MathSciNet  Google Scholar 

  8. Baraglia R, Dazzi P, Mordacchini M, Ricci L, Alessi L (2011) Group: a gossip based building community protocol. In: NEW2AN, pp 496–507

  9. Bennett PN, Dumais ST, Horvitz E (2002) Probabilistic combination of text classifiers using reliability indicators: models and results. In: SIGIR. ACM, New York, pp 207–214

    Google Scholar 

  10. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  MathSciNet  Google Scholar 

  11. Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167

    Article  Google Scholar 

  12. Caragea D, Silvescu A, Honavar V (2001) Analysis and synthesis of agents that learn from distributed dynamic data sources.In: Emergent neural computational architectures based on neuroscience, pp 547–559

  13. Castro M, Costa M, Rowstron A (2005) Debunking some myths about structured and unstructured overlays. In: NSDI

  14. Chen L, Wright P, Nejdl W (2009) Improving music genre classification using collaborative tagging data. In: WSDM

  15. Cormack GV (2007) Trec 2007 spam track overview. In: Text REtrieval Conference (TREC)

  16. Cormack GV (2007) TREC 2007 spam track overview. In: TREC

  17. Cormack GV, Lynam TR (2007) Online supervised spam filter evaluation. ACM Trans Inf Syst 25(3):11

    Article  Google Scholar 

  18. Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H (2006) Distributed data mining in P2P networks. IEEE Internet Comput 10(4):18–26

    Article  Google Scholar 

  19. Datta S, Giannella CR, Kargupta H (2009) Approximate distributed K-means clustering over a P2P network. IEEE TKDE 21(10):1372–1388

    Google Scholar 

  20. Di Fatta G, Blasa F, Cafiero S, Fortino G (2011) Epidemic k-means clustering. In: 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), pp 151–158

  21. Dumais ST, Platt JC, Hecherman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: CIKM

  22. Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: CIKM, pp 127–136

  23. Eugster P, Guerraoui R, Kermarrec A-M, Massoulie L (2004) Epidemic information dissemination in distributed systems. Computer 37(5):60–67

    Article  Google Scholar 

  24. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  25. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188

    Google Scholar 

  26. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  27. Fürnkranz J, Hüllermeier E, Loza Mencía E, Brinker K (2008) Multilabel classification via calibrated label ranking. Mach Learn 73(2):133–153

    Article  Google Scholar 

  28. Gkantsidis C, Mihail M, Saberi A (2004) Random walks in peer-to-peer networks. In: INFOCOM

  29. Heymann P, Koutrika G, Garcia-Molina H (2008) Can social bookmarking improve web search? In: WSDM

  30. Kamvar SD, Schlosser MT, Garcia-Molina H (2003) The eigentrust algorithm for reputation management in P2P networks. In: WWW, pp 640–651

  31. Kong J, Rezaei B, Sarshar N, Roychowdhury V, Boykin P (2006) Collaborative spam filtering using e-mail networks. IEEE Computer 39(8):67–73

    Article  Google Scholar 

  32. Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley

  33. Lee Y-J, Mangasarian OL (2001) RSVM: reduced support vector machines, pp 55–70

  34. LIBSVM library and data collection (2010) Available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  35. Luo P, Xiong H, Lü K, Shi Z (2007) Distributed classification in peer-to-peer networks. In: SIGKDD, pp 968–976

  36. Mehyar M, Spanos D, Pongsajapan J, Low SH, Murray RM (2007) Asynchronous distributed averaging on communication networks. IEEE/ACM Trans Netw 15(3):512–520

    Article  Google Scholar 

  37. Merugu S, Ghosh J (2003) Privacy-preserving distributed clustering using generative models. In: ICDM

  38. Minack E, Paiu R, Costache S, Demartini G, Gaugaz J, Ioannou E, Chirita P-A, Nejdl W (2010) Leveraging personal metadata for desktop search: the Beagle ++ system. J Web Sem 8(1):37–54

    Article  Google Scholar 

  39. Mladenić D, Brank J, Grobelnik M, Milic-Frayling N (2004) Feature selection using linear classifier weights: interaction with classification models. In: SIGIR

  40. Overell S, Sigurbjörnsson B, van Zwol R (2009) Classifying tags using open content resources. In: WSDM

  41. Papapetrou O, Siberski W, Siersdorfer S (2011) Collaborative classification over p2p networks. In: WWW (Companion Volume), pp 97–98

  42. Schuster A, Wolff R, Trock D (2005) A high-per- formance distributed algorithm for mining association rules. Knowl Inf Syst 7(4):458–475

    Article  Google Scholar 

  43. Siersdorfer S, Sizov S (2008) Meta methods for model sharing in personal information systems. ACM Trans Inf Syst 26(4):1–35

    Article  Google Scholar 

  44. Silerston T, Fourmaux O (2007) Measuring p2p iptv systems. In: NOSSDAV

  45. Terpstra WW, Kangasharju J, Leng C, Buchmann AP (2007) Bubblestorm: resilient, probabilistic, and exhaustive peer-to-peer search. In: SIGCOMM

  46. Vaidya J, Clifton C (2004) Privacy preserving naive bayes classifier for vertically partitioned data. In: SDM

  47. Voulgaris S, Gavidia D, van Steen M (2005) Cyclon: inexpensive membership management for unstructured p2p overlays. J. Network Syst. Manage. 13(2):197–217

    Article  Google Scholar 

  48. Voulgaris S, van Steen M (2005) Epidemic-style management of semantic overlays for content-based searching. In: Euro-Par, pp 1143–1152

  49. Wolpert D (1992) Stacked generalization. Neural Netw 5(2):241–259

    Article  MathSciNet  Google Scholar 

  50. Wu H, Zubair M, Maly K (2007) Collaborative classification of growing collections with evolving facets. In: HT

  51. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Odysseas Papapetrou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Papapetrou, O., Siberski, W. & Siersdorfer, S. Efficient model sharing for scalable collaborative classification. Peer-to-Peer Netw. Appl. 8, 384–398 (2015). https://doi.org/10.1007/s12083-014-0259-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12083-014-0259-1

Keywords

Navigation