Skip to main content
Log in

Bridging social media via distant supervision

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Microblog classification has received a lot of attention in recent years. Different classification tasks have been investigated, most of them focusing on classifying microblogs into a small number of classes (five or less) using a training set of manually annotated tweets. Unfortunately, labelling data is tedious and expensive, and finding tweets that cover all the classes of interest is not always straightforward, especially when some of the classes do not frequently arise in practice. In this paper, we study an approach to tweet classification based on distant supervision, whereby we automatically transfer labels from one social medium to another for a single-label multi-class classification task. In particular, we apply YouTube video classes to tweets linking to these videos. This provides for free a virtually unlimited number of labelled instances that can be used as training data. The classification experiments we have run show that training a tweet classifier via these automatically labelled data achieves substantially better performance than training the same classifier with a limited amount of manually labelled data; this is advantageous, given that the automatically labelled data come at no cost. Further investigation of our approach shows its robustness when applied with different numbers of classes and across different languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Ours is thus a single-label multi-class classification task, since each tweet is assigned exactly one out of a set of 14 available classes.

  2. The dataset is available for download at http://alt.qcri.org/~wmagdy/resources.htm.

  3. http://topsy.com/analytics?q1=site:youtube.com.

  4. http://twitter4j.org/en/index.html.

  5. This also captures tweets with shortened links to YouTube.

  6. http://developers.google.com/youtube/.

  7. http://svmlight.joachims.org/.

  8. https://www.cs.cornell.edu/people/tj/svm_light/svm_multiclass.html.

  9. These results are, relative to the difficulty of the task, only deceptively inferior to other results published in the tweet classification literature. For instance, in Lee et al. (2011) (which, as mentioned in Sect. 2, is the only article in the tweet classification literature that deals with a set of classes comparable to ours), the authors obtain 70.96 % accuracy. However, their dataset is easier than ours: in their case, 70.96 % accuracy is 3.68 times higher than their trivial acceptor (the classifier that always picks the majority class), while our 0.574 accuracy value is 8.03 times higher than that obtained by our trivial acceptor.

  10. The fact that we obtain better results on Arabic than on English might at first seem surprising, but is not implausible. The literature on multilingual classification (see e.g., Gonçalves and Quaresma 2010) reports many cases in which substantially different levels of accuracy are obtained for different languages, even when the training data and the test data are exactly the same (i.e., each training/test document is a translation equivalent of a training/test document in another language). These differences may be due to a multiplicity of factors, including the different accuracy of preprocessing tools (e.g., stop word lists, stemmers, lemmatizers, decompounders, parsers, etc.) in the different languages, the presence of different linguistic phenomena in different languages, etc. In our case, the documents are not even translation equivalents of each other, so the difference should be even less surprising.

References

  • Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on Twitter. In: Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM 2011). Barcelona, ES

  • Bollen J, Mao H, Zeng XJ (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8

    Article  Google Scholar 

  • Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011). Barcelona, ES, pp 1776–1781

  • Chen Y, Li Z, Nie L, Hu X, Wang X, Chua TS, Zhang X (2014) A semi-supervised Bayesian network model for microblog topic classification. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012). Mumbai, IN, pp 561–576

  • Darwish K, Magdy W, Mourad A (2012) Language processing for Arabic microblog retrieval. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012). Maui, US, pp 2427–2430

  • De Choudhury M, Diakopoulos N, Naaman M (2012) Unfolding the event landscape on Twitter: Classification and exploration of user categories. In: Proceedings of the 15th ACM Conference on Computer Supported Cooperative Work (CSCW 2012). Seattle, US, pp 241–244

  • Do CB, Ng AY (2005) Transfer learning for text classification. In: Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS 2005). Vancouver, CA, pp 299–306

  • Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PLoS One 6(12)

  • Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st International Conference on Machine Learning (ICML 2004). Banff, CA, pp 38–45

  • Genc Y, Sakamoto Y, Nickerson JV (2011) Discovering context: Classifying tweets through a semantic transform based on Wikipedia. In: Proceedings of the 6th International Conference on Foundations of Augmented Cognition (FAC 2011). Orlando, US, pp 484–492

  • Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Stanford University, Tech. rep

  • Gonçalves T, Quaresma P (2010) Polylingual text classification in the legal domain. Informatica e Diritto XIX(1–2), pp 203–216

  • Husby SD, Barbosa D (2012) Topic classification of blog posts using distant supervision. In: Proceedings of the EACL Workshop on Semantic Analysis in Social Media. Avignon, FR, pp 28–36

  • Imran M, Castillo C, Diaz F, Vieweg S (2014) Processing social media messages in mass emergency: a survey. http://arxiv.org/abs/1407.7071v2

  • Irani D, Webb S, Pu C, Li K (2010) Study of trend-stuffing on Twitter through text classification. In: Proceedings of th 7th Conference on Collaboration, Electronic Messaging, Anti-Abuse and Spam (CEAS 2010). Redmond, US

  • Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, Dordrecht

    Book  Google Scholar 

  • Kinsella S, Passant A, Breslin JG (2011) Topic classification in social media using metadata from hyperlinked objects. In: Proceedings of the 33rd European Conference on Information Retrieval (ECIR 2011). Dublin, IE, pp 201–206

  • Kothari A, Magdy W, Darwish K, Mourad A, Taei A (2013) Detecting comments on news articles in microblogs. In: Proceedings of the 7th International Conference on Weblogs and Social Media (ICWSM 2013). Cambridge, US

  • Lee K, Palsetia D, Narayanan R, Patwary MMA, Agrawal A, Choudhary A (2011) Twitter trending topic classification. In: Proceedings of the 6th Workshop on optimization-based techniques for emerging data mining problems (OEDM 2011). Vancouver, CA, pp 251–258

  • Magdy W, Elsayed T (2014) Adaptive method for following dynamic topics on Twitter. In: Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM 2014). Ann Arbor, US

  • Marchetti-Bowick M, Chambers N (2012) Learning for microblogs with distant supervision: Political forecasting with Twitter. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012). Avignon, FR, pp 603–612

  • McCallum AK, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of the AAAI Workshop on Learning for Text Categorization. Madison, US, pp 41–48

  • Mintz M, Bills S, Snow R, Jurafsky D (2009) Distant supervision for relation extraction without labeled data. In: Proceedings of the 47th Annual Meeting of the ACL and 4th International Joint Conference on Natural Language Processing (ACL/IJCNLP 2009). Singapore, SN, pp 1003–1011

  • Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  • Pan W, Zhong E, Yang Q (2012) Transfer learning for text mining. In: Aggarwal CC, Zhai C (eds) Mining text data. Springer, Heidelberg, DE, pp 223–258

    Chapter  Google Scholar 

  • Quercia D, Askham H, Crowcroft J (2012) TweetLDA: Supervised topic classification and link prediction in Twitter. In: Proceedings of the 4th ACM Conference on Web Science (WS 2012). Evanston, US, pp 247–250

  • Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007). Corvalis, US , pp 759–766

  • Sammut C, Harries M (2011) Concept drift. In: Sammut C, Webb GI (eds) Encyclopedia of Machine Learning. Springer, Heidelberg, pp 202–205

    Google Scholar 

  • Sankaranarayanan J, Samet H, Teitler BE, Lieberman MD, Sperling J (2009) TwitterStand: news in tweets. In: Proceedings of the 17th ACM International Conference on Advances in Geographic Information Systems (GIS 2009). Seattle, US, pp 42–51

  • Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in Twitter to improve information filtering. In: Proceedings of the 33rd ACM International Conference on Research and Development in Information Retrieval (SIGIR 2010). Geneva, CH, pp 841–842

  • Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999). Berkeley, US, pp 42–49

  • Zubiaga A, Ji H (2013) Harnessing Web page directories for large-scale classification of tweets. In: Posters Proceedings of the 22nd International World Wide Web Conference (WWW 2013). Rio de Janeiro, BR, pp 225–226

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Walid Magdy.

Additional information

Fabrizio Sebastiani is on leave from Consiglio Nazionale delle Ricerche, Italy.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Magdy, W., Sajjad, H., El-Ganainy, T. et al. Bridging social media via distant supervision. Soc. Netw. Anal. Min. 5, 35 (2015). https://doi.org/10.1007/s13278-015-0275-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-015-0275-z

Keywords

Navigation