Bridging social media via distant supervision

Magdy, Walid; Sajjad, Hassan; El-Ganainy, Tarek; Sebastiani, Fabrizio

doi:10.1007/s13278-015-0275-z

Bridging social media via distant supervision

Original Article
Published: 04 July 2015

Volume 5, article number 35, (2015)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Walid Magdy¹,
Hassan Sajjad¹,
Tarek El-Ganainy¹ &
…
Fabrizio Sebastiani¹

511 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

Microblog classification has received a lot of attention in recent years. Different classification tasks have been investigated, most of them focusing on classifying microblogs into a small number of classes (five or less) using a training set of manually annotated tweets. Unfortunately, labelling data is tedious and expensive, and finding tweets that cover all the classes of interest is not always straightforward, especially when some of the classes do not frequently arise in practice. In this paper, we study an approach to tweet classification based on distant supervision, whereby we automatically transfer labels from one social medium to another for a single-label multi-class classification task. In particular, we apply YouTube video classes to tweets linking to these videos. This provides for free a virtually unlimited number of labelled instances that can be used as training data. The classification experiments we have run show that training a tweet classifier via these automatically labelled data achieves substantially better performance than training the same classifier with a limited amount of manually labelled data; this is advantageous, given that the automatically labelled data come at no cost. Further investigation of our approach shows its robustness when applied with different numbers of classes and across different languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Ours is thus a single-label multi-class classification task, since each tweet is assigned exactly one out of a set of 14 available classes.
The dataset is available for download at http://alt.qcri.org/~wmagdy/resources.htm.
http://topsy.com/analytics?q1=site:youtube.com.
http://twitter4j.org/en/index.html.
This also captures tweets with shortened links to YouTube.
http://developers.google.com/youtube/.
http://svmlight.joachims.org/.
https://www.cs.cornell.edu/people/tj/svm_light/svm_multiclass.html.
These results are, relative to the difficulty of the task, only deceptively inferior to other results published in the tweet classification literature. For instance, in Lee et al. (2011) (which, as mentioned in Sect. 2, is the only article in the tweet classification literature that deals with a set of classes comparable to ours), the authors obtain 70.96 % accuracy. However, their dataset is easier than ours: in their case, 70.96 % accuracy is 3.68 times higher than their trivial acceptor (the classifier that always picks the majority class), while our 0.574 accuracy value is 8.03 times higher than that obtained by our trivial acceptor.
The fact that we obtain better results on Arabic than on English might at first seem surprising, but is not implausible. The literature on multilingual classification (see e.g., Gonçalves and Quaresma 2010) reports many cases in which substantially different levels of accuracy are obtained for different languages, even when the training data and the test data are exactly the same (i.e., each training/test document is a translation equivalent of a training/test document in another language). These differences may be due to a multiplicity of factors, including the different accuracy of preprocessing tools (e.g., stop word lists, stemmers, lemmatizers, decompounders, parsers, etc.) in the different languages, the presence of different linguistic phenomena in different languages, etc. In our case, the documents are not even translation equivalents of each other, so the difference should be even less surprising.

References

Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on Twitter. In: Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM 2011). Barcelona, ES
Bollen J, Mao H, Zeng XJ (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8
Article Google Scholar
Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011). Barcelona, ES, pp 1776–1781
Chen Y, Li Z, Nie L, Hu X, Wang X, Chua TS, Zhang X (2014) A semi-supervised Bayesian network model for microblog topic classification. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012). Mumbai, IN, pp 561–576
Darwish K, Magdy W, Mourad A (2012) Language processing for Arabic microblog retrieval. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012). Maui, US, pp 2427–2430
De Choudhury M, Diakopoulos N, Naaman M (2012) Unfolding the event landscape on Twitter: Classification and exploration of user categories. In: Proceedings of the 15th ACM Conference on Computer Supported Cooperative Work (CSCW 2012). Seattle, US, pp 241–244
Do CB, Ng AY (2005) Transfer learning for text classification. In: Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS 2005). Vancouver, CA, pp 299–306
Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PLoS One 6(12)
Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st International Conference on Machine Learning (ICML 2004). Banff, CA, pp 38–45
Genc Y, Sakamoto Y, Nickerson JV (2011) Discovering context: Classifying tweets through a semantic transform based on Wikipedia. In: Proceedings of the 6th International Conference on Foundations of Augmented Cognition (FAC 2011). Orlando, US, pp 484–492
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Stanford University, Tech. rep
Gonçalves T, Quaresma P (2010) Polylingual text classification in the legal domain. Informatica e Diritto XIX(1–2), pp 203–216
Husby SD, Barbosa D (2012) Topic classification of blog posts using distant supervision. In: Proceedings of the EACL Workshop on Semantic Analysis in Social Media. Avignon, FR, pp 28–36
Imran M, Castillo C, Diaz F, Vieweg S (2014) Processing social media messages in mass emergency: a survey. http://arxiv.org/abs/1407.7071v2
Irani D, Webb S, Pu C, Li K (2010) Study of trend-stuffing on Twitter through text classification. In: Proceedings of th 7th Conference on Collaboration, Electronic Messaging, Anti-Abuse and Spam (CEAS 2010). Redmond, US
Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, Dordrecht
Book Google Scholar
Kinsella S, Passant A, Breslin JG (2011) Topic classification in social media using metadata from hyperlinked objects. In: Proceedings of the 33rd European Conference on Information Retrieval (ECIR 2011). Dublin, IE, pp 201–206
Kothari A, Magdy W, Darwish K, Mourad A, Taei A (2013) Detecting comments on news articles in microblogs. In: Proceedings of the 7th International Conference on Weblogs and Social Media (ICWSM 2013). Cambridge, US
Lee K, Palsetia D, Narayanan R, Patwary MMA, Agrawal A, Choudhary A (2011) Twitter trending topic classification. In: Proceedings of the 6th Workshop on optimization-based techniques for emerging data mining problems (OEDM 2011). Vancouver, CA, pp 251–258
Magdy W, Elsayed T (2014) Adaptive method for following dynamic topics on Twitter. In: Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM 2014). Ann Arbor, US
Marchetti-Bowick M, Chambers N (2012) Learning for microblogs with distant supervision: Political forecasting with Twitter. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012). Avignon, FR, pp 603–612
McCallum AK, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of the AAAI Workshop on Learning for Text Categorization. Madison, US, pp 41–48
Mintz M, Bills S, Snow R, Jurafsky D (2009) Distant supervision for relation extraction without labeled data. In: Proceedings of the 47th Annual Meeting of the ACL and 4th International Joint Conference on Natural Language Processing (ACL/IJCNLP 2009). Singapore, SN, pp 1003–1011
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Pan W, Zhong E, Yang Q (2012) Transfer learning for text mining. In: Aggarwal CC, Zhai C (eds) Mining text data. Springer, Heidelberg, DE, pp 223–258
Chapter Google Scholar
Quercia D, Askham H, Crowcroft J (2012) TweetLDA: Supervised topic classification and link prediction in Twitter. In: Proceedings of the 4th ACM Conference on Web Science (WS 2012). Evanston, US, pp 247–250
Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007). Corvalis, US , pp 759–766
Sammut C, Harries M (2011) Concept drift. In: Sammut C, Webb GI (eds) Encyclopedia of Machine Learning. Springer, Heidelberg, pp 202–205
Google Scholar
Sankaranarayanan J, Samet H, Teitler BE, Lieberman MD, Sperling J (2009) TwitterStand: news in tweets. In: Proceedings of the 17th ACM International Conference on Advances in Geographic Information Systems (GIS 2009). Seattle, US, pp 42–51
Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in Twitter to improve information filtering. In: Proceedings of the 33rd ACM International Conference on Research and Development in Information Retrieval (SIGIR 2010). Geneva, CH, pp 841–842
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999). Berkeley, US, pp 42–49
Zubiaga A, Ji H (2013) Harnessing Web page directories for large-scale classification of tweets. In: Posters Proceedings of the 22nd International World Wide Web Conference (WWW 2013). Rio de Janeiro, BR, pp 225–226

Download references

Author information

Authors and Affiliations

Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
Walid Magdy, Hassan Sajjad, Tarek El-Ganainy & Fabrizio Sebastiani

Authors

Walid Magdy
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Sajjad
View author publications
You can also search for this author in PubMed Google Scholar
Tarek El-Ganainy
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Sebastiani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Walid Magdy.

Additional information

Fabrizio Sebastiani is on leave from Consiglio Nazionale delle Ricerche, Italy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Magdy, W., Sajjad, H., El-Ganainy, T. et al. Bridging social media via distant supervision. Soc. Netw. Anal. Min. 5, 35 (2015). https://doi.org/10.1007/s13278-015-0275-z

Download citation

Received: 10 March 2015
Revised: 09 June 2015
Accepted: 16 June 2015
Published: 04 July 2015
DOI: https://doi.org/10.1007/s13278-015-0275-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bridging social media via distant supervision

Abstract

Access this article

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

Social media analytics: a survey of techniques, tools and platforms

A survey of sentiment analysis in social media

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bridging social media via distant supervision

Abstract

Access this article

Similar content being viewed by others

A survey on sentiment analysis methods, applications, and challenges

Social media analytics: a survey of techniques, tools and platforms

A survey of sentiment analysis in social media

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation