Abstract
A variety of tasks involving social media text require mining rare samples. In text classification, information retrieval, and other NLP tasks, working with very skewed or imbalanced data sets poses many challenges. In such settings, training data sets can be rapidly bootstrapped using highly targeted sampling strategies. This chapter draws on work in active learning, semantic similarity, and sampling strategies to address a variety of social media text mining tasks. The topics involved are particularly well suited for social media analysis. Most tasks surrounding user generated social media text such as content moderation, and recommendations often involve rapid model construction in response to real world events in real time. The methods discussed allow task-specific data sets and models to be constructed rapidly often using just a handful of initial samples. We then explore extensions to sample across languages—allowing powerful pipelines that can transfer resources from well-resourced languages to their low-resource counterparts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Example from Khuda Bukhsh et al. [17].
References
Alphabet, Inc. (2021) Hate speech policy. YouTube https://support.google.com/youtube/answer/2801939
Attenberg J, Melville P, Provost F (2010) A unified approach to active dual supervision for labeling features and examples. In: Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 40–55
Attenberg J, Ipeirotis P, Provost F (2011) Beat the machine: challenging workers to find the unknown unknowns
BBC News (2021) Kashmir attack: tracing the path that led to pulwama. BBC News https://www.bbc.com/news/world-asia-india-47302467
Buyse A (2014) Words of violence: “fear speech,” or how violent conflict escalation relates to the freedom of expression. Hum Rights Q 36(4):779–797. http://www.jstor.org/stable/24518298
Chen Y, Mani S (2011) Active learning for unbalanced data in the challenge with multiple models and biasing. In: Guyon I, Cawley G, Dror G, Lemaire V, Statnikov A (eds) Active learning and experimental design workshop in conjunction with AISTATS 2010, JMLR Workshop and conference proceedings, Sardinia, Italy, Proceedings of machine learning research, vol 16, pp 113–126, http://proceedings.mlr.press/v16/chen11a.html
Conneau A, Lample G, Ranzato M, Denoyer L, JĂ©gou H (2017) Word translation without parallel data. arXiv preprint arXiv:171004087
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:191102116
Culliford E, Paul K (2020) Facebook offers up first-ever estimate of hate speech prevalence on its platform. Reuters https://www.reuters.com/article/uk-facebook-content-idINKBN27Z2QY
Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. In: AAAI
Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Prieditis A, Russell S (eds) Machine learning proceedings 1995, Morgan Kaufmann, San Francisco, pp 150–157. https://doi.org/10.1016/B978-1-55860-377-6.50027-X. https://www.sciencedirect.com/science/article/pii/B978155860377650027X
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: ICWSM
Ertekin S, Huang J, Bottou L, Lee Giles C (2007) Learning on the border: active learning in imbalanced data classification. In: CIKM 2007 - Proceedings of the 16th ACM conference on information and knowledge management, International conference on information and knowledge management, Proceedings, pp 127–136. https://doi.org/10.1145/1321440.1321461, 16th ACM conference on information and knowledge management, CIKM 2007; Conference date: 06-11-2007 Through 09-11-2007
Facebook, Inc. (2021) Facebook community standards: objectionable content hate speech. Facebook. https://www.facebook.com/communitystandards/objectionable_content
Jacobs J, Potter K (1997) Hate crimes: a critical perspective. Crime Justi Rev Rese CRIME JUSTICE 22. https://doi.org/10.1086/449259
KhudaBukhsh AR, Bennett PN, White RW (2015) Building effective query classifiers: a case study in self-harm intent detection. In: Proceedings of the 24th ACM international on conference on information and knowledge management. Association for Computing Machinery, New York, CIKM ’15, pp 1735–1738. https://doi.org/10.1145/2806416.2806594. https://doi.org/10.1145/2806416.2806594
KhudaBukhsh AR, Palakodety S, Carbonell JG (2020) Harnessing code switching to transcend the linguistic barrier. In: Bessiere C (ed) Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI-20, International joint conferences on Artificial Intelligence Organization, pp 4366–4374, special track on AI for CompSust and Human well-being
KhudaBukhsh AR, Palakodety S, Mitchell TM (2020) Discovering bilingual lexicons in polyglot word embeddings. CoRR abs/2008.13347. https://arxiv.org/abs/2008.13347, 2008.13347
Kim S, Song Y, Kim K, Cha JW, Lee GG (2006) MMR-based active machine learning for bio named entity recognition. In: Proceedings of the human language technology conference of the NAACL, companion volume: Short Papers. Association for Computational Linguistics, New York City, pp 69–72. https://aclanthology.org/N06-2018
Klein AZ, Sarker A, Cai H, Weissenbacher D, Gonzalez-Hernandez G (2018) Social media mining for birth defects research: a rule-based, bootstrapping approach to collecting data for rare health-related events on twitter. J Biomed Inf 87:68–78
Lample G, Ott M, Conneau A, Denoyer L, Ranzato M (2018) Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 5039–5049. https://doi.org/10.18653/v1/D18-1549. https://aclanthology.org/D18-1549
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Machine learning proceedings 1994, Elsevier, pp 148–156
Pagliardini M, Gupta P, Jaggi M (2018) Unsupervised learning of sentence embeddings using compositional n-gram features. In: NAACL 2018—Conference of the North American Chapter of the Association for Computational Linguistics
Palakodety S, KhudaBukhsh AR, Carbonell JG (2020a) Hope speech detection: a computational analysis of the voice of peace. In: Giacomo GD, Catalá A, Dilkina B, Milano M, Barro S, BugarĂn A, Lang J (eds) ECAI 2020—24th European conference on artificial intelligence. Frontiers in artificial intelligence and applications. IOS Press, vol 325, pp 1881–1889. https://doi.org/10.3233/FAIA200305. https://doi.org/10.3233/FAIA200305
Palakodety S, KhudaBukhsh AR, Carbonell JG, Palakodety S, KhudaBukhsh AR, Carbonell JG (2020) Voice for the voiceless: active sampling to detect comments supporting the Rohingyas. In: Proceedings of the AAAI conference on artificial intelligence, vol 34(01), pp 454–462
Pereira-Kohatsu JC, Sánchez L, Liberatore F, Camacho-Collados M (2019) Detecting and monitoring hate speech in twitter. Sensors (Basel, Switzerland) 19
Saha P, Mathew B, Garimella K, Mukherjee A (2021) Short is the road that leads from fear to hate”: Fear speech in Indian WhatsApp groups. In: Proceedings of the web conference 2021. Association for Computing Machinery, New York, WWW ’21, pp 1110–1121
Scheffer T, Decomain C, Wrobel S (2001) Active hidden Markov models for information extraction. In: Hoffmann F, Hand DJ, Adams N, Fisher D, Guimaraes G (eds) Advances in intelligent data analysis. Springer, Berlin, pp 309–318
Settles B (2009) Active learning literature survey
Sindhwani V, Melville P, Lawrence RD (2009) Uncertainty sampling and transductive experimental design for active dual supervision. In: Proceedings of the 26th annual international conference on machine learning. Association for Computing Machinery, New York, ICML ’09, p 953–960. https://doi.org/10.1145/1553374.1553496. https://doi.org/10.1145/1553374.1553496
Twitter, Inc (2021) Hateful conduct policy. Twitter https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
United Nations Office for the Coordination of Humanitarian Affairs (2021) Rohingya refugee crisis. YouTube. https://www.unocha.org/rohingya-refugee-crisis
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Palakodety, S., KhudaBukhsh, A.R., Jayachandran, G. (2021). Semantic Sampling. In: Low Resource Social Media Text Mining. SpringerBriefs in Computer Science. Springer, Singapore. https://doi.org/10.1007/978-981-16-5625-5_6
Download citation
DOI: https://doi.org/10.1007/978-981-16-5625-5_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-5624-8
Online ISBN: 978-981-16-5625-5
eBook Packages: Computer ScienceComputer Science (R0)