Semantic Sampling

Palakodety, Shriphani; KhudaBukhsh, Ashiqur R.; Jayachandran, Guha

doi:10.1007/978-981-16-5625-5_6

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

377 Accesses

Abstract

A variety of tasks involving social media text require mining rare samples. In text classification, information retrieval, and other NLP tasks, working with very skewed or imbalanced data sets poses many challenges. In such settings, training data sets can be rapidly bootstrapped using highly targeted sampling strategies. This chapter draws on work in active learning, semantic similarity, and sampling strategies to address a variety of social media text mining tasks. The topics involved are particularly well suited for social media analysis. Most tasks surrounding user generated social media text such as content moderation, and recommendations often involve rapid model construction in response to real world events in real time. The methods discussed allow task-specific data sets and models to be constructed rapidly often using just a handful of initial samples. We then explore extensions to sample across languages—allowing powerful pipelines that can transfer resources from well-resourced languages to their low-resource counterparts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Example from Khuda Bukhsh et al. [17].

References

Alphabet, Inc. (2021) Hate speech policy. YouTube https://support.google.com/youtube/answer/2801939
Attenberg J, Melville P, Provost F (2010) A unified approach to active dual supervision for labeling features and examples. In: Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 40–55
Google Scholar
Attenberg J, Ipeirotis P, Provost F (2011) Beat the machine: challenging workers to find the unknown unknowns
Google Scholar
BBC News (2021) Kashmir attack: tracing the path that led to pulwama. BBC News https://www.bbc.com/news/world-asia-india-47302467
Buyse A (2014) Words of violence: “fear speech,” or how violent conflict escalation relates to the freedom of expression. Hum Rights Q 36(4):779–797. http://www.jstor.org/stable/24518298
Chen Y, Mani S (2011) Active learning for unbalanced data in the challenge with multiple models and biasing. In: Guyon I, Cawley G, Dror G, Lemaire V, Statnikov A (eds) Active learning and experimental design workshop in conjunction with AISTATS 2010, JMLR Workshop and conference proceedings, Sardinia, Italy, Proceedings of machine learning research, vol 16, pp 113–126, http://proceedings.mlr.press/v16/chen11a.html
Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2017) Word translation without parallel data. arXiv preprint arXiv:171004087
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:191102116
Culliford E, Paul K (2020) Facebook offers up first-ever estimate of hate speech prevalence on its platform. Reuters https://www.reuters.com/article/uk-facebook-content-idINKBN27Z2QY
Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. In: AAAI
Google Scholar
Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Prieditis A, Russell S (eds) Machine learning proceedings 1995, Morgan Kaufmann, San Francisco, pp 150–157. https://doi.org/10.1016/B978-1-55860-377-6.50027-X. https://www.sciencedirect.com/science/article/pii/B978155860377650027X
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: ICWSM
Google Scholar
Ertekin S, Huang J, Bottou L, Lee Giles C (2007) Learning on the border: active learning in imbalanced data classification. In: CIKM 2007 - Proceedings of the 16th ACM conference on information and knowledge management, International conference on information and knowledge management, Proceedings, pp 127–136. https://doi.org/10.1145/1321440.1321461, 16th ACM conference on information and knowledge management, CIKM 2007; Conference date: 06-11-2007 Through 09-11-2007
Facebook, Inc. (2021) Facebook community standards: objectionable content hate speech. Facebook. https://www.facebook.com/communitystandards/objectionable_content
Jacobs J, Potter K (1997) Hate crimes: a critical perspective. Crime Justi Rev Rese CRIME JUSTICE 22. https://doi.org/10.1086/449259
KhudaBukhsh AR, Bennett PN, White RW (2015) Building effective query classifiers: a case study in self-harm intent detection. In: Proceedings of the 24th ACM international on conference on information and knowledge management. Association for Computing Machinery, New York, CIKM ’15, pp 1735–1738. https://doi.org/10.1145/2806416.2806594. https://doi.org/10.1145/2806416.2806594
KhudaBukhsh AR, Palakodety S, Carbonell JG (2020) Harnessing code switching to transcend the linguistic barrier. In: Bessiere C (ed) Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI-20, International joint conferences on Artificial Intelligence Organization, pp 4366–4374, special track on AI for CompSust and Human well-being
Google Scholar
KhudaBukhsh AR, Palakodety S, Mitchell TM (2020) Discovering bilingual lexicons in polyglot word embeddings. CoRR abs/2008.13347. https://arxiv.org/abs/2008.13347, 2008.13347
Kim S, Song Y, Kim K, Cha JW, Lee GG (2006) MMR-based active machine learning for bio named entity recognition. In: Proceedings of the human language technology conference of the NAACL, companion volume: Short Papers. Association for Computational Linguistics, New York City, pp 69–72. https://aclanthology.org/N06-2018
Klein AZ, Sarker A, Cai H, Weissenbacher D, Gonzalez-Hernandez G (2018) Social media mining for birth defects research: a rule-based, bootstrapping approach to collecting data for rare health-related events on twitter. J Biomed Inf 87:68–78
Google Scholar
Lample G, Ott M, Conneau A, Denoyer L, Ranzato M (2018) Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 5039–5049. https://doi.org/10.18653/v1/D18-1549. https://aclanthology.org/D18-1549
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Machine learning proceedings 1994, Elsevier, pp 148–156
Google Scholar
Pagliardini M, Gupta P, Jaggi M (2018) Unsupervised learning of sentence embeddings using compositional n-gram features. In: NAACL 2018—Conference of the North American Chapter of the Association for Computational Linguistics
Google Scholar
Palakodety S, KhudaBukhsh AR, Carbonell JG (2020a) Hope speech detection: a computational analysis of the voice of peace. In: Giacomo GD, Catalá A, Dilkina B, Milano M, Barro S, Bugarín A, Lang J (eds) ECAI 2020—24th European conference on artificial intelligence. Frontiers in artificial intelligence and applications. IOS Press, vol 325, pp 1881–1889. https://doi.org/10.3233/FAIA200305. https://doi.org/10.3233/FAIA200305
Palakodety S, KhudaBukhsh AR, Carbonell JG, Palakodety S, KhudaBukhsh AR, Carbonell JG (2020) Voice for the voiceless: active sampling to detect comments supporting the Rohingyas. In: Proceedings of the AAAI conference on artificial intelligence, vol 34(01), pp 454–462
Google Scholar
Pereira-Kohatsu JC, Sánchez L, Liberatore F, Camacho-Collados M (2019) Detecting and monitoring hate speech in twitter. Sensors (Basel, Switzerland) 19
Google Scholar
Saha P, Mathew B, Garimella K, Mukherjee A (2021) Short is the road that leads from fear to hate”: Fear speech in Indian WhatsApp groups. In: Proceedings of the web conference 2021. Association for Computing Machinery, New York, WWW ’21, pp 1110–1121
Google Scholar
Scheffer T, Decomain C, Wrobel S (2001) Active hidden Markov models for information extraction. In: Hoffmann F, Hand DJ, Adams N, Fisher D, Guimaraes G (eds) Advances in intelligent data analysis. Springer, Berlin, pp 309–318
Google Scholar
Settles B (2009) Active learning literature survey
Google Scholar
Sindhwani V, Melville P, Lawrence RD (2009) Uncertainty sampling and transductive experimental design for active dual supervision. In: Proceedings of the 26th annual international conference on machine learning. Association for Computing Machinery, New York, ICML ’09, p 953–960. https://doi.org/10.1145/1553374.1553496. https://doi.org/10.1145/1553374.1553496
Twitter, Inc (2021) Hateful conduct policy. Twitter https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy
United Nations Office for the Coordination of Humanitarian Affairs (2021) Rohingya refugee crisis. YouTube. https://www.unocha.org/rohingya-refugee-crisis

Download references

Author information

Authors and Affiliations

Onai Inc., San Jose, CA, USA
Shriphani Palakodety
Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester, NY, USA
Ashiqur R. KhudaBukhsh
Onai Inc., San Jose, CA, USA
Guha Jayachandran

Authors

Shriphani Palakodety
View author publications
You can also search for this author in PubMed Google Scholar
Ashiqur R. KhudaBukhsh
View author publications
You can also search for this author in PubMed Google Scholar
Guha Jayachandran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shriphani Palakodety .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Palakodety, S., KhudaBukhsh, A.R., Jayachandran, G. (2021). Semantic Sampling. In: Low Resource Social Media Text Mining. SpringerBriefs in Computer Science. Springer, Singapore. https://doi.org/10.1007/978-981-16-5625-5_6

Download citation

DOI: https://doi.org/10.1007/978-981-16-5625-5_6
Published: 02 October 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-5624-8
Online ISBN: 978-981-16-5625-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics