Skip to main content

Semantic Sampling

  • Chapter
  • First Online:
Low Resource Social Media Text Mining

Abstract

A variety of tasks involving social media text require mining rare samples. In text classification, information retrieval, and other NLP tasks, working with very skewed or imbalanced data sets poses many challenges. In such settings, training data sets can be rapidly bootstrapped using highly targeted sampling strategies. This chapter draws on work in active learning, semantic similarity, and sampling strategies to address a variety of social media text mining tasks. The topics involved are particularly well suited for social media analysis. Most tasks surrounding user generated social media text such as content moderation, and recommendations often involve rapid model construction in response to real world events in real time. The methods discussed allow task-specific data sets and models to be constructed rapidly often using just a handful of initial samples. We then explore extensions to sample across languages—allowing powerful pipelines that can transfer resources from well-resourced languages to their low-resource counterparts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Example from Khuda Bukhsh et al. [17].

References

  1. Alphabet, Inc. (2021) Hate speech policy. YouTube https://support.google.com/youtube/answer/2801939

  2. Attenberg J, Melville P, Provost F (2010) A unified approach to active dual supervision for labeling features and examples. In: Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 40–55

    Google Scholar 

  3. Attenberg J, Ipeirotis P, Provost F (2011) Beat the machine: challenging workers to find the unknown unknowns

    Google Scholar 

  4. BBC News (2021) Kashmir attack: tracing the path that led to pulwama. BBC News https://www.bbc.com/news/world-asia-india-47302467

  5. Buyse A (2014) Words of violence: “fear speech,” or how violent conflict escalation relates to the freedom of expression. Hum Rights Q 36(4):779–797. http://www.jstor.org/stable/24518298

  6. Chen Y, Mani S (2011) Active learning for unbalanced data in the challenge with multiple models and biasing. In: Guyon I, Cawley G, Dror G, Lemaire V, Statnikov A (eds) Active learning and experimental design workshop in conjunction with AISTATS 2010, JMLR Workshop and conference proceedings, Sardinia, Italy, Proceedings of machine learning research, vol 16, pp 113–126, http://proceedings.mlr.press/v16/chen11a.html

  7. Conneau A, Lample G, Ranzato M, Denoyer L, JĂ©gou H (2017) Word translation without parallel data. arXiv preprint arXiv:171004087

  8. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:191102116

  9. Culliford E, Paul K (2020) Facebook offers up first-ever estimate of hate speech prevalence on its platform. Reuters https://www.reuters.com/article/uk-facebook-content-idINKBN27Z2QY

  10. Culotta A, McCallum A (2005) Reducing labeling effort for structured prediction tasks. In: AAAI

    Google Scholar 

  11. Dagan I, Engelson SP (1995) Committee-based sampling for training probabilistic classifiers. In: Prieditis A, Russell S (eds) Machine learning proceedings 1995, Morgan Kaufmann, San Francisco, pp 150–157. https://doi.org/10.1016/B978-1-55860-377-6.50027-X. https://www.sciencedirect.com/science/article/pii/B978155860377650027X

  12. Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. In: ICWSM

    Google Scholar 

  13. Ertekin S, Huang J, Bottou L, Lee Giles C (2007) Learning on the border: active learning in imbalanced data classification. In: CIKM 2007 - Proceedings of the 16th ACM conference on information and knowledge management, International conference on information and knowledge management, Proceedings, pp 127–136. https://doi.org/10.1145/1321440.1321461, 16th ACM conference on information and knowledge management, CIKM 2007; Conference date: 06-11-2007 Through 09-11-2007

  14. Facebook, Inc. (2021) Facebook community standards: objectionable content hate speech. Facebook. https://www.facebook.com/communitystandards/objectionable_content

  15. Jacobs J, Potter K (1997) Hate crimes: a critical perspective. Crime Justi Rev Rese CRIME JUSTICE 22. https://doi.org/10.1086/449259

  16. KhudaBukhsh AR, Bennett PN, White RW (2015) Building effective query classifiers: a case study in self-harm intent detection. In: Proceedings of the 24th ACM international on conference on information and knowledge management. Association for Computing Machinery, New York, CIKM ’15, pp 1735–1738. https://doi.org/10.1145/2806416.2806594. https://doi.org/10.1145/2806416.2806594

  17. KhudaBukhsh AR, Palakodety S, Carbonell JG (2020) Harnessing code switching to transcend the linguistic barrier. In: Bessiere C (ed) Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI-20, International joint conferences on Artificial Intelligence Organization, pp 4366–4374, special track on AI for CompSust and Human well-being

    Google Scholar 

  18. KhudaBukhsh AR, Palakodety S, Mitchell TM (2020) Discovering bilingual lexicons in polyglot word embeddings. CoRR abs/2008.13347. https://arxiv.org/abs/2008.13347, 2008.13347

  19. Kim S, Song Y, Kim K, Cha JW, Lee GG (2006) MMR-based active machine learning for bio named entity recognition. In: Proceedings of the human language technology conference of the NAACL, companion volume: Short Papers. Association for Computational Linguistics, New York City, pp 69–72. https://aclanthology.org/N06-2018

  20. Klein AZ, Sarker A, Cai H, Weissenbacher D, Gonzalez-Hernandez G (2018) Social media mining for birth defects research: a rule-based, bootstrapping approach to collecting data for rare health-related events on twitter. J Biomed Inf 87:68–78

    Google Scholar 

  21. Lample G, Ott M, Conneau A, Denoyer L, Ranzato M (2018) Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, pp 5039–5049. https://doi.org/10.18653/v1/D18-1549. https://aclanthology.org/D18-1549

  22. Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Machine learning proceedings 1994, Elsevier, pp 148–156

    Google Scholar 

  23. Pagliardini M, Gupta P, Jaggi M (2018) Unsupervised learning of sentence embeddings using compositional n-gram features. In: NAACL 2018—Conference of the North American Chapter of the Association for Computational Linguistics

    Google Scholar 

  24. Palakodety S, KhudaBukhsh AR, Carbonell JG (2020a) Hope speech detection: a computational analysis of the voice of peace. In: Giacomo GD, Catalá A, Dilkina B, Milano M, Barro S, Bugarín A, Lang J (eds) ECAI 2020—24th European conference on artificial intelligence. Frontiers in artificial intelligence and applications. IOS Press, vol 325, pp 1881–1889. https://doi.org/10.3233/FAIA200305. https://doi.org/10.3233/FAIA200305

  25. Palakodety S, KhudaBukhsh AR, Carbonell JG, Palakodety S, KhudaBukhsh AR, Carbonell JG (2020) Voice for the voiceless: active sampling to detect comments supporting the Rohingyas. In: Proceedings of the AAAI conference on artificial intelligence, vol 34(01), pp 454–462

    Google Scholar 

  26. Pereira-Kohatsu JC, Sánchez L, Liberatore F, Camacho-Collados M (2019) Detecting and monitoring hate speech in twitter. Sensors (Basel, Switzerland) 19

    Google Scholar 

  27. Saha P, Mathew B, Garimella K, Mukherjee A (2021) Short is the road that leads from fear to hate”: Fear speech in Indian WhatsApp groups. In: Proceedings of the web conference 2021. Association for Computing Machinery, New York, WWW ’21, pp 1110–1121

    Google Scholar 

  28. Scheffer T, Decomain C, Wrobel S (2001) Active hidden Markov models for information extraction. In: Hoffmann F, Hand DJ, Adams N, Fisher D, Guimaraes G (eds) Advances in intelligent data analysis. Springer, Berlin, pp 309–318

    Google Scholar 

  29. Settles B (2009) Active learning literature survey

    Google Scholar 

  30. Sindhwani V, Melville P, Lawrence RD (2009) Uncertainty sampling and transductive experimental design for active dual supervision. In: Proceedings of the 26th annual international conference on machine learning. Association for Computing Machinery, New York, ICML ’09, p 953–960. https://doi.org/10.1145/1553374.1553496. https://doi.org/10.1145/1553374.1553496

  31. Twitter, Inc (2021) Hateful conduct policy. Twitter https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy

  32. United Nations Office for the Coordination of Humanitarian Affairs (2021) Rohingya refugee crisis. YouTube. https://www.unocha.org/rohingya-refugee-crisis

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shriphani Palakodety .

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Palakodety, S., KhudaBukhsh, A.R., Jayachandran, G. (2021). Semantic Sampling. In: Low Resource Social Media Text Mining. SpringerBriefs in Computer Science. Springer, Singapore. https://doi.org/10.1007/978-981-16-5625-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-5625-5_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-5624-8

  • Online ISBN: 978-981-16-5625-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics