Skip to main content

Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text

  • Conference paper
  • First Online:
Advances in Distributed Computing and Machine Learning

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 302))

Abstract

Language Identification (LI) is a crucial part of various text-processing pipelines, as most techniques presume that the language of input text is known. Document-level Language Identification has been seen as an almost solved problem in some application areas, but language detectors fail in the case of social media environment due to code-switching, word-borrowing from different languages, phonetic typing; which imply that LI in code-mixed text must be carried out at word-level. Hence, this work focuses on identifying languages at word-level in multilingual environments like social-media. One of the major concerns of these environments is phonetic typing which can be taken into consideration by inculcating graphemic features into our model. Character n-grams take all combination of character occurring together into account resulting in large model size, whereas graphemic features consider only those combinations of characters having some underlying linguistic significance. For example, ‘kh’ and ‘gh’ graphemes occur majorly in languages like Hindi and Urdu in comparison to English. According to our observations in dataset (Sarma et al. in Word level language identification in assamese-bengali-hindi-english code-mixed social media text, pp. 261–266), we have observed that more graphemes (53.46%) are exclusive to a particular language than bigrams (21.38%) or trigrams (39.43%) are. This work consists of detailed analysis and comparison on the basis of several metrics between the character n-gram and grapheme based features by performing experiments using grapheme based features in various popular methods (originally containing only character n-gram features) in place of character n-gram features. Through these set of experiments and our analysis, we show the usefulness of grapheme in the field of word-level LI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J., Weiss, D.: A fast, com-pact, accurate model for language identification of codemixed text. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 328–337, Brussels, Belgium. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/D18-1030. https://www.aclweb.org/anthology/D18-1030

  2. Bali, K., Sharma, J., Choudhury, M., Vyas, Y.: I am borrowing ya mixing ? an analysis of English-Hindi code mixing in Facebook. In: Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 116–126, Doha, Qatar. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/W14-3914. https://www.aclweb.org/anthology/W14-3914

  3. Rijhwani, S., Sequiera, R., Choudhury, M., Bali, K., Maddila, C. S.: Estimating code-switching on twitter with a novel generalized word-level language detection technique. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1971–1982, Vancouver, Canada. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1180. https://www.aclweb.org/anthology/P17-1180

  4. Wikipedia. n-gram—wikipedia, the free encyclopedia (2004). https://en.wikipedia.org/wiki/N-gram. [Online; Accessed 22 July 2004]

  5. Urban Dictionary (2011). https://dictionary.reference.com/browse/grapheme. [Online; Accessed 22 July 2011]

  6. Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M., Smith, N.: Hierarchical character-word models for language identification, pp. 84–93 (2016). https://doi.org/10.18653/v1/W16-6212

  7. Singh, K., Sen, I., Kumaraguru, P.: A Twitter Corpus Hindi English Code Mixed Dataset for POS Tagging. Workshop on Natural Language Processing for Social Media (Social NLP 2018)

    Google Scholar 

  8. Killer, M., Stüker, S., Schultz, T.: Grapheme based speech recognition. Proc. Eurospeech, 04 (2009)

    Google Scholar 

  9. Baldwin, T., Lui, M.: Language identification: The long and the short of the matter. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 229–237, Los Angeles, California. Association for Computational Linguistics (2010). https://www.aclweb.org/anthology/N10-1027

  10. Murthy, K., Kumar, G.: Language identification from small text samples. J. Quant. Linguis. 13, 57–80 (2006). https://doi.org/10.1080/09296170500500694

  11. Church, K.: Stress assignment in letter to sound rules for speech synthesis. In: 23rd Annual Meeting of the Association for Computational Linguistics, pp. 246–253, Chicago, Illinois, USA. Association for Computational Linguistics (1985). https://doi.org/10.3115/981210.981240. https://www.aclweb.org/anthology/P85-1030

  12. Yang, X., Liang, W.: An n-gram-and-wikipedia joint approach to natural language identification. In: 2010 4th International Universal Communication Symposium, pp. 332–339 (2010). https://doi.org/10.1109/IUCS.2010.5666010

  13. Lui, M., Baldwin, T.: Langid.py: An off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, pp. 25–30, Jeju Island, Korea. Association for Computational Linguistics (2012). https://www.aclweb.org/anthology/P12-3005

  14. Tromp, E., Pechenizkiy, M.: Graph-based n-gram language identification on short texts. In: Proceedings of Benelearn, pp. 27–34 (2011)

    Google Scholar 

  15. Moodley, A.: Language Identification With Decision Trees: Identification Of Individual Words In The South African Languages. Ph.D. thesis (2016)

    Google Scholar 

  16. Malmasi, S., Zampieri, M.: Arabic dialect identification using iVectors and ASR transcripts. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 178–183, Valencia, Spain. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-1222. https://www.aclweb.org/anthology/W17-1222

  17. Jhamtani, H., Kumar, B., Raychoudhury, V.: Word-level language identification in bi-lingual code-switched texts (2014)

    Google Scholar 

  18. Barman, U., Das, A., Wagner, J., Foster, J.: Code-mixing: A challenge for language identification in the language of social media. In: Proceedings of the First Workshop on Computational Approaches to Code-Switching (2014)

    Google Scholar 

  19. Samih, Y., Maharjan, S., Attia, M., Kallmeyer, L., Solorio, T.: Multilingual code-switching identification via LSTM recurrent neural networks. In: Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 50–59, Austin, Texas. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/W16-5806. https://www.aclweb.org/anthology/W16-5806

  20. Yeong, Y.-L., Tan, T.-P.: Applying grapheme, word, and syllable information for language identification in code switching sentences (2011). https://doi.org/10.1109/IALP.2011.34

  21. Oh, J.-H., Choi, K.-S.: An ensemble of grapheme and phoneme for machine transliteration, vol. 3651, pp. 450–461 (2005). https://doi.org/10.1007/1156221440

  22. Banerjee, S., Choudhury, M., Chakma, K., Naskar, S.K., Das, A., Bandyopadhyay, S., Rosso, P.: Msir@fire: A comprehensive report from 2013 to 2016. SN Comput. Sci. 1(1), 55 (2020). https://doi.org/10.1007/s42979-019-0058-0

    Article  Google Scholar 

  23. Dongen, N.: Analysis and prediction of dutch-english code-switching in dutch social media messages (2017)

    Google Scholar 

  24. Bartlett, S., Kondrak, G., Cherry, C.: On the syllabification of phonemes. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 308–316, Boulder, Colorado. Association for Computational Linguistics (2009). https://www.aclweb.org/anthology/N09-1035

  25. Sarma, N., Singh, S. R., Goswami, D.: Word level language identification in assamese-bengali-hindi-english code-mixed social media text. In: 2018 International Conference on Asian Language Processing (IALP), pp. 261–266 (2018)

    Google Scholar 

  26. Rathod, P., Dhore, M.L., Dhore, R.: Hindi and Marathi to English machine transliteration using SVM (2013)

    Google Scholar 

  27. Yenigalla, P., Kumar, A., Tripathi, S., Singh, C., Kar, S., Vepa, J.: Speech emotion recognition using spectrogram and phoneme embedding, pp. 3688–3692 (2018). https://doi.org/10.21437/Interspeech.2018-1811

  28. Cortes, C., Vapnik, V.: Support-vector networks. Mac. Learn. 20(3), 273–297 (1995). ISSN 1573-0565. https://doi.org/10.1023/A:1022627411411

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jain, S., Agarwal, K. (2022). Usefulness of Graphemes in Word-Level Language Identification in Code-Mixed Text. In: Sahoo, J.P., Tripathy, A.K., Mohanty, M., Li, KC., Nayak, A.K. (eds) Advances in Distributed Computing and Machine Learning. Lecture Notes in Networks and Systems, vol 302. Springer, Singapore. https://doi.org/10.1007/978-981-16-4807-6_17

Download citation

Publish with us

Policies and ethics