Opinion Classification on Code-mixed Tamil Language

Divya, S.; Sripriya, N.; Evangelin, Daphne; Saai Sindhoora, G.

doi:10.1007/978-3-031-33231-9_10

S. Divya ORCID: orcid.org/0000-0001-6317-8680¹²,
N. Sripriya¹²,
Daphne Evangelin¹² &
…
G. Saai Sindhoora¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1802))

Included in the following conference series:

International Conference on Speech and Language Technologies for Low-resource Languages

192 Accesses

Abstract

User Sentiment Analysis (SA) is an interesting application of Natural Language Processing (NLP) to analyze the opinions of an individual. The user's opinion is beneficial to the public, business organizations, movie producers etc. to take valid decisions and enhance it. Few sentiments are incorrectly interpreted due to context errors such as multi-polarity. People belonging to multilingual communities utilize multiple regional languages for communication and thus social media platform enabled the users to express their ideas in mixed languages. The user opinions posted as a mixture of two or more language is known as code-mixed data. It is quiet challenging to handle such code-mixed data as it contains colloquial vocabulary and is difficult to interpret the context in mixed languages. This proposed system focuses on this issue by analyzing the efficiency several word embedding techniques in the generation of contextual representation of words. To evaluate the performance of various embedding techniques, the representations generated are given as input to a standard machine learning technique for sentiment classification. The efficiency of several embedding algorithm is analyzed by classifying the code-mixed data based on its representation. This analysis is carried out on Dravidian Code-mixed FIRE 2020 Tamil dataset which contains review comments collected from YouTube. The evaluation proves that the transformer model generates effective representations and the positive labels are efficiently identified with the F1 score of 0.75. The representations generated by various embedding algorithms are fed as input to several classification algorithms and the accuracy of the models are estimated. From the result, it is derived that IndicBERT generates semantically efficient representations and thus facilitates in achieving greater classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mandl, T., Modha, S., Kumar, A.M., Chakravarthi, B.R.: Overview of the hasoc track at fire 2020: hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Forum for Information Retrieval Evaluation, pp. 29–32 (2020)
Google Scholar
Myers-Scotton, C.: Duelling languages: Grammatical structure in codeswitching, Oxford University Press (1997)
Google Scholar
Chakravarthi, B.R., Muralidaran, V., Priyadharshini, R., McCrae, J.P.: Corpus creation for sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pp. 202–210. European Language Resources association, Marseille, France (2020). https://www.aclweb.org/anthology/2020.sltu-1.28
Nasukawa, T., et al.: Sentiment analysis: capturing favorability using natural language processing. In: Proceedings of the 2nd International Conference on Knowledge Capture, pp. 70–77 (2003). https://doi.org/10.1016/j.knosys.2016.08.012
Archak, N., et al.: Deriving the pricing power of product features by mining consumer reviews. deriving the pricing power of product features by mining consumer reviews. Manage.Sci. 57(8), 1485–1509 (2011). http://dx.doi.org/https://doi.org/10.1287/mnsc.1110.1370
Neri, F., et al.: Sentiment Analysis on Social Media. In: 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (2012)
Google Scholar
Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5, 1–167 (2012)
Article Google Scholar
Thavareesan, S., et al.: Review on sentiment analysis in tamil texts. JSc EUSL(2018), vol. 9, no. 2, pp. 1–18, e- ISSN: 2602-9030 (2018)
Google Scholar
Sitaram, D., et al.: Sentiment analysis of mixed language employing Hindi – English code switching. In: International Conference on Machine Learning and Cybernetics (ICMLC), pp. 271–276 (2015)
Google Scholar
Nithya, K., et al.: Deep learning based analysis on code-mixed tamil text for sentiment classification with pre-trained ULMFiT. In: Proceedings of the Sixth International Conference on Computing Methodologies and Communication (ICCMC 2022) IEEE Xplore Part Number: CFP22K25-ART; ISBN: 978-1-6654-1028-1 (2022)
Google Scholar
Chakravarthi, B.R., et al.: DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Lang. Resources Eval. 56(3), 765–806 (2022). https://doi.org/10.1007/s10579-022-09583-7
Kalaivani, A., et al.: Dravidian-CodeMix-FIRE2020: sentiment code-mixed text classification in Tamil and Malayalam using ULMFiT. In: CEUR Workshop Proceedings, vol. 2826, pp. 528–534 (2020)
Google Scholar
Srinivasan, R., et al.: Sentimental analysis from imbalanced code‐mixed data using machine learning approaches. Distrib Parallel Databases (2021)
Google Scholar
Anita, S., Pal, S.: Sentiment Analysis on Multilingual Code Mixing Text Using BERT-BASE” participation of IRLab@IIT(BHU) in Dravidian-CodeMix and HASOC tasks of FIRE2020 (2020)
Google Scholar
Chakravarthi, B.R., et al.: Corpus creation for sentiment analysis in code-mixed Tamil-English text”. In: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pp. 202–210. European Language Resources association, Marseille, France. (2020)
Google Scholar
Aizawa, A.: An information-theoretic perspective of tf–idf measures. Inf. Process. Manage. 39(1), 45–65 (2003)
Article MATH Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013)
Patel, A., Meehan, K.: Fake news detection on reddit utilising CountVectorizer and term frequency-inverse document frequency with logistic regression, MultinominalNB and support vector machine. In: 2021 32nd Irish Signals and Systems Conference (ISSC), pp. 1–6. IEEE (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Kannan, R.R., Rajalakshmi, R., Kumar, L.: IndicBERT based approach for Sentiment Analysis on Code-Mixed Tamil Tweets (2021)
Google Scholar
Chen, T., et al.: Xgboost: extreme gradient boosting. R package version 0.4–2, 1(4), 1–4 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Sri Sivasubramaniya Nadar College of Engineering, Chennai, Tamilnadu, India
S. Divya, N. Sripriya, Daphne Evangelin & G. Saai Sindhoora

Authors

S. Divya
View author publications
You can also search for this author in PubMed Google Scholar
N. Sripriya
View author publications
You can also search for this author in PubMed Google Scholar
Daphne Evangelin
View author publications
You can also search for this author in PubMed Google Scholar
G. Saai Sindhoora
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to S. Divya or N. Sripriya .

Editor information

Editors and Affiliations

National Institute of Technology Karnataka, Mangalore, India
Anand Kumar M
National University of Ireland, Galway, Ireland
Bharathi Raja Chakravarthi
Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, India
Bharathi B
National University of Ireland, Galway, Ireland
Colm O’Riordan
Indian Institute of Technology Madras, Chennai, India
Hema Murthy
Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, India
Thenmozhi Durairaj
University of Hildesheim, Hildesheim, Germany
Thomas Mandl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Divya, S., Sripriya, N., Evangelin, D., Saai Sindhoora, G. (2023). Opinion Classification on Code-mixed Tamil Language. In: M, A.K., et al. Speech and Language Technologies for Low-Resource Languages . SPELLL 2022. Communications in Computer and Information Science, vol 1802. Springer, Cham. https://doi.org/10.1007/978-3-031-33231-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-33231-9_10
Published: 29 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33230-2
Online ISBN: 978-3-031-33231-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics