ABSTRACT
Cyberbullying detection in a sentence or utterance has challenges due to syntactic and meaning variations (lexical). Term Frequency-Inverse Document Frequency (TF-IDF) carries out textual feature extraction to produce candidates thematically based on word occurrence statistics. However, these candidates are generated without considering a term relationship between constituent elements in the parsing language syntax. This study discusses a TF-IDF feature extraction model using the n-Gram approach to produce candidate feature selection based on a specified term relationship. Thresholding applications for the formation of dynamic n-Gram segmentation were also discussed. Furthermore, the dynamic n-Gram model in TF-IDF feature extraction can be used in cyberbullying classification to overcome variations in syntax and meaning of sentences/speech from Bahasa Indonesia.
- [1] P. Michael A, R. Sharon, H. Mats, and B. Tina, Post-Truth, Fake News. Singapore: Springer International Publishing, 2018. [Online]. Available: https://doi.org/10.1007/978-981-10-8013-5Google ScholarCross Ref
- [2] A. Rajput, “Chapter 3 - Natural Language Processing, Sentiment Analysis, and Clinical Analytics,” in Innovation in Health Informatics, M. D. Lytras and A. Sarirete, Eds. Academic Press, 2020, pp. 79–97. doi: 10.1016/B978-0-12-819043-2.00003-4.Google ScholarCross Ref
- [3] J. Baggini and P. S. Fosl, The Philosophers, 2nd ed. United Kingdom (UK): Blackwell Publishing Ltd, 2010.Google Scholar
- [4] V. Balakrishnan, S. Khan, and H. R. Arabnia, “Improving cyberbullying detection using Twitter users’ psychological features and machine learning,” Comput. Secur., vol. 90, p. 101710, Mar. 2020, doi: 10.1016/j.cose.2019.101710.Google ScholarDigital Library
- [5] C. P. Barlett, “Chapter 2 - Cyberbullying, Traditional Bullying, and Aggression: A Complicated Relationship,” in Predicting Cyberbullying, C. P. Barlett, Ed. Academic Press, 2019, pp. 11–16. doi: 10.1016/B978-0-12-816653-6.00002-9.Google ScholarCross Ref
- [6] I. Ting, W. S. Liou, D. Liberona, S. Wang, and G. M. Tarazona Bermudez, “Towards the detection of cyberbullying based on social network mining techniques,” in 2017 International Conference on Behavioral, Economic, Socio-cultural Computing (BESC), Oct. 2017, pp. 1–2. doi: 10.1109/BESC.2017.8256403.Google ScholarCross Ref
- [7] T. Mahlangu and C. Tu, “Deep Learning Cyberbullying Detection Using Stacked Embbedings Approach,” in 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI), Nov. 2019, pp. 45–49. doi: 10.1109/ISCMI47871.2019.9004292.Google ScholarCross Ref
- [8] P. Sheldon, P. A. Rauschnabel, and J. M. Honeycutt, “Chapter 3 - Cyberstalking and Bullying,” in The Dark Side of Social Media, P. Sheldon, P. A. Rauschnabel, and J. M. Honeycutt, Eds. Academic Press, 2019, pp. 43–58. doi: 10.1016/B978-0-12-815917-0.00003-4.Google ScholarCross Ref
- [9] Z. L. Chia, M. Ptaszynski, F. Masui, G. Leliwa, and M. Wroczynski, “Machine Learning and feature engineering-based study into sarcasm and irony classification with application to cyberbullying detection,” Inf. Process. Manag., vol. 58, no. 4, p. 102600, Jul. 2021, doi: 10.1016/j.ipm.2021.102600.Google ScholarDigital Library
- [10] J. Eronen, M. Ptaszynski, F. Masui, A. Smywiński-Pohl, G. Leliwa, and M. Wroczynski, “Improving classifier training efficiency for automatic cyberbullying detection with Feature Density,” Inf. Process. Manag., vol. 58, no. 5, p. 102616, Sep. 2021, doi: 10.1016/j.ipm.2021.102616.Google ScholarDigital Library
- [11] A. Xiong, D. Liu, H. Tian, Z. Liu, P. Yu, and M. Kadoch, “News keyword extraction algorithm based on semantic clustering and word graph model,” Tsinghua Sci. Technol., vol. 26, no. 6, pp. 886–893, Dec. 2021, doi: 10.26599/TST.2020.9010051.Google ScholarCross Ref
- [12] I. Arroyo-Fernández, C.-F. Méndez-Cruz, G. Sierra, J.-M. Torres-Moreno, and G. Sidorov, “Unsupervised sentence representations as word information series: Revisiting TF–IDF,” Comput. Speech Lang., vol. 56, pp. 107–129, Jul. 2019, doi: 10.1016/j.csl.2019.01.005.Google ScholarDigital Library
- [13] C. Wan, Y. Wang, Y. Liu, J. Ji, and G. Feng, “Composite Feature Extraction and Selection for Text Classification,” IEEE Access, vol. 7, pp. 35208–35219, 2019, doi: 10.1109/ACCESS.2019.2904602.Google ScholarCross Ref
- [14] G. Gledec, R. Šoić, and Š. Dembitz, “Dynamic N-Gram System Based on an Online Croatian Spellchecking Service,” IEEE Access, vol. 7, pp. 149988–149995, 2019, doi: 10.1109/ACCESS.2019.2947898.Google ScholarCross Ref
- [15] S. Li, R. Pan, H. Luo, X. Liu, and G. Zhao, “Adaptive cross-contextual word embedding for word polysemy with unsupervised topic modeling,” Knowl.-Based Syst., vol. 218, p. 106827, Apr. 2021, doi: 10.1016/j.knosys.2021.106827.Google ScholarDigital Library
- [16] P. Flach, Machine Learning. The Art and Science of Algorithms that Make Sense of Data. United Kingdom (UK): Cambridge University Press, 2012.Google ScholarCross Ref
- [17] M. Fortunatus, P. Anthony, and S. Charters, “Combining textual features to detect cyberbullying in social media posts,” Knowl.-Based Intell. Inf. Eng. Syst. Proc. 24th Int. Conf. KES2020, vol. 176, pp. 612–621, Jan. 2020, doi: 10.1016/j.procs.2020.08.063.Google ScholarCross Ref
- [18] A. P. Genoud, Y. Gao, G. M. Williams, and B. P. Thomas, “A comparison of supervised machine learning algorithms for mosquito identification from backscattered optical signals,” Ecol. Inform., vol. 58, p. 101090, Jul. 2020, doi: 10.1016/j.ecoinf.2020.101090.Google ScholarCross Ref
- [19] F. A. Ozbay and B. Alatas, “Fake news detection within online social media using supervised artificial intelligence algorithms,” Phys. Stat. Mech. Its Appl., vol. 540, p. 123174, Feb. 2020, doi: 10.1016/j.physa.2019.123174.Google ScholarCross Ref
- [20] M. M. Ali, B. K. Paul, K. Ahmed, F. M. Bui, J. M. W. Quinn, and M. A. Moni, “Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison,” Comput. Biol. Med., vol. 136, p. 104672, Sep. 2021, doi: 10.1016/j.compbiomed.2021.104672.Google ScholarDigital Library
- [21] T. Imura , “Comparison of Supervised Machine Learning Algorithms for Classifying of Home Discharge Possibility in Convalescent Stroke Patients: A Secondary Analysis,” J. Stroke Cerebrovasc. Dis., vol. 30, no. 10, p. 106011, Oct. 2021, doi: 10.1016/j.jstrokecerebrovasdis.2021.106011.Google ScholarCross Ref
- [22] N. Cohen-Shapira and L. Rokach, “Automatic selection of clustering algorithms using supervised graph embedding,” Inf. Sci., vol. 577, pp. 824–851, Oct. 2021, doi: 10.1016/j.ins.2021.08.028.Google ScholarDigital Library
- [23] Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object Detection With Deep Learning: A Review,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 11, Art. no. 11, Nov. 2019, doi: 10.1109/TNNLS.2018.2876865.Google ScholarCross Ref
- [24] E. Zinovyeva, W. K. Härdle, and S. Lessmann, “Antisocial online behavior detection using deep learning,” Decis. Support Syst., vol. 138, p. 113362, Nov. 2020, doi: 10.1016/j.dss.2020.113362.Google ScholarCross Ref
- [25] H. M. Fayek, L. Cavedon, and H. R. Wu, “Progressive learning: A deep learning framework for continual learning,” Neural Netw., vol. 128, pp. 345–357, Aug. 2020, doi: 10.1016/j.neunet.2020.05.011.Google ScholarCross Ref
- [26] U. Mokhtar , “SVM-Based Detection of Tomato Leaves Diseases,” in Intelligent Systems’2014, Cham, 2015, pp. 641–652.Google ScholarCross Ref
- [27] P. Tao, Z. Sun, and Z. Sun, “An Improved Intrusion Detection Algorithm Based on GA and SVM,” IEEE Access, vol. 6, pp. 13624–13631, 2018, doi: 10.1109/ACCESS.2018.2810198.Google ScholarCross Ref
- [28] D. Martens, B. B. Baesens, and T. Van Gestel, “Decompositional Rule Extraction from Support Vector Machines by Active Learning,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 2, pp. 178–191, Feb. 2009, doi: 10.1109/TKDE.2008.131.Google ScholarDigital Library
- [29] H. Chen, P. Tino, and X. Yao, “Probabilistic Classification Vector Machines,” IEEE Trans. Neural Netw., vol. 20, no. 6, pp. 901–914, Jun. 2009, doi: 10.1109/TNN.2009.2014161.Google ScholarDigital Library
- [30] J. Wu and H. Yang, “Linear Regression-Based Efficient SVM Learning for Large-Scale Classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 10, pp. 2357–2369, Oct. 2015, doi: 10.1109/TNNLS.2014.2382123.Google ScholarCross Ref
- [31] S. Long, X. He, and C. Yao, “Scene Text Detection and Recognition: The Deep Learning Era,” Int. J. Comput. Vis., vol. 129, no. 1, Art. no. 1, Jan. 2021, doi: 10.1007/s11263-020-01369-0.Google ScholarDigital Library
- [32] J. (Juyang) Weng, N. Ahuja, and T. S. Huang, “Learning Recognition and Segmentation Using the Cresceptron,” Int. J. Comput. Vis., vol. 25, no. 2, pp. 109–143, Nov. 1997, doi: 10.1023/A:1007967800668.Google ScholarDigital Library
Index Terms
- The Use of Dynamic n-Gram to Enhance TF-IDF Features Extraction for Bahasa Indonesia Cyberbullying Classification
Recommendations
TF-IDF Keyword Extraction Method Combining Context and Semantic Classification
DSIT 2020: Proceedings of the 3rd International Conference on Data Science and Information TechnologyKeyword extraction plays the same role as the cornerstone in the field of natural language processing. Text classification, information retrieval, abstract generation and text clustering are all based on keyword extraction. This article takes the ...
Apply the Dynamic N-gram to Extract the Keywords of Chinese News
IEA/AIE 2014: Proceedings, Part II, of the 27th International Conference on Modern Advances in Applied Intelligence - Volume 8482The explosive growth of information on the Internet has created a great demand for new and powerful tools to acquire useful information. The first step to retrieve information form Chinese article is word segmentation. But there are two major ...
Applications of tf-idf concept to improve monolingual and cross-language information retrieval based on word embeddings
AISS '19: Proceedings of the 1st International Conference on Advanced Information Science and SystemThis work applied word embeddings for English monolingual information retrieval and Dutch-English cross-language information retrieval. Besides word embeddings, this work also applied tf-idf concept to increase result of relevant documents. We present ...
Comments