Performance analysis of Word Embeddings for Cyberbullying Detection

Cyber bullying activities are increasing day by day with the increase of Social Media Platforms such as Face book, Twitter, Instagram etc. Bullies take the advantage of these large online connected platforms due to which it became as a big challenging task in Natural Language Processing (NLP). In this paper, we compare the performance of various word embedding methods from basic word embedding methods to recent advanced language models such as RoBERTa, XLNET, ALBERT, etc. for cyberbullying detection. We used LightGBM and Logistic regression classifiers for the classification of bullying and non-bullying tweets. Among all the models, RoBERTa is outperformed as compared to state-of-the-art models.


Introduction
The usage of social media is much more increasing from day to day with increase of digital devices like smartphones, tabs, and electronic gadgets. The most common social media platforms like Facebook, Twitter, Instagram, WhatsApp etc. are majorly used to share the messages and express their opinions. According to Statista * website information, the number of social media users increased from 10 million to 260 million. These social media platforms give more freedom to users to express their ideas and feelings in different forms like text message, sharing image, audio, and graphics. Some of the people using these social media networks for anti-social activities like online trolling, cyberbullying etc.
In recent years, cyberbullying is a big challenging task in social media era. Cyberbullying is a kind of online hate speech by intentionally posting hate kind of text messages, images, audios and videos [1]. The effect of cyberbullying is more dangerous than traditional bullying because, it reaches very large number of people in short time. The major victims of cyberbullying activities are school children and adolescents. The victims are psychologically affected, and sometimes lead to suicide attempts [2]. To save the bullying victims from these activities, some researchers developed machine learning algorithms that can predict cyberbullying text messages which are having humiliating content. In natural language processing tasks, the machine learning algorithms depends on word embeddings. Each text message must convert into numeric form by different word embedding techniques. In this paper, we study the performance of word embedding techniques on cyberbullying detection on two classifier algorithms called Logistic regression and LightGBM. The major contributions of the paper as follows ➢ We conducted experiments on most advanced word embedding techniques in natural language processing research area. ➢ Compared the performance of word embeddings in three different sampling modes. ➢ Demonstrated the mis-classification rate of word embedding techniques.

Related work
In this section, we discussed about previous works on cyberbullying detection using various word embedding techniques. M. Karan and J. Snajder [3] proposed a pre-emptive detection system for toxic language. They used support vector machine as classifier and tf-idf used for word embeddings. Xu et al. [4] used bag of words word embeddings and studied different roles of participants in cyberbullying activities such as victims, bullies, accuser. They achieved f1-score of 0.77 using support vector machine classifier. Nahar et al. [5] implemented graph model to estimates victim scores on Twitter dataset. They used tfidf to extract features from the text. Pang et al. [6] employed three machine learning algorithms SVM, Navie Bayes, Maximum Entropy for sentiment classification on IMDb dataset to classify the movie review is positive or negative. They used Bag of words word embeddings technique. Zhao and Mao [7] proposed a method called Semantic-enhanced Marginalized Stacked Denoising Autoencoder (smSDA) for cyberbullying detection. This smSDA capable of reducing noise and able to identify best features from Bag of Words representation in effective and efficient way. Yen et.al [8] employed a supervised learning approach to identify online harassment in web 2.0. They combine different features like contextual features, sentiment features and content features of documents and trained on support vector machine classifier. The proposed technique outperforms the traditional tf-idf. Vikas et al. [9] included two additional features called pronoun capturing and skip-grams to the features extracted by tf-idf and n-gram. This method increases the accuracy of the detection system up to 86% using logistic regression classifier. Kasture [10] considered Linguistic Inquiry and Word count(LIWC) for feature extraction. They train with different machine leaning classifiers and found that random forest is the best among the twitter dataset on recall of 96.3% and precision of 98.4%. Ghasem et al [11], chosen the top most 500 informative words as the feature vector and train on support vector machine(SVM) and achieved f1-score of 95%. Jiale Wu et.al [12], PROPOSED ITFT (improved tf-idf based fastText) model for bullying detection by considering keywords, position weight to traditional tf-idf word embedding method. The proposed model able to filter the noise and improved the accuracy. B Sri Nandini et al., [13] proposed a model using Fuzzy logic and genetic algorithms to classify the bullying actions such as Terrorism, Racism, Harassment. They used to genetic algorithm for optimization of features and fuzzy rule set to retrieve the data for classification task. Charkrabarthy N [14] used Bag of Words and tf-idf word embedding techniques for toxic comments classification from Wikipedia dataset. Salminen et al. [15] proposed online hate content detection approach for various social media networks. The authors collected 197,566 comments from YouTube, Reddit, Wikipedia and Twitter.

Materials and Methods
In this section we discuss about data collection and proposed methodology for cyberbullying detection and evaluation metrics. Fig 1 shows the key components in the proposed methodology. We collected the dataset from the source Waseem and Hovy [16]. The Waseem and Hovy collected tweets from popular social networking site called Twitter over a period of two months with the help of Twitter API. They retrieved 136,052 tweets totally and assigned the labelling for 13,762 tweets. The 9,533 tweets neither sexist nor racist labelled as 0, and 4,229 tweets labelled as 1 which contains either sexism or racism. We assumed labelled '0' tweets are non-bullying comments and labelled '1' tweets are bullying comments. Table 1 shows the example labelling of bullying and non-bullying tweets.  The input raw data is first pre-processed to avoid inconsistencies and then different feature extraction techniques are used to extract the robust features. Then the classification methods are used to compare the effectiveness of each feature in terms of evaluation metrics.

Data pre-processing
The pre-processing is very important step for any natural language processing (NLP) tasks. The data collected from various social media consists of noise data such as stop words, URL, unwanted numbers. Table 2 shows the example of noise in social media tweets. To remove noise data from the tweet, we applied various natural language pre-processing steps such as tokenization, lemmatization [17] to get the original root words and stemming [18] to reduce inflections words to root form.

Feature Extraction
For any natural language processing tasks, feature extraction methods play important role. Machine learning classifier accepts only numerical representation as input data to get trained. So every text information needs to convert into numeral form by applying various feature extraction techniques. In this proposed methodology, we used different feature extraction methods such as Bag of Words (BoW), TF-IDF, word2vec, glove, FastText and various language models such as ALBERT [19], ELCTRA [20], GPT-2 [21], XL-NET [22] and RoBERTa [23].

Feature Extraction Techniques
In this section, we give outline of the feature extraction techniques that are used for comparison in our work. ALBERT: ALEBERT is a language model stands for A LITE BERT. It is self-supervised learning language model of language representations. ALBERT incorporates two-parameter reduction techniques which are factorized embedding parameterization and cross-layer parameter sharing in order for lifting the major obstacles in scaling pre-trained models in NLP. In this model, number of parameters are reduced 89% compared to BERT [24] model and it achieves respectable performance on benchmark datasets. The major advantage of ALBERT is parameter efficiency, because it can fit much larger batches into memory for quick inference. GPT-2 GPT-2 is a pre-trained language model which is useful for various natural language processing tasks such as text generation, question and answer system and language translation. It is a transformedbased model trained on a dataset eight million web documents. It uses huge vocabulary size in training. It can predict the next occurring word in a sentence. The GPT-2 architecture based on Transformers concept which provides mechanism of encoder-decoders to identify input and output dependencies. It was evaluated through zero-shot learning.

XL-NET
It is a general auto regressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order. It integrates two methods called autoregressive and auto encoding. In a given text sequence k = (k1,…,kT), autoregressive models use probability of p(k) = ∏ ( =1 kt | k<t) for forward prediction of the word and use probability of p(k) = ∏ 1 = (kt | k>t) backward prediction of the word. XL-Net outperforms BERT on 20 tasks, including question answering, natural language inference, sentiment analysis, and document ranking. RoBERTa RoBERTa is an enhanced version of BERT model by improved training methodology. Dynamic masking pattern is introduced whereas BERT follows static masking pattern. It removes Next Sentence Prediction(NSP) task from BERT's pre-training. It uses 160GB of text for pre-training with various data such as BookCorpus, English Wikipedia, OpenWebText whereas BERT trained against BookCorpus only. This model outperforms both XL-NET and BERT on GLUE benchmark results.

Classifiers
Classification is supervised learning approach in which the model predicts discrete output variable based on input variables. Machine learning classifier plays vital role for after generating word embeddings. These classifiers are used in many applications such as speech recognition, handwriting recognition, face detection, document classification etc.… A bunch of machine learning classifiers such as Support Vector Machine (SVM), Navie Bayes(NB), Radom Forest(RF), K-nearest neighbor(kNN) used for these tasks. We used simple baseline linear classifier Logistic Regression (LR)[25] and non-linear classifier called LightGBM [26] in the experiments for classification task.

Logistic Regression
It is a supervised machine learning algorithm and intended for binary classification tasks. It uses logistic function called sigmoid function which is in the range of 0 and 1. Eq.1 is sigmoid equation for logistic regression. The sigmoid function generates probability scores based on input features. For example, the probability score is ≥ 0.5, it belongs to bullying tweet; otherwise, the prediction will be for the other class (non-bullying).

Light GBM
LightGBM(Light Gradient Boosting Machine) improved many demerits of Decision Tree and Gradient Boosting Decision Tree algorithms and hence is very useful in bringing great insights from Cyberbullying Data. The LightGBM classifier follows leaf-wise tree growth which is big advantage for this classifier. Earlier boosting algorithms unable to handle large datasets while LightGBM able to handle a huge amount of data with many advantages such as faster training speed, good accuracy, parallel learning, and less memory usage. Fig 3 shows the leaf-wise growth in LightGBM classifier. Each node represent a decision of classify the tasks.

Evaluation Metrics
We considered precision, recall and f1-score as evaluation metrics. Precision: The precision can be defined as the ratio of correctly predicted as bullying tweets to total number of predictions as bullying tweets. Eq.2 shows precision formula.
Where C indicates correctly predicted as bullying tweets, B indicates non bullying tweets predicted as bullying tweets.
Recall : The recall can be defined as the ratio of correctly predicted as bullying tweets to total number of actual bullying tweets. Equation-3 shows recall formula. Here, NB indicates non bullying tweets.

Results and Discussion
The proposed approach is implemented in Python programming language with help of several packages called Pandas, Numpy, matplotlib, Scikit-Learn, LightGBM, Tensorflow in Linux operating system. We ran our codes on Intel i7 8th Gen 12core CPU processor and Nvidia Max-Q 1070 8GB RAM. We conducted several experiments to check the performance of various Word Embedding methods for cyberbullying detection data. We performed 5-fold validation of the dataset. We consider macro average of precision, recall, f1-scores of class-0 and class-1. Fig 4 shows the f1-scores of Waseem dataset with No sampling mode. We observed that most of the language models scored high f1-scores as compared to traditional word embedding such as BoW and T-IDF and semantic word embeddings such as word2vec, glove , FastText. Over sampling and under sampling are two techniques used for imbalanced datasets to avoid class bias problems. We conducted experiments in Over sampling and Under sampling. Table 3 shows precision, recall and f1-score of Waseem dataset with over sampling and under sampling modes. We noticed that the performance of language models is better than traditional word embedding techniques in terms of precision, recall and F1-score.   Error Analysis: Mis-classification rate(MCR) is the ratio of false positive(FP) and false negatives(FN) to the total number of records.Eq.5 shows mis-classification rate. In this experiment we applied five fold validation. The total number of tweets in each fold is 2753. Table 4 shows the total number of misclassifications i.e both false positives and false negatives in each fold in terms of percentage. The least number of misclassifications found in language models such as RoBERTa, Open-AI, GPT-2. We noticed that LightGBM classifies better than logistic regression classifier in all the word embedding methods. Fig 5 show

Conclusion and Future works
With the increase of social media platforms, the bullying activities are increasing significantly. This paper proposed a methodology for cyberbullying detection by experimenting various word embeddings techniques with two classifiers. We conducted experiments in three types of sampling modes on the Twitter dataset. 1) no sampling 2) under sampling 3) oversampling. The f1-score of language models in sampling mode almost close to 80 percentage whereas basic word embeddings methods are around 75%. The mis-classification rate of basic word embeddings is 21%, whereas misclassification rate of language models reduced to 16%. We observed that the drawbacks of basic word embedding techniques can overcome with these language models for many downstream tasks in Natural language processing. As part of the future work for cyberbullying detection system, language models have to be considered to build the robust features.