NIT_Agartala_NLP_Team at SemEval-2019 Task 6: An Ensemble Approach to Identifying and Categorizing Offensive Language in Twitter Social Media Corpora

The paper describes the systems submitted to OffensEval (SemEval 2019, Task 6) on ‘Identifying and Categorizing Offensive Language in Social Media’ by the ‘NIT_Agartala_NLP_Team’. A Twitter annotated dataset of 13,240 English tweets was provided by the task organizers to train the individual models, with the best results obtained using an ensemble model composed of six different classifiers. The ensemble model produced macro-averaged F1-scores of 0.7434, 0.7078 and 0.4853 on Subtasks A, B, and C, respectively. The paper highlights the overall low predictive nature of various linguistic features and surface level count features, as well as the limitations of a traditional machine learning approach when compared to a Deep Learning counterpart.


Introduction
Offensive language has been the scourge of the internet since the rise of social media.Social media provides a platform for everyone and anyone to voice their opinion.This has empowered people to make their voices heard and to speak out on global issues.The downside to this, however, is the misuse of such platforms to attack an individual or a minority group, and to spread hateful opinions.Pairing this with the perceived anonymity the internet provides, there has been a massive upswing in the use of social media for cyberbullying and hate speech, with technology giants coming under increased pressure to address the issue.
Most of what we may be interested in detecting can be broadly labelled as hate speech, cyberbullying or abusive use of swearing.The union of these three subsets form what can be identified as 'Offensive Language on Social Media'.However, what we consider offensive is often a grey area, as is evident by the low inter-annotator agreement rates when labelling data for offensive language (Waseem et al., 2017b).
Detecting offensive language has proven to be difficult, due to the broad spectrum in which language can be used to convey an insult.The nature of the abuse can be implicit -drawing from sarcasm and humour rather than offensive terms -as well as explicit, by making extensive use of traditional offensive terms and profanity.It does not help that the reverse is also entertained, with profanity often being used to imply informality in speech or for emphasis.Coincidentally, these are also the reasons why lexical detection methods have been unfruitful in classifying text as offensive or non-offensive.
The OffensEval 2019 shared task (Zampieri et al., 2019b) is one of several endeavours to further the state-of-the-art in addressing the offensive language problem.The paper describes the insights obtained when tackling the shared task using an ensemble of traditional machine learning classification models and a Long Short-Term Memory (LSTM) deep learning model.Section 2 first discusses other related approaches to detecting hate speech and offensive language.Then Section 3 describes the dataset and Section 4 the ideas and methodology behind our approach.Section 5 reports the results obtained, while Section 6 discusses those results with a particular eye towards the errors committed by the models.Finally, Section 7 sums up the key results and points to ways the work can be extended.

Related Work
Most datasets for offensive language detection represent multiclass classification problems (Davidson et al., 2017;Founta et al., 2018;Waseem and Hovy, 2016), with the annotations often obtained via crowd-sourcing portals, with varying degrees of success.Waseem et al. (2017b) state that annotation via crowd-sourcing tends to work best when the abuse is explicit (Waseem and Hovy, 2016), but is considerably less reliable when considering implicit abuse (Dadvar et al., 2013;Justo et al., 2014;Dinakar et al., 2011).They propose a typology that can synthesise different offensive language detection subtasks.Zampieri et al. (2019a) expand on these ideas and propose a hierarchical three-level-annotation model, which is used in the OffensEval 2019 shared task.Another issue is whether the datasets should be balanced or not (Waseem and Hovy, 2016), since there are much fewer offensive comments than benign comments in randomly sampled real-life data (Schmidt and Wiegland, 2017).
Classical Machine learning algorithms have been wielded to some success in automated offensive language detection, mainly Logistic Regression (Davidson et al., 2017;Waseem and Hovy, 2016;Burnap and Williams, 2015) and Support Vector Machines (Xu et al., 2012;Dadvar et al., 2013).Recently, however, deep learning models have outperformed their traditional machine learning counterparts, with both Recurrent Neural Networks (RNN) -such as LSTM (Pitsilis et al., 2018) and Bi-LSTM (Gao and Huang, 2017) -and Convolutional Neural Networks (CNN) having been used.Gambäck and Sikdar (2017) utilised a CNN model with word2vec embeddings to obtain higher F 1 -score and precision than a previous logistic regression model (Waseem and Hovy, 2016), while Zhang et al. (2018) combined a CNN model with a Gated Recurrent Unit (GRU) layer.Malmasi and Zampieri (2018) used an ensemble system much like ours to separate profanity from hate speech, but reported no significant improvement over a single classifier system.
In terms of features, simple bag of words models have proven to be highly predictive (Waseem and Hovy, 2016;Davidson et al., 2017;Nobata et al., 2016;Burnap and Williams, 2015).Mehdad and Tetreault (2016) endorsed the use of character n-grams over token n-grams citing their ability to glaze over the spelling errors that are frequent in online texts.Nobata et al. (2016); Chen et al. (2012) showed small improvements by including features capturing the frequency of different entities such as URLs and mentions, with other features such as part-of-speech (POS) tags (Xu et al., 2012;Davidson et al., 2017) and sen-timent scores (Van Hee et al., 2015;Davidson et al., 2017) also having been used (Schmidt and Wiegland, 2017).More recently, meta information about the users have been suggested as features, but no consistent correlation between user information and tendency for offensive behaviour online has been shown, with Waseem and Hovy (2016) claiming gender information leading to improvements in classifier performance, but with Unsvåg and Gambäck (2018) challenging this and reporting user-network data to be more important instead.Wulczyn et al. (2017) concluded that anonymity leads to an increase in the likelihood of a comment being an attack.

Data
The training dataset used for the shared task, the Offensive Language Identification Dataset (Zampieri et al., 2019a), contains 13,240 tweets, with each tweet having been annotated on the basis of a hierarchical three-level model.An additional 860 tweets were used as the test set for the shared task.The three levels/subtasks are as follows: A -Whether the tweet is offensive (OFF) or nonoffensive (NOT).B -Whether the tweet is targeted (TIN) or untargeted (UNT).C -If the target is an individual (IND), group (GRP) or other (OTH; e.g., an issue or an organisation).The dataset does not have an equal number of offensive and non-offensive tweets.Only about onethird of the tweets are marked offensive, to partially account for the fact that most online discourse mainly is non-offensive.The corpus exhibits a larger number of male (∼3000) than female pronouns (∼2500), but is reasonably balanced.
Noticeably, the annotators were very conservative in their classification of tweets as nonoffensive.It is unclear whether this was due to a more strict definition provided by the task organisers.For example, it is not immediately clear why tweets such as:1 "@USER Ouch!" (23159), "@USER He is a beast" (50771), and "@USER That shit weird!Lol" (31404) were annotated as offensive.
The annotators furthermore seemed to disagree over the cathartic and emphatic use of swearing, as in "@USER Oh my Carmen.He is SO FRICK-ING CUTE" (39021), "@USER GIVE ME A FUCKING MIC" (60566), and "@USER why are you so fucking good."(80097).These tweets do not really seem to be offensive except for them containing varying degrees of profanity.However, this is inconsistent, with some other tweets annotated not offensive, as expected: "@USER No fucking way he said this!" (47427), and "@USER IT'S FUCKING TIME!!" (59465), although most tweets that contained profanity were included in the offensive class.
Another thing to note is a large amount of political criticism within the tweets in the corpus.Whether it be left wing or right wing, extreme cases seem to be correctly annotated as offensive, while a healthy amount of criticism and political discourse correctly is annotated as non-offensive.The dataset also exhibits a dearth of racist tweets.

Methodology
Initially, a suite of features was composed based on those used successfully in previous work such as Waseem and Hovy (2016), Davidson et al. (2017), Nobata et al. (2016) and Burnap and Williams (2015): surface-level token unigrams, bigrams, and trigrams, weighted by TF-IDF; POS tags obtained through the CMU tagger2 (Gimpel et al., 2011), which was specifically developed for the language used on Twitter; sentiment score assigned using a pre-trained model included in TextBlob3 ; and count features for URLs, mentions, hashtags, punctuation marks, words, syllables, and sentences.
Scikit-learn4 (Pedregosa et al., 2011) was used as the primary library for modelling and training.L1-regularised Logistic Regression and a Linear Support Vector Classifier stood out initially as the best models.Further experimentation displayed that while those two models exhibited the highest accuracy, their recall of offensive tweets in subtask A and of untargeted insults in subtask B were lower than other classifiers provided in the Scikit-learn library, such as the Passive-Aggressive (PA) classifier (Crammer et al., 2006) and stochastic gradient descent (SGD).
Further exploration showed that the classifiers were not in agreement on certain tweets.This led to the idea of a vote-based ensemble model built on the following five classifiers combined by plurality voting (Kuncheva, 2004): L1-regularised Logistic Regression, L2-regularised Logistic Regression, Linear SVC, SGD, and PA.The ensemble model exhibited the best results in subtasks A and B. In subtask C, the multi-class classification problem and a severe reduction of the size of the training set led to much lower macroaveraged F 1 -scores, with the ensemble model performing badly.A deep learning approach, based on an LSTM architecture (Hochreiter and Schmidhuber, 1997), was adopted specifically for this subtask.The model used a 200 dimensional GloVe embedding5 pre-trained on 2 billion tweets (Pennington et al., 2014), with trainability set to False.The embedding layer was followed by a 1D convolution layer with 64 output filters and a Rectified Linear Unit (ReLU) activation function.The output of this layer was down-sampled using a max pooling layer of size 4.These inputs were fed into an LSTM layer of 200 units and subsequently a dense layer of 3 units with a softmax activation function.The model used the 'Adam' optimiser and the categorical cross entropy loss function.Due to the less amount of data, overfitting was quite common on as few as 3 epochs.Therefore, the model benefited from larger dropout values (up to 0.5).This model exhibited a better result than the ensemble model in subtask C, although only by a small margin.

Results
The experiments were run in three stages.First, before choosing the models, a mini ablation study was carried out on how various features affected the accuracy and F 1 -score metrics of different models.The selected models were then optimised on the training set, before being evaluated on the test dataset.

Feature Engineering
The initial ablation study was carried out on a small sample space of models: the Linear SVC and L1/L2-penalised Logistic Regression.The results are represented in Table 1.
The ablation analysis revealed that surface-level token/character n-grams are by far the most predictive of the features.An interesting observation is the significantly improved recall of offensive tweets when character n-grams are included.
However, the best F 1 -score/accuracy was never achieved with the character n-gram model, and hence only token n-grams were included on the final feature list.Other features provided only small improvements in accordance with previous observations (Wiegand et al., 2018).The addition of POS information seems to cause a reduction in performance, so this feature was dropped, except for in subtask C, where a small positive effect could be observed.Furthermore, artificially balancing the classes by modifying class weights helped alleviate the low recall issue to some extent.

Training Set
A 10-fold cross-validation was performed on each model used in the ensemble, with the metrics obtained in each fold averaged to obtain a median for each model's performance on the dataset.These initial results were obtained only for subtask A, to decide which models would be a part of the ensemble.Most models used in the ensemble exhibited similar accuracy, but varied in the recall of offensive tweets.It was also observed that models with the higher recall of offensive tweets exhibited equivalently lower recall of non-Offensive tweets.These observations are graphically represented in Figure 1.Small improvements in F 1 -score and accuracy were achieved while using the ensemble model (F 1 -score: .7338and Accuracy: .7720)

Test Set
After the models were trained, their performance was measured on a separate set of 860 unseen tweets.All F 1 -scores provided by the OffensEval organising team were macro-averaged.Baselines for each metric were also provided.

Conclusion
The idea of a hierarchical classification of offensive language is a step in the right direction in reducing the ambiguity existing between various similar subtasks.It is yet to be seen, however, how effective this method would be in synthesising more specific subsets of offensive language.For example, the cyberbullying subtask instances may yield either OFF, TIN or IND labels at each level of classification, but we are unaware of how effectively models developed for the OffensEval subtask perform on cyberbullying data sets.Some issues that have plagued offensive language detection -such as the problem of ambiguity and overlap between various subtasks -could effectively be solved if the idea of hierarchical classification achieves what it sets out to do.Consistent with previous work, we find that it is difficult to classify non-offensive tweets containing profanity and offensive tweets lacking profanity.We also found that a similar issue persists with tweets that are politically motivated and valid criticism incorrectly classified as offensive and similarly, political hate incorrectly classified as nonoffensive.
On the topic of selecting a classification model, it is noteworthy that even a simple and crude deep learning model such as the one used here can obtain better results than a more polished ensemble model.Except for surface level n-grams, most features are not as predictive as we would like them to be.
The data analysis showed that even though the annotators of the OLID data set were experienced with the platform, there still exist quite a few cases of erroneous classification by the annotators, just as noted for other datasets (Waseem and Hovy, 2016;Davidson et al., 2017;Nobata et al., 2016), for which amateur annotators were found unreliable.
Offensive language detection has proven to be a more layered issue than was initially expected, but with various developments in research the task seems surmountable.Future work must focus on building upon previous endeavours, to reduce the redundancy between subtasks and publications.The OffensEval shared task is a significant step forward in achieving this goal and we look forward to seeing how future research will be affected by the work that has been done here.

Table 1 :
Ablation analysis on subtask A, with the training set.