ES-JUST at SemEval-2021 Task 7: Detecting and Rating Humor and Offensive Text Using Deep Learning

This research presents the work of the team’s ES-JUST at semEval-2021 task 7 for detecting and rating humor and offensive text using deep learning. The team evaluates several approaches (i.e.Bert, Roberta, XLM-Roberta, and Bert embedding + Bi-LSTM) that employ in four sub-tasks. The first sub-task deal with whether the text is humorous or not. The second sub-task is the degree of humor in the text if the first sub-task is humorous. The third sub-task represents the text is controversial or not if it is humorous. While in the last task is the degree of an offensive in the text. However, Roberta pre-trained model outperforms other approaches and score the highest in all sub-tasks. We rank on the leader board at the evaluation phase are 14, 15, 20, and 5 through 0.9564 F-score, 0.5709 RMSE, 0.4888 F-score, and 0.4467 RMSE results, respectively, for each of the first, second, third, and fourth sub-task, respectively.


Introduction
Dealing with natural languages has long been a challenge and an interesting topic for researchers (Chowdhury, 2003).Understanding and generating languages is part of natural language processing (NLP) (Nadkarni et al., 2011).Recently, the language model is able to deal with sequence-tosequence problems such as question and answer, translation, multiple choice.In addition, it is able to capture complex relationships, semantic meaning, word meaning disambiguation, and word aspectbased (Deng and Liu, 2018).Humorous text is one of the important things we watched every day.It is commonly used to express an opinion on issues (societal, political, sports, and economic), whether in posts on social media platforms, or as advertising for a specific product (Kramer, 2011).In addition, the humor in the text makes the text complex in terms of interpretation and understanding of the text.Because of the manipulation of the meaning of words and the way the text is written to express the sense of humor in the words.On the other hand, understanding the humorous in the text varies according to the age or gender of the person, or even according to the culture, social status and mentality of the person (Goel and Dolan, 2007).In this task, a dataset was collected in the English language that represents humor and joke in the text and words.We participated in this task to build an approach capable of distinguishing a text that is humorous or not.Here we have explicitly used pre-trained models that deal with the concept of contextual text such as Bert (Devlin et al., 2018), Roberta (Liu et al., 2019), and XLM-Roberta (Conneau et al., 2019).In addition, as a baseline we worked on training the dataset by the Bert embedding layer and extracting weights to feed it into the Bi-LSTM and Dense layers.
In all sub-tasks, Roberta model showed superiority compared to other approaches.We ordered according to the official results among 36 participating teams.In the first sub-task, we achieved 26th rank with an 0.9564 F-score result.On the second sub-task, ranked 26th with a score of 0.5709 RMSE.A third sub-task placed 25th with 0.4888 Fscore (Sokolova et al., 2006).The last sub-mission we took the 9th rank with a score of 0.4467 RMSR (Chai and Draxler, 2014).The remainder of this paper is organized as follows: Background in Section 2. The properties of the dataset and the system in section 3. Section 4 explains the experiment and analyzing results.The last section 5 shows conclusions and future work.

Background
In (Hossain et al., 2019) developed a new humor corpus, which consist of 15,095 news headlines in English.They substituted the headlines with few words to be funny.Also, (Li et al., 2020) used attention-based with bi-directional long short-term memory (AttBiLSTM) to classify slang language into negative humor or positive humor.In (Annamoradnejad, 2020) utilized a BERT embedding layer with several parallel hidden layer to categorize 200K humorous sentences whether (positive or negative).While (Fan et al., 2020) used two kinds of attention mechanisms (internal and external) to capture sense of humor in words.Most of the previous works came to predict the humor polarity (positive, negative) or the humor rating (range values) in the text.However, this research addressing the humor and offensive score detection.

Task Description
We worked with four sub-tasks provided by SemEval-2021 1 , in task-1 divided into (a, b and c sub-tasks).Each sub-task related to the other.Moreover in task-2 has one sub-task (a).In general, Sub-task-1a will predict whether the text expresses a humorous or not (binary classification problem 1, 0).Sub-task-1b if the text is considered a humorous, will predict how humorous it is from 0 to 5 values (regression problem).Sub-task-1c If the text is a humorous, we would predict if it is controversial or not (binary classification problem 1, 0).Sub-task-2a will predict the offense.

Data-set
A dataset consists of a set of texts and each text has four categories (is-humor, humor-rating, humorcontroversy, offense-rating) in English language (Meaney et al., 2021).Each text asked by 20 annotators to label each category of the text.As well as, the annotators come from different gender and age groups.For is-humor and humor-controversy categories were taken the majority of the classes by 20 annotators as label for each text.Whereas, humor-rating and offense-rating categories take the average of rating classes between range 1 and 5 over 20 annotators as a label for each text.shows examples of the training dataset per text with the four classifications for each category.Moreover, in humor-rating and humor-controversy, we noted the categories have many NaN values, because if is-humor the majority of the classes were not classified as humor which means 0 label, so the remaining categories of humor are NaN values.Therefore, we need to remove the NaN values from each category as pre-processing the dataset before training the models.Table 1 shows the total number in the training, development and testing dataset for each category.

Dataset
Is

System overview
The proposed system focused on pre-trained transformer models, we Moreover applied some techniques that represent embedding words and feeding them into long short-term memory (LSTM) layers to train the data-set.Through all of the sub-tasks, the highest score was via the Roberta model.It is one of the powerful models pre-trained on a huge data-set and complex architecture.As well, it was released by Facebook and designed base on the BERT model that was released by Google.All pre-trained models are capable of handling long text dependencies and capturing features and relationships.Furthermore, the structure of pre-trained models that involve encoder-decoder (Cho et al., 2014) is enabled to deal with sequence-to-sequence (Sutskever et al., 2014) tasks.In addition, BERT-Large (Devlin et al., 2018)  and 0's for randomly chosen sentences.In contrast, Roberta used the MLM model approach for the training phase, as well as trained on a huge dataset compared to the BERT model.Moreover, we tried the XLM-Roberta-Large pretrained model (Conneau et al., 2019), which has 550M parameters with 24-layers of architecture.In addition, it consists of 1024 of the output hiddenstate embedding, 4096 of feed-forward hiddenstate, and 16 of head attentions.The model has Trained on 2.5 TB of newly created clean Com-monCrawl data that supports 100 languages.
On the other hand, this research exploits BERT embedding to represent text.Where the weights were extracted by training the dataset on the BERT embedding layer and then feeding them into a BI-LSTM layer of 128 units (Graves and Schmidhuber, 2005).Moreover, We used the dropout layer with 0.3 ratios, the max-pool layer, then passing the information into a dense layer with 64 units.In the last layer for classification tasks, the final dense layer is 2 hidden output units with a sigmoid activation function, and for regression, one unit output in the final dense layer.The Figure 1 shows the model architecture used for prediction label on classification and regression tasks.

Experimental and Results
In the experimental phase, the dataset was divided into three parts (training, development, and testing).We used the training dataset to train the model, and the development dataset to fine-tune the model to capture the best hyper-parameters without occurring over-fitting or under-fitting the model.Moreover, we used the test data set to check the performance of the model with an unseen dataset and to ensure the generalizability of the model.However, to perform the experiments we used collaborative google Colab as a platform, which provides a num- ber of GPUs available for use with modest memory size 2 .In addition, in our experiments with pretrained models, we used the transformers library that is based on the PyTorch language and allows you to fine-tune the models and train them on your own dataset 3 .In training the model, we did not use any pre-processing technique in the entered dataset.Although, there are some symbols, upper and lower case letters, misspellings, and some abbreviations in the dataset.However, We did not treat these issues in dataset, where the dataset is trained as it is.In order, for the model to be more realistic and robust in dealing with the real dataset.As well as, the model might deal with those cases as features for each case in the dataset for the model learning phase.Just in pre-processing phase, we needed to remove NaN values in both sub-task (1B and 1C ).In order to test the performance of the approaches used in this task, where each sub-task has a metric that meets the type of output of each sub-task such as regression metric or classification metric.Accuracy and F-score metrics were a measure of the performance in sub-task-1a and sub-task-1c.Likewise, the RMSE metric was a measure of outcome performance in both sub-task-1b and sub-task-2c.
In the process of model tuning, we tried several hyper-parameters, where the batch size was fixed 8, and the Adam optimizer function was used on all experiments.Furthermore, we applied several learning-rates in the range 1e-5, 4e-5, 1e-6, 3e-6 and a different number of epoch 2, 4, 8, 12 epochs.The table 3 shows the main experiments among many models with different LRs and Epochs for each sub-task.

Model
Epoch

Result
Roberta achieved high-performance results compared to other approaches, that exhibit his ability to capture traits and distinguish between labels.The table 4 presents the best results for both development and evaluation level results, as well as the best hyper-parameters selected based on the experimental phase for each sub-task.In sub-task-1A Roberta-Large achieved high scores in a binary classification problem compared to the other models, where we scored 26 at F-score metrics in the evaluation phase for our ranking on the leader-board.While in the sub-task-2B also Roberta achieved acceptable results in the regression problem and outperformed the other models, as we ranked on the Leader Board 26 at RMSR metric.For the rest of the other subtasks, sub-task -3C is treated as a binary classification, which we achieved 25 rank in evaluation phase by F-score.In the last sub-task, our rank was 9 for an RMSR metric at the evaluation phase on the leader board.

Error Analysis
This section presents some analyzes to clarify the outcomes and limitations model of each sub-task.Figure 2 represents the confusion matrix for each label is given in sub-task-1A which is a classification problem.The figure shows the number of cases which the actual label matches the predicted label (y = ŷ) which is 946 in total.While the number of labels that differ (y != ŷ) that the model could not predict the label, it is 54.In the square that represents 31 false positives, we can see that it is a little more than the square that represents 23 false positives.This is because the training dataset is slightly biased towards label 1, which is 4,932 out of 8,000, while label 0 makes up 3,068 of the training data set.A third sub-task, which is a binary classification problem.The figure 4 shows the model is able to recognize label 0 a little more than label 1, but in general, the model is not learned well (high biased).The number of cases for both (0 and 1) labels in the training dataset were 2467 and 2465 almost equal, respectively.
Finally, in the last sub-task-2C, we needed to use a round function to approximate continuous values to discrete values.However, the diameter of the figure 5 clearly shows the highest label to the lowest label distinguished by the model.Where the values are logically acceptable compared to the number of cases for each label in the training dataset, which are 5737, 1043, 623, 364, 214, and 19 frequency for each of 0, 1, 2, 3, 4, 5 labels respectively.

Conclusion
In this paper, we presented several approaches that addressed four sub-tasks.We obtained high scores  using a pre-trained Roberta model for each subtask.In the first sub-task, predicting if the text is humorous or not, we gained a 0.9564 F-score.While in the second sub-task, finding a humorous text representation rate from 0 to 5, that was got a 0.5709 RMSE.A third sub-task, verification of the text is controversial or not, obtained a 0.4888 Fscore.The last sub-task is to find the offensive rate in the text for the range of 0 to 5, which achieved 0.4467 RMSE.For future works, we are going to do more experiments and using ensemble technique to enhance the results.Moreover, adding more dataset with the original to treat the biased label.

Figure 1 :
Figure 1: An illustration of the proposed model architecture.

Figure 2 :
Figure 2: An illustration of the confusion matrix for sub-task-1A.
labels, while the labels 0, 1, 4, the model could not recognize them in the prediction phase.This is due to the size of the training dataset is a little for each label compared to 2, 3 labels.The size of the dataset in 0, 1, and 4 labels in the training dataset are 16, 410, and 47 respectively.On the other hand, label 2 is repeated 2835 times and 3 is repeated 1624 times in the training dataset.

Figure 3 :
Figure 3: An illustration of the discretization confusion matrix for sub-task-1B.

Figure 4 :
Figure 4: An illustration of the confusion matrix for sub-task-1C.

Figure 5 :
Figure 5: An illustration of the discretization confusion matrix for sub-task-2A.

Table 1 :
The total number for each category (is-humor, humor-rating, humor-controversy, offense-rating) after removing NaN values.

Table 2 :
I got REALLY angry today and it wasn't about nothing, but you're going to have to take my word for that.An example illustrating the features of a training dataset whether a humor or offense.If Is-humor class is 0 then the Humor-rating and Humor-controversy classes are Nan values.

Table 3 :
The models applied and hyper-parameters used.(ES denotes to early-stopping technique)

Table 4 :
The best results gained for both development and evaluation level with hyper-parameters chosen.