ECNU at SemEval-2018 Task 3: Exploration on Irony Detection from Tweets via Machine Learning and Deep Learning Methods

The paper describes our submissions to task 3 in SemEval-2018. There are two subtasks: Subtask A is a binary classification task to determine whether a tweet is ironic, and Subtask B is a fine-grained classification task including four classes. To address them, we explored supervised machine learning method alone and in combination with neural networks.


Introduction
Irony, also known as sarcasm, refers to the use of words and sentences, whose intended meanings contrary to their literal meanings.Modeling irony has a large potential for applications in various research areas, so SemEval2018-Task3 (Hee et al.) aims to classify irony into different classes.
There are two subtasks.In subtask A, when given a tweet, the classifier should predict whether the tweet is ironic or non-ironic, and in subtask B, the ironic class is further divided into another three categories, i.e., irony by Polarity contrast, by Situational and Other verbal irony.
Polarity contrast irony represents the tweets containing an expression whose polarity (positive, negative) is inverted between the literal and the intended meaning.Situational irony stands for the ones which don't contain explicit polarity contrast.However, the events or results described in them are contrary to the desired or expected common knowledge.Other verbal irony tweets also don't contain any explicit polarity contrast, but they can't be classified into the Situational irony.Finally, non-ironic contains instances which are clearly not ironic, or lack adequate context to be sure that they are ironic.
In the remaining of the paper, section 2 describes our system in details.Section 3 reports datasets, experiments and results discussions.Finally, Section 4 concludes our work.

System Description
In both subtasks we used supervised machine learning to model the irony in datasets.Moreover, we explored neural networks in subtask A.
• In subtask A, we built a binary classification system to make predictions (see in 2.2.1).
Then, we combine it with a Bi-LSTM neural networks(see in 2.2.2).
• In subtask B, we used two machine learning systems to train and evaluate.
1. 4-class classification system: We made use of classifier directly itself to make 4-class predictions.2. 4 binary-classification system: We designed a two-step system as follows: -Step 1 The entire problem was regarded as 4 binary-classification problems.Each tweet would be trained and evaluated within 4 classes, and 4 confidence values would be returned.-Step 2 The classifier would allocate each tweet with a label gaining the highest confidence, and then made evaluation.

Feature Engineering
4 types of features were designed to extract effective information from the given tweets.

Linguistic-informed Features
• Word N-grams We extracted word n-grams features (n = 1, 2, 3) from tweets.To accomplish that, we used TweetTokenizer from NLTK tools (Bird et al., 2009).Otherwise, N-grams features with the use of Relevant Frequency (RF) (Lan et al., 2009) were also applied to this system.(Manning et al., 2014).We used a 12-dimensions binary feature to indicate the entities in tweets.

Word Embedding Features
A lot of recent studies on NLP applications were reported to have good performance through using word vectors, such as document classification (Sebastiani, 2002) and question answering (Lan et al., 2016).In our work, two widely-used word embedding features were adopted, respectively Google Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014).For Word2Vec, a dictionary (Available in Google1 .) with 31622 words and 300 dimensions was applied.For GloVe, we used data from the dictionary with 2196017 words and 300 dimensions (glove.840B.300d,available in GloVe2 ).

Tweet domain Features
We collected tweet related features, and used unigram to imply if a tweet contained such information.
• Hashtags All the tokens begin with "#" symbol are called hashtags.We extracted all the hashtags, removed its "#" symbol and built unigram features for them.if the last token is ?.
• Emoticon: We collected 67 emotions labeled with positive and negative scores from the Internet 12 , and used a 67-dimension binary feature to record the sentiment score of the emotion in tweets.
• Elongated Words Feature In the sentence "Ahhaaaaaaa, that's sooooo funny!", Ahhaa.. and so.. are the use of elongated words.The existence of these words will lead to the overfitting in unigram features.So we designed a feature to handle them.
In our work, elongated-word feature was defined as the word which has characters repeated for 3-11 times.We captured and handled them by using regular expression.

Machine Learning Algorithm
In both subtasks, we used following supervised machine learning algorithms to train the model: blinear (Fan et al., 2008).

Deep Learning
Next, we explored neural networks in subtask A.
We modeled all the tweets data through a Bi-LSTM network.The general architecture of the model was depicted in Figure 1.
• Input and Embedding Layer: Each tweet was preprocessed by normalizing hyper links and mentions to someurl and someuser as described in 2.1.1 and extracting word n-grams in hashtags as described in 2.1.4.Then the tweet was converted into a vector and padded to an equal length (or truncated if the tweet is longer than the pre-defined length).The input vector was fed to the embedding layer (i.e.pre-trained glove.twitter.27Bvectors), which converted each word into a distributional vector.
• Bi-LSTM Layer: We used bi-directional L-STMs to model the input sequence.In the bidirectional architecture, two layers of hidden nodes from two LSTMs captured compositional semantics from both forward and backward directions of the word sequence.
• Attention Layer: We add attention layer to model the weights of input words follow (Raffel and Ellis, 2016), i.e. learning the weights of hidden states at each time stamp, then computing the sentence representation via a weighted sum.
• Output Layer: The output of Bi-LSTM was passed to a fully connected (FC) layer, which produced a higher order feature set easily separable for 2 classes.Finally, a softmax layer was added on top of the fully connected layer.The network was trained by minimizing the binary cross-entropy error with ADAM (Kingma and Ba, 2015) for parameter optimization.
3 Experiments and Results

Datasets
The statistics of the datasets provided by SemEval 2018 task 3 are shown in Table 1.

Evaluation Metric
The official evaluation criterion is as follow: • For subtask B, macro-averaged F 1 -score calculated among all four classes is used.
F polar cont +F senti +F other +Fnon 4 3.3 Experiments on training data

Subtask A: Irony Detection
We used a series of features and explored different machine learning algorithms, in combination with neural networks, in subtask A.

Machine Learning
The count of the train data was only 3,834 and no dev datasets were provided.To fully exploit these data, we used 10-fold cross-validation with data shuffling.The major feature selection work was done with LibLinear L2-regularized logistic regression (LibLinear LR).
We used the following features in Table 2 as the baseline features.Since the cross validation operations were done with data shuffling, some fluctuations in result might exist.From the table it can be observed that all these features can make contributions to the classifier.Then we added three other features: Word N-grams, Hashtags and Hashtag unigrams.Each feature had two versions, with or without Relevant frequency (RF).Simultaneously, we set different word frequency when building lexicon for these features, from frequency threshold 1 to 5. In order to choose features which can improve the performance best, we used Hill Climbing method.
Hill Climbing is a method which can automatically extract the best features from a set of given features.Its principle is as follows: 1. Given a Candidate Feature set, traverse each feature and move the feature producing the best performance into Best Feature set.Then we explored the performance of different learning algorithms.Table 4 lists the comparison of best three supervised learning algorithms with all above features.

Traverse the remaining features in
Finally, we made ensemble of three algorithms in Table 4.The ensemble score was 0.6982.

Neural Networks
In our LSTM framework, the dimension of word vector was set to 100 and the hidden layers for both LSTM and FC layers were set to 256.The drop out rate was set to 0.2 for preventing overfitting.
10% of the training data were randomly selected as validation set.The best model during training was used in test evaluation stage.We implement the framework based on Tensor-flow (Abadi et al., 2016)  The performance results on train datasets are listed in Table 5, and the average is about 0.66.

Ensemble of Machine Learning and Neural Networks
The average performance of machine learning and neural networks were respectively 0.69 and 0.66.We ensembled different results of neural network and of machine learning.Here we used 4 algorithms, i.e., Scikit-Learn's NaïveBayes, LR, SVM and LibLinear's LR, to avoid that label 0 and label 1 were voted same times.
During the ensemble, we also tried another strategy.Since we wanted to higher the recall value of positive labels, we ensembled only the data predicted as "label 0" by neural networks.For those "label 1" data, we remained their original labels.The results of this strategy will be discussed in 3.4.

Subtask B: Irony Classification
When handling subtask B, we used only machine learning.We conducted two steps in subtask B.
In the first step, the average f 1 -macro score is between 0.42-0.43.Table 6 shows how much each class is graded, the f 1 -scores of label 2 and 3 are much lower than that of label 0 and 1.This is caused by imbalance in data distribution.Here label 0 represents for Non-ironic, 1 for Polarity contrast, 2 for Situational irony, 3 for Other verbal irony.
In the second step, to solve the problem of imbalance in data distribution, we enlarged the data size of label 2 and 3. Label 2 was expanded 6 times, and label 3 was expanded to 10 times.Then we ensembled multi-algorithms.Each algorithm would perform 4 binary-classifications successive-ly.Finally, we used Scikit-Learn LR for label 0, 1, 2, and Scikit-Learn SVM for label 3. Results are listed in Table 7.  Table 8 shows the results of our system and the top-ranked systems provided by the official.Compared with the top ranked systems, there's so much room for improvement in our work.There are several possible reasons for this.
• First, the overfitting problems is very serious.The scores during Training and dev period and test period differed significantly.It will be discussed in 3.5.
• Second, possibly the features failed to extract useful information from the test data Unlike Word N-Grams, some features, like hashtag, the probability of the same hashtag or matching words appearing in both test files and training files is quite low.
3.5 Supplement results beside the contest

Ensemble of Machine Learning and Neural Networks on subtask A
This is the performance of machine learning algorithms on subtask A after the contest.The numbers in parentheses represent positions in the official ranking if the result is submitted.The last record is the same as ECNU's.
In Table 10, TOP3 means the ensemble of 3 best algorithms on train datasets.The 4+NN means using 4 best machine learning algorithms and ensemble them with the results of neural networks.en0 means using the other strategy mentioned in 3.3.1.Hence, the ensemble data using the other strategy enjoys a particular high recall value.Nevertheless, the performance of these results differ greatly that on train datasets.In Table 11, the average of f 1 -scores on pure neural networks' results is about 0.61.This phenomenon indicates that in our work the training of supervised machine learning appeared to have been overfitted.

Neural Networks on subtask A
In Table 9 the average of f1-scores on pure neural networks' results are 0.61, 0.62 and 0.60 for three Groups respectively.
This phenomenon indicates that in our work, the training of supervised machine learning appeared to be overfitted.Moreover, turn on drop out settings in more neural network layers can further reduce overfitting.
However, our attempt of further incorporating attention layer brought negative affect on subtask A's performance.This may suggest the weighted sum of hidden states probably is not a good representation of the sentence for irony detection.

Conclusion
In this paper, we explored supervised machine learning algorithms and neural networks, detected whether a given tweet was ironic or not, and classified them into four more detailed categories.The result was that the machine learning classifiers overfitted, and neural networks performed better than the traditional training methods.The system performance for subtask A ranked above average, and subtask B didn't perform so well.In future work, we consider focusing more on exploring the neural networks.
Figure 1: The architecture of our LSTM models.(a) The NN model submitted to Task A, which only incorporates a drop out layer after bi-lstm layer.(b) The NN model explored after contest, which adds attention layer and incorporates additional drop out at both embedding and lstm layers.

Table 1 :
Statistics of datasets in train and test data.Label 0 stands for non-ironic, label 1 in subtask A is ironic, label 1, 2, 3 in subtask B is respectively polarity contrast irony, situational irony and other verbal irony.

Table 2 :
Performance of different features on crossvalidation shuffling data test.".+" means to add current features to the previous feature set.The numbers in the parentheses are the performance increments compared with the previous results.
Candidate Feature set, ensemble each one with Best Feature set to train the model.If one feature can lead to better performance than before, move it to Best Feature set.

Table 3 :
The results of hill climbing.

Table 4 :
Performance of three best learning algorithms.

Table 5 :
and Keras 13 .Performance of partial neural networks on subtask A on train and dev datasets.

Table 6 :
The f 1 -scores of each label in subtask B.

Table 8 :
Performance of our systems and top-ranked teams on both two subtasks.The numbers in the parentheses are the official rankings.The evaluation metrics in mentioned in 3.2.

Table 9 :
Performance of pure neural networks on subtask A on test datasets.The number in parentheses is the position of this result if submitted.Performances in Group 'NN' are based on Figure 1(a); Performances in Group 'NN+more dropout' are based on Figure 1(a) with additional drop out settings; and Performances in Group 'NN+more dropout+attention' are based on Figure 1(b).

Table 10 :
Performance of ensemble on machine learning and neural networks on subtask A and test datasets.

Table 11 :
Performance of pure neural networks.The numbers in parentheses represent positions in the official ranking if the result is submitted.