TSA-INF at SemEval-2017 Task 4: An Ensemble of Deep Learning Architectures Including Lexicon Features for Twitter Sentiment Analysis

This paper describes the submission of team TSA-INF to SemEval-2017 Task 4 Subtask A. The submitted system is an ensemble of three varying deep learning architectures for sentiment analysis. The core of the architecture is a convolutional neural network that performs well on text classification as is. The second subsystem is a gated recurrent neural network implementation. Additionally, the third system integrates opinion lexicons directly into a convolution neural network architecture. The resulting ensemble of the three architectures achieved a top ten ranking with a macro-averaged recall of 64.3%. Additional results comparing variations of the submitted system are not conclusive enough to determine a best architecture, but serve as a benchmark for further implementations.


Introduction
The SemEval competitions continually offer suitable dataset and resulting benchmarks for a variety of natural language processing tasks.The SemEval-2017 Task 4 Subtask A addresses the polarity classification task of informal texts (Rosenthal et al., 2017).Tweets serve as a very accessible sample of the abundant social media content.Submitted systems must classify tweets into the categories of negative, positive and neutral opinion.Submitted results are compared over macroaveraged recall.
In recent benchmarks across this task, deep learning implementations achieved top results (Nakov et al., 2016).We seek to combine three varying deep learning approaches in an ensemble.Conventional methods seem to become obsolete since convolutional neural networks (CNN) have first shown state-of-the-art results in sentiment analysis (Kim, 2014).SemEval has since seen successful results by similar models (Severyn and Moschitti, 2015) as well as ensembles of CNNs (Deriu et al., 2016).Long term short term recurrent neural networks (LSTM) (Hochreiter and Schmidhuber, 1997) have also been applied successfully in ensemble with a CNN (Xu et al., 2016).As an alternative to LSTMs gated recurrent neural networks (GRNN) have been shown to be competitive in other domains (Chung et al., 2014).These models are well suited to model sequential data and were successfully applied for sentiment analysis of larger documents (Yang et al., 2016).The core contribution of a recent non deep learning system to win this task (Kiritchenko et al., 2014) back to back in 2013 and 2014, was the integration of opinion lexicons into a support vector machine system.Opinion lexicons have since then also been integrated into CNN architectures (Rouvier and Favre, 2016;Shin et al., 2016).In this work we combine these diverse architectures.
We use a CNN, thoroughly optimized for text classification, as the foundation of our ensemble approach.We add a lexicon integrated CNN to take advantage of lexicon features.In order to diversify the approach we also include a GRNN architecture as a sequential model.The idea is to get better and more robust results with a broader architecture.Results show that adding the latter two systems does not improve overall results, though the results for the core CNN approach were already good.Furthermore, results for individual classes do improve, making this a viable option when prioritizing specific classes or evaluation metrics.With this work we seek to contribute to the growing body of literature that presents comparable and reproducible solutions for this task.

Approach
This section outlines the overall approach before detailing the subcomponents of the architecture.The purpose of the system is to classify an input tweet into an element of the opinion classes C, with |C| = 3.This is determined by the maximal value of an system output vector o ∈ R |C| .
As outlined in Fig. 1 we propose an ensemble of three deep learning architectures to solve this task.
A CNN and a GRNN over word embeddings as well as a CNN over lexicon embeddings.The ensemble components output vector representations v 1/2/3 can be considered an abstraction of the input tweet.These representations are the input to an ensemble system which determines the final output.We will describe preprocessing steps to create the ensemble layer input before outlining the ensemble architecture.

Preprocessing
First of all the tweet data is tokenized with NLP4J1 as a preprocessing step for creating embeddings.
In the following we refer to a tweet as a document, which is a sequence of tokens, constrained to n = 120.If the actual document is less in size, it is padded with zero vectors, otherwise it is truncated.The tokens are then converted into either word embeddings of dimension d or lexicon embeddings of dimension l.We use pretrained word embeddings from Frederic Godin2 (Godin et al., 2015), with d = 400.The embeddings were trained on 400 million tweets.The lexicon embeddings are polarity scores from three different lexicons, thus l = 3.We use Bing Liu's Opinion lexicon (Hu and Liu, 2004), the Hashtag Sentiment Lexicon and the Sentiment 140 Lexicon (Mohammad et al., 2013) to form the complete lexicon embedding.
Tensorflow3 (Abadi et al., 2016) is used as the deep learning framework for implementing the system.The next subsections describe the components of this system, followed by an outline of how their outputs are combined into an ensemble.

Convolution Neural Network
This component is based on a standard CNN architecture used for text classification (Kim, 2014).We make small changes for fine tuning to the task.The input to this component are the word embeddings described in the preceding subsection.The embeddings of dimension d are formed into a document matrix D ∈ R n×d across the n input tokens.The document matrix D is passed through k filters of filter size s.The convolution weights belong to R s×d .The convolutions result in k convolution output vectors of dimension R n−s+1 .A max pool layer converts these vectors to a vector of size R k .We then add a normalization layer (Ioffe and Szegedy, 2015) so as to merge outputs from different filter sizes.With f filter sizes, we finally arrive at vector v 1 ∈ R k×f .This vector is passed to a dense layer of 256 ReLu units.The dense layer is followed by an output softmax layer.We apply dropout (Hinton et al., 2012) at the beginning of the dense layer as well as the output layer.The implementation uses following the configuration: • Weights are initialized using Xavier weight initialization (Glorot and Bengio, 2010).
• The Learning rate is set to 0.0001 with a batch size of 100.
• The input vector to the dense layer v 1 thus has a dimension of 1280.
• At the dense layer and output layer, we use dropout with a keep probability of 0.7.
• We run 200 training iterations and select the model that performed best on development data.

Gated Recurrent Neural Network
The GRNN is based on the gated recurrent unit (GRU) (Cho et al., 2014), which uses a gating mechanism while tracking the input sequence with a latent variable.GRUs seemed to perform better compared to other RNN cells like LSTM in our experiments, which go beyond the scope of this paper.The input to the GRNN are the word embeddings described in Section 2.1.The input is read sequentially by a GRU layer.The GRU cell is designed to learn how to represent a state, based on previous inputs and the current input.The GRU layer consists of g hidden units.After the last token of the sequence is processed, the output vector v 2 ∈ R g of this layer is collected to be merged into other architectures.The implementation is configured by: • g = 256 hidden units with tanh activation, • resulting in the 256 dimensional output vector v 2 .

Lexicon Integrated Convolution Neural Network
The lexicon integrated CNN (CNN-lex) is similar to the previously described CNN architecture.The fundamental difference is that convolutions are done over lexicon embeddings, described in Section 2.1.The input to this component is a document matrix L ∈ R n×l across the l dimensional lexicon embeddings of n input tokens.The architecture uses j filters per e convolution filter sizes.
The convolution layer output is passed through a max pool and normalization layer.This results in an output vector v 3 ∈ R j×e that is collected to be merged into other architectures.The implementation uses following configuration: • The architecture uses e = 3 filter sizes, [3,4,5], and j = 64 filters per filter size over n = 120 lexicon embeddings of dimension l = 3.
• The ensemble output vector v 3 thus has a dimension of 192.

Ensembles
The previously described architectures can be combined into ensemble systems.The CNN-lex and GRNN architectures are already defined as ensemble subsystems through their output vectors v 2 and v 3 .While the CNN architecture was previous introduced as a stand alone system it is naturally described as an ensemble component.The vector v 1 described in Section 2.2 is the output vector of the CNN as a subsystem.The three vectors are concatenated as inputs to the ensemble layer.The ensemble layer consists of a dense layer of 256 ReLu units followed by a softmax output layer, which results in the output vector o of dimension |C|.The CNN is combined into three ensembles by concatenating its input vector v 1 with v 2 (CNN, GRNN) and with v 3 (CNN, CNN-lex) as well as with both v 2 and v 3 (CNN, CNN-lex, GRNN).The latter is the submission architecture while the other two are evaluated for comparison.These ensembles are trained as a single system.Training is conducted as previously described for the CNN architecture.

Data
The training data for this approach was constrained to data published in the context of the Se-mEval workshops.Table 1 lists the data available to the authors.It is important to note that the data is heavily imbalanced.Before submission the system was tested with the 2016-test set as development test data.The results described in this paper focus on the 2017-test data, which was used to rank the submissions.All other data in Table 1 was combined into one data set, shuffled and split four to one into training and development data.

Results
The following results compare the core CNN architecture against ensembles with the CNN as a subsystem.The ranked submission marked by * in Table 4 1) for CNN and ensemble architectures, where * marks the ranked submission.
For detailed rankings we refer to the task description (Nakov et al., 2016), we only put this result in context with the described architectures.Three ensembles are used for comparison, the basic CNN combined with either the GRNN or the lexicon integrated CNN as well as both.The two result data sets are the 2016-test set as pre-submission test data and the final 2017-test data set used to benchmark the submissions.
Overall the CNN performs en par or better than the ensembles on macro-averaged recall and macro-averaged positive negative F1.For both development test data in Table 3 and test data in Table 4 we observe that the CNN outperforms the ensembles across macro-averaged F1.Though there is a substantial difference between macroaveraged recall of the CNN versus the ensembles on the development test data, macro-averaged recall on the test data is consistent across all systems.
The strongest fluctuation in averaged results is the drop in accuracy for the CNN, CNN-lex ensemble across both data sets.Detailed results in Table 2 show that this is due to a steep drop in neutral class recall, a class the data is heavily biased towards (Table 1).We note that though macroaveraged recall stays consistent on the test data (Table 4), per class results do fluctuate substantially (Table 2).These class trends were generally consistent across both 2017-test and 2016-test data, thus the later results are omitted for brevity.

Conclusions
In the previous sections we described experiments of adding various deep learning architecture elements to a basic CNN.Results show that the derived ensembles of approaches did not improve performance over the more relevant metrics of macro-averaged recall and F1.To give further context it is important to mention that substantially more effort went into engineering and tuning of the CNN model than of the additional architectures.Just as the submission system, the CNN architecture itself would have ranked within the top ten of this sentiment analysis task.Room for improvement was thus limited.We do note that per class results do fluctuate quite a bit across ensembles, which means these architecture can be used to prioritize class specific recall and precision.
It remains open whether the more complex architectures perform more robustly across diverse datasets.We will seek more clarity on this issue by experimenting with different data sets.Another architecture choice to pursue is to include an attention mechanism so that the ensemble system can learn which subcomponents to prioritize.

Figure 1 :
Figure 1: Ensemble component output vectors v i are used as input to an ensemble, which determines a classification output vector o.

Table 3 :
Macro-averaged recall ρ, negative positive macro-averaged F1 and accuracy on development test data (2016-test, Table1) for CNN and ensemble architectures.