Modeling Multi-Targets Sentiment Classification via Graph Convolutional Networks and Auxiliary Relation

Existing solutions do not work well when multi-targets coexist in a sentence. The reason is that the existing solution is usually to separate multiple targets and process them separately. If the original sentence has N target, the original sentence will be repeated for N times, and only one target will be processed each time. To some extent, this approach degenerates the fine-grained sentiment classification task into the sentencelevel sentiment classification task, and the research method of processing the target separately ignores the internal relation and interaction between the targets. Based on the above considerations, we proposes to use Graph Convolutional Network (GCN) to model and process multi-targets appearing in sentences at the same time based on the positional relationship, and then to construct a graph of the sentiment relationship between targets based on the difference of the sentiment polarity between target words. In addition to the standard target-dependent sentiment classification task, an auxiliary node relation classification task is constructed. Experiments demonstrate that our model achieves new comparable performance on the benchmark datasets: SemEval-2014 Task 4, i.e., reviews for restaurants and laptops. Furthermore, the method of dividing the target words into isolated individuals has disadvantages, and the multi-task learning model is beneficial to enhance the feature extraction ability and expression ability of the model.


Introduction
With the development of the social economy, people's lives are increasingly dependent on the mining of large amounts of data. Sentiment Analysis is a general term for tasks such as sentiment subject recognition and sentiment polarity classification. In the sentiment classification task, the classification task of text-level and sentence-level giving a holistic evaluation of the review text is relatively more researched, while the finegrained sentiment classification task is relatively less studied. In the fine-grained sentiment classification task, there are two definitions, one is to be more detailed in the division of emotional granularity, for example, from the general positive, negative and neutral triages, divided into anger, disgust, fear, happiness, like, sadness, surprise, such a seven-category task [Rathnayaka, Abeysinghe, Samarajeewa et al. (2019)]; another definition refers to the corresponding sentiment classification task for specific subjects in the comment text, called Target-Dependent Sentiment Classification (TDSC) or Aspect-Based Sentiment Classification (ABSC). The difference between the two is that objects will appear in the original text in the TDSC evaluation while the objects may appear after the original abstraction in the ABSC evaluation. Both are clearly defined in SemEval-2014 task 4 [Pontiki, Galanis, Pavlopoulos et al. (2014)]. In this paper, we focus on the Target-Dependent Sentiment Classification [Jiang, Yu, Zhou et al. (2011);Dong, Wei, Tan et al. (2014); Vo and Zhang (2015); Tang, Qin, Feng et al. (2015); Song, Wang, Jiang et al. (2019)]. As a special case of aspect-level sentiment classification, the targets in the target-dependent sentiment classification task is bound to appear in the text and the polarity of sentiments towards them needs to be identified separately. For example, given a sentence "While this is a pretty place in that overly cute French way, the food was insultingly horrible.", the sentiment polarity for "place" is positive, for "food" is negative. Another example, given a sentence "Not a large place, but it's cute and cozy.", the sentiment for "place" is conflict, as both negative (Not a large) and positive (cute and cozy) sentiments are expressed towards the same target. In the previous method, the researchers usually split the multiple targets in a sentence into multiple instances for processing. This way ignores the correlation and influence between the targets, so in this paper we propose a new model called TSR-GCN, which uses Graph Convolutional Networks to model multi-targets in a sentence simultaneously based on the positional relationship, and we introduce an auxiliary relation classification task to further explore the sentiment polarity relation between targets (nodes of the graph). The experimental results show that our model can still achieve comparable performances with the current best results when the composition is relatively rough, indicating that the approach worth is further exploration and research.

Related work
In this section, we will review related works on Target-Dependent Sentiment Classification (TDSC) and Graph Convolutional Networks (GCN).

Conventional neural networks
Traditional approaches mainly focus on designing a set of features to train a classifier (e.g., SVM) for target-dependent sentiment classification [Jiang, Yu, Zhou et al. (2011);Wagner, Arora, Cortes et al. (2014); Kiritchenko, Zhu, Cherry et al. (2014)]. The traditional method of sentiment analysis needs to rely on the complex feature engineering, needs to spend a lot of manpower and resources, and the method is poor universal in the cross-domain. Multiple sentiment lexicons are built for this purpose [Neviarouskaya, Prendinger and Ishizuka (2009) ;Qiu, Bing, Bu et al. (2009);Taboada, Brooke, Tofiloski et al. (2011)]. With the development of deep learning, neural network models are of growing interests for this Natural Language Processing (NLP) task because of neural networks' capacity of learning representation from data without feature engineering [Dong, Wei, Tan et al. (2014);Tang, Qin, Feng et al. (2015); Tang, Qin and Liu (2016); Wang, Huang, Zhao et al. (2016); Ma, Li, Zhang et al. (2017); Chen, Sun, Bing et al. (2017); Huang and Carley (2018); Zhang, Wang, Li et al. (2018); Song, Wang, Jiang et al. (2019); Sun, Huang and Qiu (2019)]. The mainstream neural networks methods are based on long short-term memory networks [Hochreiter and Schmidhuber (1997)], memory networks [Sukhbaatar, Weston, Fergus et al. (2015)] and attention mechanism [Bahdanau, Cho and Bengio (2014)]. Recursive neural networks [Dong, Wei, Tan et al. (2014)], and gated neural networks [Zhang, Zhang and Vo (2016); Xue and ]. Convolutional neural networks [Huang and Carley (2018)] are used relatively rarely in this field. More recently, the pre-trained language models, such as ULMFiT [Howard and Ruder (2018)], OpenAI GPT [Radford, Narasimhan, Salimans et al. (2018)], ELMo [Peters, Neumann, Iyyer et al. (2018)] and BERT [Devlin, Chang, Lee et al. (2018)], have shown great power in the semantic expression of text. In particular, BERT achieved excellent results in sentence-level sentiment classification. Song et al. [Song, Wang, Jiang et al. (2019)] proposed an Attentional Encoder Network (AEN) without cyclic recursive structure, and used Attentional Encoder method to model between context and target words. According to the given target, Sun et al. [Sun, Huang and Qiu (2019)] transformed the target-dependent sentiment classification problem into a sentence-pair classification task by constructing an auxiliary question. Xu et al. [Xu, Liu, Shu et al. (2019)] believe that customer reviews can be transformed into a large-scale source of knowledge that can then be used to answer users' questions. A new task that is named Review Reading comprehensions (RRC), Xu et al. [Xu, Liu, Shu et al. (2019)] through exploration, finetuning BERT model to further improve the expressive force of RRC task, and then will be based on specific target sentiment classification problem into a special Machine Reading Comprehension (Machine Reading comprehensions, MRC) problems, including all the issues related to sentiment tendency of the given target. Hazarika et al. [Hazarika, Poria, Vij et al. (2018)] processed multiple targets with LSTM network in the first stage, and then used LSTM to aggregate each group of features in the second stage, indicating that the target words in the previous text would affect the target words in the following text. Ma et al. [Ma, Zeng, Peng et al. (2019)] introduced positional attention. In the first stage, the model processed the target words one by one, and in the second stage, the model integrated multiple target words in the whole sentence simultaneously.

Graph convolutional networks
As shown in the Fig. 1, the left picture is a 2D Convolutional Neural Network (CNN). Its input is a matrix of 4 rows and 4 columns. The convolution operation of the entire input is realized by gradually moving the convolution kernel. The input on the right is a graph network, whose structure and connections are irregular and can't implement convolution operation like CNN. How to perform the convolution operation on the graph structure data, Defferrard et al. [Defferrard, Bresson, and Vandergheynst (2016)] present a formulation of CNN in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs. Defferrard et al. [Defferrard, Bresson and Vandergheynst (2016)] employed Graph Convolutional Networks (GCN) in text classification tasks and outperformed the traditional CNN models. Kipf et al. [Kipf and Welling (2016)] propose a more general GCN model, and experiments on citation networks and knowledge graph data sets have achieved excellent results. By stacking GCN layers, the hidden state of neighbor nodes at the current time is used as part of input to generate the hidden state of center nodes at the next time until the change of hidden state of each node is very small, and the information flow of the whole graph tends to be stable. So far, each node has aggregated its neighbor's information. GNN is widely used in various fields, such as relation extraction [Zhang, Qi and Manning (2018)  Zhaoa et al. [Zhaoa, Houb and Wua (2019)] applied Graph Convolutional Network in the field of fine-grained sentiment classification earlier. They took the target word as the node of the graph and proposed two methods of composition, one is to connect the nodes based on the right and left adjacent positions, and the other is to connect the nodes in pairs globally. Both of the two methods of composition contain the self-loop of the node (that is, the node itself is connected with an edge).

Our approach 3.1 Problem definition and notations
A target-dependent sentiment classification task usually predicts the sentiment polarity of a tuple (s, t) which consists of a sentence and a target. The difference between multitargets sentiment classification and target-dependent sentiment classification in the general sense is that the former will be processed at the same time when there are multiple targets in a sentence, while the latter will be processed separately. The sentence = [ 1 , … , , … , ] (1) consists of n words, and the number of targets in each sentence is at least 1 and less than . As shown below, the target contains words and any target word are a subsequence of the sentence , the intersection between any two target words is empty.  [Wu, Schuster and Chen (2016)] segmentation, the target word will be divided into multiple sub-word units. The output of the corresponding position in the last transformer encoding layer will be taken as the feature representation of the target, and max-pooling will be used to extract the significant feature as the vector representation of the entire target word.

Graph convolutional networks
We construct a graph to capture the sentiment dependencies between multi-targets in one sentence, where each node is regarded as a target and each edge is treated as the sentiment dependency relation. As shown in Fig. 3. A graph is a set of nodes connected via a set of edges. If two nodes are connected by an edge, it means that the two nodes are neighboring to each other. Formally, given a node , we use ( ) to denote all neighbors of . The adjacency matrix of a graph encodes graph topology, where each element represents an edge from node to node . If the value is 1, means that there is an edge between node and node . If the value is 0, there is no edge to join between node and node .
A GCN layer propagates the node features ℎ at layer , using a function ( ) of the adjacency matrix and has an output given by where is the weight matrix and is the bias, and are learned weights parameters. is a nonlinear activation function, where ReLU is used by us. f is called propagation rule. There are usually three rules from Li et al. [Li, Tarlow, Brockschmidt et al. (2015); Kipf and Welling (2016); Hamilton, Ying and Leskovec (2017)] as follows: where D is a degree matrix that defined as follows: Different processing rules can obtain different characteristics of nodes in the graph. Refer to Dehmamy et al. [Dehmamy, Barabási and Yu (2019)], we also combine three different GCN propagation modules and residual connection into our model.
The outputs of the modules are concatenated and fed into a fully connected layer.

Sentiment classification and auxiliary relation
By classifying the relations among nodes, a multi-task joint learning model is constructed. In relation classification, we do not rely on the existing adjacency matrix (dependent edges) to directly predict the relation between all nodes.

Figure 4: An example of how to construct the sentiment relation between nodes
As shown in Fig. 4, the relation we designed can be divided into three types according to the difference in sentiment polarity between nodes, the same, the opposite, and others. For each relation , the model can learn weight matrices 1 , 2 , 3 and calculate the relation tendency score as ( , , ) where ( , , ) represents the relation tendency score for targets pair ( , ) under relation . We apply the softmax function to ( , , ) , yielding ( , ) , which represents the probability of each relation for ( , ). With ( , ), it will be used to calculate the relation categorical loss 1. Then, GCN is used in each graph, and the influence degree between different relations and nodes is taken as the comprehensive target feature. The process is as follows: where ( , ) represents the edge weight. and means the GCN weight under relation . includes all targets and contains all relations. As illustrated in Fig. 2, we use two types of loss in our model TSR-GCN: sentiment loss and relation loss, both of which belong to categorical loss. For sentiment loss, we use the Positive, Negative, Neutral and Conflict (The conflict tag appears only in the fourcategory task) as the ground-truth labels. Every target belongs to one of the three or four classes. The ground-truth sentiment labels for sentiment loss 1 and sentiment loss 2 are the same. We use cross-entropy as the categorical loss function during training. For relation loss, we feed in a one-hot relation vector as the ground truth of ( , ) for each target pair ( , ). In our model, we designed three relations: opposite, similar, and others. If one is positive and the other is negative, the relation is the opposite. If the sentient polarity of the two is the same, the relation is similar. In other cases, the relation belongs to other categories. The ground-truth relation vectors for relation loss of 1 and relation loss 2 are the same. For relation loss, we also use cross-entropy as the categorical loss function during training. For both sentiment loss and relation loss, we add an additional weight parameter to balance the loss before and after the two stages. Finally, the total loss is calculated as the sum of all sentiment loss and relation loss: where is a weight parameter. Our goal is to minimize the during model training. In our model, we set to 3 which was referenced from Fu et al. [Fu, Li and Ma (2019)].

Datasets
Tab. 1 shows the statistics of the dataset restaurant and laptop which from SemEval-2014 task 4 2 [Pontiki, Galanis, Pavlopoulos et al. (2014)]. These two datasets will be used in our experiments to verify the validity of our proposed model. The definition of conflict label is that there is both positive and negative polarity for the same target in a sentence. It is worth noting that "conflict" samples are few in the dataset, which will make the dataset very unbalanced in the process. So some existing work [Tang, Qin and Liu (2016)

Experiment settings
In our experiment, we use PyTorch [Paszke, Gross, Chintala et al. (2017)] to implemented all models. And fine-tuning the model of 2F 3 . Hyperparameters in the experiment are displayed in Tab. 2. The dropout rate is 0.1, the batch size is 32, the learning rate is 2e-5. max sequence length is 128, the max epoch number is 6, and the size of a hidden layer in GCN is 256.

Results
We use the classification accuracy metric to measure the performance of our model and previous methods. To demonstrate the effectiveness of our model, we compare it to a number of baseline methods, as shown below: In the Tab. 3, 4-way stands for 4-way classification, i.e., positive, negative, neutral and conflict. 3-way keeps only 3 classes, with conflict data removed from SemEval-2014 datasets. The results with "♭" from BERT-PT paper [Xu, Liu, Shu et al. (2019)], with " ‡" are copied from the AEN-BERT paper [Song, Wang, Jiang et al. (2019)], "♮" from GCAE [Xue and ], with " †" are retrieved from SDGCN-BERT [Zhaoa, Houb and Wua (2019)], with "♯ " are from Hazarika et al. [Hazarika, Poria, Vij et al. (2018)], and those with "℘" are from Ma et al. [Ma, Zeng, Peng et al. (2019)]. "-" indicates not reported in the original paper. For our method or re-implementations from others' code, we run the program for 10 times with random initialization, and show "mean ± std" as its performance. Best and second-best scores in each column are shown in bold and underlined, respectively. TD-LSTM Tang et al. [Tang, Qin, Feng et al. (2015)] uses two one-way LSTM networks to model the preceding and the following text of the target word, including the target word. Based on the target word, the direction of the LSTM network on the left is from the beginning of the clause to the target word, and the direction of the LSTM network on the right is from the end of the clause to the target word. The hidden layer states of the two LSTM networks are fused by splicing and then used for the final classification layer.  Wang et al. [Wang, Huang, Zhao et al. (2016)] splices the word embedding of the target word and the word embedding of each word in the sentence as the input of LSTM network. This makes the output of the LSTM network contain information from the target word at every moment. Then the hidden layer output of each moment is concatenating with the word embedding of the target word again to further train the weight of the attention matrix. To some extent, the attention mechanism can capture the importance of different contexts for a given target word.

ATAE-LSTM
MemNet Tang et al. [Tang, Qin and Liu (2016)] is an end-to-end deep memory network that captures the importance of each word in context for a given target word through multiple computation layers. RAM Chen et al. [Chen, Sun, Bing et al. (2017)] uses the multi-attention mechanism to obtain the sentiment characteristics between distant words, and then combines the output results of multiple attention through the recurrent neural network, thus enhancing the expression ability of MemNet. IAN Ma et al. [Ma, Li, Zhang et al. (2017)] adopts two LSTM networks to model the sentence and the target word respectively, and then generates the attention vector of the hidden layer state of the sentence and the hidden layer state of the target word in a way of mutual supervision, and finally takes the concatenating result of the attention vector of the two as the input of the classification layer. GCAE Xue et al. [Xue and Li (2018)] is a convolutional neural network with gated mechanism. The gated unit composed of Tanh and ReLU can selectively output corresponding sentiment characteristics according to the given target words.
BERT-pair-QA-M Sun et al. [Sun, Huang and Qiu (2019)] uses the given target words to construct an auxiliary question and fine-tune the BERT model by sentence classification task. AEN-BERT Song et al. [Song, Wang, Jiang et al. (2019)] is an attention encoder network that avoids repetition, using an attention-based encoder to model between context and target words. BERT-PT Xu et al. [Xu, Liu, Shu et al. (2019)] assumes that the task of sentiment classification of specific target words can be interpreted as a special Machine Reading Comprehension (MRC) problem [Rajpurkar, Zhang, Lopyrev et al. (2016); Rajpurkar, Jia and Liang (2018)], in which all problems are related to the sentiment polarity of a given target word. Hazarika's model Hazarika et al. [Hazarika, Poria, Vij et al. (2018)] uses LSTM network to process multiple targets in the first stage, and then LSTM is used to aggregate each group of features in the second stage, indicating that the target words in the previous text will affect the target words in the following text. Ma's model Ma et al. [Ma, Zeng, Peng et al. (2019)] introduces positional attention. In the first stage, the model processes the target words one by one, and in the second stage, the model integrates multiple target words in the whole sentence. SDGCN-BERT Zhaoa et al. [Zhaoa, Houb and Wua (2019)] takes the target word as the node of the graph, and proposes two ways of composition: one is to connect the nodes based on the right and left adjacent positions, and the other is to connect the nodes in pairs globally. Then the graph convolutional neural network is used to model multiple target words in a sentence at the same time, and then a bi-directional attention mechanism based on position coding is introduced to obtain the expression of specific target words. TD-BERT [Gao, Feng, Song et al. (2019)] with positioned output at the target termsbased BERT model, it adopts a straightforward manner to incorporated target information. The model is not only simple but also very effective. Experimental results are given in Tab. 2. We can find that models designed based on BERT have a significant improvement in the classification accuracy than models designed based word embedding, indicating that the BERT model is indeed more capable of semantic expression. The BERT model fully considers the context information of the sentence where the target word is located in the process of training, and the corpus of the training is larger. Our model TSR-GCN achieve state-of-the-are performance on Laptop datasets 4 classification and Restaurant datasets 3 classifications, and in the other two classification tasks have also achieved good results in the second place. This shows that the design idea of our model is feasible, and we will get better results if we can design a more detailed auxiliary task such as relation classification.

Conclusion
In this paper, the proposed TSR-GCN model is used to deal with multi-targets simultaneously in the same sentence. The Graph Convolutional Networks can consider the internal relation between the target words, then we create a multi-task learning combination by constructing an auxiliary relation classification task, which makes the model have further improved the classification effect. Experiments support that our model shows very good results compared with other methods on the SemEval 2014 Task 4 datasets.