Target-Guided Structured Attention Network for Target-Dependent Sentiment Analysis

Target-dependent sentiment analysis (TDSA) aims to classify the sentiment of a text towards a given target. The major challenge of this task lies in modeling the semantic relatedness between a target and its context sentence. This paper proposes a novel Target-Guided Structured Attention Network (TG-SAN), which captures target-related contexts for TDSA in a fine-to-coarse manner. Given a target and its context sentence, the proposed TG-SAN first identifies multiple semantic segments from the sentence using a target-guided structured attention mechanism. It then fuses the extracted segments based on their relatedness with the target for sentiment classification. We present comprehensive comparative experiments on three benchmarks with three major findings. First, TG-SAN outperforms the state-of-the-art by up to 1.61% and 3.58% in terms of accuracy and Marco-F1, respectively. Second, it shows a strong advantage in determining the sentiment of a target when the context sentence contains multiple semantic segments. Lastly, visualization results show that the attention scores produced by TG-SAN are highly interpretable


Introduction
Target-dependent sentiment analysis (TDSA) is an actively studied research topic with the aim to determine the sentiment polarity of a text towards a specific target. For example, given a sentence ''the food is so good and so popular that waiting can really be a nightmare'', the target-dependent sentiments of food and waiting are positive and negative, respectively.
The major challenge of TDSA lies in modeling the semantic relatedness between the target and its context sentence (Tang et al., 2016a;. Most recent progress in this area benefits from the attention mechanism, which captures the relevance between the target and every other word in the sentence. Based on such word-level correlations, several models have already been proposed for constructing target-related sentence representations for sentiment prediction (Wang et al., 2016;Tang et al., 2016b;Liu and Zhang, 2017;Ma et al., 2017).
One important underlying assumption in existing attention-based models is that words can be used as independent semantic units for modeling the context sentence when performing TDSA. This assumption neglects the fact that a sentence is oftentimes composed of multiple semantic segments, where each segment may contain multiple words expressing a certain meaning or sentiment collectively. Furthermore, different semantic segments may even contribute differently to the sentiment of a certain target. Figure 1 shows an example of a restaurant review, which contains two salient semantic segments (highlighted in blue). Intuitively, a TDSA model should be able to identify both segments and determine that the second one is more relevant to the writer's sentiment towards the target [waiting]. Existing methods, however, would only attend important words (highlighted in red) such as ''good'', ''popular'', ''really'', and ''nightmare'' individually through the aforementioned assumption.
We hypothesize that the ability to uncover multiple semantic segments and their relatedness with the target from a context sentence will be beneficial for TDSA. In this light, we propose a fine-to-coarse TDSA framework, namely, Target-Guided Structured Attention Network (TG-SAN) Figure 1: A motivating example, where darker shades denote higher contributions to the sentiment of the target [waiting]. (a) A TDSA model should be able to identify two salient segments from the sentence, and that the second one is more important for determining the target's sentiment. (b) Existing attention-based models would attend important words individually and fail to determine their relatedness with the target. in this paper. The core components of TG-SAN include a Structured Context Extraction Unit (SCU) and a Context Fusion Unit (CFU). As opposed to using word-level attention, the SCU utilizes a target-guided structured attention mechanism to encode multiple semantic segments of a sentence as a structured embedding matrix, where each vector in the matrix can be viewed as one target-related context. The CFU then fuses the extracted contexts based on their relatedness with the target to construct the ultimate context representation of the target for sentiment classification.
Our contributions are summarized as follows: (1) We propose to uncover multiple semantic segments and their relatedness with the target in a sentence for TDSA.
(2) We devise a novel TG-SAN, which uses a fine-to-coarse framework to produce the context representation of the target. TG-SAN utilizes a target-guided structured attention mechanism to encode a sentence as a rdimensional matrix, where each vector can be viewed as one target-related context. The matrix is further fused into a single context vector by leveraging their relatedness with the target for sentiment classification.
(3) We empirically demonstrate that TG-SAN outperforms a variety of baselines and the state-of-the-art on three benchmarks, and that it is effective in handling sentences composed of multiple semantic segments. We also present visualization results to reveal the superior explanatory power of the proposed model.

Related Work
Given a target and its context sentence, the major challenge of TDSA lies in identifying targetrelated contexts in the sentence for determining the target's sentiment. Early work adopted rulebased methods or statistical methods to solve this problem (Ding et al., 2008;Zhao et al., 2010;Jiang et al., 2011). These methods relied either on handcrafted features, rules, or sentiment lexicons, all of which required massive manual efforts. In recent years, neural networks have achieved great success in various fields for their strong representation capability. They have also been proven effective in modeling the relatedness between the target and its contexts. Recursive neural networks were first used by Dong et al. (2014) and Nguyen and Shirai (2015) for TDSA. Specifically, the target was first converted into the root node of a parsing tree, and then it contexts were composed based on syntactic relations in the tree. As such approaches rely strongly on dependency parsing, they fall short when analyzing nonstandard texts such as comments and tweets, which are commonly used for sentiment analysis.
Another line of work applied recurrent neural network (RNN) and its extensions to TDSA for their natural way of encoding sentences in a sequential fashion. For instance, Tang et al. (2016a) utilized two RNNs to individually capture the left and the right contexts of the target, and then combined the two contexts for sentiment prediction.  elaborated on this idea by using a gate to leverage the contributions of the two contexts for sentiment prediction. However, such RNN-based methods place more emphasis on the words near the target while ignoring the distant ones, regardless of whether they are target-related.
Recently, attention mechanisms have become widely used for modeling the relatedness between every context word and the target for TDSA (Wang et al., 2016;Liu and Zhang, 2017;Ma et al., 2017). For example,  assigned attention scores to each context word according to their relevance to the target, and combined all context words with their attention scores to constitute the context representation of the target for sentiment classification.
The aforementioned attention-based methods used a single attention layer to capture targetrelated contexts. One drawback of this has been recently examined by  and Li 3) applies a self-attentive operation on the target memory to obtain a structured target representation R t , which is used to guide the extraction of r target-related segments R c from the context memory through a structured attention mechanism. The CFU (Section 3.4) generates the target vector r t through a self-attentive operation on R t , and then learns the contribution of each context to obtain the ultimate context vector r c . Finally, the Output Layer (Section 3.5) composes the context vector and the target vector for predicting the target's sentiment. et al. (2018). They argued that using one layer of attention to attend all context words may introduce noises and degrade classification accuracy. To alleviate this problem,  proposed refining the attended words in an iterative manner, whereas Li et al. (2018) used a convolutional neural network to extract n-gram features whose contributions were decided by their relative positions to the target in the context sentence.
To the best of our knowledge, no existing study has explicitly considered uncovering a sentence's semantic segments and learning their contributions to a target's sentiment. We address this problem with a novel target-guided structured attention network in this work.

Approach
We first mathematically formulate the TDSA problem addressed in this paper, and then describe the proposed TG-SAN. Figure 2 depicts the architecture of TG-SAN.

Problem Formulation
A sentence is a sequence of words S = {w 1 , . . . , w i , . . . , w L }, where w i is the one-hot representation of a word and L is the length of the sequence. Given a target, the positions of its mentions in S are denoted by T = {i 1 j , . . . , i t j , . . . , i l j } m j=1 , where l is the number of word tokens in the target and m is the number of times the target appears in S. L t = l * m is therefore the total number of word tokens of the target in the sentence. Note that by allowing m ≥ 1, our problem formulation explicitly models the situation where the target has multiple mentions in a sentence, whereas existing attentionbased TDSA models only addressed a single mention situation (m = 1).
Given a context sentence S and a target's mentions indexed by T , our task is to predict the sentiment polarity y ∈ O of the target, where O = {−1, 0, 1} denote negative, neutral, and positive sentiments, respectively.

Memory Builder
The Memory Builder constructs the target memory and the context memory from the input sentence as follows. A lookup table E ∈ R d e ×|V | is first built to represent the semantics of each word by word vectors, where d e is the dimension of the word vectors and |V | is the vocabulary size. The onehot representation of the word sequence S is then converted into a sequence of dense word vectors A Bi-LSTM layer is placed on top of the word vectors to obtain their contextualized word representations. The output of this Bi-LSTM layer is a sequence where d h denotes the dimension of each hidden state.
The sequence H ∈ R L×2d h is further split into a target memory M t and a context memory M c according to the positions of target mentions T . M t ∈ R L t ×2d h consists of the representations of the target words, while M c ∈ R L c ×2d h consists of those of the context words, where L c = L − L t .

Structured Context Extraction Unit (SCU)
Given the target memory and the context memory, the next step is to extract the target-related segments which may appear in different parts of the context sentence. Recently, Lin et al. (2017) proposed a structured self-attention mechanism, which represents a sentence as multiple semantic segments, and applied such mechanism successfully to document-level sentiment analysis. In TDSA, however, not all semantic segments are related to the target. We therefore build on the idea of Lin et al. (2017) to devise a SCU, which is able to capture target-related segments as the contexts for determining the target's sentiment. Structured target representation. The target memory M t is converted into a structured repre-sentation using the self-attentive operation (Lin et al., 2017) as follows: where A t ∈ R r×L t is a weight matrix and R t ∈ R r×2d h is the embedding matrix representing the target. W 1 t and W 2 t are two parameters for the self-attentive layer. r is a hyper-parameter referring to the number of rows in the target matrix. In other words, r represents the number of structured representations transformed from the target memory M t .
Following Lin et al. (2017), a penalization term P is used in the loss function to encourage the diversity of rows captured in R t .
Target-guided contexts extraction. Given the target matrix R t , target-related semantic segments are uncovered from the context memory M c as follows. A matrix A c ∈ R r×L c is first built to capture the relatedness between the target matrix and the context memory using a bilinear attention operation. It is then used to build a context matrix R c ∈ R r×2d h , where each row in the matrix can be viewed as a target-related semantic segment: where W c is the parameter of the bilinear attention operation.
A feed-forward network is further placed on top of the context matrix R c to produce its transformed representation R c . A residual connection (He et al., 2016) is then used to compose both matrices to obtain the final structured context representation s are learnable parameters of the feed-forward network. The layer normalization (Ba et al., 2016) used in Equation (10) helps to prevent gradient vanishing and exploding.

Context Fusion Unit (CFU)
The CFU learns the contributions of the different extracted contexts to the target's sentiment, and produces the ultimate context vector of the target. Specifically, a self-attentive operation is utilized to fuse the target matrix R t into a target vector r t .
where w 2 m and W 1 m are learnable parameters. Given the target vector r t , the contribution of each context is then learned to produce the ultimate context vector r c ∈ R 2d h : where U is a weight matrix, R c [i] ∈ R 2d h represents the i-th target-related context and α i denotes its normalized contribution score.

Output Layer and Model Training
Consider the examples (a) ''It takes a long time to boot up'', and (b) ''The battery life is long''. Although both targets (in italic) have similar contexts, their sentimental orientations are totally different. It is therefore necessary to consider the target itself along with its contexts to predict its sentiment.
In the output layer, the context vector r c and the target vector r t are concatenated, and transformed via a non-linear function. The transformed vector is further used in conjunction with r c to build the final feature vector r ct : where f (·) denotes a non-linear activation function, and the ReLU function is adopted in this paper. A softmax layer is then applied to convert the feature vector into a probability distribution: where W q ∈ R O ×2d h and b q ∈ R O are parameters of the softmax layer. For a number of D training instances, crossentropy loss with a L 2 regularization term is adopted as the loss function: where y i is the true sentiment label, q i is the predicted probability of the true label, θ is the set of parameters of TG-SAN, λ 1 and λ 2 are regularization coefficients, and P i is the penalization term for the i-th training instance (see Equation (6)).

Experimental Setup Datasets
We evaluate the proposed TG-SAN on three public benchmark datasets, namely, Tweet, Laptop, and Restaurant. The Tweet dataset contains tweets collected from Twitter (Dong et al., 2014). The Laptop, and Restaurant datasets are from the SemEval 2014 challenge (Pontiki et al., 2014), containing customer reviews on laptops and restaurants, respectively. We discarded data instances labeled as ''Conflict'' in the Laptop and Restaurant datasets following previous studies. Table 1 summarizes statistics of the datasets. We use classification accuracy and macro-F 1 as evaluation metrics in all experiments.

Compared Models
To demonstrate the ability of the proposed model, we compare it with three baseline approaches, four attention-based models, and the state-of-the-art.
SVM (Kiritchenko et al., 2014): This was a topperforming system in SemEval 2014. It utilized various types of handcrafted features to build a SVM classifier.
AdaRNN (Dong et al., 2014): This utilized a recursive neural network based on dependency tree structure to iteratively compose target-related contexts from a sentence for sentiment classification.
TD-LSTM (Tang et al., 2016a): This employed two LSTMs to separately model the left and the right contexts of a given target, and concatenated their last hidden states to predict the target's sentiment.
ATAE-LSTM (Wang et al., 2016): This used a LSTM layer to model a sentence, and used an attention layer to produce a weighted representation of the sentence with respect to a given target.
IAN (Ma et al., 2017): This used two LSTMs to separately model the sequence of target words and that of context words in a sentence. It then applied an interactive attention mechanism to capture the relatedness between the target and its context for sentiment classification.
MemNet (Tang et al., 2016b): This applied multiple hops of attention on the word embeddings of the context sentence, and treated the output of the last hop as the final representation of the target.
RAM : This proposed a recurrent neural attention mechanism to iteratively refine the context representation, and took the combination of all constructed contexts as the final representation for sentiment classification.
TNet (Li et al., 2018): It is the state-of-theart in target-dependent sentiment analysis. It first transformed words considering their positions with respect to the target, and used a convolutional neural network to extract n-gram features from the context sentence for sentiment classification. Note that the published results of TNet were based on the authors' implementation with a bug in data preprocessing. 1 We fixed the identified bug, retrained the TNet model with the parameters suggested in the work of Li et al. (2018), and reported the revised results in this paper for empirical comparison.

Experimental Settings
As no standard validation set is available for the benchmark datasets, we randomly held out 20% of the training set as the validation set for tuning the hyper-parameters of TG-SAN. Settings producing the highest validation accuracy are listed in Table 2, and are adopted in the subsequent experiments unless otherwise specified.

Parameter Value
Word embedding dimension d e 300 LSTM hidden dimension d h 150 Dropout rate 0.5 No. of structured representations r 2 Penalization term coefficient λ 1 0.1 Regularization term coefficient λ 2 10 −6 Batch size 64 Table 2: Hyper-parameter settings of TG-SAN.
vectors (Pennington et al., 2014), and fixed the word vectors during the training process. The recurrent weight matrices were initialized with random orthogonal matrices. All other weight matrices were initialized by randomly sampling from the uniform distribution U(−0.01, 0.01). All bias vectors were initialized to zero. RMSProp was used for network training by setting the learning rate as 0.001 and the decay rate as 0.9. Dropout (Srivastava et al., 2014) and early stopping were adopted to alleviate overfitting. Dropout was applied on the inputs of the Bi-LSTM layer and the output layer with the same dropout rate shown in Table 2.

Main Results
We report the experimental results of TG-SAN (r = 2) and the compared models in Table 3. In summary, TG-SAN outperforms all compared models on the Tweet and the Restaurant datasets. On the Laptop dataset, it also achieves the best accuracy among all models, and macro-F1 comparable to the best-performing model, RAM . Such results demonstrate the efficacy of the proposed TG-SAN. We also observe that the attention-based models perform better than the baseline models in general. This is not surprising, as different context words can be of different importance to the sentiment of a target, a phenomenon that can be naturally captured by the attention mechanism. TNet and RAM are the most competitive among all compared models, attributed to their efforts on alleviating the noise produced by using a single layer of attention, as already shown in previous studies. However, we observe that their prediction abilities vary across datasets: RAM performs better than TNet on Laptop and Restaurant, and vice versa on Tweet. In contrast, TG-SAN produces satisfactory performance consistently on  Table 3: Comparison of Accuracy and Macro-F 1 among different models. Results marked with ♯ are adopted from , and those with * are adopted from the original papers. Performance improvements of the proposed TG-SAN model over the state-of-the-art, TNet (Li et al., 2018), are statistically significant at p < 0.01.
all datasets, demonstrating the capability of the proposed fine-to-coarse attention framework in capturing the semantic relatedness between the target and the context sentence for TDSA. To conclude, we validated the efficacy of TG-SAN through comparative experiments. The advantage of TG-SAN over existing methods confirms our hypothesis that semantic segments are the basic units for understanding target-dependent sentiments. It also shows that such segments can be effectively captured by the proposed target-guided structured attention mechanism.

Ablation Studies
Three ablation models are designed to reveal the effectiveness of each compoent in TG-SAN.
w/o CFU: This ablation model uses the SCU to capture target-related segments in a sentence, and averages all context vectors to constitute the vector r c in Equation (13) without distinguishing their different contributions.
w/o SCU & CFU: In this ablation model, the combination of SCU and CFU is replaced by a simple attention layer. Specifically, the target is represented as the averaged vector of the target memory. It is then utilized to attend the most relevant words in the context sentence to build the context vector. In the output layer, the context vector and the target vector are both composed for sentiment prediction.
w/o TG: In this ablation model, the guidance of the target in the SCU is removed to explore the effect of the target on context extraction. Hence, the SCU is reduced to the one proposed by Lin et al. (2017), which extracts semantic segments from the sentence using the self-attentive mechanism. Table 3 reports the results of the three ablation models. We observe that performance degrades when the attention layer capturing the contributions of contexts is removed in w/o CFU. This indicates that some contexts are indeed more important than the others in deciding the sentiment of a target, and the difference is well captured by CFU. Results also show that the use of SCU is crucial. Comparing w/o CFU and w/o SCU & CFU, the macro-F1 of the latter drops drastically by 1.66%, 4.83%, and 2.29% on Tweet, Laptop, and Restaurant respectively. Furthermore, results worsened when the target's guidance is replaced with the self-attentive mechanism as in w/o TG. This indicates that not all semantic segments appearing in the sentence are related to the target, and it is necessary to extract the related ones for TDSA.

Effects of r
One important hyper-parameter in TG-SAN is r, which refers to the number of structured representations extracted from the context sentence. We vary the value of r from 1 to 5 to investigate its effects on the TDSA task in this experiment. It r =  is worth noting that the attention mechanism of the model degenerates into simple attention when setting r as 1. Table 4 reports the results. TG-SAN performs best when r = 2 on the Tweet and Laptop datasets, and r = 4 on the Restaurant dataset. In general, we conclude that the best setting of r is always greater than 1. This demonstrates that multiple contexts are indeed beneficial for predicting target-dependent sentiments, which are well captured by the structured attention mechanism. We also observe that when r > 1, model performance may decrease as r increases. The reason might be that a growing r increases the complexity of the model, making it more difficult to train and less generalizable.

Studies on Multi-segment Sentences
To better understand the advantage of structured attention in TDSA, we further examine a specific group of instances containing multiple semantic segments. Specifically, each instance considered in this experiment either contains multiple different targets, or multiple mentions of the same target. We identified in total 38, 382, and 825 such instances from the Tweet, Laptop, and Restaurant datasets, respectively. It is worth noting that multisegment instances are particularly common in Laptop and Restaurant, accounting for 59.78% and 73.79% of all instances, respectively.
In this experiment, we compare TG-SAN with two models relying on a simple attention mechanism. One is its degenerated version with r = 1, and the other is a baseline model (w/o SCU & CFU). Table 5 reports the comparative results.
We observe that TG-SAN outperforms the other two models on all datasets. This demonstrates that the structured attention mechanism provides a richer context representation ability to identify the target-related contexts more effectively, which is in line with our motivation.

Case Studies
We demonstrate through case studies that TG-SAN produces not only superior classification performances, but also highly interpretable results. Figure 3 presents test instances covering three different situations: (1) multiple targets, multiple segments; (2) single target, multiple segments; and (3) single target, single segment. For each instance, we plot a heat map to visualize the attention results produced by TG-SAN and a baseline model (w/o SCU & CFU) for comparison. Note that the attention score of each word in TG-SAN is produced by the product of the context weights α ∈ R r (see Figure 3: Visualization results (best viewed in color). Targets are shown in square brackets. Positive and negative sentiments are highlighted in red and green respectively. In the visualized attention results, the darker the shading of a word, the higher the attention weight it receives from the corresponding model. In general, TG-SAN demonstrates a stronger interpretability than the baseline model. It effectively uncovers all sentiment-related contexts in each case, and identifies the most important ones with respect to a specific target. In contrast, contexts captured by the baseline model are incomplete and inaccurate, as can be seen obviously from the attention results it generates for ''waiting'' in sentence (1) and ''google'' in sentence (2). Equation (14)) and the word contributions of each context A c ∈ R r×L c (see Equation (7)), denoted by α T A c .
Visualization results show that TG-SAN has a strong ability in uncovering semantic segments in a sentence. It can also effectively identify the relatedness between a segment and a certain target. For example, sentence (1) contains two segments expressing opposite sentiments towards the targets ''food'' and ''waiting''. TG-SAN identifies both segments, and places more emphasis on the segment ''so good'' (respectively, ''nightmare'') when predicting the sentiment of ''food'' (respectively, ''waiting''). In contrast, whereas the baseline model identifies all sentiment-related words, it fails to determine accurately the relatedness between each word and the target. As a result, it produces a wrong sentiment prediction for ''waiting''. Similar observations can be made from sentence (2). In this sentence, TG-SAN explicitly captures two target-related segments, whereas the baseline model identifies only one. In case (3), we observe that even when a context sentence contains only one target-related segment, TG-SAN still produces a reasonable explanation for its prediction.

Conclusions and Future Work
In this paper, we develop a novel Target-Guided Structured Attention Network (TG-SAN) for target-dependent sentiment analysis (TDSA). As opposed to the simple word-level attention mechanism used by existing models, TG-SAN uses a fine-to-coarse attention framework to uncover multiple target-related contexts and then fuse them based on their relatedness with the target for sentiment classification. The effectiveness of TG-SAN is validated through comprehensive experiments on three public benchmark datasets. It also demonstrates superior ability in handling multi-segment sentences, which contain multiple targets or multiple mentions of the same target. In addition, the attention results it produces are highly interpretable as visualization results shown.
As future work, we may extend this study in two directions. First, the SCU is currently utilized once to extract target-related contexts from a sentence, but extending such fine-to-coarse framework through iterative use of multiple SCUs is also feasible from the model perspective. Second, we would like to explore the effectiveness of our model in other tasks where semantic relatedness plays an important role as in TDSA, such as the answer sentence selection task for questionanswering.