Attention enhanced capsule network for text classification by encoding syntactic dependency trees with graph convolutional neural network

Text classification is a fundamental task in many applications such as topic labeling, sentiment analysis, and spam detection. The text syntactic relationship and word sequence are important and useful for text classification. How to model and incorporate them to improve performance is one key challenge. Inspired by human behavior in understanding text. In this paper, we combine the syntactic relationship, sequence structure, and semantics for text representation, and propose an attention-enhanced capsule network-based text classification model. Specifically, we use graph convolutional neural networks to encode syntactic dependency trees, build multi-head attention to encode dependencies relationship in text sequence, merge with semantic information by capsule network at last. Extensive experiments on five datasets demonstrate that our approach can effectively improve the performance of text classification compared with state-of-the-art methods. The result also shows capsule network, graph convolutional neural network, and multi-headed attention has integration effects on text classification tasks.


INTRODUCTION
Text classification is the basic task of text analysis with broad applications in topic labeling, sentiment analysis, and spam detection. Text representation is one important way for classification. From the analysis of text structure, the text is a sequence of words by certain rules. Sequences of different word orders show different meanings. The sequence structure of a text contains important information about the semantics of the text. Moreover, words in a text are contextual, the distance dependence between words in the sequence structure will also affect the meaning of the text. From the analysis of text composition, the text is composed of words or phrases with different syntactic functions according to certain syntactic relations. For humans, syntactic relations are the basis for editing and reading texts. According to the syntactic relation, we can understand the subject, predicate, and object of the text to promptly understand the semantics of the text. Whether it is text representation or text classification by text representation, it is the key The text syntactic relationship and word sequence are important and useful for text classification. How to model and incorporate them to improve performance is one key challenge. In this paper, we combine the syntactic relationship, sequence structure, and semantics for text representation, and propose a novel model that utilizes GCN for syntactic relationship, multi-head attention for words, and corporate them in capsule network for text classification.
The contributions of this paper can be summarized as follows: We incorporate syntactic relationship, sequence structure, and semantics for text representation.
We introduce GCN to extract syntactic information for dependencies relationship representation.
We build multi-head attention to encode the different influences of words to enhance the effect of capsule networks on text classification. We show that CapsNet, GCN, and multi-head attention have an integration effect for text classification.

Deep learning-based methods
With the introduction of distributed word vector representation (Word Embedding) (Mikolov et al., 2013;Pennington, Socher & Manning, 2014), neural networks-based methods have substantially improved the performance of text classification tasks by encoding text semantics. CNN was first applied to image processing. Kim (2014) proposed the CNN-based text classification model (TextCNN). The model uses convolution filters to extract local semantic features and improved upon the state of the art on four out of seven tasks. Zhang, Zhao & Lecun (2015) proposed the character-level CNN model, which extracts semantic The monkey eats an apple det nsubj dobj det Figure 1 The example of syntactic dependency tree, where 'monkey' is the subject of the predicate 'eats', and 'apple' is its object. Full-size  DOI: 10.7717/peerj-cs.831/ fig-1 information from character-level original signals for text classification tasks. Conneau et al. (2017) proposed very deep convolutional networks to learn the hierarchical representation for text classification. Being a spatially sensitive model, CNN pays a price for the inefficiency of replicating feature detectors on a grid. Recently, Sabour, Frosst & Hinton (2017) proposed the CapsNet model, which uses vector neural units and dynamic routing update mechanisms, and verified its superiority in image classification. Zhao et al. (2018) proposed the text classification model based on CapsNet (Capsule-A), which adopted CapsNet to encode text semantics, and proved that its classification effect is superior to CNN and LSTM. In CapsNet, the feature is represented by a capsule vector instead of a scalar (activation value output by neuron). Different dimensions in a vector can represent different properties of a feature. For a text feature, it often means different meanings in different semantic relations. We use capsules to represent text features to learn the semantic information of different dimensions of text features. On the other hand, the similarity between features at different levels is different. For building a high-level feature, lower levels with high similarity have a higher weight. CapsNet can learn this similarity relationship through a dynamic routing algorithm. Although CapsNet can effectively improve coding efficiency, it still has limitations in recognizing text with semantic transitions.
Attention mechanisms are widely used in tasks such as machine translation (Vaswani et al., 2017) and speech recognition (Chorowski et al., 2014). Lin et al. (2017) proposed the self-attention mechanism that can encode long-range dependencies. Vaswani et al. (2017) proposed a machine translation model (Transformer) based on multi-head attention. Alaparthi & Mishra (2020) proposed a pre-trained model of language representation (BERT) that also takes multi-head attention as its basic component. The basic unit of multi-head attention is scaled dot-product attention. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. Attention is to extract the long-distance dependencies in the text by calculating the similarities between words in the text. The words in the text can express different meanings in different semantic scenarios. The representation of words is different in different semantic spaces. Similar to ensemble learning, multi-head attention can put text in different semantic spaces to calculate attention and get integrated attention. The location information cannot be obtained by relying solely on the attention mechanism, and location information also has an important influence on understanding text semantics. Kim, Lee & Jung (2018) proposed a text sentiment classification model combining attention and CNN, but it is still limited by the disadvantages of CNN. Although BERT has been particularly effective on many tasks, it requires a lot of data and computing resources for pre-training. Therefore, our research is still valuable.

Syntactic information-based methods
CNN, RNN, and most deep learning-based methods always utilized word local topology to represent text. Word order and semantic, text syntactic information all have important influences on text classification. Some researchers have done some work on text syntactic information for different tasks. A text is made up of words that represent different syntactic elements, such as subject, predicate, object, and so on. Different syntactic elements are interdependent. A syntax dependency tree is a kind of tree structure, which describes the dependency relationship between words. Figure 1 shows an example of a syntax dependency tree. There are many tools (like StanfordNLP) to generate syntactic dependency trees by analyzing syntactic dependency relations.
The syntactic dependency tree is also a kind of graph data. Bastings et al. (2017) used GCN to encode the syntactic dependency tree and combined it with the CNN for machine translation.  used GCN to encode syntactic dependency trees to generate word representations and combined it with LSTM for the role labeling task. These works show that GCN can effectively extract syntactic information in syntactic dependency trees. The syntactic relationship is presented as a tree structure. A tree is also a form of a graph. Secondly, the sequence structure-based model represents the syntactic tree as a sequence according to some rules of nodes. But the sequence is directional, and nodes in a tree are not sequential. GCN can directly consider the relationship between nodes in a syntax tree.
In summary, we aim to propose a novel model named Syntax-AT-CapsNet that uses multi-head attention to extract long-distance dependencies information and that uses GCN to encode syntactic dependency trees to extract syntactic information, which enhances the effect of capsule networks on text classification tasks.

SYNTAX-AT-CAPSNET MODEL
Our Syntax-AT-CapsNet model consists of the following three modules as depicted in (Fig. 2).
Attention module. It is composed of an attention layer that adopts multi-head attention. It encodes the dependency relationship between words in the text sequence and important word information to form a text representation. Syntax module. It is composed of GCN. It encodes the syntax dependency tree, extracts the syntactic information in the text to for a text representation. Capsule network module. It is a capsule network with 5-layer. Based on text representations output by the Attention module and the Syntax module, it further extracts text semantic and structural information to classify the text.

Input
The input of the Syntax-AT-CapsNet model is defined as the sentence matrix X: where x i 2 R d is the word vector of the i-th word in the sentence, L is the length of the sentence, and d is the embedding vector size of words.

Attention module
The Attention module is shown in (Fig. 3). The calculation of attention in this module can be divided into five steps. First step, linearly transform the input sentence matrix X and divide it into three matrices: where W 2 R dÂ3d is the transform matrix, and spilt denotes the division operation. Second step, linearly projects the matrices Q; K; V onto h different linear subspaces: The purpose of this step is to compute multiple attention values in parallel.
At the same time, the dimension of input matrix is reduced to reduce the calculation pressure caused by multiple calculation.  Third step, calculate the attention on each subspace in parallel: where head i is the attention value on the i-th subspace, softmax denotes the softmax function (Vaswani et al., 2017). In fact, Q i and K i represents the sentence matrix on the subspace. It's divided by ffiffiffi d p in case the dot product gets too big. The weight of the sentence matrix on the subspace V i is obtained by calculating the dot product of Q i and K i and using a softmax function.
Fourth step, concat the attention values on each subspace and get the attention value of the entire sentence through linear transformation: where W M 2 R dÂd is the transform matrix, Multi_head is the attention value of the entire sentence, and concat denotes the concat operation. The final step, Connect the attention value Multi_head to the original sentence matrix X to get the sentence matrix output by the module: where X 1 2 R LÂd is the output of the attention module, and residual Connect denotes the residual connection operation.

Syntax module
The Syntax module is shown in (Fig. 4). The Syntax module uses GCN to encode syntactic dependency trees, which can encode syntactic relationships between words in a text into word vectors. The module first needs to use a natural language processing tool (we adopted StanfordNLP) to generate the syntactic dependency tree of the input sentence, and construct its adjacency matrix. The adjacency matrix construction algorithm in this paper is shown in Algorithm 1. Since the syntactic relationship between word nodes in the syntactic dependency tree has direction, when constructing the adjacency matrix, the syntactic dependency tree is used as a directed graph. In addition, in order not to be disturbed by its word vector, the node is not provided with a self-loop. As shown in (Fig. 5), the adjacency matrix corresponds to the example sentence "The Monkey eats an apple" shown in (Fig. 1) and is generated by our method. The input sentence matrix and adjacency matrix are further passed through the GCN to obtain a text representation containing syntactic information.
The calculation in the Syntax module can be divided into two steps. First step, use StanfordNLP tool to generate syntactic dependency tree and construct adjacency matrix adj.
Second step, perform a two-layer graph convolution operation on the input sentence matrix X and adjacency matrix adj: where G 2 R LÂd is the output of the first layer graph convolution operation, X 2 2 R LÂd is the output of the second layer graph convolution operation, and adj 2 R LÂL is adjacent matrix, W t1 and W t2 2 R dÂd are parameter matrices. In operation (7), the adjacency matrix and sentence matrix go through the first level graph-convolution operation. The relationship between nodes directly connected is obtained. Then, through operation (8), the relationship between nodes indirectly connected nodes is calculated.

Capsule network module
The capsule network module is composed of a fusion layer, a convolution layer, a primary capsule layer, a convolution capsule layer, and a fully connected capsule layer. It uses the text representation output by the attention module and the syntax module as input to further extract features. Each layer in the module can extract different levels of features. By further combining low-level features to obtain higher-level features, and finally form a feature representation of the entire text for classification. The first layer is the fusion layer, put the text representations X 1 and X 2 output by the syntax module and the attention module into a single layer network: where W f 1 ; W f 2 2 R L . Two sentence matrices are combined by linear transformation through this step. The second layer is the convolutional layer, which extracts N-gram phrase features at different positions in the text. This layer uses k 1 convolution filters to perform convolution on the sentence matrix X 3 to obtain the N-gram feature matrix M: where M k 1 ¼ m 1 ; m 2 ; . . . ; m LÀNþ1 ½ 2 R LÀNþ1 is the k 1 -th column vector in M, each element m i in this vector is obtained by operation (11): where f denotes the nonlinear activation function, W c1 2 R NÂd is the k 1 -th convolution filter, x i:iþNÀ1 denotes that N-word vectors in the sentence are connected in series, b 1 is bias item. The original features are extracted by convolution in this step. The third layer is the primary capsule layer, which combines the N-gram phrase features extracted at the same location as capsules. This layer uses k 2 transformation matrices to transform the feature matrix M into the primary capsule matrix P: where is the k 2 -th column capsule in P, each capsule is obtained by operation (13): where g denotes the nonlinear compression function, W c2 2 R k 2 Â1Âl is the k 1 th transformation matrix, M i is the i-th row vector in M, and b 2 is bias item. The primary capsules are constructed by linearly transforming the original features of the same location. The fourth layer is the convolution capsule layer, which uses a shared transformation matrix to extract local capsules, similar to the convolution layer. This layer uses k 3 transformation matrices to perform capsule convolution operation on P to obtain the capsule matrix U: where U k 3 ¼ u 1 ; u 2 ; . . . ; u LÀNÀN 1 þ2 ½ 2 R LÀn 2 þ1 ð Þ Â l is the k 3 -th column capsule in U, each capsule is obtained from the N 1 line capsules in P: where P i:iþ N 1 Âk 2 ð Þ À 1 represents N 1 rows of capsules, and each capsule is linearly converted to obtain a prediction vector: where W c3 2 R lÂl is the k 3 -th conversion matrix,b i is the offset term. It's the same thing as convolution, except the basic units become capsules. And the prediction vector is operated (17): where g is a nonlinear activation function, c i is the coupling coefficient, which is updated with a dynamic routing algorithm (Zhao et al., 2018). The similarity between the primary capsules and the generated convolutional capsules is different within a window. A primary capsule with a high similarity should be given a higher weight. Equation (17) is based on this principle. The last layer is the fully connected capsule layer, which is used to form the capsule Y representing the category: where y j 2 R l denotes the capsule of the j-th category. The capsules in U are linearly transformed (16) to obtain the prediction vector u jjr , and the operation (17) is performed to obtain y j . The fully connected capsule layer is shown in Fig. 6, it can also represent a convolution window operation at the fourth layer. Finally, the modulus of the capsule vector representing the category in the fully connected capsule layer is taken as the probability of belonging to the category.

Syntax-AT-CapsNet learning algorithm
The learning algorithm of Syntax-AT-CapsNet is shown in Algorithm 2. When the model is learning, the coupling coefficients are updated by the dynamic routing algorithm, and the global parameters of the model are updated by the back propagation algorithm.

The input of a fully capsule layer
The prediction of a fully capsule layer The output of a fully capsule layer Then the trained Syntax-AT-CapsNet model parameters can be obtained. During prediction, the classification results can be obtained through the sequential calculation of each module in the model.

EXPERIMENTAL DETAILS
We did extensive experiments to verify the effect of Syntax-AT-CapsNet model on the single label and multi-label text classification tasks and designed more ablation experiments to demonstrate the role of each module.

Data sets
We choose the following four datasets in our experiments: movie reviews (MR) (Miwa & Bansal, 2016), subjectivity dataset (Subj) (Pang & Lee, 2004), customer review (CR) (Hu & Liu, 2004), Reuters-21578 (Lewis, 1992). The data in MR, Subj, and CR have two categories and are used for the single-label classification tasks, where MR and Subj are composed of movie sentiment review data, CR is composed of product reviews from Amazon and Cnet. The Reuters-21578 test set Linearly transform and divide the sentence matrix X into Q, K, V; Perform graph convolution operation on X and adj to get sentence matrix X 2 //Capsule network module: Step 10-Step 14 10 Put X 1 and X 2 into MLP get the sentence matrix X 3 11 Perform convolution operation on X 3 to obtain an N-gram matrix M;

12
Convert the N-gram feature matrix M to the capsule matrix P;

13
Perform capsule convolution and routing on P to obtain the capsule matrix U;

14
Perform capsule calculation and routing on U to obtain category capsule Y;

15
Calculate loss and update parameters by back propagation; 16 end for; 17 return the trained Syntax-AT-CapsNet model parameters; 18 end.
consists of Reuters news documents. We selected 10,788 news documents under the 8 category labels related to economic and financial topics in Reuters-21578 and further divided them into two sub-datasets (Reuters-Full and Reuters-Multi). In Reuters-Full, all texts are kept as the test set, and in Reuters-Multi, only multi-label texts are kept as the test set. The experimental data description is shown in Table 1.

Evaluation index
Exact Match Ratio (ER), Micro Averaged Precision (Precision), Micro Averaged Recall (Recall), and Micro Averaged F1 (F1) were used as evaluation indexes in the experiment. Accuracy is used instead of ER in the single-label classification.

Parameter setting
The experimental parameters of our work are as follows. In the model, input a 300dimensional word2vec word vector d ¼ 300 ð Þ . The attention module uses two heads of attention h ¼ 2 ð Þ. The first layer of the capsule network module uses 32 convolution filters k 1 ¼ 32 ð Þ, the window size is 3 (N = 3). The second layer uses 32 transformation matrices k 2 ¼ 32 ð Þand 16-dimensional capsule vectors l ¼ 16 ð Þ. The third layer uses 16 conversion matrices k 3 ¼ 16 ð Þ, the window size is 3 N 1 ¼ 3 ð Þ.The last layer uses 9 capsule vectors j ¼ 9 ð Þ to represent 9 classes. In model training, mini-batch with a size of 25 batch size ð Þare used, the training batch is controlled to 20 Epoch ¼ 20 ð Þ , and the learning rate is set to 0.001 learning rate ¼ 0:001 ð Þ . During the model test, for the single label classification task, the category label corresponding to the capsule vector with the largest module length is taken. For the multi-label classification tasks, the category labels corresponding to capsule vectors with a modulus length greater than 0.5 are taken.

Benchmark model
In this paper, TextCNN, Capsule-A, and AT-CapsNet are used as benchmark models for comparative experiments. TextCNN is a classic model of text classification based on CNN, which is representative. Capsule-A is a text classification model based on capsule networks. AT-CapsNet is a multi-headed attention capsule network text classification model. Table 1 Data description: the number of training sets, validation sets and test sets of five kinds of data sets, the number of categories of each data set, and whether the data set belongs to singlelabel data or multi-label data.

EXPERIMENTAL RESULTS AND DISCUSSION
Performance on single-classification and multi-classification tasks The experimental result of single-label classification is shown in Table 2, and the multilabel classification experiment result is shown in Table 3. It can be observed from the experimental results: Compared with the benchmark model (Table 2), our model achieved the best results of accuracy, Recall, and F1 on the three binary classification data sets ER, Subj, CR. Compared with AT-CapsNet (a multi-headed attention capsule network that does not  introduce syntax), it is found that the three data sets have a significant improvement, which proves the value of introducing syntactic information in this article. Compared with the benchmark models (Table 3), on the two multi-label data sets Reuters-Full and Reuters-Multi, our model has achieved competitive results in four evaluation indicators. And achieves the best results in ER, Recall, and F1, which indicates the effectiveness of the model in the classification of multi-tag texts.
That is, our model has better effects on multi-label and single-label classification tasks than benchmark methods. As described in the introduction, the problems with the baseline model are the key to our research. This shows that our model overcomes the shortcomings of these models to a certain extent. That's the purpose of our work.

Syntax module verification experiment
To show the effect of the syntax module, we did the following experiments and the experimental results are shown in Table 4.
We can see that when the syntactic module is added to the benchmark model, the four evaluation indicators have been significantly improved, which shows that the syntactic module of this article can effectively improve the effect of text classification tasks. It also proves the feasibility and value of extracting syntactic information with graph convolutional neural networks. This also shows that we are right to learn from human reading behavior.

Module ablation experiment
The results of the module ablation experiment are shown in Tables 5 and 6.
From the above experimental results, we can draw the following conclusions: When controlling a single module (Table 5), the ablation of each module will cause the classification effect to decrease to varying degrees, which shows that each module in our model has a certain role in improving the text classification effect. In addition, by comparing the reduced values, it was found that the capsule network had the greatest influence, followed by attention, and the syntax module was smaller, which indicated the correctness and value of taking the capsule network as the core module in this paper. When two modules are controlled (Table 6), the ablation of any two modules will cause the classification effect to decrease to varying degrees. Among them, the ablation syntax and capsule network modules have the greatest influence, and the ablation syntax and attention module have the second influence, and the ablation attention and the Decrease value 2.9 1 0.8 0.9 Table 6 The experimental result of control two module ablation. Remove two module from our model to verify the effect of the module. There is no corresponding module for the first model in each control group. capsule network module has the least influence, which shows that the syntactic module can function to the greatest extent when combined with other modules. It also shows that the motivation of using graph neural networks to encode syntactic information and other models is correct. It can be seen from the above that when the attention module, the syntax module, or the capsule network module in the model of this article is removed or partially removed, the effectiveness of the model has declined to vary degrees. Since the syntactic module uses graph convolutional neural networks, the above experiments also prove that graph convolutional neural networks, capsule networks, and multi-head attention have an integrated effect on text classification tasks. This also shows that we are on the right track in building the model.

CONCLUSIONS
This paper proposes an enhanced capsule network text classification model Syntax-AT-CapsNet for text classification tasks. The model first uses graph convolutional neural networks as submodules to encode syntactic dependency trees, extract syntactic information in text, and further integrate with sequence information and dependency relationships, thereby improving the effect of text classification. Through model classification effect verification experiment, syntax module verification experiment, and module ablation experiment, the effect of the model in this paper on text classification and multi-label text classification task is verified, the function of syntax module is demonstrated, and the integrated effect of graph convolutional neural network, capsule network, and multi-head attention is proved. Future work will further optimize the model for other downstream tasks of text classification.