Graph Convolutional Network for Word Sense Disambiguation

Word sense disambiguation (WSD) is an important research topic in natural language processing, which is widely applied to text classiﬁcation, machine translation, and information retrieval. In order to improve disambiguation accuracy, this paper proposes a WSD method based on the graph convolutional network (GCN). Word, part of speech, and semantic category are extracted from contexts of the ambiguous word as discriminative features. Discriminative features and sentence containing the ambiguous word are used as nodes to construct the WSD graph. Word2Vec tool, Doc2Vec tool, pointwise mutual information (PMI), and TF-IDF are applied to compute embeddings of nodes and edge weights. GCN is used to fuse features of a node and its neighbors, and the softmax function is applied to determine the semantic category of the ambiguous word. Training corpus of SemEval-2007: Task #5 is adopted to optimize the proposed WSD classiﬁer. Test corpus of SemEval-2007: Task #5 is used to test the performance of WSD classiﬁer. Experimental results show that average accuracy of the proposed method is improved.


Introduction
In natural language, there is phenomenon that a polysemous word has many senses. WSD is to determine meanings of the ambiguous word based on its context, which is widely applied to natural language processing tasks. ere are many ambiguous words in Chinese vocabulary. For example, Chinese word 'ben' has 3 semantic categories including 'book,' 'capital,' and 'foundation. ' We need determine meanings of 'ben' based on its context. Now, scholars at home and abroad are studying WSD. WSD methods can be divided into 3 categories: supervised methods, unsupervised ones, and semisupervised ones.
In the supervised WSD method, annotated corpus is used to train the WSD classifier. e optimized classifier is adopted to disambiguate test corpus [1]. In the unsupervised WSD method, corpus need not be annotated manually. Unlabeled corpus is analyzed to reveal its inherent nature and law. Unlabeled corpus is disambiguated based on the similarity between unlabeled instances [2]. In the semisupervised WSD method, annotated corpus is used to train classifier. At the same time, a large amount of unannotated corpus is used to expand training corpus for improving the performance of the WSD classifier [3].
Our work is different from the method proposed by Trask [4]. Trask gives the Sense2Vec model based on Word2Vec.
e Sense2Vec model is used to select appropriate sense embeddings of context for WSD. At the same time, he clusters ambiguous words with supervised labels. e novelty of our work is that words, parts of speech, and semantic categories from all left and right units around ambiguous word are used as discriminative features. e WSD graph is adopted to express discriminative features and sentence containing the ambiguous word. Word2Vec tool is used to vectorize discriminative features, and Doc2Vec tool is used to vectorize the sentence containing the ambiguous word. Word2Vec, PMI, and TF-IDF are applied to compute embeddings of their relationships between discriminative features and sentence. GCN and softmax function are applied to WSD based on the WSD graph.
Main contributions of this article are summarized as follows: (1) Sentence containing ambiguous word, word, part of speech, and semantic category are viewed as discriminative features. Doc2Vec tool is used to generate the feature vector of the sentence. Word2Vec tool is adopted to generate feature vectors of word, part of speech, and semantic category. (2) WSD graph is constructed, in which discriminative features are used as nodes. Edges between word and sentence, word, part of speech, and semantic category are, respectively, constructed. PMI, TF-IDF, and Word2Vec tool are used to compute edge weights. (3) WSD graph is input into GCN to determine the semantic category of the ambiguous word.
is paper is organized as follows. Related work is reported in Section 2. WSD feature extraction is given in Section 3. GCN word sense disambiguation is described in Section 4. Experimental results are given and analyzed in Section 5. Conclusion is described in Section 6.

Related Work
WSD is divided into supervised methods, unsupervised ones, and semisupervised ones.
In supervised WSD methods, labeled data is used to train WSD classifier. Zhang et al. [5] proposed two supervised WSD models based on the bidirectional long short-term memory network and self-attention model in the biomedical field. Amancio et al. [6] solved the WSD problem by treating texts as complex networks, where the semantic category of the ambiguous word is distinguished upon characterizing its local structure. Pal et al. [7] extended the baseline strategy and gave an improved WSD supervised method to establish decision trees, support vector machines, artificial neural networks, and naive Bayes models. Silva and Amancio [8] applied the framework of complex networks to the problem of supervised classification in word disambiguation task. Tripodi and Pelillo [9] designed the WSD model based on evolutionary game theory, which determines the semantic category of the ambiguous word according to distribution information and semantic similarity. Correa et al. [10] computed the relevance of the bipartite network representing both feature words and ambiguous word to solve ambiguities in written texts. Kumar et al. [11] presented a supervised method to disambiguate the ambiguous word by predicting over a continuous sense embedding space, which generalizes over both seen and unseen senses. Bevilacqua and Navilgli [12] embedded information of the LKB graph into a supervised neural architecture for exploiting pretrained synset embeddings to predict ones that are not in the training set.
In unsupervised WSD methods, unlabeled corpus is clustered to determine the semantic category of the ambiguous word. Alsaeedan et al. [13] proposed a hybrid WSD method that consists of self-adaptive genetic algorithm, max-min ant one, and ant colony one. Meng et al. [14] give a context2vec model with part of speech to differentiate meanings represented by one point in vector space. Li et al. [15] presented the WSD method based on polysemy vector representation. Statistical polysemy, the number of word senses, and K-means clustering algorithm are adopted to disambiguate the ambiguous word. Yuan et al. [16] transformed WSD into text topic classification problem. ey designed an unsupervised WSD model based on LDA topic. Jain and Lobiyal [17] applied the graph to reveal implicit information that links words in a sentence for WSD. Zhong and Wang [18] used the multiple kernel learning approach to combine multiple feature channels, which learns different weights that reflect the different importance of feature channels for WSD. Blevins and Zettlemoyer [19] designed a bi-encoder to embed ambiguous word with its surrounding context and dictionary definition. Encoders are jointly optimized in the same representation space and the nearest sense is selected. Ruas et al. [20] proposed a WSD method which considers semantic effects of contexts to disambiguate and annotate ambiguous word by its specific sense. Yang et al. [21] used domain keywords and word vector from unlabeled data to build WSD classifier. e proposed method is adapted to WSD task of other domains when knowledge from different fields is integrated. Hung and Chen [22] gave 3 methods based on contexts of word-ofmouth documents to build SentiWordNet lexicons for WSD. Lu et al. [23] combined Chinese and English knowledge resources by mapping word senses to construct a graph. At the same time, a graph-based WSD method with multiknowledge integration is presented.
In semisupervised WSD methods, annotated corpus and a large amount of unannotated one are used to train the WSD classifier. Jain et al. [24] gave a semisupervised algorithm for constructing WordNet graphs, in which clue words are used. Saqib et al. [25] designed a framework consisting of buzz words and query words to use WordNet for detecting target words. Buzz words are defined as a 'bagof-words' using POS, and query words have multiple meanings. Zhu [26] presented a semisupervised WSD method based on von Neumann kernel. Semantic similarities between terms are determined with both labeled and unlabeled data by means of a diffusion process on a graph defined by lexicon and co-occurrence. von Neumann kernel is constructed based on semantic similarity. Cardellino and Alonso Alemany [27] proposed a disjoint semisupervised learning method, in which an unsupervised model is trained on unlabeled data, and its results are used by a supervised classifier. Janz and Piasecki [28] combined plWordNet and semantic links extracted from a large valency lexicon, usage examples, Wikipedia articles, and SUMO ontology in a PageRank-based WSD algorithm. Mahmoodv and Hourali [29] used the machine learning algorithm with minimal supervision to disambiguate word senses based on features of target word and collaborative learning method. Başkaya and Jurgens [30] gave a semisupervised WSD method that combines a small amount of annotated data with information from word sense induction. Navigli and Velardi [31] created structural specifications of possible senses for each word in context and selected the best hypothesis with G grammar. Khapra et al. [32] adopted bilingual bootstrapping strategy for WSD, in which a model trained by annotated data is applied to annotate untagged data and vice versa using parameter projection. Akkaya et al. [33] used clustering and labeling strategy to generate labeled data for subjectivity WSD semiautomatically. Faralli and Navigli [34] designed a minimally-supervised framework for performing domain-driven WSD. ese 3 methods have their own shortcomings. Although the supervised WSD method can achieve the better performance, it needs a lot of annotated corpus. It is time-consuming and laborious. At the same time, the performance of WSD relies on machine learning algorithms. e unsupervised WSD method does not label corpus manually. But, disambiguation accuracy is not high. e semisupervised WSD method requires a small amount of annotated corpus and a large amount of unannotated one. But, it makes the WSD model worse and affects the coverage of ambiguous words to use unlabeled corpus to fit the model. GCN can capture global information of the graph and represent features of nodes better. Convolutional kernels of GCN act on all nodes of the whole graph, and weight parameters are shared. is reduces parameters of a singlelayer network and effectively avoids the overfitting problem. erefore, this paper proposes a WSD method based on GCN. Sentence, word, part of speech, and semantic category are viewed as nodes, and their relationships are used as edges. e WSD graph is constructed. GCN is adopted to extract effective features from the WSD graph and apply linguistic knowledge from corpus better to WSD.

WSD Feature Extraction
Firstly, the Chinese sentence including the ambiguous word is segmented into words. Secondly, the Chinese word is labeled with part of speech. irdly, the Chinese word is annotated with the semantic category. Here, word, part of speech, and semantic category are extracted from contexts of the ambiguous word as discriminative features. For Chinese sentence containing ambiguous word 'biaomian,' the process of extracting discriminative features is shown in Figure 1.
Here, w is a word, p denotes part of speech, s expresses the semantic category, and d is the Chinese sentence containing the ambiguous word.

WSD Based on Graph Convolutional Network
GCN is a multilayer neural network that operates directly on a graph and induces the embedding vector of a node based on its neighbor ones. An undirected graph G � (V, E) is defined, where V(|V| � N) is a set of nodes and E is a set of edges. Assuming that each node is connected with itself, then e WSD graph of the corpus is constructed, whose nodes are sentence, word, part of speech, and semantic category. At the same time, the WSD graph contains embeddings of nodes and edge weights. It is important to construct edges, respectively, between word and sentence, word, part of speech, and semantic category. e set of sentence nodes is ere are N nodes including D, W, P, and S in the WSD graph. Adjacency matrix A ∈ R N×N is constructed based on V. Each node has a Mdimensional feature vector which is Word2Vec or Doc2Vec. N feature vectors form feature matrix X ∈ R N×Μ . Adjacency matrix A and feature matrix X are input into GCN. en, the softmax function is adopted to determine the semantic category of the ambiguous word. GCN word sense disambiguation is shown in Figure 2.
ere are t semantic categories s 1 , s 2 , ..., s t for the ambiguous word w, as shown in Figure 2. e ellipse represents node, and the line denotes the edge between two nodes.

Discrete Dynamics in Nature and Society
Here, 'D' represents the sentence node, 'W' denotes the word node, 'P' represents the part of speech node, and 'S' denotes the semantic node. e number is used to distinguish different sentences, words, parts of speech, and semantic categories. R(X) represents the embedding representation of X. Here, X can be sentence, word, part of speech, and semantic category. e edge between R(D 1 ) and R(W 1 ) denotes the relationship between D 1 and W 1 , whose value is TF-IDF(d 1 , w 1 ). e edge between R(W 1 ) and R(W 2 ) represents the relationship between W 1 and W 2 , whose value is PMI (w 1 , w 2 ). e edge between R(W 1 ) and R(P 1 ) denotes the relationship between W 1 and P 1 , whose value is Word2Vec (p 1 ). e edge between R(W 1 ) and R(S 1 ) represents the relationship between W 1 and S 1 , whose value is Word2Vec(s 1 ). e output of GCN is input into the softmax function to determine the semantic category of w.
Discriminative features are extracted from lexical units in the sentence including ambiguous word w. For the above example, d 1 represents the sentence. Doc2Vec tool is used to vectorize d 1 as v d1 . Word2Vec tool is adopted to vectorize word, part of speech, and semantic category, respectively, as v w , v p , and v s . e same features are viewed as a node in the WSD graph. ere are 20 nodes for the WSD graph of the above sentence. 200-dimensional feature vector is generated for each feature. Feature matrix X ∈ R 20×200 is constructed by 20 feature vectors. e process of constructing feature matrix X is shown in Figure 3.
We use PMI to calculate edge weight between two word nodes. A fixed-size sliding window is adopted to collect co-occurrence statistics. e sliding process of the window is shown in Figure 4.
In Figure 4, w represents the word and the number denotes location. Suppose there are 7 words in a sentence including w 1 , w 2 , . . . , w 7 . If the size of the sliding window is set to 3, the window contains 3 words. e window slides from left to right. When the window slides each time, a new word will be included and the first word of the window will be discarded. e sliding process does not stop until the rightest word of the window is the last word of the sentence.
PMI is adopted to calculate the weight between two words. PMI value of word pair w i , w j is defined as where #(w i ) is the number of sliding windows containing w i , #(w i , w j ) is the number of sliding windows containing w i and w j , and # is the number of sliding windows. When the PMI value is positive, semantic relevance between w i and w j is high. Otherwise, their semantic relevance is low. An edge is added between word nodes W i and W j when PMI (w i , w j ) is positive.

WSD Graph
Corpus Feature extraction layer Output layer Words class   (1) Use Doc2Vec tool to vectorize the sentence as embedding of node D. Use Word2Vec tool to vectorize word, part of speech, and semantic category, respectively, as embeddings of node W, node P, and node S. (2) When PMI (w i , w j ) is positive, an edge is added between nodes W i and W j whose weight is PMI (w i , w j ).
(3) Use Word2Vec tool to vectorize part of speech p j in node P j as edge weight between nodes W i and P j . Use Word2Vec tool to vectorize semantic category s j in node S j as edge weight between nodes W i and S j . (4) Use TF-IDF to vectorize sentence d i in node D i and word w j in node W j , where TF is the frequency that word w j occurs in d i and IDF is logarithmic inverse fraction of the number of sentences containing word w j .
Edge weight between nodes i and j can be given by One layer GCN can obtain information of its neighbor nodes through convolution operations. When multiple GCN layers are stacked, information about larger neighborhoods are integrated. Feature matrix L (1) ∈ R N×K is defined as where ρ is the activation function and W 0 ∈ R Μ×K is the weight matrix. Normalized symmetric adjacency matrix Ã can be given by If Ã is the ordinary adjacency matrix, it cannot consider the influence of a node on itself. At the same time, it cannot consider that a node with more neighbors has greater influence on WSD.
GCN aggregates high-order neighborhood information by stacking multiple convolutional layers, which makes a node have more information using the following formula: where j is the layer number of GCN and L (0) � X. Two-layer GCN is used here. Adjacency matrix A and feature matrix X are input into GCN. en, the softmax function is adopted to compute probabilities of ambiguous word w under semantic categories, which are defined as  where ReLU is the activation function, W 0 is the weight matrix of the input layer, W 1 is the weight matrix of the output layer, ÃXW 0 is feature embeddings of the input layer, and ReLU(ÃXW 0 )W 1 is feature embeddings of the output layer.
Probability that ambiguous word w belongs to semantic category s i can be given by where U � {(d 1 , l 1 ), (d 2 , l 2 ), ..., (d n , l n )} and l i ∈ s 1 , s 2 , . . . , s t is the semantic category of sentence d i (i � 1, 2, . . ., n). Loss function L is defined as cross-entropy loss error of all annotated instances, which is defined as where Z ij is the predicted probability that d i belongs to semantic category s j , as shown in formula (8).
Parameters W 0 and W 1 of formula (8) can be trained by the gradient descent method based on U.
Semantic category s of ambiguous word w is determined using the following formula:  Average accuracy is used to evaluate the performance of the WSD classifier, which is defined as

Experimental Results and Analysis
where N is the number of all ambiguous words, m i is the number of test sentences correctly classified for the ith ambiguous word, n i is the number of all test sentences containing the ith ambiguous word, p i is disambiguation accuracy of the ith ambiguous word, and p avg is average accuracy.
In the first two groups of experiments, we compare some WSD methods based on deep neural networks and proposed one. In the first group of experiments, words are extracted as discriminative features from two left and right units around the ambiguous word. LSTM, CNN, CNN + LSTM, and CNN + Bi-LSTM are, respectively, adopted to determine its semantic category. In GCN(1), words are extracted as discriminative features from two left and right units around the ambiguous word. 4 words compose d. Discriminative features and d are used to construct the WSD graph. In GCN(2), words are extracted as discriminative features from all left and right units of the ambiguous word. ese words compose d. Discriminative features and d are used to construct the WSD graph. Training corpus is used to optimize the WSD model, and test corpus is adopted to testify accuracy of the WSD model, as shown in Table 1.
From Table 1, it can be seen that WSD performance of GCN is higher than those of other models. Average accuracy of GCN(2) is superior to that of GCN(1). is is because that all left and right words around ambiguous word are used in GCN (2). But, two left and right words around the ambiguous word are only used in GCN(1). GCN(2) considers more contexts than GCN(1). So, GCN(2) is better than GCN(1) on WSD performance. Average accuracy of LSTM is slightly lower than that of CNN, which shows that CNN can extract disambiguation features better. CNN and LSTM are inferior to CNN + LSTM on WSD performance. e reason is that CNN + LSTM can extract more effective features than CNN and LSTM. Average accuracy of CNN + LSTM is the same with that of CNN + Bi-LSTM. It indicates that they have the same ability of feature extraction when they are used, respectively, to extract disambiguation features from two left and right words around ambiguous words.
In the second group of experiments, words, parts of speech, and semantic categories are extracted as discriminative features from two left and right units around the ambiguous word. LSTM, CNN, CNN + LSTM, and CNN + Bi-LSTM are adopted to determine its semantic category. In CNN(1), frequencies of discriminative features Training corpus is used to optimize the WSD model, and test corpus is adopted to testify accuracy of the WSD model, as shown in Table 2. From Table 2, it can be seen that WSD performance of LSTM, CNN, CNN + LSTM, CNN + Bi-LSTM, and GCN (4) is improved compared with the corresponding model in the first group of experiments. is is because that words, parts of speech, and semantic categories are used as discriminative features. It shows that part of speech and semantic category provide more discriminative information for WSD. Average accuracy of CNN is better than that of LSTM. It indicates that CNN can extract disambiguation features better from words, parts of speech, and semantic categories. CNN + LSTM is superior to CNN and LSTM on WSD performance. e reason is that CNN + LSTM can extract more effective features than CNN and LSTM. Average accuracy of CNN + LSTM is lower than that of CNN + Bi-LSTM. It indicates that CNN + Bi-LSTM has the better ability of feature extraction when they are used, respectively, to extract disambiguation features from two left and right words, parts of speech, and semantic categories around ambiguous words. Average accuracy of CNN(1) is higher than that of CNN. It shows that average accuracy of CNN is improved when frequencies are vectorized as disambiguation features. Average accuracy of GCN(4) is higher than that of GCN (3). is is because that all left and right words, parts of speech, and semantic categories around ambiguous word are used in GCN(4). But, two left and right words, parts of speech, and semantic categories around the ambiguous word are only used in GCN(3). GCN(4) considers more contexts than GCN(3). So, GCN(4) is better than GCN(3) on WSD performance. But, average accuracy of GCN(3) is lower than GCN(1). It shows that it influences information transfer between nodes to introduce more discriminative features into GCN when there are few nodes.
Ambiguous words of GCN(2) and GCN(4) in Tables 2  and 3 are classified according to the category number. Average accuracy of ambiguous words with the same category number is calculated, as shown in Figure 5.
From Figure 5, it can be seen that average accuracy of ambiguous words with the same category number decreases when the category number increases. is is because that the predicted results have more possibilities with the category number increasing. It makes error rate of the WSD classifier higher. Average accuracy of GCN(4) is better than that of GCN(2) for 2 categories, 3 ones and 4 ones. is is because Discrete Dynamics in Nature and Society that words are only adopted as discriminative features in GCN (2). Words, parts of speech, and semantic categories are used as discriminative features in GCN(4). Compared with GCN(2) and GCN(4), average accuracy growth of ambiguous words with more categories is bigger than that of ambiguous words with less ones. is shows that, for the ambiguous word with more categories, more discriminative information will improve classification accuracy.
In the third group of experiments, words, parts of speech, and semantic categories are extracted as discriminative features from all left and right units around the ambiguous word. ese words compose d. Discriminative features and d are used to construct the WSD graph, and GCN is adopted to determine the semantic category of the ambiguous word. e PMI value is applied to construct edges between word nodes in the WSD graph and compute their weights. PMI value is relevant to the size of the sliding window. By setting different window sizes, the influence of edge weight between two words on WSD is compared. Window size is, respectively, set to 5,8,10,12,15,20, and the length of d #. Training corpus is used to optimize GCN in the third group of experiments. Test corpus is adopted to testify accuracy of the WSD model, as shown in Table 3.
From Table 3, it can be seen that average accuracy of WSD is the highest when window size is 10. rough the analysis of experimental results, it can be concluded that if window size is too small, the information between some word nodes is not aggregated well. Otherwise, there will be an edge between two irrelevant word nodes. When the window is expanded, accuracies of some ambiguous words increase and accuracies of some ambiguous words become worse.
is is caused by the inconsistency of the word number in sentences. When window size is 10, the sliding window can cover the sentence length in corpus equally and a better WSD effect can be gotten. In order to observe the influence of window size more intuitively, a line chart of average accuracies under different window sizes is drawn, as shown in Figure 6.
In the fourth group of experiments, words, parts of speech, and semantic categories are extracted as discriminative features from all left and right units around the ambiguous word. ese words compose d. Discriminative features and d are used to construct the WSD graph, and GCN is adopted to determine the semantic category of the ambiguous word. Window size is set to 10, and the convolutional layer number of GCN is set to different values. Training corpus is used to optimize GCN. Test corpus is adopted to testify accuracy of the WSD classifier, as shown in Table 4.
From Table 4, it can be seen that the effect of the WSD classifier is not good under 1 convolutional layer. is is because that the information between nodes cannot be fused well. In order to make a node have more extensive information, multiple convolutions are needed. e performance of WSD is the highest under 2 convolutional layer. When the layer number is 3 and 4, average accuracies of the WSD classifier decrease.
is is because that, after the layer number increases, discriminative information in convolutional operations is more sufficient. e information in distant neighbors of a node will gradually gather. Information in irrelevant nodes converges, which leads to excessive information fusion and reduces the performance of the WSD classifier. Here, 'biaomian,' 'qixi,' 'zhongyi,' 'tiao,' 'changcheng,' and 'chi' are used as representative ambiguous words. Disambiguation accuracies under different layer numbers are shown in Figure 7.
From Figure 7, it can be seen that most ambiguous words have the highest accuracies when the layer number is 2. Accuracy of 'zhongyi' is the highest when the layer number is 3. When the layer number is 4, 'changcheng' has the highest accuracy. is shows that the effect of information fusion between nodes is the best when the layer number is high for some ambiguous words. Accuracies of some ambiguous words reduce because of excessive information fusion.  Discrete Dynamics in Nature and Society

Conclusions
is paper proposes a WSD method based on GCN, in which words, parts of speech, and semantic categories from all left and right units around the ambiguous word are used as discriminative features. e WSD graph is constructed whose nodes are, respectively, words, parts of speech, semantic categories, and sentence. Edges are added, respectively, between word and sentence, word, part of speech, and semantic category. GCN is adopted to process the WSD graph, and the softmax function is used to determine the semantic category of the ambiguous word. We use the WSD graph to describe discriminative features, sentence, and their relationships, which can better express relationships between linguistic knowledge instead of serializing them. Experimental results show that GCN has better WSD performance than CNN, LSTM, CNN + LSTM, and CNN + Bi-LSTM and whether words are used as discriminative features or words, parts of speech, and semantic categories are used as discriminative features. GCN can make information transfer more fully between nodes of the WSD graph and extract disambiguation information effectively. It can obtain better WSD performance without too many convolutional layers, which greatly reduces the amount of calculation. In the future, we will integrate more language knowledge into the WSD graph and use the attention mechanism to further improve WSD performance.

Conflicts of Interest
e authors declare that they have no conflicts of interest.