Attention Neural Network for Biomedical Word Sense Disambiguation

. In order to improve the disambiguation accuracy of biomedical words, this paper proposes a disambiguation method based on the attention neural network. The biomedical word is viewed as the center. Morphology, part of speech, and semantic information from 4 adjacent lexical units are extracted as disambiguation features. The attention layer is used to generate a feature matrix. Average asymmetric convolutional neural networks (Av-ACNN) and bidirectional long short-term memory (Bi-LSTM) networks are utilized to extract features. The softmax function is applied to determine the semantic category of the biomedical word. At the same time, CNN, LSTM, and Bi-LSTM are applied to biomedical WSD. MSH corpus is adopted to optimize CNN, LSTM, Bi-LSTM, and the proposed method and testify their disambiguation performance. Experimental results show that the average disambiguation accuracy of the proposed method is improved compared with CNN, LSTM, and Bi-LSTM. The average disambiguation accuracy of the proposed method achieves 91.38%.


Introduction
Biomedical texts are now so large that automation tools are needed to process them effectively. But, it is difficult to process biomedical texts automatically. e reason is that there are more ambiguous biomedical words in the biomedical field. It is helpful for the automatic processing of biomedical articles to determine semantic categories of biomedical words. Now, biomedical word sense disambiguation (WSD) is widely applied to biomedical natural language processing tasks, such as text indexing, text categorization, and named entity extraction.
In the field of biomedicine, there is polysemy in professional vocabulary. For example, the biomedical word "ADA" has two semantics, including "American Dental Association" and "Adenosine Deaminase." We need to determine the correct meanings of biomedical words according to relevant information of contexts.
Biomedical WSD methods can be divided into 3 categories: supervised methods, unsupervised ones, and knowledge-based ones. In supervised methods, a labeled dataset along with lexical and syntactic information in context is used to train a classifier that predicts correct senses of biomedical words in the test dataset. In unsupervised ones, unlabeled biomedical texts are used to provide sense choices for biomedical words. In knowledge-based ones, thesauri and sense inventories are applied to determine semantic categories of biomedical words. WordNet and Unified Medical Language System (UMLS) [1] are two important thesauri, which provide brief definitions of different senses and corresponding synonyms. Knowledgebased approaches do not use any corpus but solely rely on thesauri or sense inventories such as WordNet and UMLS that contain brief definitions of different senses and corresponding synonyms.
is paper intends to combine neural networks and linguistic knowledge for improving the performance of biomedical WSD. ere is a lot of linguistic information in contexts around the biomedical word, which can be used to determine its semantics. But, some information is helpful, and others are noisy. Neural networks are often used to extract discriminative information. Each neural network has its advantages and disadvantages. It is a challenge for the biomedical WSD system to combine multiple neural networks for extracting effective discriminative information from contexts. is paper combines Av-ACNN and Bi-LSTM to extract discriminative features from contexts around the biomedical word and determine its semantics, which improves the disambiguation accuracy of the biomedical WSD system.
In this paper, we take morphology, part of speech, and semantic information of four adjacent lexical units around the biomedical acronym as disambiguation features. Word embeddings [2][3][4] are used as representations for biomedical WSD problems. Discriminative information embedded in word units is extracted by an attention mechanism to obtain features at a high level. Based on these features, average asymmetric convolutional neural networks (Av-ACNN), bidirectional long short-term memory (Bi-LSTM) networks, and softmax function are used to determine the semantic category of the biomedical acronym. e main contributions of this paper are summarized as follows: (1) Morphology, part of speech, and semantic information of 4 adjacent lexical units around the biomedical acronym are used as disambiguation features. We use word embedding to generate a feature vector. (2) Attention mechanism is adopted to generate weights dynamically by capturing relationships between left and right contiguous words of the biomedical acronym. (3) Multiscale asymmetric convolution neural network can reduce computation quantity and obtain more feature information. Useful information can be obtained forward and backward using bidirectional long short-term memory networks. e remainder of this paper is organized as follows. Related work is reported in Section 2. Section 3 describes the extraction of disambiguation features and how to generate disambiguation feature vectors. Attention neural network is given in Section 4. e process of training attention neural network is given in Section 5. Experimental analysis is given in Section 6. Section 7 gives conclusions.

Related Work
ere are 3 kinds of biomedical WSD methods. ey are, respectively, supervised methods, unsupervised ones, and knowledge-based ones.

Supervised Methods.
Supervised methods use labeled data to train the biomedical WSD classifier. Liu applies 3 machine learning algorithms to biomedical WSD, including Naive Bayes and decision lists, adaptation of decision lists, and mixed supervised learning method. Experiments show that the hybrid supervised method with Naive Bayes performs best in biomedical WSD [5]. Stevenson extracts domain-independent linguistic features around the ambiguous word from the text. ese features have been adapted for biomedical text disambiguation by adding CUIs [6] and Medical Subject Heading terms [7]. Son proposes a support vector machine with examplewise weights to solve the WSD task. Weights of training instances are adjusted according to their similarities to test data [8]. Moon builds a clinical sense inventory with 440 common abbreviations. By comparing this inventory with UMLS, Adam, and Stedman, he analyzes these clinical abbreviations and acronyms among diverse resources [9]. Yepes uses word embedding to improve traditional features. At the same time, a recurrent neural network based on LSTM nodes is used for biomedical WSD [10]. Festag applies word embedding and recurrent convolutional neural networks to medical term disambiguation, which map medical terms to multiple concepts in UMLS [11]. Wang proposes an interactive learning algorithm in which expert's domain knowledge is used to build a medical WSD model. An expert can provide domain knowledge in 3 ways, including labeling instances, specifying indicative words of a sense, and highlighting support evidence in a labeled instance [12]. Bis proposes a novel deep neural network for biomedical WSD, in which a layered bidirectional LSTM network and a maxpooling layer along multiple time steps are built in order to create a dense representation of context [13]. Wei applies CNNs and LSTMs to capture semantic and syntactic features for bioconcept disambiguation [14]. But, a lot of humanannotated corpora are needed in the supervised biomedical WSD method.

Unsupervised Methods.
Unsupervised methods use an unlabeled corpus to provide sense choice for a word in context. Agirre uses relations in UMLS to create a graph. At the same time, a personalized PageRank algorithm [15] is applied to rank semantic categories of ambiguous words based on their structural importance in graphs and relation to words in context [16]. Duan proposes a graph-based algorithm to cluster words into groups, in which the principle of finding the maximum margin between clusters is adopted [17]. Wanton gives a kernel-based method for biomedical WSD. Information in a knowledge base is used to construct an affinity matrix, and kernels are defined based on the matrix [18]. Rab applies six relation types in UMLS to build a graph for ambiguous words. At the same time, he gives a graph-based algorithm to disambiguate terms in a biomedical text [19]. Fernandez proposes a graph-based unsupervised algorithm to solve the WSD problem in a biomedical domain. When a knowledge base is built, contexts of ambiguous terms are considered [20]. Li proposes a language model based on Bi-LSTM, in which word order is considered and the entire sentential context is described. It generates high-quality context representations in an unsupervised manner [21]. Pesaranghader computes sense embeddings based on their text definitions in the Unified Medical Language System. At the time, he proposes a net to determine the semantic category of the ambiguous term [22]. But, the performance of the unsupervised biomedical WSD method is low. 2 Discrete Dynamics in Nature and Society

Knowledge-Based Methods.
Knowledge-based approaches apply external lexical resources to biomedical WSD, such as machine-readable dictionaries, thesauri, and ontologies. Rais considers that the terms in context have the same weight. en, a modified SenseRelate algorithm [23] is given. He applies semantic similarity and relatedness measures to biomedical WSD. en, the influence of context window size on WSD is evaluated [24]. Yepes compares 4 knowledge-based WSD methods in the biomedical domain. e method which uses semantic categories assigned to concepts in Metathesaurus performs the best [25]. Plaza studies the influence of 3 WSD algorithms on biomedical summarization, in which documents are mapped onto concepts in UMLS. ree WSD algorithms are, respectively, journal descriptor indexing, machine-readable dictionary, and automatic extracted corpus [26]. McInnes uses semantic similarity and relatedness measures to determine semantic categories of biomedical terms, which does not require human-annotated corpus and yields high disambiguation accuracy [27]. Garla adopts a directed concept graph to compute semantic similarity based on UMLS. Vertices represent concepts, and edges denote taxonomical relationships [28]. Based on neural word and concept embeddings, Sabbir combines cosine similarity, projection magnitude proportion, and a prior knowledge-based approach to determine the semantic category of the biomedical term [29]. Antunes uses unlabeled MEDLINE abstracts to generate word embeddings. Word embeddings are applied to compute embedding vectors in surrounding contexts around the ambiguous term. According to the similarity between context vector and concept vector, meanings of ambiguous terms are determined [30]. But, it is difficult to extract correct knowledge from lexical resources and apply them to biomedical WSD. ese 3 methods have their own shortcomings. e supervised WSD method achieves better performance. But, it needs a lot of annotated biomedical corpus. Unsupervised WSD method need not label biomedical corpus manually. But, disambiguation accuracy is not high. e knowledgebased method applies external lexical resources to biomedical WSD. But, it is difficult to extract correct knowledge from lexical resources and apply them to biomedical WSD correctly.

Preprocessing Text.
Punctuations in contexts of the biomedical word have less semantic information and have little influence on determining its semantic category. At the same time, they will bring noises in the process of estimating the model's parameters. Regular expression of python is used to delete punctuations from sentences containing the biomedical word. Part of speech refers to grammatical features of a kind of word, which is their grammatical functions. It provides help for determining the relationship between two words. Nltk packet of python is adopted to label each word with part of speech in the sentence. Semantics is the sense of a word. Words with the same or similar sense are classified into a category. e purpose is to decrease data sparsity in the process of parameter estimation. Nltk packet of python is used to label each word with semantics in the sentence.
In the sentence "A message from ADA president Feldman.", "ADA" is a biomedical word. Firstly, punctuation "." is deleted. Secondly, every word is labeled with part of speech. For "A," its part of speech is DT. For "message" and "president," their part of speech is NN. For "from," its part of speech is IN. For "ADA" and "Feldman," their part of speech is NNP. irdly, every word is labeled with semantics. "A" is annotated with angstrom.n.01, "message" is labeled with message.n.01, "ADA" is annotated with adenosine_deaminase.n.01, and "president" is labeled with president.n.01. "Feldman" and "from" are annotated with "−1."

Disambiguation Feature Extraction.
Word nearby ambiguous word has more impact on the sense of the ambiguous word, but words far away from it have less one. In this paper, the biomedical word is viewed as the center. Morphology, part of speech, and semantic information from left and right lexical units are extracted as disambiguation features to determine its meanings. When the number of left or right contiguous vocabulary units is less than 2, the corresponding disambiguation feature is set to −1. e purpose is to ensure that each biomedical word has 12 disambiguation features.
For English sentences containing the biomedical word "ADA," the process of extracting disambiguation features is shown as follows: English sentence: A message from "ADA" president Feldman. Part of speech tagging: A/DT message/NN from/IN ADA/NNP president/NN Feldman/NNP Semantic annotation: A/DT/angstrom.n.01 message/ NN/message.n.01 from/IN/-1 ADA/NNP/adenosine_deaminase.n.01 president/NN/president.n.01 Feldman/ NNP/-1 e process of extracting disambiguation features is shown in Figure 1 3.3. CBOW Model. Word2vec's CBOW [2] is used to generate a feature vector. e input of the CBOW model is a word vector corresponding to context-related words of a word, and its output is a probability distribution. e dimension of output is the same as that of the input. e gradient descent method is used to update input weights and output weights. After the training process, each word in the input layer is multiplied by input weights to get a word vector. e size of the word window is 4. e window contains its 2 left word units and 2 right ones. Twelve features from these 4 word units are input to the CBOW model. Word vector of "ADA" is computed as shown in Figure 2.

Generation of Disambiguation Feature Matrix.
e feature vector is a real one that maps high-dimensional space into low one, which can represent a large amount of Discrete Dynamics in Nature and Society 3 potential information in a word. Twelve disambiguation features are extracted from a sentence containing the biomedical word "ADA." CBOW model is used to convert each feature into a 100-dimensional feature vector.
In this paper, we design 3 optional methods to construct a feature matrix: (1) e first one uses the morphology of 4 adjacent vocabulary units around the biomedical word "ADA" to construct a feature vector. is feature vector is used to construct a 4 × 100 feature matrix F. (2) e second one considers positions of left and right adjacent words. Generally, if a context-related word  is closer to the biomedical word "ADA," it is more important for determining the category of "ADA," and its weight is bigger. e weighted sum of 12 feature vectors is used as a feature vector of the biomedical word "ADA," whose dimension is 100.
is feature vector is used to construct a 10 × 10 feature matrix M.
(3) e third one does not consider positions of left and right adjacent words. e disambiguation feature vector of the biomedical word "ADA" is denoted as

Disambiguation Model Based on Attention
Neural Network

Attention Layer.
e attention mechanism is used to generate weights dynamically to capture the relationship between left and right adjacent words of the biomedical word. Feature matrix S is generated according to weight parameters. Feature matrix V is input to attention layer, and feature vector s m is calculated as follows: where W Q , W K , and W L are weight matrices, φ mn is the weight coefficient, and s m is the output feature vector. Here, d is the dimension of the input, which plays a role in adjusting the inner product. Values of m and n are, respectively, set to 1, 2, . . ., 12.
Equation (1) is used to compute the correlation strength of two elements. Weight coefficient φ mn is calculated based on correlation strength a mn as shown in (2). s m is computed as shown in (3). Feature matrix S is constructed as as follows: For the biomedical word "ADA," attention operation is used to process feature matrix S as shown in Figure 3.

Convolutional Layer.
e convolutional neural network is a deep learning model, which includes the input layer, convolutional layer, pooling layer, dropout layer, and fully connected layer. CNN can extract local features from data through convolution operations. CNN has representation learning ability, which can classify the input information shift-invariantly based on its hierarchical structure. CNN shares convolution kernels weight and multifeature graph, which can be used to process high-dimension data. After a series of convolution and pooling operations are implemented, the parameters of the model and the risk of overfitting are reduced. e convolutional neural network can capture local correlation of space-time structure, which has achieved excellent performance in natural language processing, computer vision, and image processing.
Here, asymmetric convolution proposed by Szegedy [31] is introduced, where k i × h convolution is split into k i × 1 convolution and 1 × h convolution. e biggest advantage of this method is that it reduces computation quantity, and its effect is similar to that of two-dimensional convolution. e size of the convolutional kernel is k i × 1, i � 1, 2, 3. Multiple convolution kernels of different sizes are set to get different features. e first convolution operation corresponding to 1 × h convolution kernel is applied to s m and generates the corresponding feature z i m as follows: where m � 1, 2, . . ., 12, w 1 i is convolution kernel in which 1 means the first convolution, i is used to distinguish 3 parallel asymmetric convolution operations, b 1 i denotes bias, R(•) is activation function, and c represents net activation of the convolutional layer. ReLU activation function R(•) is as follows: After 1 × h convolution kernel is used as shown in equations (5) and (12), eigenvalues are obtained. Feature mapping constructed by these 12 eigenvalues is as follows: e second convolution operation corresponding to k i × 1 convolution kernel is applied to Z i , and feature value c i m is computed as follows: where m � 1, 2, . . ., 12, w 2 i is convolution kernel in which 2 means the second convolution, i is used to distinguish 3 parallel asymmetric convolution operations, b 2 i denotes bias, R(•) is activation function, and e represents net activation of the convolutional layer. ReLU activation function R(•) is shown as follows: In the second convolution, 3 asymmetric operations use the same number of convolution kernels. But, the size of the convolution kernel is different. So, 3 asymmetric convolution operations in the second convolution can output the same number of feature mapping C i , as shown in the following equation: Discrete Dynamics in Nature and Society 5 Features with the same index in the convolution window are averaged to generate feature mapping D as follows: where j � 1, 2, . . ., 12.
According to the index in the convolution window, feature mappings C i of 3 asymmetric convolution operations are concatenated with feature mapping D to generate E as follows: e above process is called Av-ACNN. If feature mappings C 1 , C 2 , and C 3 are concatenated directly and input into Bi-LSTM, the process is called ACNN.
For feature matrix S in the above example, the process of semantic classification is shown in Figure 4. e cell structure of LSTM is shown in Figure 5.

Bi-LSTM Layer.
LSTM is a special recurrent neural network (RNN), which can capture longer distance information. It uses a set of gate controllers to solve effectively gradient disappearance and explosion of RNN. Bi-LSTM is composed of forward LSTM and backward LSTM, which makes up for the shortcomings of LSTM. LSTM infers semantics of biomedical word based on the previous input information but cannot include the subsequent information into the reference, which will affect WSD accuracy. In fact, left and right contexts around the biomedical word can all influence the process of WSD. If you access the right context as you access the left one, it is very beneficial for biomedical WSD. Bi-LSTM is composed of two LSTMs. One is forward LSTM and the other is backward LSTM. ey represent, respectively, the left context and the right one of the biomedical word. Bi-LSTM is very suitable for sequence annotation tasks with a top-down relationship. It is often used to model context information in natural language processing and provides help for biomedical WSD.
In this paper, LSTM inputs data at the multitime step and outputs results at the last time step. Results at the last time step are added and input into a fully connected layer. LSTM unit contains memory unit C t and 3 gate controllers. ey are, respectively, input gate i t , forget gate f t , and output gate o t . ese 3 gates control update of memory unit C t and output of hidden layer state h t . e output of h t is computed as follows:  Discrete Dynamics in Nature and Society where W i , W f , W c , and W o are, respectively, weight matrices of input gate, forget gate, candidate state, and output gate.

Fully Connected Layer and Semantic Classification.
Outputs H → and H ← of Bi-LSTM are input to a fully connected layer. e softmax function is applied to map the output of the neuron to the interval (0, 1) as shown in equation (21). e purpose is to determine semantic categories of biomedical words.
In ( Discrete Dynamics in Nature and Society In equation (22), c has r semantic categories and x i represents the ith semantic class, i � 1, 2, ..., r.
In probability distribution P(x 1 |c), P(x 2 |c), . . ., P(x r |c) of biomedical word c, semantic category s with the highest probability is selected as the predicted one as shown in the following equation:

Biomedical WSD Based on Attention Neural Network
Attention mechanism, Av-ACNN, and Bi-LSTM are combined to disambiguate biomedical words. e training process of attention mechanism, Av-ACNN, and Bi-LSTM includes forward propagation and backpropagation. Semantic classification is forward propagation. Gradient calculation and parameter optimization are backpropagation. e training process of attention mechanism, Av-ACNN, and Bi-LSTM is shown as follows.
Step 1: initialize iteration number n and parameter set θ.
Step 2: matrix V is constructed based on N.
Step 1: initialize iteration number n and parameter set θ.
Step 2: matrix V is constructed based on N and input into the attention layer. Feature matrix S is built according to equation (4).
Step 3: according to equation (7), feature mappings Z 1 , Z 2 , and Z 3 are generated by convolutional operation in which the size of convolution kernel is 1 × h.
Step 4: according to equation (10), feature mappings C 1 , C 2 , and C 3 are generated by convolutional operation in which the size of convolution kernel is k i × 1.
Step 5: according to equation (14), feature mapping E is constructed and input into the Bi-LSTM layer. Hidden layer output h t is calculated according to equation (20).
Step 6: according to equation (23), the category of biomedical word c is determined.
Backward propagation includes gradient calculation and parameter optimization. e process of gradient computation is shown as follows.
Step 1: loss value J is computed as where y 1 , y 2 , ..., y r are the one-hot codes of hs.
Step 2: gradient δ Y of the output layer is calculated as where ⊙ is the product of corresponding elements.
Step 3: gradients δ H and δ C l at the last time step of the forward LSTM and backward LSTM are, respectively, calculated as shown as follows: where l represents the last time step, Step 4: gradients h t and C t at any time in LSTM are computed as shown as follows: (29) where C t represents gradient at t + 1 time, Step 5: gradient δ 2 i of the second convolution layer is calculated as where i is the ith asymmetric convolution and 2 is the second convolution layer.
Step 6: gradient δ 1 i of the first convolutional layer is computed as where 1 represents the first convolution layer and rot180(·) is the operation of rotating 180 ∘ . e process of parameter update is shown as follows. Update_Parameter( ). 8 Discrete Dynamics in Nature and Society Step where α is learning rate and H � {H → , H ← }.
Step 2: update weight matrices W y and W c and bias terms b y and b c in LSTM cell as follows: where l is the last time step, α is learning rate, and y � {i, f, o}.
Step 3: update weight w 2 i and bias b 2 m in the second convolutional layer as follows: where Z i is the output of the first convolution layer.
Step 4: Update weight w 1 i and bias b 1 m in the first convolutional layer as follows: When the semantic category of biomedical word c is determined, disambiguation features are extracted from its 4 adjacent lexical units. Feature matrix V is constructed and input into the attention layer. According to equation (4), the attention layer outputs feature matrix S, which is input into the convolutional layer. Based on equation (10), feature mappings C 1 , C 2 , and C 3 of the asymmetric convolution are computed. According to indices of the convolution window, C 1 , C 2 , and C 3 are fused twice to obtain feature mapping E, which is input into the Bi-LSTM layer. e output of the Bi-LSTM layer is computed based on equation (20). According to equation (22), probability distribution P(x i |c) of biomedical word c under semantic category x i is calculated.
According to equation (23), its semantic category is determined. ey are, respectively, Ca and PCA, which are all selected. Sentences containing these 51 biomedical words are used as training corpus and test corpus to measure the proposed method's performance.

Experiments
6.1.1. Experiment Analysis. Ten groups of experiments are carried out, and average disambiguation accuracy is used to evaluate the performance of the WSD classifier, which is defined as Discrete Dynamics in Nature and Society where N is the number of all biomedical words, m i is the number of test sentences correctly classified for the ith biomedical word, n i is the number of all test sentences containing the ith biomedical word, p i is disambiguation accuracy of the ith biomedical word, and p avg is average disambiguation accuracy. In Experiment 1 and Experiment 3, method (1) is used to construct feature matrix F. In Experiment 1, CNN is applied to determine the semantic category of the biomedical word. In Experiment 2, LSTM is used to determine its semantic class. In Experiment 3, Bi-LSTM is adopted to determine its semantic category. In Experiment 4, morphology, part of speech, and semantic information in 4 adjacent vocabulary units of the biomedical word are used as disambiguation features. Feature matrix V is constructed by method (3). e proposed framework is adopted to determine its semantic class. Disambiguation accuracies from Experiment 1 to Experiment 4 are shown in Table 1.
e average disambiguation accuracy of Experiment 2 is 1.19% higher than that of Experiment 1. Experiments show that, compared to CNN, LSTM is more suitable for the disambiguation of the MSH corpus. e average disambiguation accuracy of Experiment 3 is higher 1.35% than that of Experiment 2. e reason is that Bi-LSTM takes account of contextual information from two directions. In Experiment 4, the proposed network is used to disambiguate biomedical words. e disambiguation accuracy of Experiment 4 is 5.6% higher than that of Experiment 3. e reason is that morphology, part of speech, and semantic information are extracted as disambiguation features in Experiment 4, but Experiment 3 only considers morphology. e proposed network is better than Bi-LSTM.
In Experiment 5, Experiment 6, and Experiment 7, morphology, part of speech, and semantic information in 4 left and right lexical units of the biomedical word are selected as disambiguation features. At the same time, feature matrix M is constructed by method (2) in Experiment 5, Experiment 6, and Experiment 7. In Experiment 5, ACNN and LSTM are combined to determine the semantic class of the biomedical word. In Experiment 6, ACNN and Bi-LSTM are fused to determine its semantic category. In Experiment 7, Av-ACNN and Bi-LSTM are combined to determine the semantic class of the biomedical word. e training corpus of MSH is used to optimize the WSD classifier, and the optimized WSD model is applied to classify the test corpus of MSH. Disambiguation accuracies from Experiment 5 to Experiment 7 are shown in Table 2.
e average disambiguation accuracy of Experiment 6 is 1.25% higher than that of Experiment 5. e reason is that Bi-LSTM is composed of forward LSTM and backward LSTM, in which feature information in context is obtained, respectively, from two directions. But, LSTM only obtains feature information from one direction. erefore, the disambiguation accuracy of Bi-LSTM is higher than that of LSTM. e average disambiguation accuracy of Experiment 7 is 0.72% higher than that of Experiment 6. e reason is that the weighted average D of C 1 , C 2 , and C 3 is computed according to indexes of the convolutional window in Experiment 7. en, C 1 , C 2 , C 3 , and D are concatenated according to indexes of the convolutional window to get feature mapping E. Because more disambiguation information is included in E, the disambiguation effect is improved. In Experiment 8, morphology, part of speech, and semantic information are extracted as disambiguation features from 4 adjacent lexical units of the biomedical word. Feature matrix V is generated by method (3). Av-ACNN and Bi-LSTM are combined to determine the semantic class of the biomedical word. Disambiguation accuracies of Experiment 4, Experiment 7, and Experiment 8 are shown in Table 3. e disambiguation effect of Experiment 7 is better than that of Experiment 8. is shows that the disambiguation ability of the feature matrix constructed by method (2) is better than that of the feature matrix constructed by method (3). e reason is that the position of the adjacent word is considered in method (2). Generally, if a context-related word is closer to the biomedical word, it is more important for determining the category of the biomedical word and its weight is bigger. e disambiguation accuracy of Experiment 4 is 2.26% higher than that of Experiment 7. is is because that attention layer generates dynamically weight coefficients, which are applied to construct the feature matrix in Experiment 4. In Experiment 7, weight coefficients are manually set to construct a feature matrix. erefore, the disambiguation effect of Experiment 4 is better than that of Experiment 7. e disambiguation accuracy of Experiment 4 is 3.14% higher than that of Experiment 8. In Experiment 7, the attention layer is not used. is shows that the attention layer is helpful for improving the disambiguation effect. e size of the convolution kernel has a great influence on feature extraction. If the convolution kernel is too small, it is not able to extract effective features. If the convolution kernel is too large, the computation quantity will be increased. ree groups of experiments have been conducted in which 3 convolution kernels with different sizes are used to obtain discriminative features. In Experiment 4, Experiment 9, and Experiment 10, morphology, part of speech, and semantic information in 4 left and right lexical units of the biomedical word are selected as disambiguation features. At the same time, method (3) is used to construct feature matrix V and the attention network proposed in this paper is applied to determine the semantic class of the biomedical word. In Experiment 4, the second convolution kernels for 3 asymmetric convolution operations are, respectively, 2, 3, and 5 in size. In Experiment 9, the second convolution kernels for 3 asymmetric convolution operations are, respectively, 3, 4, and 5 in size. In Experiment 10, the second convolution kernels for 3 asymmetric convolution operations are, respectively, 1, 2, and 3 in size. e disambiguation accuracies of Experiment 4, Experiment 9, and Experiment 10 are shown in Table 4. e average disambiguation accuracy of Experiment 4 is higher than that of Experiment 9 and Experiment 10. It shows that the size of the convolution kernel is suitable in Experiment 4.
From Experiment 1 to Experiment 8, the average disambiguation accuracies of biomedical words with 2, 3, 4, and 5 semantic categories are calculated. Experimental results are shown in Figure 6.
From Figure 6, it can be seen that the average disambiguation accuracy of the proposed method is better than that of other methods under 2, 3, 4, and 5 semantic classes. e average disambiguation accuracy under 2 semantic categories is higher than that under 3 and 4 ones. e reason is that when the number of semantic classes is less, the difficulty of biomedical WSD is smaller. But, the disambiguation accuracy of the five categories is relatively high. e reason is that there is only one ambiguous word with five categories, and the distribution of its training corpus may be in accordance with that of its test one.
Finally, the time cost of the proposed model is analyzed. Here, n is the sequence length, d is the representation dimension, k is the kernel size of convolutions, c is the number of categories, and s is the number of support vectors. e run-time complexity of CNN is O(k·n·d). e run-time  e run-time complexity of SVM is O(s). Although the run-time complexity of the proposed method is the biggest, its average disambiguation accuracy is the highest.

Conclusions
Morphology, part of speech, and semantic information are extracted as disambiguation features from 4 adjacent lexical units of the biomedical word in this paper. e attention mechanism is used to generate a feature matrix, from which Av-ACNN and Bi-LSTM are used to extract discriminative features. Based on discriminative features, the softmax function is applied to determine the category of biomedical word. Experimental results show that the proposed method is more suitable for biomedical WSD than other methods.

Conflicts of Interest
e authors declare that they have no conflicts of interest.