Multidimensional Self-Attention for Aspect Term Extraction and Biomedical Named Entity Recognition

Wide attention has been paid to named entity recognition (NER) in specific fields. Among the representative tasks are the aspect term extraction (ATE) in user online comments and the biomedical named entity recognition (BioNER) in medical documents. Existing methods only perform well in a particular field, and it is difficult to maintain an advantage in other fields. In this article, we propose a supervised learning method that can be used for much special domain NER tasks. )e model consists of two parts, a multidimensional self-attention (MDSA) network and a CNN-based model. )e multidimensional self-attention mechanism can calculate the importance of the context to the current word, select the relevance according to the importance, and complete the update of the word vector.)is update mechanism allows the subsequent CNNmodel to have variable-lengthmemory of sentence context. We conduct experiments on benchmark datasets of ATE and BioNER tasks. )e results show that our model surpasses most baseline methods.


Introduction
With the rapid growth of Internet data, people urgently need to obtain valuable information from massive unstructured text. In this context, the task of named entity recognition (NER) has attracted people's attention. However, most of the research studies are based on the extraction of place names or organization names from general datasets and pay little attention to specific fields. is paper explores aspect term extraction in user online comments and the biomedical named entity recognition in medical documents.
Aspect term extraction (ATE) is an important subtask in aspect-based sentiment analysis [1,2]. It aimed to detect opinion targets explicitly appearing in the sentences. For example, "Screen is crystal clear and the system is very responsive." e words "Screen" and "system" should be extracted as aspect terms. e target words we extracted must be cooccurring with the emotional words. erefore, the context is very important. Only emotional entities are the target words that need to be extracted. e purpose of biomedical named entity recognition (BioNER) is to extract specific entities in medical texts such as disease, gene, and protein names. BioNER is an important subtask in the medical information extraction task. Comprehensive and accurate identification of entities in medical texts is helpful for the full utilization of medical information. At the same time, BioNER is also one of the basic tasks of constructing medical knowledge graphs [3].
In previous works, ATE and BioNER are considered as a token-classification task. Traditional machine learning models like the hidden Markov model (HMM) [4] and conditional random field (CRF) [5,6] are used to solve the problems. Although these methods achieved reasonable performance, they use handcrafted features, which cannot be applied to other tasks. Recently, deep learning methods have been widely used in natural language processing (NLP). Neural models have become the de-facto standard for highperformance systems and widely used in natural language processing tasks. Many researchers have applied deep learning methods to success in the NER task. Similarly, the named entity recognition task in a special field can still be regarded as a token-level sequence labeling task.
However, it is still a challenging task because there are some problems that need to be solved for NER in special fields. First, taking the ATE task as an example, the ATE task requires the extraction of terms with sentimental tendencies. For instance, in the sentence "Sapphire is the only Indian restaurant I go to when I'm in NYC," we cannot extract the word "Sapphire" because it does not carry the user's emotions. e model needs to be able to effectively distinguish whether the entity is emotionally inclined. Second, taking the BioNER task as an example, there are a lot of abbreviations in medical texts, and they are often ambiguous. With different contexts, their meanings often change. For instance, CAT can represent "chloramphenicol acetyl transferase" and "computer automated tomography" [7].
ird, the expansion of the medical exploration field and the rapid growth of the medical entity term have brought great difficulties for the biomedical named entity recognition task. For example, with the spread of COVID-19, "COVID-19" has become a medical entity with practical meaning. e key to solving the problem of NER in special fields is to better mining of the relationship between entities and context. In previous studies, many researchers used the convolutional neural network (CNN) and long-short term memory network (LSTM) to obtain good results in a particular specialty field. But, their mining of the relationship between entities and context is insufficient.
In our paper, we propose a multidimensional self-attention CNN (MDSA-CNN) model for the named entity recognition task in a special field. We use the self-attention mechanism to calculate the dependencies between elements in the input sequence, and the calculation results reflect the importance of the elements. rough the training parameter matrix, it can capture the dependent elements that have a significant contribution to the task, regardless of the distance between the elements in the sequence. After that, the current word vector is updated through the weight, so that the word vector of the current element better expresses the true meaning in the context.
We use two tasks ATE and BioNER to verify the effectiveness of our model. For the ATE task, with extensive experiments on the SemEval-2014 dataset and the SemEval-2016 dataset, the results indicate that our MDSA-CNN model is superior to other baseline methods in aspect term extraction for aspect-level sentiment analysis tasks. For the BioNER task, we use the following datasets for experiments: NCBI-disease and BC2GM. e results show that our model can still obtain results that exceed the benchmark scores on the medical datasets. For the tasks ATE and BioNER on two completely different special field datasets, our models have achieved gratifying results.

Related Work
ere are few pieces of research on NER in multiple special fields, and more researchers only conduct research in a single special field [8][9][10]. We take the ATE task and BioNER as examples to describe the development of NER in special fields.
ATE is the first step of aspect-based sentiment analysis. Early winners [5,6,11] of SemEval aspect-based sentiment analysis (ABSA) challenges employed traditional sequence models, such as CRFs and maximum entropy (ME), to extract target words. ey are heavily dependent on feature engineering. e aspect terms and opinion terms are extracted together based on the complex syntax pattern [12]. Recently, neural network-based models have become the mainstream method of aspect extraction. A recurrent neural network based on semantic synthesis tasks [13] uses a syntactic analysis tree to identify the emotions of phrases and sentences in order. Irsoy and Cardie [14] apply deep Elman-type RNN to extract opinion and aspect term expressions. LSTM-based [15] and CNN-based [16,17] methods have achieved good results. Later on, Wang et al. [18] and He et al. [19] employed the attention mechanism to select and focus on the relevant parts of the input for ATE tasks.
BioNER is an important part of the medical text understanding system and plays a decisive role in accurately understanding medical texts. Early researchers used rulebased methods to deal with BioNER tasks [20,21]. is method relies on a large number of manual features and is difficult to adapt to the rapid development of information in the medical field. After that, CRF-based methods became mainstream [22]. e complexity of medical terms has prompted researchers to use neural networks to handle BioNER tasks. Among them, deep learning methods, especially multitask learning methods, have received extensive attention from researchers [21,23]. ese methods only perform well in a particular field, and it is difficult to maintain an advantage in other fields.

Problem Definition and Notations.
e NER in a special field task can be formulated as a token-level sequence labeling problem. Assuming the input is a sequence of word indexes X � x 1 , . . . , x T , we should predict a label sequence Y � y 1 , . . . , y T , where each y i comes from a finite label set G � B, I, O { }. Note that an aspect term can be a phrase, B and I indicate the beginning word and the nonbeginning word of a term phrase, and O indicates other words. Figure 1, our model contains two key components, multidimensional self-attention and CNN-based model.

Model Description. As shown in
First, we introduce the structure of multidimensional attention. e traditional attention mechanism only calculates a score for each word vector, and multidimensional attention will change this result. It can get a vector with the same length as x i . e vector is associated with We initialize W, W (1) , W (2) ∈ R d e ×d e and use two bias terms before applying activation function. f(x i , x j ) represents the dependency between x i and x j from the same source x.
After that, we use the softmax function to convert f(x i , x j ) into a probability distribution: (2) e value of p (z k � i | x, x j ) represents the correlation of the k th dimension of token i to x j . e larger p (z k � i | x, x j ) represents the higher correlation of the twoword vectors.
We use P j ki to refer to p (z k � i | x, x j ). e output s j can be written as e output of multidimensional attention for all elements from x is s � [s 1 , s 2 , · · · , s n ] ∈ R d e ×n . We use s i to represent the word vector of each word, and s is the sentence encoding output. Each word vector is updated by the attention mechanism. At this point, the word vector containing the valid information of the context will be able to more truly express the meaning of the word itself.
We use the word vector processed by the attention mechanism as the network input of the four-layer CNN. We employ one-dimensional convolution with kernel size k � 2c + 1 at each layer.
Each filter extracts valid information about a position i in a sentence. Based on the kernel size, we determine that a filter can extract information about the i th word and the context of the window length 2c. After the CNN network layer, we can get the output vector corresponding to the input vector one to one. For the first CNN (l � 1) layer, we use two different filter sizes. For the remaining 3 CNN layers, we only use one filter size. We will explain the experimental parameter setting in detail in the hyperparameter section. e results of the CNN layer will be processed by the output layer. e calculated results represent the label distribution for each location. e output layer contains the fully connected layer, the softmax function, and the dropout method.

Datasets.
In our ATE experiment, we used two datasets provided by SemEval ABSA challenge [2,24] to evaluate our model. Table 1 shows the details of the two benchmark datasets, including the number of sentences and aspects. ese two datasets are from subtask 1 of SemEval-2014 Task 4 and subtask 1 of SemEval-2016 Task 5, respectively.
We use two datasets for the BioNER experiment. e two datasets are NCBI-disease [25] and BC2GM [26]. Table 2 shows the details of these two datasets, including the number of sentences and entities.
We used 300D GloVe 840B [27] as the initial word embedding of the model. Glove word vectors were pretrained on a large amount of text data by the unsupervised methods. At the same time, we used domain embedding [28] as a supplement. e out-of-vocabulary (OOV) words in the training set are randomly initialized by uniform distribution between (−0.05, 0.05). e word embeddings are fine-tuned during the training phase.

Baseline Methods.
We perform a comparison of MDSA-CNN with some baselines using the standard evaluation of the SemVal-2014 Task 4 and SemVal-2016 Task 5: CRF uses the conditional random field to process the glove word vector [27]. No other neural network structure is involved in training. IHS RD [5] and NLANGP [29] represent the winners of the original competition [2,24]. WDEmb [30] sends word embeddings, linear context embeddings, and dependency path embeddings to CRF. LSTM [15,31] represents the most basic Bi-LSTM model. CNN [28] uses the basic CNN model to process glove word vectors for aspect term extraction. is model is to show that the attention mechanism is important. Bi-LSTM-CNN-CRF [32] is the mainstream method in named entity recognition (NER). We use this method to prove that there is still room for optimization when the baseline method is applied to a special field. RNCRF [33] used a recurrent neural network built by a dependency tree to solve the problem of joint extraction of aspect and opinion terms. In addition, CRF and manually extracted features are also used in the model. CMLA [18] uses a multilayer coupled-attention network to complete the coextraction of aspect and opinion terms. MIN [31] uses the multitask learning method. is framework is divided into two parts. e first part completes the extraction of aspect and opinion terms, and the second part determines whether the sentence is inductive. DECNN [28] is a CNN model employing two types of pretrained embeddings for aspect extraction: generalpurpose embeddings and domain-specific embeddings. Benchmark [21]: source of benchmark performance scores of datasets is NCBI-disease. Multitask learning approach [34] uses multitask methods to train multiple medical datasets simultaneously.

Hyperparameter.
We set the dimension of attention to 2. e first layer of CNN consists of 128 filters with kernel size k � 3 and 128 filters with kernel size k � 5. e kernel size k of the next three layers of CNN is 5. Each CNN is composed of 256 filters. For window size c, when the kernel size k � 3, then c � 1, and when k � 5, then c � 2. We set dropout rate to 0.55. According to the characteristics of CNN training, we set the learning rate to 0.0001. We use the Adam optimizer [35] to optimize the model.

Results and Analysis
In Table 3, we find that the performance of MDSA-CNN achieved state-of-the-art results in laptop datasets. By comparing with DECNN, we discover that the introduction of multidimensional attention improves performance. We believe that the self-attention mechanism makes up for CNN's inability to obtain long-distance entity word features and helps the CNN model to capture more contextual body information. In addition, the multidimensional attention mechanism ensures that the model can fuse effective features from different angles. We noticed that after the multidimensional attention mechanism, the word embedding of the target has changed significantly. e change of the word embedding comes from the emotional word vector in the current sentence. For example, "Super light, super sexy and everything just works." We calculated the MSE distance of the entity "works" word embedding before and after the multidimensional attention mechanism and found that there was a big difference. At the same time, we noticed that on the restaurant dataset, we are slightly behind the DECNN model results. We think the reason is that the restaurant dataset has a small sentence volume and our model is more complex than DECNN. is increases the risk of overfitting. e results in Table 4 show that our multidimensional self-attention mechanism still obtains promising results on medical datasets. e NCBI-disease and BC2GM datasets have exceeded the benchmark results.

Conclusions
In order to solve the NER task in a specific field, we designed the MDSA-CNN network. is method introduces a multidimensional self-attention mechanism, which can obtain more comprehensive features of each word in a sentence. Because self-attention can update the characteristics of the current word from any word in the sentence, this method does not rely on the distance between words. For NER in a specific field, we select the ATE task in the online review field and the BioNER task in the medical field. ese are two areas where there are huge differences. We conducted experiments on these two tasks. e results show that our model has achieved gratifying results on both tasks. In some datasets, our model achieved state-of-the-art results. is shows that our model is suitable for processing NER tasks in a specific field.

Data Availability
All data supporting this systematic review are from previously reported studies and datasets, which have been cited in the paper. e processed data are available from the first author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper. Mathematical Problems in Engineering 5