Nested Entity Recognition Fusing Span Relative Position and Region Information

: At present, span-based entity recognition methods are mainly used to accurately identify the span (entity) boundary for entity recognition, in which the relative position information of the span boundary and the information of words in the span region are routinely ignored. This information can be used to improve entity recognition performance. Therefore, a nested entity recognition model, which integrates the relative position information of the span and the region information within the span, is proposed. The span representation is ﬁrst obtained with a triafﬁne attention. Then, the relative position of the span boundary and the word information in the span region, as well as the previous span representation, are fused to obtain a new label-level span representation with another triafﬁne attention. Finally, the span (entity) recognition task is carried out by a cooperative biafﬁne mechanism. Experiments were conducted on some public datasets, including ACE2004, ACE2005 and GENIA. The results show that the F1-scores achieved using the proposed method were 87.66%, 86.86% and 80.90% on ACE2004, ACE2005 and GENIA, respectively. These experiments show that the method achieved state-of-the-art (SOTA) results. Moreover, the proposed model has fewer parameters and needs fewer resources with a lower time complexity than the existing triafﬁne mechanism model.


Introduction
Named Entity Recognition (NER) is a basic natural language processing task used to extract meaningful entities from text. Traditional named entity recognition, also known as Flat Named Entity Recognition (Flat NER), is usually regarded as a sequence labeling problem [1], and has been well studied. However, nested entities with multi-granular semantic information are commonly available in text in all kinds of professional fields [2]. For nested entities, a single word may have multiple labels. Therefore, it is difficult to solve this problem based on sequence labeling frameworks.
In recent years, a variety of methods for nested entity recognition have been proposed, including sequential labeling methods that assign a label to each token from a predesigned labeling scheme [3][4][5][6], hypergraph-based modeling methods for nested named entity tasks [7][8][9], methods that label sequence to sequence (Seq2Seq) of all entities and output the starting position span length [10,11], and span-based methods for enumerating all possible entities in a sentence [12][13][14][15]. At present, the span-based method, which learns the possible representation of each span and then classifies it, is one of the most widely used methods for nested entity recognition [12,[15][16][17][18][19]. Among span-based methods, those that leverage large-scale pre-training modules can achieve better results in nested entity recognition than other methods [18,19]. However, most of the works based on span focus on how to accurately identify the entity boundary. The relative position information of the span boundary and the information of words in the span region are routinely ignored.
In addition, there is an increasing demand for the accurate identification of nested entities in various domains, such as healthcare, finance, and law. In the healthcare domain, for example, accurate recognition of entities such as diseases, medications, and treatment plans is crucial for clinical decision-making and disease monitoring. In the finance domain, precise identification of entities such as companies, individuals, and transactions aids in risk assessment and market intelligence analysis. In the legal field, accurate identification of nested entities is of significant importance for case analysis and legal research.
Therefore, improving the accuracy and efficiency of named entity recognition has become a focus of research in both academia and industry. Addressing challenges in nested entity recognition, such as modeling nested entity structures, improving boundary identification accuracy, and leveraging span region information, is a pressing research problem that needs to be tackled. To improve the performance of span-based methods, Yuan et al. [14] proposed a triaffine transformation; that is, on the basis of a biaffine transformation, span internal information is represented in the third dimension. The internal region information of the span determined by the boundary is used as the span representation, as shown in Figure 1. The triaffine transformation is used to recognize the entity "the secretary of homeland security". The left side of Figure 1 shows the span boundary represented by a biaffine transformation, and the entity can only be represented by the embedded mapping of "the" and "security". The right side shows the third dimension of the triple affine. Based on the biaffine, the triaffine transformation further takes the fusion of the span boundary and internal information as the span representation (the entire embedded mapping, "the secretary of homeland security", is used as the representation entity). In this study, a method based on a dual-triaffine attention mechanism, which integrates the relative position of the span boundary and the region information, is proposed for nested entity recognition. In the first stage, interactive learning between the span and the boundary is achieved through triaffine attention, resulting in a high-dimensional span representation. In the second stage, the span representation, relative position information, and region information are fused, and a new label-level span representation is obtained using triaffine attention, which captures the high-order interaction between the span boundary and the fused information. Additionally, we consider the importance of entity boundary identification in nested entity recognition, and therefore, the collaborative prediction of span representation using both biaffine and triaffine attention is proposed. The main contributions of this paper are as follows: (1) the proposal of a triaffine attention mechanism that integrates the relative position of the span boundary and the information within the span region; (2) the introduction of a collaborative prediction model based on biaffine and triaffine attention, utilizing a similar residual connection for span category prediction. Experimental results obtained using the ACE2004, ACE2005, and GENIA datasets demonstrate the effectiveness of the proposed dual-triaffine attention and collaborative entity prediction using biaffine attention.
(1) Sequence labeling methods Sequence-labeling-based methods usually view NER as a sequence labeling problem, assigning a tag to each token from a predesigned labeling scheme (such as BIO). Currently, most sequence-labeling-based methods combine CRF [20] with neural networks, such as CNN [21,22], Bi-LSTM [1], and Transformer [23,24]. However, it is difficult for them to recognize nested entities. Ju et al. [4] proposed a neural network model for nested named entity recognition by dynamically stacking flat named entity recognition layers. Fisher and Vlachos [5] accomplished this task through a merging and marking method. Wang et al. [3] enhanced this idea by applying a pyramid decoder. Shibuya and Hovy [6] explored the second-best path decoding method by excluding the influence of the best path.
(2) Hypergraph-based methods Muis and Lu [7] first proposed a hypergraph model for nested NER by means of an index to represent possible entities. Subsequently, this method has been extensively explored [8,9], and extended to discontinuous NER models in Katiyar and Cardie's work [8] and hypergraph models in Wang and Lu's work [9].
(3) Sequence-to-sequence (Seq2Seq) methods Gillick et al. [10] introduced the Seq2Seq model for named entity recognition (NER), in which the model takes a sentence as input and generates the starting position, span length, and label for all entities. Straková et al. [11] utilized the sequence-to-sequence architecture to improve the BILOU tagging scheme for nested entity recognition.Yan et al. [25] proposed the latest nested entity recognition method, using a BART [26]-based Seq2Seq model with a pointer network to solve most NER problems in a unified framework, in which all possible sequences of entity start-end indexes and types are generated.
(4) Span-based methods The span-based approach is widely used in nested named entity recognition (NER). This approach involves enumerating all possible spans and determining their validity as entities with corresponding types. Recent advancements in pre-trained models have made it easier to obtain span representations by concatenating boundary representation vectors [19]. These representations can then be classified using fully connected networks [9] or biaffine mechanisms [15]. Furthermore, additional features or supervision can be incorporated to enhance span-based methods. For instance, Zheng et al. [12] and Tan et al. [13] emphasized the significance of boundaries and introduced boundary detection tasks. Fu et al. [27] employed a TreeCrf method to capture the interaction between nested spans. Yuan et al. [14] proposed a triaffine mechanism that incorporates span and fused label information to calculate entity scores.

Word Pair Relationship Model
The word pair relationship model is constructed by modeling the relationship between boundary words and internal words. Li et al. [28] used the word pair relationship model to fuse relative position information and region information, and used dilated convolution to calculate the two-dimensional word pair grid. The model achieved good performance, indicating that the fusion of relative position information and region information is helpful to improve the effect of nested entity recognition.

Collaborative Prediction
Collaborative prediction can be used to exploit local and global entity representations to jointly reason about relationships between entities at short and long distances. Li et al. [29] used a collaborative prediction method to extract relations, and the results show that the collaborative biaffine mechanism can enhance the prediction ability of a Multi-Layer Perceptron (MLP). Li et al. [28] used the collaborative prediction method for named entity recognition, which can improve the entity recognition performance. In this paper, the collaborative prediction method is used for entity classification.

Model
This paper proposes a nested entity recognition model that fuses span relative position and region information. The model is mainly composed of three parts: an encoding layer, a dual-triaffine attention mechanism layer, and a predictor layer, as shown in Figure 2. The output of the LSTM goes through two mappings to represent the beginning and the end of the span, which are then used as partial inputs to the biaffine and two triaffine attention modules (shown as dashed lines). The basic idea of the overall model is described as follows: 1. In the encoding layer, the pre-trained language model BERT and Bi-LSTM are used as an encoder to generate word representations containing contextual information from the input sentence.
2. In the dual-triaffine attention mechanism layer, the span representation is generated using one triaffine attention. Then, the obtained span representation is fused with the relative position information and region information. The other triaffine attention is used to interact with the span boundary and the fused information at a high order to obtain a new span representation, which provides a basis for the subsequent span classification.
3. In the predictor layer, all entity span scores are inferred jointly using the predicted values of the triaffine attention mechanisms and the predicted values of the biaffine mechanism.

Encoder Layer
BERT [30] is adopted as a feature extraction module in the encoder layer. Given a sentence x = {x 1 , x 2 , . . . , x n }, x i represents the i-th word in a sentence. To process the input for BERT, a combination of word embedding vectors, segmentation embedding vectors, and position encoding vectors is used. This means that each word in a sentence may have vectorial representations from multiple pieces after a BERT calculation. To obtain word representations based on the piece representations, we employ a max pooling technique. To further improve context modeling, we utilize a bi-directional LSTM [1], which has been used in previous studies, to generate the final word representations.
where d h denotes the dimension of a word representation and N denotes the length of the sentence.

Triaffine Transformation
The triaffine transformation requires three vectors u, v, w ∈ R d and a parameter tensor W ∈ R d+1 × R d × R d+1 as input vectors, and outputs a scalar, in which distinct MLP (Multi-Layer Perceptron) transformations are applied on input vectors, and tensor vector multiplications are obtained. The constant 1 is concatenated with input vectors, preserving the biaffine transformation, as shown in Equations (2)-(4).
where MLP t stands for t-layer MLP. The tensor initialization conforms to the normal distribution. In the triaffine transformation process, the span boundary representation is denoted as u and v , the span internal representation as w , and the tensor in the triaffine attention is denoted as W.

Dual-Triaffine Attention Mechanisms
The triaffine attention mechanism layer is divided into two stages. In the first stage, triaffine attention is used to obtain the span representation [14]. In the second stage, the obtained span representation is fused with the relative distance of the entity boundary and the region information within the span. The other triaffine attention is used to generate the label-level span representation again.
(1) Stage one: triaffine attention for span representation The triaffine attention is used to realize the interaction between the entity span representation and the entity boundary. By introducing the third dimension, the triaffine attention α i,j,k,d m is used to calculate the entity span representation h i,j,d m of the span (i, j) : Equations (5)-(7) denote the triaffine attention computation process. The span boundary (h i , h j ) and the embedding dimension parameters (W d m ) are viewed as attention Q (queries), and each word can be viewed as the K (keys) and V (values). This attention mechanism allows Q and K to interact in higher dimensions compared to general attention.
(2) Stage two: fusing location information and region information for span classification In order to enrich the information of the span grid, we introduce the relative position information of the span boundary and the region information within the span representa-tion, as shown in Figure 3. The left side of the figure shows the relative position information of the word pair grid. In the span grid with row start position and column end position, the distance of an entity from the start position to the end position is the relative position information. In Figure 3, for the entity "the secretary of homeland security", the distance from "the" to "security" is 4. The right side of the figure shows the region information within the span. In this example, the entity "homeland security" is nested within the entity "the secretary of homeland security", so the region information is 2 in "homeland security" and 1 in the rest of the region information. We map each number corresponding to the relative position information and region information to a 20-dimensional vector, represented respectively as E d ∈ R N×N×d E d and E r ∈ R N×N×d Er , where d E d and d E r are both equal to 20. We also use h i,j,d m ∈ R N×N×d m as the entity span representation. These three vectors are concatenated along the third dimension, as shown in Equation (8): where d = d m + d E d + d E r , d m denotes the span representation dimension, d E d denotes the span boundary relative position embedding dimension, and d E r denotes the span region information embedding dimension.
In the second stage, the output of LSTM is mapped to the span boundary representation h i h j again. Then, with the span representation S i,j,d obtained in the first stage, the label-level span representation is obtained using the triaffine attention, as shown in Equation (9): Similarly to the procedure described in the first stage, the span boundary and the label are used as Q for the attention mechanism. Span is represented as attention K (keys) and V (values) to obtain the label-level span representation p i,j,d .

Predictor Layer
After the label-level span representation is obtained, we use the label-level span representation of the output of the second stage as one predictor and the biaffine mechanism as the other to compute two independent relational distributions of the span (x i ,x j ), where x i and x j represent the head and tail of the span, respectively. The results of these calculations are combined as the final prediction score.

Span Prediction
Based on the label-level span representation p i,j,d obtained by the triaffine attention mechanism, we use an MLP to calculate the score of the span from the i-th word to the j-th word, as shown in Equation (10):

Biaffine Prediction
The input to the biaffine predictor is the output of the encoding layer. For a given sentence of length N, two MLPs are used to compute the head s i and tail t j representations of the span (entity), and then a biaffine classifier is used to compute the entity label score for this pair of head and tail (x i ,x j ), as shown in Equations (11)-(13): where U,W,b are learnable parameters, s i and t j denote the head and tail representation of the span (x i ,x j ), and y i,j represents the label score. The final span label probability is obtained by combining the scores of the span predictor and the biaffine predictor, as shown in Equation (14):

Loss Function
For a sentence, the objective function for model training is to minimize the loglikelihood loss with respect to the corresponding label, which is formulated as shown in Equation (15): (15) where N denotes the number of words in the sentence,ŷ l i,j denotes the true label of the span (x i ,x j ), y l i,j denotes the predicted value, and l denotes the l-th label in the label set L. Because only the upper triangle region is valid in the span table, the loss is calculated in such a way that only the upper triangle is calculated.

Dataset
We conducted experiments using the ACE2004, ACE2005, and GENIA datasets. The ACE2004 and ACE2005 datasets are divided into train/test/validation subsets with a ratio of 8:1:1 according to the method proposed by Shibuya and Hovy [6]. The GENIA dataset is divided into train/test subsets with an 8:2 ratio according to the method proposed by Zheng Yuan et al. [14]. To verify the performance of the model, we also conducted experiments on the flat entity dataset resume [31], using precision P, recall R, and F1 score to represent the model performance. The numbers of entities in the above four datasets are shown in Table 1.

Experiment Details
For the datasets ACE2004 and ACE2005, we use BERT-large-uncased as a contextual embedding method with a learning rate of 1 × 10 −5 . For the dataset GENIA, we use Bio-BERT-v1.1 as a text context embedding method with a learning rate of 1 × 10 −6 . The BiLSTM with a hidden size of 1024 is used for the token representations for the ACE2004 and ACE2005 datasets, and 512 is used for the token representations for the GENIA dataset. The dimension of the triaffine span representation is 512. The distance embedding dimension d E d is 20, the region information embedding dimension d E r is 20, the biaffine dimension of the predictor layer is 1024, and the learning rate of the remaining model parameters is The parameters are mainly determined by the size of the hidden layer dimension in the pre-training model and the content of the dataset. The ACE datasets are publicly available. The hidden layer dimension of the widely used pre-training model is 1024, which indicates that the hidden layer dimension of the LSTM layer will also be relatively large. The GENIA dataset is a specialized biomedical corpus. The size of the dataset is small, the hidden layer dimension of the pre-trained model is 768, and the hidden layer dimension in the LSTM layer will be relatively small, as shown in Table 2.

Baseline Approaches
There are a total of nine baseline approaches considered in previous studies. Yuan et al. [14] proposed a triaffine mechanism that calculates the score of entities based on span and label information. Yu et al. [15] utilized a biaffine function to determine the entity boundary and performed classification on the identified span. Xia et al. [16] developed a detector to screen candidate entity spans and used a classifier for classification purposes. Luan et al. [19] employed a multi-task learning approach to extract entities and relations simultaneously. Fu et al. [27] treated entities as nodes in a constituency tree and used a masking internal algorithm for decoding. Straková et al. [11] utilized a sequence-to-sequence approach for nested entity extraction. Wang et al. [3] introduced a pyramid layer and an inverse pyramid layer for decoding nested entities. Shibuya et al. [6] applied multiple CRFs in an outside-to-inside layer-wise manner for nested entity extraction. Tan et al. [13] proposed a boundary-enhanced neural span classification model and introduced an additional boundary detection method based on span classification to predict entity boundaries.

Experiments on the Nested Entity Datasets
The results on the three nested entity datasets, ACE2004, ACE2005 and GENIA, are presented in Table 3. Compared with the nine baseline approaches, our model achieves suboptimal results on the GENIA dataset and achieves optimal results on the other two datasets. The F1 values of our model on the ACE2004 and ACE2005 datasets reach 87.66% and 86.86%, respectively; these results are better than those of other methods based on span [14][15][16]19] and Seq2Seq [11]. Compared with the best method in Yuan et al.'s work [14], the F1 values are increased by 0.26% and 0.04%, respectively.
For the GENIA dataset, the F1 value of our model is 0.33, which is lower than that of Yuan et al.'s model [14]. In their model, when word vectors are generated, for each sentence, the target sentence and its previous and next sentences are fed into BERT and Bi-LSTM together to obtain better contextual embedding. Yuan et al.'s model is more beneficial to the word vector representation of biomedical corpora such as GENIA. However, the number of parameters in their model is large, and the training time is long.
For a fair comparison with Yuan et al.'s model [14], in the process of generating the contextual embedding, we abandon the previous and next sentences of the target sentence. We only use the target sentence as an input. Table 4 shows the results of the comparison in terms of F1 score, parameter quantity, and training time on the GENIA and ACE2004 datasets. Under the same conditions, the F1 score of our model is higher than that of Yuan et al.'s model, and the training time and the number of parameters are relatively small. The specific parameters can be referred to in Table 2.

Experiments on the Flat Entity Datasets
In order to verify the performance of our model, we also conducted experiments on the flat dataset, Resume. Based on the results given in Table 5, our model achieves a higher performance than most of the current studies. Compared with Li et al.'s model [28], our model achieves a lower F1 score because the entity representation dimension of our triaffine attention mechanism is too high. Flat entities are distributed more sparsely than nested entities, which produces sparse entity features, resulting in a performance decline.

Ablation Experiments
Ablation experiments were carried out on the ACE2005 dataset. The results are shown in Table 6. Without the dual-triaffine attention mechanism layer, the biaffine prediction (case1) is directly used after the context embedding obtained by BERT. The performance is reduced by 1.21%. After both the relative position information embedding and region information embedding are removed, the model contains only one stage of this span representation (case2). The performance is decreased by 0.96%. After the relative position information embedding of the span boundary (case3) is removed, the performance is decreased by 0.58%. After the information embedding of the region within the span (case4) is removed, the performance is decreased by 0.4%.

Discussion
In this section, we will discuss our model using specific example sentences. In the sentence a 'new york times' reporter lends cook his mobile sat phone the phrase a 'new york times' reporter belongs to the person type, and new york times belongs to the organization type. Taking "a 'new york times' reporter" as an example, the biaffine module makes judgments based on the word embedding information of a and reporter, while the triaffine module incorporates the word embedding information of the entire span a 'new york times' reporter into the word pair table. The relative position information of this entity is 6. a, and reporter are within one entity, with a region information value of 1. new york and times are within two entities, with a region information value of 2. Distance information and region information are mapped to 20-dimensional vectors and incorporated into the model calculation. In this way, the model can learn the boundary information of spans of any length and the information within the spans. By considering multiple dimensions of information, the model can make more accurate predictions for entity recognition.
We have also learned some lessons from the experiments. More complex models may have stronger fitting capabilities, but they may also be more prone to over-fitting or require more computational resources. This could impose limitations on computational efficiency, model deployment, and scalability in practical applications. Additionally, when the model encounters data from specific professional domains, the performance may not be ideal.

Conclusions
This paper proposes a span-based nested entity recognition model. The model utilizes a pre-trained module and LSTM encoding approach, and incorporates a triaffine attention mechanism for entity span representation. The span representation is then fused with relative position and region information, and another triaffine attention is applied to generate the entity representation. The outputs of the triaffine attention and biaffine mechanisms are jointly used for predicting and classifying entity spans. Experimental results demonstrate that our model achieves state-of-the-art performance on three widely used datasets. Ablation experiments indicate that the inclusion of a dual triaffine attention, which integrates relative position and region information, enhances nested entity recognition. Additionally, we recognize the limitations of the model's generalization capability and the trade-off between model complexity and performance. In the future, we will continuously optimize the model to improve its performance and explore the application of triaffine transformation on discontinuous entity datasets.

Conflicts of Interest:
The authors declare no conflict of interest.