Full-span named entity recognition with boundary regression

Span classification is a popular method for nested named entity recognition. To recognise full-span named entities, span-based models should enumerate and verify all possible entity spans in a sentence, which leads to serious problems regarding computational complexity and data imbalance. In this study, we propose a boundary regression model to support full-span named entity recognition, where a regression operation is adopted to refine spatial locations of entity spans in a sentence. Therefore, instead of exhaustively enumerating all possible spans, we need only verify a small number of them. Span boundaries are regressed to find all possible named entities in a sentence. Furthermore, for a better representation of long-named entities, a multi-granule sentence representation is adopted to encode semantic features with different semantic granularities. In our experiments, even enumerating a small number of entity spans, our model still has competitive performance, achieving 87.35% and 80.85% F1 scores on the ACE2005 and GENIA datasets. Analytical experiments show that our model is able to find all named entities in a sentence without exhaustively verifying all possible entity spans. It is effective in mitigating the computational complexity and data imbalance problems in full-span named entity recognition.


Introduction
A named entity is defined as a word or a phrase in a sentence that refers to an object in the world.From the perspective of natural language understanding, named entity is the most basic linguistic units of a sentence.Recognising them is the key to understanding a sentence.This task was first coined in the sixth Message Understanding Conference (MUC-6) as a subtask of information extraction (Grishman & Sundheim, 1996).As a fundamental task, it can support a wide range of applications, e.g.knowledge graph construction (Al-Moslmi et al., 2020), machine translation (Hu et al., 2022), sentence parsing (Yu et al., 2020), question answering (Longpre et al., 2021), and so forth.Furthermore, named entities comprise the main part of out-of-vocabulary words (or new words) which are usually noted as a considerable obstacle to automatically processing natural language.Therefore, techniques of named entity recognition also have important theoretical impacts and applications in natural language processing.
The task of recognising named entities is often formalised as a token labelling process, where each token is given a label to indicate its semantic role in a named entity, e.g.'B' (Beginning), 'I' (Inside) or 'O' (Outside).Token labelling usually adopts a sequence algorithm to find a maximised label sequence for each sentence.The main problem is that it is difficult to recognise nested named entities in which a token may have different labels.For example, 'Guizhou University' is an organisation name, whereas 'Guizhou' is also a location name indicating the location of the university.Because nested named entities are effective in expressing semantic meanings of a named entity, they are widely used in natural languages.For example, in the ACE2005 and GENIA corpora (Doddington et al., 2004;Kim et al., 2003), 33.90 % and 35.27% of named entities are mutually overlapping.Therefore, recognising nested named entities is effective to support finer-grained semantic extraction.In natural language processing, the task has attracted considerable attention in recent years.
Span classification is a popular method to support nested named entity recognition.The core of this method is to enumerate and verify all possible entity spans in a sentence.It has the advantages of resolving nested structures and making full use of token features in a span.Span-based models can be implemented in a pipeline framework or end-to-end framework.In a pipeline framework, the recognition task is divided into the two steps of enumerating and verifying.For example, Xu and Jiang (2016) and Sohrab and Miwa (2018) first enumerate every entity span up to a certain length, then verify them with an independent classifier.In an end-to-end framework, the steps of enumerating and verifying are unified into a single model, e.g.Chen et al. (2022).In comparison between them, pipeline framework has the advantage of filtering impossible entity spans by using priori knowledge.In contrast, end-to-end framework can make a global adjustment in the training process and share model parameters in bottom layers of a deep network.
Although great success has been achieved in span classification, exhaustively enumerating and verifying all possible named entities usually suffer from two problems.First, because named entities in a sentence can be of any length (e.g. the longest named entity is 49 words in the ACE2005 corpus).For example, the phrase 'human T cell leukaemia virus type I ( HTLV-I ) trans-activator ( tax1 ) antigen' is annotated as a biomedical named entity with length 15 in the GENIA corpus.Verifying all possible entity span in a sentence leads to a massive computational complexity, especially in real applications dealing with a mass of data.Second, a sentence often contains a small number of named entities.Exhaustively enumerating all entity spans for classification suffers from a serious data imbalance problem.In related works, there are two strategies to address these problems: limiting the length of named entities, or filtering unlikely entity spans.
To avoid exhaustive enumeration, the length of recognised named entities can be limited up to a certain length in some span models.This setting is based on the phenomenon that named entities with lengths less than or equal to 6 cover 95% of the named entities in many corpora (as shown in Table 2).The limitation is a compromise between performance and computational complexity.It is less influential in applications such as information retrieval (Brandsen et al., 2022).However, it has a considerable impact on tasks such as sentence understanding or interactive dialogue, where a false named entity may lead to misunderstanding a whole sentence.Another strategy to reduce the computational complexity is to only verify entity spans that are most likely.For example, Chen et al. (2019) only verify entity spans generated with entity boundary assembling.Tan et al. (2020) filtered unlikely entity spans with a predefined threshold.This strategy to filter entity spans also suffers from two problems.First, manually designed rules should be used to filter entity spans, which may overfit evaluation data.Second, if a true named entity is falsely discarded in the proposal process, it loses the opportunity to be recognised.
Our method to support full-span named entity recognition is based on the intuition that the regression operation has the ability to refine incorrect span boundaries for locating true named entities in a sentence.Therefore, instead of exhaustively enumerating and verifying all possible entity spans, we only need to collect a small number of entity spans in the span proposal process.Then, a boundary regression is adopted to refine their spatial locations for identifying all named entities in a sentence.Furthermore, to reduce the vanishing gradient problem in long-named entity recognition, a multi-granularity sentence representation is proposed to generate spans with different granularities.Our method has three advantages in supporting full-span named entity recognition.First, verifying a small number of entity spans is effective in reducing the computational complexity.Second, positions of false spans in a sentence can be refined to identify true named entities.It has the advantage to reduce the number of negative entity spans for training.This is helpful to resolve the data imbalance problem.Third, classifying spans with different semantic granularities can reduce the influence of the vanishing gradient problem in recognising long-named entities.The contributions of this paper are summarised as follows.
(1) A boundary regression operation is adopted to support full-span named entity recognition.It has the ability to recover true named entities from falsely enumerated entity spans.It is helpful to reduce the computational complexity and data imbalance problems involved in full-span named entity recognition.(2) A multi-granular semantic end-to-end boundary regression model is designed to address the vanishing gradient problem in full-span named entity recognition.It is effective in recognising true named entities of long length and locating named entities far from falsely enumerated entity spans.(3) Several issues relating to full-span named entity recognition are analysed in the experiment section, including methods to generate multiple feature map layers or to enumerate entity spans.These analyses show a deep review of full-span named entity recognition based on boundary regression.
The remainder of this paper is structured as follows.Section 2 presents related work in named entity recognition.The boundary regression is introduced in Section 3, where the architecture of our model to support full-span named entity recognition is presented.We describe the evaluation datasets and experiment results in Section 4. Our conclusions are given in Section 5.

Related work
Our model has four characteristics for named entity recognition.First, entity locations are used as supervision information to optimise the model in the training process.Second, a multi-granularity semantic sentence representation is adopted to learn span representations with different granularities.Third, our model is a span-based deep architecture to support nested named entity recognition.Fourth, instead of exhaustively verifying all entity spans, we only verify a small number of entity spans.According to the characteristics of our model, we roughly divide related works into four categories: learning semantic features, learning semantic dependencies, resolving nested structures and reducing redundant entity spans.In the following, related works about each category is discussed in detail.

Learning semantic features
In named entity recognition, two types of features are widely used, including manually designed features and automatically learned features.In features-based models, manually designed features such as lexical features (e.g.words, n-grams, phrases or chunks) are very important for named entity recognition (Wang et al., 2022).External resources are also helpful, e.g. a thesaurus, gazetteer, Wikipedia, or external content (Wang et al., 2021).Automatically learned features are extracted from the original input in deep neural networks (Minaee et al., 2021).For example, Yuan et al. (2022) proposed a triaffine mechanism to integrate heterogeneous features.Jaiswal et al. (2021) and Zhang et al. (2022) presented a capsule network for named entity recognition, which groups neurons into capsules to detect specific features of an entity.In deep networks, pre-trained word embeddings are widely used to encode semantic features from external resources, e.g. the BERT (Devlin et al., 2019 june).Zhang et al. (2022) and Yan et al. (2021) used pre-training generation models to support end-to-end nested entity recognition.They are effective to encode prior knowledge from external resources.In low-resource NER, Zhou et al. (2022) proposed a mask entity language model for data enhancement, which provides rich entity regularity knowledge.(Li et al., 2020) presented a meta-learning approach for few-shot NER, which made full use of language models by separating the entire network into task-independent and task-specific parts.
In summary, these methods mainly depend on manually annotated entity types as supervising information to weight semantic features in the training process.In annotated corpora, in addition to entity types, entity locations are also annotated to indicate their spatial positions in a sentence.This information can be used to train a boundary regression module to find its location offset relevant to a true named entity.Our model is an end-toend framework, which simultaneously predicts the classification score of an entity span and refines its spatial location in a sentence.Entity classification and boundary regression can share model parameters in the bottom layers of a neural network to help the model learn semantic features from the training data.

Learning semantic dependencies
Because words in a sentence are grammatically linked, learning semantic dependencies of a sentence is important for named entity recognition.Sequence algorithms assume a Markov dependency between words and are effective to learn semantic dependencies of a sentence.For example, Jie and Lu (2019) proposed a dependency-guided LSTM-CRF model to encode semantic dependencies of a sentence.Among neural networks, bidirectional long short-term memory (Bi-LSTM) models are popular for named entity recognition (Chiu & Nichols, 2016).In sequence models, Zhong et al. (2020) proposed a component-based labelling scheme to overcome the problem of inconsistent label allocation.Li et al. (2021) incorporated meta-learning and adversarial training for sequence labelling, which is capable of adapting to new unseen domains.Wang et al. (2020) stacked sequential deep neural networks in a pyramid shape for nested named entity recognition.In addition, dependency tree is also widely used for capturing sentence structure information.For example, Jie et al. (2017) conducted experiments to analyse the structured information conveyed by dependency trees.Cetoli et al. (2017) implemented a Graph Convolutional Network (GCN) based on the dependency tree.
The main problems for these models are that sequence-based models usually suffer from the vanishing gradient problem, and parsing a sentence into a dependency tree is errorprone and heavily depends on external toolkits.In our model, a multi-granularity semantic sentence representation is adopted to encode semantic features of a sentence.Because it compresses semantic features into dense representations, it is effective to learn contextual features of a sentence and encode semantic dependencies between words with long distance.

Resolving nested structures
To resolve nested structures in named entities, three strategies are proposed to utilise sequence models, including layering, cascading, and joint strategies.They use independent classifiers to recognise different types or layered named entities (Alex et al., 2007).Some researches transform nested structures into flattened representations (e.g.hypergraph Lu & Roth, 2015) or use structured classification labels (Lample et al., 2016).Then, sequence models are conducted to recognise nested named entities.In addition to sequence models, span classification is widely used to support nested named entity recognition.For example, Li et al. (2021) proposed a meta-learning method to recognise named entities through boundary detection.Sohrab and Miwa (2018) enumerated and verified all possible entity spans in a sentence.Wan et al. (2022) enhanced the span representation by using the spanlevel graph.Li et al. (2022) proposed a table filling method, which organised spans of a sentence into a table.Yu et al. (2020) classified spans using a biaffine function between boundary representations.
In related works, the layering and cascading strategies are not effective to utilise manually annotated data.It also may introduce false annotation labels when handling different entity types.Because transforming nested structures into flattened representations heavily depends on prior knowledge, the migration between different domains is difficult.Our model is also a span-based model.Because we generate spans from a multi-granularity semantic sentence representation, these spans are mutually overlapped with different semantic granularities.They are effective to resolve the nested structure between named entities.

Reducing redundant entity spans
One problem for full-span named entity recognition is that all possible entity spans should be enumerated for classification, which leads to high computational complexity and serious data imbalance.In related works, many strategies have been proposed to reduce redundant entity spans.A direct strategy is only to enumerate entity spans up to a certain length.Another strategy to reduce the complexity is to filter entity spans with prior knowledge.In this aspect, Tan et al. (2020) applied two token-level classifiers to identify the start boundary and end boundary of all named entities in a sentence.Then, start and end boundaries are combined for classification by another classifier.Li et al. (2020) and Shen et al. (2022) converted the entity recognition task into a machine reading comprehension task.They predicted the start and end positions of entities, respectively, and then classified them through an independent classifier.Lou et al. (2022) presented an entity head-aware method to enhance the performance.Zhu and Li (2022) proposed a boundary-smoothing strategy for span-based neural NER models.The boundary regression operation has also been used for nested entity recognition (Chen et al., 2022), where a feature map layer was used to generate entity spans up to a certain length.
In related works, enumerating entity spans up to a certain length is the most popular method for reducing computational complexity in span-based model.However, it can not support full-span named entity recognition.Filter entity spans with prior knowledge usually implemented in a pipeline framework.It easily leads to cascading failure and can not make a global model optimisation in the training process.In our model, we enumerate a small number of entity spans from a multi-granularity semantic sentence representation.This provides an effective approach to reduce the computational complexity and vanishing gradient problems in full-span named entity recognition.

Multi-granular BR model
The architecture of our model to support full-span named entity recognition is shown in Figure 1.
Figure 1.The architecture of our model.It is an end-to-end architecture, which can be divided into four modules, including basic network, region proposal, boundary regressor and classifier.A sentence 'She is born in New York' is given as an example to show the flow path of recognising named entities.This example is introduced as follows in detail.
In Figure 1, the input sentence 'She is born in New York' contains a geographic (GPE) named entity 'New York'.A deep neural network (basic network) is first adopted to map the sentence into several representations (feature map layers) encoded with multi-granule semantics.Then, the region proposal module enumerates abstract-named entity representations (textual boxes) from each feature map.The red box in the dotted line means that it is an unenumerated true named entity.It is represented as a triple [5, 2, GPE], which indicates that it is a GPE started from the 5-th word with length 2. The blue box is a falsely enumerated entity span ('born in New') denoted to as [3, 3, ?],where '?' means that the type of the box is unknown.The start and length offsets between the red box and the blue box are denoted as s and l.Then, the blue textual box is fed into two boundary regressors and a classifier to predict the location offsets (s and ˜l) and entity type (c) simultaneously.In the output layer, after the offsets (s = 2 and ˜l = −1) are learned, it enables us to locate the true named entity 'New York', and avoids the requirement to exhaustively enumerate and verify all entity spans.In the follows, each module of our model is discussed in detail.

Basic network
The basic network is a module used to map a sentence into an abstract representation.This is a deep neural network architecture which can be truncated from a standard architecture of a high-quality sentence labelling task (e.g. a named entity recognition task or a POS tagging task).
Let T = [t 1 , t 2 , . . ., t N ] represents an input sentence, where t i is a token that denotes a word.In our basic network, a token is mapped into an abstract representation H i , which is composed of four segments.Each segment is discussed as follows.
(1) Every token t i is first transformed into a 1024-dimensional vector (referred to as H bert i ) by a BERT Large network (Devlin et al., 2019 june).The BERT model is pretrained in external resources with unsupervised methods, and it is effective to encode word semantic information.
(2) Each t i in T is initialised and mapped to a 200-dimensional vector using the pretrained word vector including Glove 1 (Pennington et al., 2014) and BioWord2Vec 2 (Chiu et al., 2016).The output is fed into a Bi-LSTM layer, which transforms each input into a (2 × 512)-dimensional vector (referred as H rdm i ).The Bi-LSTM layer enables the basic network to learn semantic dependencies in a sentence.It is helpful to encode contextual features of named entities.
(3) To encode the syntactic features of a sentence, a POS encoding is also used in the embedding layer.In our experiments, the Stanford parser 3 is adopted to generate the POS tag for each word in a sentence.Then, every POS tag is mapped into a 100dimensional vector (referred to as H pos i ).The POS embedding is also implemented by a randomly initialised lookup table.( 4) Because character embedding is effective to learn semantic features of unknown words, a character embedding is implemented on each token.Every character in a token is embedded into a 50-dimensional vector.Then, a Bi-LSTM is implemented, which outputs a (2 × 50)-dimensional vector for each word (referred to as Finally, the output of the basic network is an abstract representation of a sentence.Each word is mapped onto a 2248-dimensional vector.This is a concatenation of and H char i .It is represented as where is an abstract representation of a token t i named as features map.It may also be considered as an abstract representation of a named entity boundary.H is referred as a feature map layer or the first feature map layer.It has the same length N as the input sentence.
As discussed in Section 1, one problem with boundary regression is that, when entity boundary is far from any true entity boundary due to the vanishing gradient problem, finding these boundaries becomes difficult.Therefore, boundary regression is a relatively weak method to find true named entities with long length or far from entity spans.Therefore, we use a multi-granularity semantic representation to generate multi-granular feature maps.Based on the first feature map layer, we use convolutional networks to generate multigranular feature maps, where feature maps are compressed into dense representations with shorter lengths.
In our work, two strategies are proposed to generate multi-granular feature maps.The first strategy is to generate every layer from its previous layer by implementing a convolutional layer with a kernel size of 2. It is referred as a 'stacked method'.The second is that all feature map layers are directly generated from the first layer by implementing a convolutional layer with kernel sizes from 2 to 6, respectively.It is referred as a 'parallel method'.The kth feature layer is represented as: In our experiment, six feature map layers are adopted, where k takes values from 1 to 6.In the kth feature map layer, a vector H k i represents a k-gram abstract representation of the input.We conducted an experiment to compare their influences on the performance in Section 4.5.Because convolutional networks require fixed-length inputs, we set 50 as the length of sentences.Longer and shorter instances were trimmed or padded, respectively.

Region proposal
Region proposal is a process of enumerating textual boxes (or entity spans) 4 for classification.Before discussing the region proposal, we first provide some formalised definitions relating to textual boxes.
In our multi-granular BR model, six feature map layers are generated in the feature mapping process.The kth feature map layer is denoted as In the kth feature map layer, a vector represents a k-gram representation in the input T.
In this paper, a textual box is defined as a chunk of H k , represented as . This is a possible named entity representation of T. It is also denoted as a three tuple where parameters s, l, and c represent the start position, the length (or shape) and type of a named entity.All textual boxes in T are represented as a textual box set D = {d 1 , d 2 , . ..}.If the position and the length of a textual box precisely match a true named entity, it is referred to as a truth box.The truth box set is denoted as In a feature map layer, the strategy to generating textual boxes can be represented as a list [x 1 , x 2 , . . ., x L ] (1 ≤ x i ≤ M), where 'M' is the maximal named entity length in an evaluation dataset.The list means that for every feature map, we combine it with the neighbouring feature maps to generate textual boxes with lengths [x 1 , x 2 , . . ., x L ].
Given a feature map H k i and a list [x 1 , x 2 , . . ., x L ], the generated textual box is denoted as For example, let [1, 3] be a region proposal strategy.Then, for a feature map H i , we generate two textual boxes as H ii and H i,i+3 .
To show the ability of boundary regression to locate named entities from a sentence, in this study, we provide three strategies designed to implement the region proposal for full-span named entity recognition.
(1) Exhaustive enumeration: enumerating every possible entity span in a sentence.This strategy can also be denoted as a list [1, 2, 3, . . ., M].This strategy is difficult to adopt in real applications owing to its computational complexity.In our experiment, this strategy was mainly used for comparison.(2) Interval enumeration: a simple strategy for reducing the computational complexity of exhaustive enumeration is to collect textual boxes with the same interval distance.This can be represented as [1, 2 it is the same as exhaustive enumeration.In our experiments, k was set as 1 and 2, respectively.(3) Series enumeration: An important phenomenon for the named entity length distribution is that shorter lengths have a large number of named entities.Therefore, collecting more named entities with shorter lengths is preferable.In this case, we use a series to set the named entity length list, for example, 2 k (k ∈ {0, 1, 2, 3, 4, 5}), to be the same as [1,2,4,8,16,32].We also tried other lists such as [1,5,9,15] for comparison.

Boundary regression
In boundary regression, if a textual box is far away from any truth box, it contains less contextual features about a truth box because of the vanishing gradient and is thus less helpful to learn the offset relevant to a truth box.Therefore, adjacent textual boxes are more valuable for training a regression layer.These boxes are referred as 'neighbouring textual boxes'.
In contrast, if a textual box is far from any truth box, it is referred to as a 'remote textual box'.Given a truth box g i ∈ D g , the method used to collect neighbouring textual boxes is formalised as follows.
where γ is a predefined threshold taking values in a closed interval [0, 1].IoU(d i , g j ) is an intersection over union (IoU) function used to measure the overlapping ratio between two boxes (Everingham et al., 2010).This is represented as In Equation ( 4), the function span(d i ) represents the range of a textual box in the feature map layer.
With the same settings, the remote textual box set can be represented as D r = D − D n , which is a complementary set of D n .
Let d i represent a default textual box and g j represent a truth box.The normalised position offset and shape offset between d i and g j are defined as ), where g s j , g l j , d s i and d l i denote to the start position, the length of g j and d i , respectively.Because the region proposal generates textual boxes with different lengths, for each length of textual boxes, a multilayer perceptron (MLP) layer is applied to learn their location offsets relevant to the nearest truth box.The output of the MLP layer is referred to as ds ij and dl ij .Therefore, given an input sentence T, the location loss is computed as: where N is the number of boxes in a sentence.It is used to normalise the influence of entity spans in a sentence.
Otherwise, it has value 0. Smooth L 1 is a robust L 1 loss that quantifies the dissimilarity between d i and g j (Girshick, 2015).To support the regression operation, the length of the sentence and the positions of named entity boundaries are normalised into the range [0.0, 1.0].

Span classification
Given a textual box d i = [s, l, c], boundary regression outputs its location offsets relevant to a truth box.The updated textual box d i is fed into a fully-connected layer and a softmax layer to predict the confidence score. Let ) be a one-hot vector representing the entity type of d i , where Z is the number of entity types and c 0 i represents the negative entity type.ci = (c 0 i , c1 i , . . ., cZ i ) be the classification confidence score output by a softmax layer.The confidence loss is computed as: where i is the confidence score to be a negative instance.In our work, the tasks to regress named entity boundaries and generate confidence scores are trained in an end-to-end multi-objective learning framework, which enables the network to share model parameters in the bottom layers of the model.The total loss function combines the location loss and confidence loss as: where α is a predefined parameter balancing the location loss and confidence loss.The training objective is to reduce the total loss of the location offset and class prediction.In the training process, we optimise their locations to improve their matching degree and maximise their confidence.

Experiments
In our experiments, we adopted the GENIA corpus (Kim et al., 2003) and the ACE2005 English corpus (Doddington et al., 2004) to evaluate the mechanism of boundary regression to support full-span named entity recognition.
The GENIA corpus was collected from biomedical literature in MEDLINE by PubMed.It contains 2,000 abstracts on three medical subject heading terms: human, blood cells, and transcription factors.This dataset contains 36 fine-grained entity categories.The ACE2005 corpus is collected from broadcasts, newswires, and weblogs.The ACE2005 corpus contains three datasets: Chinese, English, and Arabic.In our work, the English corpus is used.It contains 506 documents, annotated with 7 named entity types.The length distribution of named entities in the GENIA corpus and the ACE2005 corpus are listed in Table 1.In Table 1, the length distribution of named entities is represented as a matrix.Elements of the matrix denote to the number of named entities with lengths computed by adding the row label and the column label.For example, in the ACE2005 corpus, the element in the '10' row and the '+3' column is 39.It means that there 39 named entities with length 13.Tag '-' in an element indicates that there is no named entity in the corresponding length.As shown in Table 1 showing, the longest length in the GENIA and ACE2005 corpora are 49 and 18, respectively.
To show the performance of full-span named entity recognition with different lengths, we divided the named entity lengths into three closed intervals [1,6], [7,12] and [13, +∞].They contain named entities with short length (S-Length), middle length (M-Length) and long length (L-Length), respectively.The ratios of named entities in each part are listed in Table 2.
As Table 2 shows, the lengths ranging from 1 to 6 covered most named entities, especially in the GENIA corpus.Because named entities in the GENIA corpus include terminologies such as 'protein', 'DNA', 'cell type', etc., they have shorter lengths than named entities  in the ACE2005 corpus.The latter was mainly collected from everyday speech.In both corpora, only a small part of named entities have lengths larger than 6.However, because they exhibit a wide range, it is challenging to recognise them.
In this study, we mainly evaluated the performance of full-span named entity recognition on dataset annotated with nested named entities.To see the performance of our model on flattened named entity recognition, we also evaluated our approach on the OntoNotes 5.0 (Pradhan et al., 2013) and CoNLL 2003English (Tjong Kim Sang & De Meulder, 2003) corpora.The OntoNotes corpus was collected from a wide variety of sources, including magazines, telephone conversation, newswires, and so forth.It contains 76,714 sentences and is annotated with 18 entity types.The CoNLL corpus consists of 22,137 sentences collected from Reuters newswire articles.It is divided into 14,987, 3,466 and 3,684 sentences for training, developing, and testing, respectively.
In our experiments, to compare our approach with the state-of-the-art methods, we adopted the same settings as Lu and Roth (2015) to divide the dataset in the proportion 8:1:1 for training, developing and testing.In the GENIA corpus, following related works, we also report the performance on five named entity types (DNA, RNA, protein, cell line and cell type).We used the 'AdamW' optimiser.The learning rate, weight decay, batch size and training epochs are set as 2e-5, 0.01, 12, 30, respectively.A dropout regularisation with value 0.2 was set to reduce the overfitting problem.In this paper, the Traditional Precision/Recall/F1 score (P/R/F) measurements were adopted to evaluate performance.The precision is the ratio of the corrected number to the output.Recall measures how many true named entities can be recognised and the ratio between correct and real numbers.The F1 score is an index used to balance precision and recall.This is computed as (2 × P × R)/(P + R).

Comparison with related work
In this experiment, our model is first compared with related works reported in the ACE2005 English corpus and the GENIA corpus.The first and second feature map layers of Figure 1 were used, which achieved the highest performance in our multi-granular BR model (discussed in Section 4.5 in detail).In this experiment, in the training data and testing data, we only enumerate entity spans with lengths 1, 2 and 4 for training and predicting.
We divide related work into two categories: token labelling and span classification.The span classification is further divided into limited-span entity recognition and full-span entity recognition.the limited-span refers to recognising named entities with lengths up to a certain length.In related work, the maximised named entity length for limited-span named entity recognition is usually set to 6. the full span supports all possible named entity recognition in a sentence.
In token labelling models, Lu and Roth (2015) used a full-span approach based on the CRF model, where the mention hypergraph is proposed to transform nested named entities into flattened structures.Katiyar and Cardie (2018) and Wang and Lu (2018) also applied hypergraph models based on deep neural networks.Ju et al. (2018) designed a stacked Bi-LSTM layered model.Shibuya and Hovy (2020) provided a decoding method to inference entities in an outside-to-inside way.Straková et al. (2019) applied a sequence-to-sequence model to support named entity recognition.Wang et al. (2020) stacked deep neural network in a pyramid shape.
In limited-span models, Lin et al. (2019) first detected the anchor of a named entity.Then, a point network was implemented to find the left and right boundaries of the named entity.Xia et al. (2019) proposed a detection network to enumerate entity spans, which are further evaluated by a classifier network.Sohrab and Miwa (2018) applied deep neural networks to enumerate and classify all possible regions or spans in a sentence.Lin et al. (2019) proposed an attentive neural network (ANN), which integrates externally-learned knowledge to support region-based named entity recognition.
In full-span models, Fisher and Vlachos (2019) introduced a neural network, which merges tokens into entities with nested structures and then labels each of them independently.Tan et al. (2020) incorporated the named entity boundary detection task to learn span representations of named entities.Zheng et al. (2019) proposed a boundary-aware neural model, which leverages entity boundaries to predict entity categorical labels.Shen et al. (2021) are also a boundary regression-based model, which is implemented on a single feature map layer to identify nested named entities in a two-stage framework.The result is shown in Table 3, where 'S-Length', 'M-Length' and 'L-Length' only show the performance of named entities with short length, middle length and long length, respectively.The performance in 'Total' is calculated on all named entities in the testing dataset.Traditionally, token-labelling models usually adopt a sequence method to find a maximised label sequence.Recognising nested named entities is difficult because a token may have different labels belonging to different named entities.Many strategies have been proposed to improve sequence models for nested named entity recognition, such as layering, cascading, and joint strategies.Because sequence models have the advantage of encoding semantic dependencies in a sentence, the result in Table 3 shows that they still achieved competitive performance in nested named entity recognition.
Our model on short-named entities (as shown in the S-Length row) evaluates named entities with lengths up to 6.This is the traditional set for span-based models.In our experiment, we only enumerated named entities with lengths [1,2,4].However, the results show that our performance was comparable with that of state-of-the-art methods such as Xu and Jiang (2016) and Sohrab and Miwa (2018).The results in M-Length and L-Length show that increasing the length of named entities worsens the performance.They exhibited lower performance.However, because long-named entities account for a small proportion of the total, they have little influence on the final performance.
Due to the reason of computational complexity in full-span classification, current models usually use prior knowledge to filter unlikely named entities, e.g.named entity boundaries (Tan et al., 2020;Zheng et al., 2019).This strategy leads to two shortcomings.First, in a pipeline model, it easily leads to the cascading failure problem.Second, dividing the  2022) used a triaffine mechanism to integrate heterogeneous features.It achieved the best performance in the ACE2005 corpus.Li et al. (2020) and Shen et al. (2022) proposed a machine reading comprehension-based model, where manually designed questions are required to encode named entity representations.Because these models benefit from external knowledge, they achieved the best performance on the GENIA corpus.Li et al. (2022) proposed a word-word relation classification model, which recognises flat, nested, and discontinuous named entities in an unified framework.Because the GENIA corpus was annotated with discontinuous named entity, Li et al. (2022) got improved performance in this corpus.Shen et al. (2021) and Chen et al. (2022) are also two boundary regression models for nested named entity recognition.The former is a pipeline framework, which enumerates entity lengths [1,2,3,4,5,7,9,11,13,15].The latter is an end-to-end framework, which enumerates all entity spans with lengths up to 6.They all contain a feature map layer for region proposal.Compared with them, we adopt a multi-granularity sentence representation to generate spans with different granularities.Furthermore, several enumeration strategies are proposed for making full use of boundary regression.The result shows that, even adopting the [1,2,4] enumerating strategy, our model also achieves competitive performance.It indicates that, supported by the boundary regression, there is no need at all to enumerate a large number of entity spans.It is effective to reduce the computational complexity in full-span named entity recognition.

The performance on unenumerated named entities
This experiment is conducted to show the performance of boundary regression to support full-span named entity recognition.In the region proposal process, we adopt the [1,2,4,8,16,32] enumerating strategy, which combines a feature map with right six feature maps, then generates six entity spans with lengths 1, 2, 4, 8, 16, 32.In this experiment, instead of adopting multi-granular feature maps, only the first feature map layer is used to generate textual boxes.All performance is reported in F1-score.Because the whole dataset is divided into 8:1:1 for training, development, and testing, some lengths are not contained in the testing data.They are indicated by a tag '-'.The result on all entity lengths is shown in Table 4, where the performance of enumerated named entity lengths (1,2,4,8,16,32) are shown in bold.It is clearly shown that shorter named entities have robust performance.
The reason is that longer named entities usually suffer from the gradient vanishing problem, where it is harder to capture semantic dependencies of long-named entities in a sentence.
As shown in Table 4, except named entities with length 20, almost all lengths can be identified in our model.In the ACE2005 testing data, the length 22, 23, and 27 contains only one named entity, respectively.Length 20 contains 2 named entities.In the GENIA, length 15 also contains a named entity.Therefore, they have abnormal performance (0.0% or 100% F1 values).An important phenomenon is that boundary regression is also effective to locate named entities whose lengths are not enumerated as textual boxes.For example, in Table 1, compared with adjacent enumerated named entity lengths, non-enumerated named entities with length 5 and length 15 have higher performance.The result indicates that boundary regression is powerful to find non-enumerated named entities.Without exhaustively verifying all possible entity spans, it also has the ability to support full-span named entity recognition.
After analysing the output, we found that there are two typical errors in the recognised named entities.First, if only a textual box is enumerated near two or more nested named entities, this box can only approach one of them.It leads to the omission problem.For example, clause 'it was compared to a highly diverged Drosophila homeodomain' contains two nested named entities 'Drosophila homeodomain' and 'homeodomain'.In our experiment, only the 'Drosophila homeodomain' is correctly recognised as a DNA entity.Second, if a true named entity is contained in an enumerated false entity span, because they share the same contextual features in a sentence, this span is easily recognised as a true named entity.It leads to a false positive error.For example, the phrase 'novel B-cell-specific enhancer element' is annotated with a DNA entity 'B-cell-specific enhancer element'.Both are recognised as DNA entities.

The performance with different embeddings
To show more details about our model, with the same settings in the above experiment, we conduct an ablation study to show the influence of the basic network on the ACE2005 corpus.The basic network contains four embedding modules to generate the abstract representation for each token.In this experiment, we conduct four models to compare the performance between them: 'w/o Token Embedding', 'w/o Character Embeddings', 'w/o POS Embeddings', and BERT.In this experiment, these embeddings are iteratively removed to show its influence on the final performance.The performance is shown in Table 5.The result shows that removing the token embedding from the base network has less influence on the final performance.The reason is that the BERT can also encode these token features for named entity recognition.Compared with token embeddings, removing character embedding leads to lower performance.Because character embedding can learn unknown word representations, it is effective to compensate for the weakness of other embeddings.The performance in 'w/o POS Embeddings' shows that removing POS embeddings is also influential.Compared with other embeddings based on words, POS embedding encodes informative syntactic features of a sentence.In this experiment, the worst performance is achieved in the 'w/o BERT'.The reason is that the BERT is trained on external resources with unsupervised algorithms.It is effective to learn contextual features and semantic dependencies of a sentence.

The performance with different region proposals
Region proposal is the key to balancing the computational complexity and the final performance.Enumerating a large number of entity spans increases the computational complexity and makes the data imbalance problem more serious.Enumerating few entity spans decreases the computational complexity, but it also leads to a lower recall ratio in the final performance.In this section, we conduct an experiment to analyse the influence of region proposals on the performance.In this experiment, five region proposals are implemented for comparison.The first proposal is the exhaustive enumeration, which verifies every possible textual box in a feature map layer.It is referred as [1, 2, 3, 4, . . . , M].In our experiment, the maximum named entity length (M) is set as 49 and 18 in the ACE2005 and the GENIA corpora, respectively.In interval enumeration, when setting k = 1 and k = 2, the proposed textual boxes can be represented as [1, 3, 5, 7, . . ., M] and [1, 4, 7, 10, . . ., M] respectively.In series enumeration, the list [1,2,4,8,16,32] is adopted to generate textual boxes.Sparse enumeration is denoted as list [1,5,9,15].The performance of boundary regression with different region proposals is shown in Table 6, where the 'RRP' column represents the recall of region proposal.It represents the ratio of true named entities collected by the region proposal.Smaller RRP ratio usually means a bigger challenge to recover true named entities.In the exhaustive enumeration strategy, all entity spans are enumerated.The result of exhaustive enumeration is mainly used for comparison.In this setting, we directly predict a confidence score for each entity span.There is no need to implement boundary regression.In exhaustive enumeration, all textual boxes are verified.It has the highest computational complexity and the worst data imbalance problem.However, the result shows that it also achieves good performance.
Compared to the performance between ACE 2005 and GENIA, the performance in ACE2005 has higher performance, even though the lengths of named entities in ACE2005 are more widely ranged.The reason is that recognising named entities in the GENIA dataset is more difficult, in which a large number of abbreviations are annotated.Furthermore, in the GENIA dataset, nested named entities may occur in a single word.For example, 'TCR-ligand' is an 'other_name' entity, which is nested with a 'TCR' protein entity.
In interval enumeration strategy with k = 1, the number of the enumerated textual box is half of the exhaustive enumeration.It only contains 71.10% and 62.94% truth boxes in total.However, after the boundary regression, the ratios of recall are increased to 87.53% and 82.40%, respectively.Setting k = 2, the performance still has stable performance.Compared with exhaustive enumeration, only half of the truth boxes are enumerated in the region proposal process.However, the result shows that the final performance is less affected.In series and sparse enumerations, for every feature map, only 6 or 4 textual boxes are generated for evaluation.Because a large number of named entities have shorter lengths, the collected truth boxes also cover around half of the total true named entities in the series enumeration.In the sparse enumeration, only a small number of textual boxes are generated for evaluation.However, the result shows that they all have stable performance.Compared with exhaustive enumeration, only half of truth boxes are enumerated, but the performance decreases only 1.2% and 1.1% F1-scores in the ACE2005 and GENIA corpora.
The result also shows that the best performance in the ACE2005 dataset is achieved in the exhaustive enumeration.On the other hand, the best performance in the GENIA is achieved in the series enumeration.The reason of this phenomenon may be that exhaustive enumeration in GENIA may lead to a serious data imbalance problem, which is influential on the final performance.
To reveal more details about boundary regression, we conduct another experiment on single and double enumeration.In single enumeration, we only generate a textual box for a feature map.For example, list [3] verifies textual boxes with only length 3.In double enumeration, for every feature map, we combine it with two right feature maps to generate two textual boxes.The performance is shown in Table 7.The result in Table 7 shows that list [1] has a higher precision but suffers from a lower recall.In this setting, the offsets between textual boxes are all positive values.They are not suitable to train a linear layer.Although many truth boxes have been enumerated in the region proposal process, the performance achieves a lower recall.
The results in lists [3], [5] and [7] are interesting.In these settings, only a small number of truth boxes can be collected, especially in list [7].But they all achieve better recall.The result indicates that boundary regression is powerful to locate named entities in a sentence.On the other hand, when the length of entities is increased, the precision is reduced considerably.The reason for this degeneration is that, the location offset between enumerated boxes and truth boxes is larger.It is difficult for regressing named entity boundaries.
One impressive result is shown in double enumeration, where every feature map is combined with two right feature maps to generate two textual boxes.It enumerates a small number of truth textual boxes, especially in the list [3, 5] and [3, 7].However, they all achieve competitive performance.This result proves that boundary regression is effective to support full-span named entity recognition.

The performance with different granularities
Compared with sentences, images have a compression invariant property, where a zoom operation is less influential on image classification.In traditional methods, linguistic units are usually represented as categorical values.Because a sentence contains linked words, compressing a sentence into a short representation may result in serious loss of information.Owning to deep neural networks, words can be embedded into distributed representations, which enables discontinuous semantic processing.For example, Wang et al. (2020) use convolutional networks to compress sentences into pyramid-shaped abstract representations.Then, token labelling is implemented to recognise named entities in each layer with different granular representations.
In boundary regression model, a textual box is an abstract representation of a possible named entity.In addition to entity type, it also has two parameters indicating its location and length in a sentence.Therefore, compressing an abstract sentence representation into a shorter representation has two impacts.First, it condenses semantic representations of named entities.Second, it compresses the location and length of a named entity in a sentence.In this section, we conduct an experiment to analyse the compressibility of textual semantics and reveal the influence of the compression operation on boundary regression.
As discussed in Section 3.1, two strategies ('Parallel' and 'Stacked') can be applied to generate feature map layers with different granularities.'Parallel' means that feature map layers 2 ∼ 6 are directly generated from the first feature map layer by directly implementing convolutional networks with kernel size 2 ∼ 5.The 'Stacked' strategy means that each feature map layer from 2 ∼ 6 is iteratively generated from its previous layer by implementing a convolutional kernel with size 2.
In the first experiment, we independently generate textual boxes from each single feature map layer, and computer the performance in each feature map layer.The region proposal list [1,3,5,7,11,15,20] is applied to generate textual boxes.Figures 2 and 3 show the performance of named entity recognition in the ACE2005 and GENIA corpora with 'Parallel' and 'Stacked' strategies.
The result shows that the compression operation is influential on the performance.After the sentence representation was compressed, the performance degenerated considerably.The reason is that contextual features in a sentence are weakened by the compression operation.As shown in Figures 2 and 3, denser representations lead to lower precision.However, an interesting phenomenon is that denser representations usually increase the recall.The reason may be that denser representations have the advantage to avoid the vanishing of semantic dependencies in a sentence.
Another phenomenon is that the 'Stacked' strategy has lower performance than the 'Parallel' strategy.The reason is that, in the 'Stacked' strategy, every feature map layer is iteratively generated from its previous layer, which suffers from more serious vanishing gradient problem in denser representations.
In the multi-granular BR model, instead of independently reporting the performance on each feature map layer, the performance can also be computed from the outputs of several feature map layers.In the second experiment, several feature map layers are stacked and simultaneously implemented for boundary regression.In every feature map layer, the region proposal list [1,3,5,7,11,15,20] is applied to generate textual boxes.We collect all output of them to compute the final performance.Figure 4 shows the performance.
The motivation for adopting multi-granular feature map layers is that shorter sentence representation can reduce the influence of vanishing gradient for recognising long-named entities.It also has the potential ability to strengthen the long semantic dependency in a sentence.The result shows that, when the second feature map layer is added, the performance can be improved on both precision and recall.
When the number of feature map layers is increased, the recall of boundary regression can be improved steadily.However, if the number of feature map layers is larger, a large number of textual boxes will be enumerated, which also worsens the precision and influences the final performance.
The result shows that the multi-granular model achieves the best performance when both the first and the second feature map layers are used.In this paper, we adopted the performance as our default setting.It is compared with related works in our experiments.

The performance with flattened named entity recognition
In this section, the OntoNotes 5.0 (Pradhan et al., 2013) and CoNLL 2003English (Tjong Kim Sang De Meulder, 2003) corpora are employed to evaluate the performance of BR model to recognise named entities with flattened structure.The Full-Span model is compared with several state-of-the-art models conducted on the OntoNotes and CoNLL corpora.In this experiment, we also use two feature map layers.The region proposal strategy [1,3,5,7,11,15,20] is adopted to enumerate textual boxes.
In related works, Ma and Hovy ( 2016) is a Bi-LSTM-CNNs-CRF model, which automatically encodes semantic features from words and characters.Ghaddar and Langlais (2018) is also a Bi-LSTM-CRF model learning lexical features from word and entity type representations.Devlin et al. (2019 june) is a BERT framework, which is effective to learn semantic features from external resources.Li et al. (2020) is a model based on machine reading comprehension.Yu et al. (2020) use a biaffine model to encode dependency trees of sentences.Luo et al. (2020) are also a Bi-LSTM model based on hierarchical contextualised representations.Li et al. (2022) proposed a general framework to handle named entity recognition as a word-word relationship classification that can handle flat entities.In addition, Batbaatar and Ryu (2019) used the method of word, char, and POS feature engineering for entity recognition.We reproduce this method for comparison.The result is shown in Table 8.In flattened named entity recognition, sequence models (e.g.Bi-LSTM) output a maximised labelling sequence.They are effective to encode semantic dependencies in a sentence.Therefore, in the OntoNotes and CoNLL corpora, sequence models achieved higher performance.Compared with them, our model also has competitive performance.
To show the influence of region proposal strategies on flattened named entity recognition, we also compare different region proposal strategies on the OntoNotes 5.0 and CoNLL 2003 English corpora.The result is shown in Table 9.The result in Table 9 shows the same trend as in nested named entity recognition (Tables 6 and 7).When the number of enumerated textual boxes is reduced, the performance is decreased.Compared with exhaustive enumeration, serious enumeration or interval (k = 2) enumeration reduce the computational complexity considerably, but also achieve competitive performance, especially in double enumeration strategies.

The computational complexity with different region proposals
The computational complexity between different region proposals can be estimated by the number of generated textual boxes in each sentence.In this section, we compare the computational complexity between different region proposals.Let L = [x 1 , x 2 , . . ., x L ] denotes a region proposal list.Given a sentence T with length N, the number of textual boxes Num L generated by an employed region proposal strategy can be computed as: In Equation ( 8), the value of N is an influence on the complexity ratio.In our experiment, the default length for a sentence is set as N = 50.In this paper, instead of listing the number of textual boxes in a region proposal strategy, the computational complexity of a region proposal is measured as the ratio of textual box numbers between an employed region proposal strategy and the exhaustive enumeration strategy.It is effective to show the decrease in computational complexity for a given region proposal strategy.Table 10 shows the computational complexity ratio of region proposals in the ACE2005 corpus.The result in Table 10 shows that the double enumeration has a lower computing complexity.Compared with exhaustive enumeration, only 7.52% textual boxes are enumerated for classification.However, as shown in Table 7, competitive performance can be achieved in terms of F1 score.The result indicates that supported by boundary regression, there is no need to verify every possible entity span.
Our multi-granular boundary regression model can achieve competitive performance while considerably reducing computational complexity.
In this follows, we compare the model complexity with related works.Our full model (Multi-granular BR) is mainly used to evaluate the mechanism of boundary regression.It contains three token embedding modules and six feature map layers.For better understanding, we also implement a simple model (BERT+Single-granular) for comparison.It contains only a BERT layer in the basic network and uses a feature map layer for region proposal.The statistical information of parameter numbers in different models are shown in Table 11.In Table 11, the model proposed by Yan et al. (2021) has a large number of parameters, in which a triaffine is adopted to integrate features with different formats.The boundary regression model proposed by Chen et al. (2022) also has a high computational complexity because it simultaneously enumerates a large number of entity spans (600 spans for each sentence).Every entity span is independently regressed and classified by two linear layers and an MLP layer, which lead to a large number of parameters.Compared with them, our full model has the highest computational complexity.In our simple mode, because it has similar architecture as Chen et al. (2022), the parameters are decreased to 372.2M, which is close to the model in Chen et al. (2022).

Conclusion and future work
In this paper, a multi-granular semantic end-to-end boundary regression model is proposed to support full-span named entity recognition.Experiments were conducted to show the effectiveness of the boundary regression mechanism.Results show that our method can reduce computational complexity and recognise named entities with any length without enumerated all possible named entities.It shows great potential to support real applications, where huge amounts of data should be processed as quickly as possible.In the future, our work can be developed in two directions.First, it is valuable to enhance boundary regression to reveal more details about this mechanism.Second, this method can be developed and applied to other NLP tasks, such as event extraction and automatic question-answering.

Table 1 .
Distribution of different intervals.

Table 2 .
Distribution of different intervals.

Table 3 .
Evaluation in the English Corpus.

Table 4 .
Performance of Boundary Regression.

Table 5 .
Ablation Study on the ACE2005 corpus.

Table 6 .
Performance on Region Proposals.

Table 7 .
Performance on Single and Double Enumeration.

Table 9 .
Influence of Region Proposal on the CONLL2003 and OntoNote 5.0 Corpora.

Table 10 .
Complexity Ratio Between Different Region Proposals.

Table 11 .
The Number of Model Parameter.