Entity and relation collaborative extraction approach based on multi-head attention and gated mechanism

Entity and relation extraction has been widely studied in natural language processing, and some joint methods have been proposed in recent years. However, existing studies still suffer from two problems. Firstly, the token space information has been fully utilized in those studies, while the label space information is underutilized. However, a few preliminary works have proven that the label space information could contribute to this task. Secondly, the performance of relevant entities detection is still unsatisfactory in entity and relation extraction tasks. In this paper, a new model GANCE (Gated and Attentive Network Collaborative Extracting) is proposed to address these problems. Firstly, GANCE exploits the label space information by applying a gating mechanism, which could improve the performance of the relation extraction. Then, two multi-head attention modules are designed to update the token and token-label fusion representation. In this way, the relevant entities detection could be solved. Experimental results demonstrate that GANCE has better accuracy than several competitive approaches in terms of entity recognition and relation extraction on the CoNLL04 dataset at 90.32% and 73.59%, respectively. Moreover, the F1 score of relation extraction increased by 1.24% over existing approaches in the ADE dataset.


Introduction
Entity recognition (NER)  and relation extraction (RE) Zhao et al., 2021a) are two important tasks in text mining. When a sentence is given, the entities and their types are detected first by entity recognition. Take the sentence in Figure 1 as an example, "Richard" and "Celeste" are two entities with type People(Peop), and "Ohio" is another entity with type Location(Loc). Then, the semantic relationships between the entities will be determined by relation extraction. For example, there is a "Live in" relation between entities in Figure 1. In this way, the relationship structure of entities in unstructured texts can be obtained automatically (Liu et al., 2019;Sharma et al., 2020;Zhao et al., 2020). Accordingly, entity recognition and relation extraction play essential roles in information extraction (IE). Besides, they are enablers for other natural language processing tasks, such as knowledge base population (Viswanathan et al., 2015), information retrieval (Chen et al., 2015), and question answering (Yih et al., 2016).
At first, NER and RE are generally organised as two successive subtasks. A major disadvantage of such an approach is that errors can propagate and spread across subtasks. Specifically, errors generated in NER could be propagated to RE (Li et al., 2017). Another disadvantage is that the correlations between two subtasks are ignored, while these correlations may benefit the coordination between these two subtasks. Recently, joint models have been put forward to realise NER and RE simultaneously (Li et al., 2017;Nayak & Ng, 2020;Xiao et al., 2020), which merge these two subtasks into an individual task. In other words, for an input, joint models extract the entities and relations in the input text simultaneously. Therefore, the above two disadvantages can be eliminated in the joint models. Such models have achieved gratifying results. Miwa and Bansal (2016), Katiyar and Cardie (2017), Bekoulis et al. (2018a), Nguyen and Verspoor (2019) and Eberts and Ulges (2019). However, we notice that two limitations still exist in those models: (1) Little work has been done to utilise label information in the joint model. However, the entity types contained in the label information play an important role in RE. To our best knowledge, few studies make sufficient use of label space information in their joint model. Some exceptions are Miwa and Bansal (2016), Bekoulis et al. (2018a) and Wan et al. (2021), which fuse label space information in their model by simple feature concatenation. Those works are beneficial attempts that prove the positive contribution of label information to the joint extraction task.
(2) Several RNN-and CNN-based models, such as tree-LSTM structured model (Miwa & Bansal, 2016) and globally normalised CNN-model (Adel & Schütze, 2017), are proposed to realise the joint extraction task. Furthermore, Bekoulis et al. (2018a) regards the joint extraction as a multi-head selection problem and proposes a multi-head model. Wadden et al. (2019) extract global features of context to extend graph neural network (GNN) for the joint task. In these models, all entities in the sentence are treated equally. However, the importance of each entity in the joint extraction task is different. In this case, noises will likely be introduced into the model if all entities are used indiscriminately, especially for the multi-head model.
To address these two limitations, a novel end-to-end joint entity and relation extraction model, called Gated and Attentive Network Collaborative Extracting (GANCE), is proposed in this paper.
Firstly, the label information is exploited in GANCE, and then a gating mechanism is applied to fuse token and label information dynamically. In this way, the entity types in label information are utilised by GANCE and benefit the performance of relation extraction.
Secondly, a multi-head attention module is designed to capture the attention weight between tokens. Then, another multi-head attention module is used to refine the attention weight after the label information is integrated by gating. Based on these two-layer multihead attention modules, the relevance of the entities can be extracted and the potential relevant entities can be detected. Since the relevant entities could conduce to the RE task (Zhao et al., 2021a), it is anticipated that GANCE could achieve better performance on RE than the existing models.
Finally, the correctness and feasibility of the proposed GANCE are validated based on two public datasets, i.e, CoNLL04 and ADE. Furthermore, the comparison results between several competitive approaches and GANCE show that GANCE could achieve better performance.

Entity and relation extraction
Early entity and relation extraction are mostly configured as two subtasks. Then two models (NER model and RE model) are designed to solve the two subtasks in pipeline way (Chan & Roth, 2011;Miwa et al., 2009;Nadeau & Sekine, 2007). Yang and Cardie (2013) and Miwa and Sasaki (2014) propose the joint extraction model. However, early joint models involve non-trivial feature engineering and rely heavily on NLP tools.
Recently, with the development of deep learning, joint models tend to adopt RNNbased and CNN-based structures to skip feature engineering (Alberto, 2018;Chiang et al., 2019;de Jesús Rubio, 2009;de Rubio, 2020;Furlán et al., 2020;Islas et al., 2021;Lin et al., 2020;Shen et al., 2021). Miwa and Bansal (2016) design a bidirectional treestructured RNNs model, which makes full use of dependency tree and word sequence information to extract relationships between entities. Wang et al. (2016) propose multilevel attention CNNs to extract relations. Similarly, Katiyar and Cardie (2017) introduce a traditional attention model to extract relations.  design a multi-round problem method for entity and relation extraction. This method applies BERT as the core model and achieves promising performance on multiple datasets. Bekoulis et al. (2018b) propose a multi-head mechanism to predict multiple relationships. However, it requires manual feature extraction and the assistance of external tools. Bekoulis et al. (2018b) encode the whole sentence by a BiLSTM. Then the output is fed into a multi-head mechanism.

Label space information
The information extraction problem can be regarded as a sequence labelling problem, which will generate label space information (label information for short). Sequence labelling aims to give a label to each element in the sequence. In general, in NLP, a sequence refers to a sentence, and an element refers to a word in the sentence. Named Entity Recognition (NER) is a subtask of information extraction, which needs to locate and classify elements. For NER, its label information includes the locations and types of elements. In this paper, the BIO joint tagging method is used to tag each element with "B-X", "I-X", or "O". Where "B-X" indicates the beginning of the element of type X, "I-X" indicates the middle position of the element of type X, and "O"indicates that the element does not have a type. As the entity "Richard Celeste" shown in Figure 1, "Richard" is labelled as "B-Peop" since it is the first element of the entity with the type name. Then "Celeste" is labelled as "I-Peop". Since "Celeste" is followed by a word labelled "O", it can be inferred that "Celeste" is the end boundary of this entity.
Some studies have proved that label space information does play a positive role in entity relation extraction. To facilitate zero-shot learning, label space information is first applied to computer vision (Zhang & Saligrama, 2016). Recently, label information has been widely used in other NLP tasks. Yu et al. (2019) apply label space information to text classification task, Bekoulis et al. (2018b) and Miwa and Bansal (2016) has utilised the label information by simple feature concatenation in their RE model, Wang et al.  proposes a new method to eliminate the different treatment on the two sub-tasks' (i.e. entity detection and relation classification) label spaces, and a unified label space is used for entity relation extraction. Previous studies have proved that label space information plays a positive role in entity relation extraction.

Relevant entity
In other words, the stronger the correlation between two entities, these two entities are more likely to be relevant entities, which further indicates there may be a relationship between them. For example, assuming A is an entity with type "Location", B is an entity with type "people", and C is an entity with type " organisation". Compared with A and C, A and B would have a stronger correlation for the "Live in" relationship than that of A and C. Therefore, there is a greater possibility of A and B having a "Live in" relationship in one sentence.
Based on this observation, we use the attention mechanism to capture the correlation between entities and potential relevant entities, which is helpful to our entity relation extraction. In addition, the traditional neural network model can only learn the closedistance relevant entities, it is difficult to capture long-distance relevant entities. Hence, a multi-head attention mechanism is used in GANCE to solve this problem.

The GANCE model
This section provides the detailed design of GANCE. The overall flowchart of GANCE is illustrated in Figure 2. Firstly, token representation is obtained by a BiLSTM and a multi-head attention module (Section 3.1.1). Then, a low-dimension label representation is obtained by randomly initialised vectors (Section 3.1.2). Next, a gating mechanism and another multihead attention module are carefully designed to fuse and update the token and label representation (Section 3.2). Meanwhile, conditional random field (CRF) (Lafferty et al., 2002) and multi-head mechanism (Bekoulis et al., 2018a) are employed as the decoding module for NER and RE, respectively (Section 3.3). Lastly, the training and inference processes are described (Section 3.4). Token representation is first obtained by a BiLSTM and a multihead attention module. Then, a low-dimension label representation is obtained by embedding. Next, a gating mechanism and another multi-head attention module are carefully designed to fuse token and label information.

Token representation
Word-level Encoder: Recently, distributed feature representation has been widely used in NLP, especially for the deep learning methods (Luo et al., 2018). Based on distributed feature representation, the discrete words in a sentence can be mapped into continuous input embeddings. In this paper, word embedding, character embedding, and ELMO embedding (Peters et al., 2018) are utilised and concatenated as the final embedding. Accordingly, given a sentence W = w 1 , . . . w n as a sequence of tokens, each token w i is mapped into a real-valued embedding x i ∈ R d w . This embedding representation implies the semantic and syntactic meanings of the token. Therefore, the sequence W is transformed into a set of embedding vectors. Then, a BiLSTM is utilised to encode those embedding vectors. Denote the embedding vectors of sequence W as X = (x 1 , . . . , x n ), where n is the length of the sentence. The BiLSTM takes X as input, as shown in Equations (1)-(2).
Afterwards, the outputs of the forward and backward LSTM at each timestep are concatenated as the output H of the BiLSTM, as shown in equation (3).
Multi-head attention: Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. In this paper, our attention layers are both based on the multi-head attention formulation (Vaswani et al., 2017). One critical advantage of multi-head attention is its ability to model the long-range dependency, which is beneficial for extracting relevant entities. A scaled dot product is chosen for the compatibility function in multi-head attention. Compared with the standard additive attention mechanism (Bahdanau et al., 2014), which is implemented using a one-layer feed-forward neural network, scaled dot product enables efficient computation. Given a matrix of n query vectors Q ∈ R n * 2d , keys vectors K ∈ R n * 2d and value vectors V ∈ R n * 2d . The multi-head attention is calculated as Equations (4)- (6): where head i refers to the ith head of multi-head attention.
For simplicity, the above multi-head attention module is defined as Equation (7): As shown in Figure 3(a) and Equation (8), the output H of the BiLSTM is regarded as queries, keys and values matrices and is fed into the multi-head attention. Finally, the token representation is generated, which denotes as M t , M t ∈ R n * 2d . For the parallel attention heads in GANCE, h = 8 is employed.

Label representation
As illustrated in Figure 1, the label with BIO (Beginning, Inside, Outside) format is used in this paper for NER (Zhao et al., 2021a(Zhao et al., , 2021b. Motivated by Miwa and Bansal (2016), each label is represented by a randomly initialised embedding vector, whose size denotes as d l .
After the fine-tune during the training, the generated vector sequence L ∈ R n * d l is used as the label representation. Note that gold labels are used only during training. For inference, predicted labels are utilised. In other words, the entity labels predicted by the NER model (CRF) are used in the process of relation inference.

Token-Label fusion representation
To further exploit the token and label information, it is necessary to fuse the token representation and label representation. Instead of using naive fusion ways such as simple concatenation or M F = M t + L, a gating mechanism and a multi-head attention module are applied for representations fusion and update in GANCE. The motivation behind this design is that the importance of the representations should be determined by the specific context. Hence the fusion should be implemented in a dynamic form, which could be realised by the gate and multi-head attention mechanism. At first, gating is used to fuse the token and label representation as Equations (9)- (10): where W t , W l ∈ R 2d * 2d , b f ∈ R 2d . is element-wise multiplication. Then, a multi-head attention module takes M F as its input and outputs the updated token-label representation. As shown in Figure 3(b), the multi-head attention component feeds M F as queries, keys, and values matrices by using different linear projections. Finally, the fused token-label representation M t−l is computed as Equation (11) :

NER:
To identify the entities, a CRF layer is added in GANCE. It takes token representation M t = [m t 1 , . . . , m t n ] as input. As Equations (12)-(13) show, the output of the CRF layer is a sequence of predicted tagging label probabilities Y = y 1 , . . . , y n : where W n and b n are model parameters.

RE:
A multi-head mechanism is utilised for RE, and token-label fusion representation M t−l is the input. Suppose C is a set of relation labels. The multi-head mechanism aims to give a value for each tuple (w i , w j , c r ), where w i is the head token, w j is the tail token, and c r denotes the rth relation between w i and w j . There are multiple heads in each pair of tokens < w i , w j >, and each head computes a value for one relation. Given a relation label c r using a single layer neural network, the score s(m t−l i , m t−l j , c r ) of word w j is computed as Equation (14), which is further used as the head of w i : where V ∈ R z , W ∈ R z * 2d , U ∈ R z * 2d , b r ∈ R z , and z is the width of the layer. The probability that token w i and token w j have a c r relationship can be calculated by Equations (15)- (16): where σ is the sigmoid function. During inference, the most probable candidate tuple (w i , w j , c k ) are selected using threshold-based prediction.

Training and inference
During the training process, the parameters are optimised to maximise the conditional likelihood in Equation (17) for NER (Zhao et al., 2021a): For RE, the cross-entropy L re is calculated as shown in Equation (18): where o is the number of relations (heads). The objective of the joint entity and relation extraction task is set as Equation (19) shows, where w and θ denote tokens and model parameters, respectively.
Similar to Kendall et al. (2018), we combine NER and RE objectives using homoscedastic uncertainty to learn relative weights from the data. We proceed here directly to the loss that is in our case given as L joint (w; θ) = L ner + L re instead of Equation (19). Where,

Dataset
Public benchmarks CoNLL04 (Roth & Yih, 2004) and ADE (Gurulingappa et al., 2012) are used to validate the effectiveness of the proposed method. CoNLL04 consists of 910/243/288 instances for training/validation/testing. Besides, 10-fold cross-validation are adopted on the ADE dataset. Three commonly used evaluation metrics in machine learning are used to evaluate the model, including Precision (P), Recall (R), and F1 score (F).

Implementation details
There are 3 BiLSTM layers in GANCE, and the size of hidden layer d and label embeddings d l are set as 64 and 25, respectively. The optimiser is Adam, with learning rate is set as 0.0005. The size of character embeddings, word embeddings and ELMO (Peters et al., 2018) are set as 128, 128, and 1024 respectively. Lastly, the training takes 180 epochs for convergence.

Performance on benchmarks
The performance comparison of different models on two datasets are provided and analysed as follows.

performance against entity distance
For two entities, their entity distance refers to the absolute character offset between the last character of the entity that appears first and the last character of the entity that appears second. The distance between related entities influences the effect of relationship extraction, and capturing long-distance entity dependencies is always a difficult problem in relation to extraction. To evaluate the impact of the entity distance on the performance of GANCE, experiments are conducted on the CoNLL04 dataset. According to different entity distance (i.e. ≥ 20, 9-19, 0-9), the CoNLL04 dataset is divided into three parts (Zhao et al., 2021a). Multi-head + AT (Bekoulis et al., 2018a) is set as the baseline because the same decoding layer is used in it as GANCE. Under different entity distances, the performance of GANCE and the baseline is depicted in Figure 4. It can be seen from Figure 4 that GANCE outperforms the baseline under all different entity distances. Moreover, the performance of the GANCE is much better than the baseline when the distance between entities exceeds 20 characters. Specifically, GANCE leads by 15.59% in the F1 score for RE. In conclusion, GANCE could maintain remarkable performance even though a long entity distance exists. This is in line with expectations since GANCE can detect relevant entities by learning attention weights between the entities in the sentence.

Effect of homoscedastic uncertainty
The effect of loss with and without homoscedastic uncertainty is further analysed in this section. As shown in Figure 5, GANCE based on the loss with homoscedastic uncertainty realises better performance on the dataset CoNLL04. Specifically, the F1 score of GANCE with weight loss is 0.8% and 1.08% higher than that of GANCE without weight loss for NER and RE, respectively. This is reasonable since homoscedastic uncertainty could capture the correlation confidence between NER and RE. Therefore, the model based on the loss with homoscedastic uncertainty could learn to balance the weights optimally. Moreover, as seen in Figure 5, the performance of the two models (Weight loss and Without weight loss) have similar trends. It can be observed that the performance of both models is closer to the maximum even from the early training epochs.

Ablation study
To further assess the impact of different modules on the performance of GANCE, an ablation study is designed and executed on the dataset CoNLL04. Three modules (the gate module  and two multi-head attention modules) are evaluated in the following four ways, and the results are summarised in Table 3. (I) The gate module. The gate module is replaced with a simple feature addition scheme (M F = M t + L). It is found that the F1 score performance of GANCE on NER and RE drops to 89.44 (−0.88%) and 72.14(−1.45%) respectively. Therefore, the gate module used in GANCE is essential for capturing relevant entities. (II) Multi-head attention module in token representation. This multi-head attention module is ablated in both tasks. After the ablation, the performance of GANCE significantly decreases on NER and RE. Specifically, the F1 score decreases by 2.1% and 1.93% on NER and RE, respectively. These results prove that the applied multi-head module does benefit from capturing self-correlations among tokens. (III) Multi-head attention module in token-label fusion representation. This multihead attention module is deleted and M F is directly used for decoding. The results show that the F1 score drops by 1.67% and 1.57% on NER and RE, respectively. (IV) Both multi-head attention modules. After removing both attention modules, worse results on NER (−4%) and RE (−3.48%) are caused, which demonstrates that the proposed attention modules play a vital role in enhancing character representations.

Extraction cases analysis
To gain further insights about GANCE, an error analysis is provided as Table 4 shows. For case1, it can be observed that "Rocky Mountains ", " Montana" and "Livingston" can be correctly detected as Location entities. Besides, GANCE identifies the two "Located_ In" relations between these entities. For case2, GANCE cannot recognise that there is a " Live_in" relationship between " Peter Murtha" and "U.S.". Besides, the "OrgBased_In" relation between 'Justice Department" and 'U.S." is also omitted. The reason is that "Justice Department" and " Peter Murtha" are involved in more than one relation.

Concluding remarks
In this paper, we propose a Gated and Attentive Network Collaborative Extracting (GANCE) for the task of joint entity relation extraction. GANCE consists of a gating mechanism and two multi-head attention modules. Besides, homoscedastic uncertainty to weight losses is introduced between the two tasks. Compared with existing joint methods, GANCE provides a new way to utilise label-space information and detect relevant entities. Experimental results on two benchmarks demonstrate that GANCE could effectively improve the performance of NER and RE by fusing label-space information and detecting relevant entities effectively. For future work, text-based entity and relationship extraction can be used in more related research fields in the Internet of Things, such as relationship prediction, link mining, image recognition (Liang, Long, et al., 2021), attack detection (Kang, 2020), QoS prediction and anti-attack protection , security defense (Liang, Ning, et al., 2021), IP circuit protection (Liang et al., 2020), etc. Besides, it can be combined with many new technologies, such as Blockchain (Liang et al., 2020), big data (Chen, Liang, Zhou, et al., 2021), service recommendation (Chen, Liang, Xu, et al., 2021), Security Risk Assessment , etc. In addition, related research on entity representation, text mining, and relation extraction can be conducted based on images, videos, and audios. which would have theoretical significance and application value for the Intelligent Internet of Things (IIoT) development.

Disclosure statement
No potential conflict of interest was reported by the authors.