Multi-Level Attention with 2D Table-Filling for Joint Entity-Relation Extraction

: Joint entity-relation extraction is a fundamental task in the construction of large-scale knowledge graphs. This task relies not only on the semantics of the text span but also on its intricate connections, including classification and structural details that most previous models overlook. In this paper, we propose the incorporation of this information into the learning process. Specifically, we design a novel two-dimensional word-pair tagging method to define the task of entity and relation extraction. This allows type markers to focus on text tokens, gathering information for their corresponding spans. Additionally, we introduce a multi-level attention neural network to enhance its capacity to perceive structure-aware features. Our experiments show that our approach can overcome the limitations of earlier tagging methods and yield more accurate results. We evaluate our model using three different datasets: SciERC, ADE, and CoNLL04. Our model demonstrates competitive performance compared to the state-of-the-art, surpassing other approaches across the majority of evaluated metrics.


Introduction
Named Entity Recognition (NER) and Relation Extraction (RE) aim to extract structured information from plain texts.They are long-standing research topics in the field of Natural Language Processing (NLP).We present Figure 1 as an example of the NER and RE problem: NER aims to identify entities in text and classify them into pre-defined entity types, for example, "Reagan" should be recognized as a person (Peop) and "U.S." as a location (Loc), respectively.On the other hand, RE is usually based on the entities that have been identified by NER, combined with contextual semantic information, to assign a relation type to these entities.For instance, a "Live_In" relation exists between "Reagan" and "U.S.".Methods for entities and relations extraction can be categorized into pipeline or joint models.In the traditional pipeline approach [1][2][3][4], NER and RE are considered as two independent tasks: first, entities are recognized in the input sentence, and then relations are classified as pairs of extracted entities.Joint works [5][6][7][8][9][10] extract entities and relations in parallel, then combine them into triples and avoid the error propagation caused by the pipeline framework.
Many joint methods focus on learning a unified representation of these two tasks to explore the correlation between NER and RE.Given the exceptional performance of Pretrained Language Models (PLMs) like BERT [11], which can help mitigate problems, such as limited semantic elements within a sentence, researchers can maximize the utility of BERT to extract more complex features.Some works [3,12,13] have focused on exploring methods to obtain improved span representations from pre-trained encoders.For example, ref. [13] proposes a simple and effective way to capture span representations through BERT for lightweight reasoning.Ref. [4] introduces a novel span representation approach to consider the interrelation between the spans (pairs) by strategically packing the markers in the encoder.These approaches often heavily rely on predefined features (span features), causing the model to overlook the intricate interconnections among the entities and relations, thereby impeding the recognition of semantic relations between entity pairs.
To explore the common structure of the two tasks, table-filling methods have been proposed, wherein unit features are defined as the basic semantic properties of the target word pair [8,[14][15][16][17].In this approach, the (i, j, r)-th cell is assigned a label that represents the relationship between tokens at positions (i, j) in the sentence.To this end, for an input sentence, the output of the method is usually a three-dimensional (3D) matrix with each entry corresponding to the classification result.These approaches built upon the table structure operate on the idea that cell labels are dependent on features or predictions derived from preceding or adjacent cells.Ref. [8] formulates joint extraction as a token pair linking problem and introduces an innovative handshaking tagging scheme that aligns the boundary tokens of entity pairs for each relation type.Ref. [14] proposes to eliminate the different treatments on the two sub-tasks' label spaces and applies a unified classifier to predict each cell's label.In their approach, entities and relations are represented by squares and rectangles in the table.Ref. [15] employs a scoring-based classifier and a relation-specific horn tagging strategy.However, the information from type markers is not utilized in these methods.
In our study, we propose leveraging a pre-trained encoder to enhance the model's semantic information with features linked to the target information.This encompasses entity and relation type markers, along with structural details.Specifically, inspired by the works above and the interaction map proposed by [16], we design a new word-pair tagging method to extract all results in one step.The input of our model is a two-dimensional (2D) table, with each entry corresponding to a word pair in sentences.A detailed description of our word-pair tagging can be found in Figure 2. Furthermore, we design a multi-level attention network joint extraction model: First, we facilitate multi-head biaffine auxiliary alignment between objects to discern correlations between units.Then, we combine table structure-aware features with sequence-aware features, thereby capturing connections between unit features while providing the model with both textual semantic information and task-related details.Our model predicts the most probable results from the word-pair tagging table by calculating the attention score.In general, our main contributions are as follows:

•
We incorporate the type markers alongside text tokens in the same encoder, thus preserving task relevance rather than treating them as isolated components.Building upon a novel word-pair tagging approach, we condense our table into two dimensions.

•
We propose a multi-level attention mechanism that models interactions around unit features, capturing dependencies between table structure-aware and sequence-aware features.This mechanism effectively integrates the inherent relationships between feature sequences relevant to entities or relations, while maintaining the efficiency advantage of the model.

Related Work
In recent years, many works [18][19][20] have considered the joint modeling of entity recognition and relation extraction tasks and largely focused on developing effective prediction models.Joint extraction of entity and relation mitigates the error propagation issue associated with the traditional pipeline approach and leverages the interaction between tasks, resulting in improved performance.Furthermore, some problems attract much attention from researchers:

•
Overlapping: Based on the different overlapping patterns of triples [21], sentences can be divided into three categories, as suggested by [22]: Normal, Entity Pair Overlap (EPO) and Single Entity Overlap (SEO).A sentence is classified as Normal if none of its triples have overlapping entities.It is categorized as EPO if some of its triples have overlapping entity pairs.Meanwhile, a sentence falls into the SEO class if some of its triples have an overlapping entity but do not have overlapping entity pairs.Note that a sentence can belong to both the EPO and SEO classes.

•
Interaction: Since these tasks are closely interconnected, joint models capable of simultaneously extracting entities and their relations within a single framework have the potential to leverage inter-task correlations and dependencies, leading to potential performance improvements.Several recent efforts have aimed to exploit such intertask correlations by jointly modeling both NER and RE tasks.
Some approaches like token-level models [23,24] using the BIO tagging scheme face challenges in modeling overlapping entity mentions and often encounter cascading errors due to sequential decoding.The span-based approach [25] identifies overlapping entities by determining the boundaries of objects and then categorizing them based on these boundaries.However, span-based models are affected by maximal span lengths, and a sentence including n words may consist of n(n + 1)/2 numbers of entity possibilities.In previous works, ref.
[13] width embeddings were set and learned through backpropagation, while [3] the process span pairs with levitated markers independently, which is timeconsuming and overlooks the interrelation between the span pairs.Earlier work [26] in this area commonly reduces the task to a table-filling problem to be useful in addressing overlapping and interaction problems.However, these methods usually required an additional expensive decoding step to obtain globally consistent cell labels.In the work by [27], a novel neural architecture was introduced, which utilized the table structure and involved repeated applications of 2D convolutions for pooling local dependency and metric-based features.Another work [28] proposed a global featureoriented triple extraction model that fully leveraged the global associations.Each relation's table is filled based on its refined table feature, and all triples linked to this relation are extracted based on its filled table.
This paper introduces a two-dimensional table to represent interactions between individual words in a sentence.Our method leverages both the table structure within the 2D table representation and the sequence structure information within the text.We facilitate interaction between these elements with our multi-level attention architecture, especially considering the context of neighboring entries in the table.

Methods
In this section, we first detail the joint extraction of entities and relations tasks and our word-pair tagging method (Sections 3.1 and 3.2).Then, we describe our contextualized word representations based on pre-trained language models (Section 3.3) and introduce our multi-level attention for table-filling tasks (Section 3.4).Finally, we introduce the training methods to extract entities and relations (Section 3.5).Figure 2 shows a detailed description of our word-pair tagging, and Figure 3 shows an overview of our model architecture.

Task Description
Given a sentence S of words w 1 , w 2 , . . ., w n as input, the model is required to extract related entities and to identify the relation types between entities to form a set of triplets identifying pairs in the form of (e , where e 1 is not equal to e 2 .An entity e 1 /e 2 is a span with the pre-defined entity types t 1 /t 2 .The r represents the relation between the entities e 1 and e 2 .The task requires the model to correctly predict the boundaries of the subject entity and the object entity, and the entity relation.

Word-Pair Tagging
We propose a new word-pair tagging method, thereby transforming the task into one that extracts the predicted results between each word-pair (w i , w j ).By concatenating text and task label types into a natural language sequence, our model can exploit their contextualized correlations and leverage the semantic knowledge learned from the pretrained language model.These markers will be explained further below:

•
Diagonal markers in the purple part indicate entity-head and entity-tail.The orange part on the right represents the connection between an entity-head and an entity type.Similarly, the orange part below the table represents the connection between an entity-tail and an entity type.When both the entity-head and entity-tail have the same entity type, they can form an entity.The table exactly expresses how to detect the correct span boundary of the spans, as shown in Figure 2, where ("Reagan", Peop), ("State Department", Org) and ("U.S.", Loc) can be extracted.By combining the extraction of entity and relation parts, we successfully extract complete relational triples (Reagan Peop , Live_In, U.S. Org ).

Text Representation
In our approach, we enhance the input sequence by appending entity and relation type markers, which distinguishes our method from standard BERT models that process only raw text augmented with [CLS] and [SEP] tokens.Specifically, given an input sentence with n words(e.g., S = {w 1 , w 2 , . . .w i , w n }, where the sentence length is n, and entity types (e.g., E = {t e1 , t e2 , . . ., t en }) and relation types (e.g., R = {t r1 , t r2 , . . ., t rn }), we provide the combined sequence of the text and the inserted type markers to the PLM (e.g., BERT) to obtain the contextualized representations, and the sequence length becomes L = tn + 2 + en + rn (including [CLS] and [SEP], two special start and end markers): where H ′ ∈ R L×d is the context-aware embedding of tokens, where tn is the sum of word pieces in the sentence after the segmentation(e.g., Mondrian → Mon, ##dr, ##ian), en is the number of entity types, rn is the number of relation types, and d is the dimension of hidden units in the BERT model.These markers are integrated into the input sequence, providing contextual cues that are absent in traditional BERT inputs, thereby enabling the PLM to leverage semantic and relational metadata along with textual information.After that, we compute the embedding of each word by max-pooling its composing tokens to aggregate information for their associated spans.If a word is split into multiple word pieces, we use the max-pooling of all piece vectors as its word representation.Finally, the length of sequence representation H becomes n + 2 + en + rn.

Multi-Level Attention Encoder
Our multi-level attention encoder consists of a table structure-aware module, contexttable fusion modules, a and sequence-aware module.Our model takes the sequence representation H obtained in Section 3.2 as input, and its output is used to predict both entities and relations in sentences.
To ensure that text representations are shared between the entity and relation types, we adopt a table structure-aware module.Initially, we apply two multi-layer perceptrons (MLPs) on the pre-trained feature vector H to obtain separate representations for head-andtail parts of an entity or relation.We split the representations H i and H j obtained from the MLPs into multiple heads.Then, a multi-head biaffine model is leveraged to obtain representations of word pairs (h i , h j ).Next, we concatenate the representations from all heads to obtain H T and apply a softmax activation function to H T .The resulting H T serves as the weight information for the sequence, containing both context information and table structure.The calculation formula for this process is as follows: where H i , H j ∈ R n×h , n is the length of a sentence, h is the hidden size, Split(•) equally splits a matrix in the last dimension, h j ∈ R n×h k , h k is the hidden size for each head, U is a n × r × n trainable parameter, r is the number of heads, and H T ∈ R n×n×r .
We then perform multi-head attention calculations using the weight information and sequence information as our context-table fusion modules, obtaining the new sequence representation S: where S ∈ R n×h .In the final sequence-aware module, we use two separate feed-forward neural network (FFNN) layers with the residual structure to encode representations S.
The interaction function is defined as follows: Finally, we transform the features S through a non-linear transformation Q and K and calculate the attention score to generate a predicted score for each relationship of the 2D word-pair: where p ∈ R n×n is the interaction matrix for prediction results, n means the length of sentence, each entry corresponds to a word-pair, σ is a sigmoid function, and we consider P(•) valid when the value of P(•) exceeds threshold σ(σ > 0.5).The representation P ij of the word-pair (x i , x j ) can be considered as a combination of the representation h i of x i and h j of x j .

Training
Given the input and its gold label y ′ (0 or 1), the binary cross entropy loss is used for training: where y is the predicted results, and n is the length of the sentence.

Results
In this section, we present the experimental part, including the datasets, evaluation metrics, and experiment settings to evaluate the performance of our proposed model for entity and relation extraction.Additionally, we conduct exhaustive ablation studies to further investigate the effectiveness of the model.

Datasets
To evaluate the performance of our proposed method, we tested it across three datasets from different domains, namely SciERC, ADE and CoNLL04: SciERC: ref. [29] is derived from 500 AI paper abstracts and defines scientific terms and relations specifically for scientific knowledge graph construction.This dataset includes six scientific entities, including task, method, metric, material, other-scientific-term, generic and seven relation types, including compare, conjunction, evaluate-for, used-for, featureof, part-of, hyponym-of, and includes 2687 sentences.We adopt the official training (1861 sentences)/validation (275 sentences)/testing (551 sentences) splits.ADE: ref. [30] propose the Adverse Drug Events (ADE) dataset for extracting drugrelated adverse effects from medical text, which focuses on one relation category and two entity categories, including drug and adverse-effect.ADE consists of 4272 sentences and 6821 relations, these sentences describe the adverse effects arising from drug use.Given there are no official train-test splits, we report the mean performance based on 10-fold cross-validation, where results are based on averaging performance across the ten folds, as in prior work.
CoNLL04: ref. [31] contains 1441 sentences with annotated named entities and relations extracted from news articles.It has four entity categories, including person, location, organization, and other, and five relation categories, including Live_In, Located_In, OrgBased_In, Work_For, and kill.We employ the training (1153 sentences) and test set (288 sentences), where 20% of the training set is used as a held-out development part, which is consistent with [13,32].This dataset contains no overlapping entities.

Evaluation Metrics
We evaluate these models on both entity recognition and relation extraction tasks, following the approach of prior work.For the NER task, an entity is considered correct if its predicted boundary and type match the ground-truth.For the RE task, previous works have used different metrics: (1) boundaries evaluation (Re), where a relation is considered correct if its relation type, as well as the two related entities, are both correct, without considering the correctness of the entity type; (2) strict evaluation (Re+), where a predicted relation is treated as a true positive if it is exactly matched to a relation in the ground truth based on boundaries and type of subject/object entities and relation type.
For the convenience of comparison, we report multiple evaluation metrics consistent with them.In our experiments on these datasets, we report a micro-F1 score for the ADE and CoNLL04 datasets, and we also report the macro-F1 score.

Experiment Settings
For fair comparison, we used bert-base-cased as the encoder on most datasets and replaced with scibert-scivocab-uncased for the SciERC dataset.We fixed the length of the input sentence to 100.We employed multi-head biaffine decoding with heads = 4 and embedding size = 300.The Adam Optimizer [33] is used with a linear warmup-decay learning rate schedule.We trained the entity model for 100 epochs with a learning rate of 1 × 10 −5 for all experiments.To mitigate overfitting, we applied a dropout strategy with a rate set between 0.2 and 0.4.We used a batch size of 4/20 for SciERC/other datasets, respectively.In our experiments, we ran all experiments with five different seeds and reported the average score.

Results
Tables 1-3 present the test set evaluation results for the SciERC, ADE, and CoNLL04 datasets.
Regarding entity recognition, our model achieves an absolute F1-score improvement of +0.1% on the SciERC dataset and +0.52% on the ADE dataset, using the ALBERT PLM.In our experiments on the CoNLL04 dataset, our model demonstrates notable improvements and competitive performance across various metrics.Notably, under the macro metric, our model exhibits a precision advantage over the best-reported model [34] and achieves an F1-score enhancement of 0.68% compared to the second-best model [35].Additionally, our approach yields competitive results in terms of Micro-F1 values.This demonstrates that entity-type information is useful for the entity model, and pre-trained transformer encoders are able to capture long-range dependencies from context.
For relation extraction, our approach outperforms the best previous methods by an absolute F1 of +0.7% and +1.2% on the SciERC dataset for RE and RE+ tasks, respectively.Additionally, we achieve +1.12% and +2.27% F1-score improvements on the ADE dataset when using bert and albert PLM, respectively.On the CoNLL04 dataset, our model achieved the highest precision and recall across both macro and micro metrics.Our model is competitive without using additional data.Notably, under the micro metric, our model surpassed the second-best performing model [20] with a competitive F1-score improvement of 0.4%.
By comparing the results presented in recent papers, our proposed model attains consistently strong performance over all three datasets, from which we can observe that our word-pair tagging method and learned multi-level features are effective for entity and relation extraction.

Ablation Study
Our model basically consists of four modules: max-pooling aggregation module (A), table structural-aware module (B), context-table fusion module (C) and sequence-aware module (D).We report the ablation analysis results for the ADE and SciERC datasets, focusing on the RE+ from Table 4, and the layers of the encoded block are all set to one: While the max-pooling aggregation module had a positive effect for F1-score on ADE and SciERC datasets, it also helped the model improve the precision to a certain extent.When removing table structural-aware and context-table fusion modules, for the ADE and SciERC datasets, we find that recall has a large negative impact and approximately decreased by 2.11-3%.When removing the sequence-aware module, the system shows a decrease of 1.65% and 0.81% in F1-scores for the ADE and SciERC datasets, respectively.These results indicate that the BERT encoder itself can capture type-specific dependencies among tokens and labels within its architecture, the joint addition of table structure-aware, context-table fusion and sequence-aware modules have a significant effect on NER and RE improvement.

Effect of Encode Layers
To investigate whether a deeper module can further model dense interactions over label spaces, we stack multi-level attention units in depth from 0 to 5 on the ADE and SciERC datasets and analyze the performance.The results are presented in Figure 4: In Figure 4, we demonstrate improvements of model performance through adjustments to the model's layer settings and explore the effect of the superposition of different layers.We found that increasing the number of layers from 0 to 2 leads to a significant improvement in the F1 scores for both tasks.However, we found that the F1 score did not improve further by continuing to increase the number of layers.Therefore, in our final model, we use two layers as the optimum configuration.

Effect of Table Encoding
In this section, we conducted numerous experiments to explore the performance impact of several different table encoding strategies on entity and relation extraction.Each model utilized in these experiments was structured with two layers.We conducted a study using the ADE dataset, and the experimental results are shown in Table 5: • Concat: the concat method represents each word-pair representation via concatenating the corresponding distinct tokens features.While this method collects information at the token level, it overlooks the connections between tokens, leading to coarse-grained formative features.Consequently, using the Concat model leads to a drop in NER and RE+ F1-score performance by 0.69% and 1.4%, respectively.

•
Multi-head CNN: the convolutional approach is a natural method to merge all of the features, and it might be necessary to utilize all local features and predict scores on a global scale.Fusion features that are composed of correlations between unit features can help the model in capturing local sentence features and in learning connections between features, thus learning semantic structural information in sentences.When constructing the CNN structure, we still employed a two-layer CNN with convolutional kernels of 3, and we set its output dimension number of the decoder to be the same as the number of heads.CNN-based models are effective in capturing local features of adjacent cells, but make it difficult to capture long-distance dependencies.As shown in Table 5, using the multi-head CNN has a small negative impact, with performance declining by 0.24% and 0.46% for NER and RE.

•
CLN: we use the Conditional Layer Normalization (CLN) proposed in [46], which generates a high-quality representation of the word-pair grid.The layer normalization is conducted in the feature dimension.The results, as displayed in Table 5, show a decrease of 1.83% in NER and a decrease of 0.96% in RE.
The experiments demonstrate that it is necessary to fuse the representations of table structure to predict the entity and relations.Furthermore, the application of multi-head biaffine can enhance the learning of table structural information.

Effect of Type Information
To examine the influence of the interaction of information on the types of entities and relations, we separate the entity or relation-type sequences from the input sentence to model the two tables independently, denoted as the separate-type model.Specifically, we obtain the sequence embeddings of the input sentence, including the natural language texts of entity type, and the input sentence, including the natural language texts of relations with the same BERT encoder.From these, we generated two tables to jointly decode the predicted two tables.As this method takes entity and relation types as separate inputs, the network can only independently model the correlations of the entity part and relation part, without capturing the interdependencies between task interactions.As shown in Table 5, the separate-type model has marked performance degradation on both tasks compared to a joint-type model, with F1-scores dropping by 2.87% for NER and 1.61% for RE.Experimental results prove the interdependencies between the type's information of entities and relations, and our model benefits from unifying these elements in the modeling process.The integration of type information improves the performance of all sub-tasks.

Conclusions
In this paper, we present an effective approach for joint entity and relation extraction.
Our method is able to simultaneously and efficiently recognize boundaries and types of entities, as well as the relations among them.By utilizing our novel word-pair tagging method, we overcome the spatial and semantic limitations of previous methods, thereby effectively generating more accurate triplets through the fusion of structural information.Our experiments demonstrate that our method is competitive with the previous state-of-the-art results on three standard benchmarks and consistently delivers significant enhancements over the runner-up models in a majority of the evaluated scenarios.We illustrate the feasibility of integrating entity and relation type information within the pre-trained language model, which enriches the final contextual representation of the model.Simultaneously, the extraction model will be relieved of insufficient interaction of two tasks.In future work, we plan to further study the effect of fusion representation in our framework and expand the model framework to support a wider array of information extraction tasks.

Figure 1 .
Figure 1.An illustrative example of the entities and relations extraction task.

Figure 2 .
Figure 2. The marks in the table are represented as entities and relations in the sentence of Figure 1.The model outputs individual scores for each table element, which represent the relationships between word pairs.

Figure 3 .
Figure 3.An overview of our model architecture, consisting of four main modules: 1. Max-pooling aggregation module: uses a pre-trained language model (PLM) and max-pooling for contextualized representations.2. Table structural-aware module: derives head and tail representations with MLPs and computes word-pair representations using a multi-head biaffine model.3. Context-table fusion module: applies multi-head attention to combine the weighted and original sequences.4. Sequenceaware module: encodes the sequence with FFNN layers and residual structures, followed by a non-linear transformation for relationship prediction.

Figure 4 .
Figure 4. Performances with respect to the number of layers setting on the ADE and SciERC test sets.

•
Out-of-diagonal markers in the purple part indicate subjects and objects.The green part on the right represents the connection between a subject and a relation type, while the green part below the table represents the connection between an object and a relation type.If a subject and object share the same relation type, they can form relational triples.Therefrom, the table can exactly express overlapped relations, e.g., the location entity "U.S." participates in two relations, ("Reagan", "U.S.", Live_In) and ("State Department", "U.S.", OrgBased_In).

Table 4 .
Ablation study for ADE and SciERC datasets, focusing on the RE+.Each row after the first indicates the removal of a particular component.

Table 5 .
Study on the ADE dataset.The separate-type method employs two tables within the same model.Bold marks the highest score.