Semantic Enhanced Distantly Supervised Relation Extraction via Graph Attention Network

: Distantly Supervised relation extraction methods can automatically extract the relation between entity pairs, which are essential for the construction of a knowledge graph. However, the automatically constructed datasets comprise amounts of low-quality sentences and noisy words, and the current Distantly Supervised methods ignore these noisy data, resulting in unacceptable accuracy. To mitigate this problem, we present a novel Distantly Supervised approach SEGRE (Semantic Enhanced Graph attention networks Relation Extraction) for improved relation extraction. Our model ﬁrst uses word position and entity type information to provide abundant local features and background knowledge. Then it builds the dependency trees to remove noisy words that are irrelevant to relations and employs Graph Attention Networks (GATs) to encode syntactic information, which also captures the important semantic features of relational words in each instance. Furthermore, to make our model more robust against noisy words, the intra-bag attention module is used to weight the bag representation and mitigate noise in the bag. Through extensive experiments on Riedel New York Times (NYT) and Google IISc Distantly Supervised (GIDS) datasets, we demonstrate SEGRE’s effectiveness.


Introduction
Relation extraction aims to extract relations between pairs of marked entities in texts, which is one of the fundamental tasks in natural language processing (NLP) [1][2][3]. One primary problem of traditional supervised relation extraction (RE) methods is the requirement of large-scale manual labeling, which is very time-consuming and labor-intensive. Thus, Mintz et al. [4] proposed a Distantly Supervised RE, which constructs the dataset by aligning a known knowledge base (KB) and sentences crawled from web pages of the New York Times (NYT) automatically. Under the assumption that, if there is a relation between two entities in KB, then all sentences containing these two entities also represent the same relation. The problem of incorrect labeling often occurs. Riedel et al. [5] proposed a multi-instance learning method to relax this assumption. Despite the problem of wrong labeling, the Distantly Supervised methods still suffer from low-quality sentences, which are automatically generated by crawling web pages [6]. To handle the problem of low-quality sentences, we have to face two major challenges: (1) increase valuable auxiliary information; (2) reduce the noise of irrelevant words in the sentence. Noisy words in the text refer to words that do not contain semantics or words that have nothing to do with the information conveyed by the text. Non-noisy words refer to words Information 2020, 11, 528 2 of 12 that contain semantics and are also a part of text semantics. When the noisy words are removed from the text, the rest are non-noisy words.
For the use of valuable auxiliary information, an idea model should make full use of local features or external information to extract precise semantic features from low-quality sentences containing noisy words. On the one hand, by encoding the position information of the word in the sentence, the position features of the corpus can be obtained. On the other hand, entity type information provides abundant background knowledge, which can be used to enhance semantics and the effectiveness of RE. For instance, the sentence "[SeamlessWeb] e1 is a symbol of the heavy time commitment demanded by many of [NewYork] e2 's professional service firms", the entity pair SeamlessWeb and NewYork has a relation /business/company/place_ f ounded, which is difficult to extract if the information Seamlessweb is a company and new_york is a location is lacking. Therefore, entity features learned from entity types are prior knowledge to initialize the RE model. We will use the entity type information to obtain more entity semantic features in this paper.
As for the other challenge, syntactic structure information is used to capture the semantic relationship and reduce the noise in the sentence. Figure 1 illustrates two methods for acquiring sentence semantics. In (a), the shortest dependent path between the entity pairs is displayed in the dependency tree and highlighted in bold (edges and marks). The dependency tree is used to express the dependency relationship between words in a sentence. Specifically, it analyzes and recognizes the grammatical components such as "subject-predicate-object" and "fixed adverbial complement" in the sentence. The dotted line is not the shortest path, but it also has a semantic impact on the critical path. Through the syntactic dependency graph, we can more intuitively discover the syntactic relationship between two entity pairs, reduce the interference of irrelevant words, which helps understand sentences and achieve more accurate relationship extraction. In (b), sequence structure refers to reading the words in a sentence sequentially from left to right. The sequence method is using adjacent words to obtain sentence semantics, which cannot obtain the direct connection between the keywords in the sentence. Comparing the advantages and disadvantages of the two methods, this paper chooses the dependency tree method, making full use of the syntactic structure to effectively analyze the semantic connection between the entity pairs, and judging the relationship between the entity pairs more reasonably. The example uses dependency tree and sequence structure to obtain sentence semantic, and assist in extracting relations between entities (indicated in red). In (a), the dependency tree can clearly express the dependency relationship between words in the sentence. Specifically, it analyzes and recognizes the grammatical components such as "subject-predicate-object" and "fixed adverbial complement" in the sentence. Each node is representing a word. In (b), the words in the sentence are read sequentially, usually from the left to the right, such as LSTM and GRU, while there are also two-way sequential reading forms, such as BiLSTM and BiGRU.
In this paper, we propose a novel semantic enhanced Distantly Supervised relation extraction method SEGRE, which utilizes additional semantic information and dependent syntactic to improve effective semantics against noisy words and reduce inner-sentence noise. For improving effective semantics, SEGRE adopts word position and entity type information to provide abundant local features and background knowledge. Furthermore, it uses encoded syntactic information obtained from Graph Attention Networks along with embedded additional semantic information to improve neural relation extraction. Our contributions can be summarized as follows: • We propose SEGRE, a novel semantic enhanced method for improving Distantly Supervised RE, which utilizes additional semantic features and knowledge learned from word position and entity type information to strengthen its robustness against low-quality corpus.

•
To handle the problem of low-quality sentences, SEGRE uses Graph Attention Networks for modeling syntactic information and enhancing semantic features of important words, which has been shown to perform competitively.

•
Experimental results show that SEGRE has achieved significant results on benchmark datasets, which improves the Precision/Recall (PR) curve area from 0.39 to 0.41 and increases P@100 by 4.7% over the state-of-the-art work.
The rest of this paper is organized as follows. Section 2 summarizes previous studies on relation extraction. Section 3 details the proposed model SEGRE and describes its various modules. Section 4 presents the experimental results. Section 5 concludes the paper.

Related Work
Relation extraction is a key component of constructing a relational knowledge graph and can be applied to structured search, sentiment analysis, question answering, and summary. Distantly Supervised relation extraction is proposed by Mintz et al. [4] to solve the problem of the lack of labeled training data. However, the sentence that refers to these two entities does not necessarily represent the relationship in the known knowledge base. Distantly Supervised inevitably causes an incorrect labeling problem. Thus, multi-instance learning methods are adopted to address this issue [5,7,8].
Deep neural network models are often used to perform relation extraction tasks. Here we introduce several basic types of deep neural networks: RNN [9] is widely used in the processing of time series data but has the problem of Long-Term Dependencies. The LSTM [10] model was born and used to improve this situation. biLSTM is a combination of the forward lstm and the backward lstm, which can encode front-to-back and back-to-front information and capture bidirectional semantics. LSTM also has many variants, among which the most used is GRU [11], which combines the forget gate and the input gate into a single update gate. The biGRU is a combination of the forward GRU and the backward GRU.
The large-scale automatically constructed dataset by crawling web pages will lead to the amount of low-quality sentences [12]. The use of additional semantic information provides abundant features and knowledge to enhance semantics against low-quality corpus. Zeng et al. [13] adopted piecewise convolution neural networks (PCNNs), which use the position information of words in a sentence to model the sentence representation. Yaghoobzadeh et al. [14] also tried to mitigate the noise in DS by combining entity type and relation extraction models. Vashishth et al. [15] used entity type and relation alias information to impose soft constraints when predicting relations. However, the above methods ignore inner-sentence noise.
As neural networks have been widely used, an increasing number of researches have been proposed. Lin et al. [16] proposed selective attention to neural network examples. Ji et al. [12] assigned more precise attention weights using entity descriptions. Nagarajan et al. [17] used attention to learn from multiple valid sentences. We also used attention mechanisms [18] to learn sentence and bag representations.
Moreover, features based on dependency trees are beneficial for relation extraction [4]. Xu et al. [19] adjusted the neural model to encode the shortest dependent path. Zhang et al. [20] adopted a path-centric pruning strategy. He et al. [21] established Subtree Parsing (STP) to delete noisy words that are not related to relations. Graph convolution network (GCN) [22] incorporates structural information based on dependencies into the neural models. Song et al. [23] used GCN to directly encode the complete dependency graph. Zhang et al. [3] proposed Attention Guided Graph Convolution Networks (AGGCNs), a soft pruning method that automatically selects useful substructures. More recently, Velickovic et al. [24] proposed graph attention networks (GATs), which uses the attention mechanism to weight neighborhood states. The combination of reducing inner-sentence noise and using additional semantic information can better improve the performance of relation extraction.
Recently, scholars adopt feature extraction and text analysis methods in specific application scenarios to improve performance. Ali et al. [25] proposed a big data analytics engine based on data mining techniques, ontologies, and BiLSTM to improve healthcare monitoring accuracy. Ali et al. [26] used ensemble deep learning and feature fusion approaches to predict heart disease. Kaplan et al. [27] applied feature extraction technology to diagnose bearing vibration signals. Ayvaz et al. [28] studied to diminish the deficiency in the strategic cost management and prediction of economic crises with deep learning methods. Distantly supervised relation extraction is also beneficial for the construction of a knowledge graph in a specific application domain.

SEGRE Model (Semantic Enhanced GATs Relation Extraction)
An overview of the proposed SEGRE for Distantly Supervised relation extraction is illustrated in Figure 2. SEGRE consists of three modules used to learn the representation of a given bag and feed it into the softmax classifier. Firstly, the input sentences concatenate word, position and entity type embedding to encode the local context of each word and get the multi-level word representation. Secondly, we construct a syntactic dependency tree for each word in the sentence through a Bi-GRU and input it into the graph attention network to get the syntactic sentence representation. Furthermore, a group of bags sharing the same relation label in the training set is aggregated using the intra-bag attention module to weight the bag representation. Finally, the bag representation is fed to a softmax classifier to get the relation of the entity pair in the sentence. Each module will be described in detail in subsequent sections. . SEGRE first encodes each word in the sentence by concatenating word, position, and entity type information. Then the sentence representation is achieved by constructing a graph attention network using a syntactic dependency tree. Next, the bag representation is calculated by weighting sentence embeddings using intra-bag attention. Finally, the bag representation is fed to a softmax classifier to get the relation of the entity pair.

Multi-Level Word Representation
The multi-level word representation concatenates word information, position information, and entity type information. The word information is encoded by Bert [29] to obtain the semantics of the current word in the sentence. The position information records the position of the current word in the sentence, inspired by Zeng et al. [13]. When the word is in different positions, it represents different semantics and importance. Entity type information refers to the type to which the current word belongs. For example, the entity type of Seamlessweb is the company, so through the entity type company you can know that Seamlessweb is the name of a company. Therefore, more meaningful word semantics can be obtained by using multi-level word representation. The specific implementation is as follows.
The inputs of the network are word, position tokens and entity type, which are transformed to the distributed representations before being input into the neural model. We extract meaningful word representations from different level semantics, i.e., the word embedding e w (w), the position embedding e p (w), and the entity type embedding e t (w).
For the word w in the sentence x, we represent each word by k dimensional Bert embedding. In order to integrate the relative position of tokens with respect to target entities, we use p dimensional position embedding. Specifically, we use Pos1 and Pos2 to refer to the relative distance between the current word and the head and tail entities respectively. For instance, in Figure 1 relative distances of symbol from seamlessweb and newyork are 3 and −9 respectively. Then the position of each word is transformed to a pdimensions.
Entity types can enforce constraints on the prediction of the relation between subject and object. For instance, in Figure 1 the relation/business/company/place_founded can only exist between a company and a location. The entity type embedding refers to FIGER [30] by k t dimensional embedding. Note that if the word in the sentence is not an entity, the entity type is completed with 0.
The final word presentation is obtained by concatenating these three parts of embeddings: where ⊕ denotes the concatenation operation. Thus, we get a sequence of word vector {v t }.

Bidirectional Gated Recurrent Unit
Based on the word vector {v t }, we adopt a layer of bidirectional Gated Recurrent Unit (GRU) [11] to learn the semantic information of the sentence, which uses a hidden state vector {h t } to remember important signals. At each step, a new hidden state is computed based on previous hidden state using the same function.
where z i and r i are the update gate and reset gate, σ(·) is a sigmoid function, and W z , W r , W h , U z , U r , U h are parameters. e(w k |s) is the representation of w k given s, which comes from the hidden vectors of h k,t . Furthermore, Bi-GRU that implements GRU in both forward and reverse can be used to access the long-distance semantics of the future and the past.
where α t and β t represent the weights corresponding to the forward hidden layer state − → h t and the reverse hidden state ← − h t at time t, and b t indicates the hidden state bias at time t.

Graph Attention Network
Although Bi-GRU can capture local context, it fails to capture long-range dependencies that can be captured through dependency edges. We employ Graph Attention Networks for encoding features from syntactic dependency trees to improve relation extraction. The syntactic dependency tree is generated by Stanford CoreNLP [31].
We use the constructed syntactic dependency tree to form a graph, ζ(ν, ε), where the nodes ν are the words in the sentence and the edges ε are the syntactic relations in the dependency tree. An edge from node u to node v with label l uv is represented as (ν, ε, l uv ). If there is a relation label existing between two words in the sentence, then the two words in the dependency graph are directly connected. Since the dependency tree has 55 different relation labels, which makes the constructed dependency graph too complicated. We use the same processing method as Nguyen and Grishman [32] to construct the graph, and only three kinds of edge labels are used to represent the relation, which are forward (→), backward (←), self-loop (⊥), defined as follows: The input of GATs is h * = {h * 1 , h * 2 , ..., h * m }, where m is the number of words in sentence. e ij represents the importance of the characteristics of node j to node i. We put up the structure of the dependency graph and only calculate the e ij where node j is adjacent to node i in the graph. In order to make coefficients easy to compare between different nodes, we use the softmax function to normalize them across all choices of j. For each word w i , GATs embedding h gat i is defined as: where the single-head attention mechanism α ij is a single-layer feedforward neural network and applies the LeakyReLU nonlinearity [24]. LeakyReLU activation function is a variant of the ReLu activation function, and ReLu is the most commonly used activation function in neural networks. The LeakyReLU activation function has a small slope for negative inputs, and because the derivative is always non-zero, it can reduce the appearance of silent neurons and allow gradient-based learning, and solves the problem that neurons do not learn after the Relu function enters the negative interval.
α k ij are normalized attention coefficients computed by the kth attention mechanism α k , and W k is the corresponding input linear transformation's weight matrix. The syntactic graph encoding from GATs and Bi-GRU output vector are concentrated to obtain the final sentence representation

Bag Aggregation
In this section, the first step of bag aggregation is to calculate the weight of different sentences in the bag through the intra-bag attention mechanism, and the second step is to multiply the sentence embedding and its weight and then accumulate to get the bag representation. After bag aggregation, the bag representation is sent to the softmax classifier to obtain the classification of the relationship between entities.
For utilizing all valid sentences, we employ the attention mechanism used by Jat et al. [33] over sentences to obtain a representation for the entire bag. For sentence s i in the bag, attention weight α i is calculated as follows: Bag representations B are calculated by weighting sentence embedding using intra-bag attention, which can deal with noise at sentence-level. (12) Finally, the bag representation is fed to the softmax classifier to obtain the probability distribution of different relations.

Experiments
In order to demonstrate the performance and adaptability of SEGRE, we compare several methods on two benchmark datasets, Riedel New York Times (NYT) and Google IISc Distantly Supervised (GIDS) datasets, and give implementation details and experimental results analysis.

Compared Methods
We have chosen seven methods to compare their performance with the proposed SEGRE. Mintz [4] first proposes a multi-class logistic regression model for Distantly Supervised; MultiR [7] uses a probabilistic graphical strategy for multi-instance learning; MIMLRE [8] jointly models multiple instances and multiple tags. PCNN [13] adopts a relation extraction model combining piecewise and CNN. PCNN + ATT [16] uses PCNN and attention mechanisms to obtain sentence representations. BGWA [33] adopts a word and sentence level attention strategy for relation extraction. RESIDE [15] applies entity type and relation alias information to impose soft constraints.
In addition, we also change the partial structure of SEGRE, and compare the performance of three variations of the proposed SEGRE. Specifically, SEGRE GAT * uses undirected edges to construct GAT instead of directed edges; SEGRE GCN uses GCN to embed sentence dependency information instead of GAT; SEGRE type − removes the entity type information in multi-level word representation; and SEGRE att − implements bag representation without an attention mechanism.

Data Sets
We evaluated SEGRE on Riedel NYT [5] and GIDS datasets. Riedel NYT dataset has been widely used for RE by keeping the relation between Freebase and the New York Times Corpus consistent, using sentences in 2005-2006 to create training sets and sentences in 2007 for test sets. The entities were annotated with the Stanford NER tool [34] and linked to Freebase.
The GIDS dataset was created by Jat et al. [33], which extends the Google relation extraction corpus with other instances of each entity pair. The GIDS dataset guarantees the "at-least-one" assumption of multi-instance learning, which makes automatic evaluation more reliable, thereby eliminating the need for manual verification. The corpora statistics of the two datasets are shown in Table 1. There are 53 types of relations between entity pairs in the Riedel NYT dataset, and 5 types of GIDS datasets. The training set (TRAIN), validation set (VALID) and testing set (TEST) are officially segmented.

Implementation Details
If the comparison methods and SEGRE are implemented in an identical experimental environment, we directly copy the results of these experiments, otherwise the methods will be reproduced in the context of this paper. SEGRE uses TensorFlow libraries and python 3. We used cross-validation to tune our model and grid search for super-parameter optimization, and chose the best performance setting as the final setting. In this experiment, we applied the Adam optimizer with the learning rate decay. GRU size m = 230, position embedding size p = 16, entity type embedding size k = 50. To avoid hyperparameters, we adopted the 38 coarse-grained types of FIGER's first layer instead of all 112 fine-grained entity types.

Experimental Results
In order to evaluate the effectiveness of our proposed SEGRE, we compared it with the method described in Section 4.1. We use the Precision-Recall curve and top-N precision (P@N) metric to evaluate the performance in our experiments. Notice that we only use the neural compared methods on the GDS dataset.
The Precision-Recall curves on Riedel NYT and GIDS are shown in Figure 3. We found that SEGRE achieved higher accuracy in the entire recall range of both datasets. On the Riedel NYT dataset, all non-neural network methods are not very effective, because they use existing NLP tools for feature extraction, which may produce errors. The PR curve areas of PCNN, PCNN + ATT, BGWA, and RESIDE are about 0.332, 0.386, 0.394 and 0.409 respectively, while SEGRE increases it to 0.417. Meanwhile, on the GIDS dataset, the PR curve areas of PCNN, PCNN + ATT, BGWA, and RESIDE are about 0.694, 0.743, 0.751, and 0.787 respectively, while SEGRE increases it to 0.791. The result indicates that our SEGRE can use word position and entity type information to increase additional semantic information, and use syntactic dependency trees to eliminate unrelated noise words in sentences, it finally achieves more accurate sentence representations for relationship extraction. Following previous works, we adopt P@N as a quantitative indicator to compare our model with baselines based on various instances under each relational tuple. P@N means the precision of the relation classification results with the top N highest probabilities in the test set. Table 2 shows the P@N value of relation extraction as the number of sentences in the bag changes. Here, one, two and all represent the number of sentences randomly selected from the package, forming three types of data sets. The table shows the P@100, P@200, P@300, and their means of the SEGRE model and its compared methods on the test sets. We can see our proposed methods achieved higher P@N values than previous work, and the P@100, P@200 and P@300 values of SEGRE have been improved to 3.6%, 2.9%, 1.7% over state-of-the-art model, respectively.  Figure 4 shows the performance of different ablated versions of our proposed SEGRE on the Riedel NYT and GIDS datasets. We observe that after SEGRE changes different components, the performance of the model varies significantly. The PR curve area of SEGRE is 0.007 higher than that of SEGRE GAT * on the NYT dataset, and 0.023 higher than that of SEGRE GAT * on the GIDS dataset. Because the syntactic dependency tree constructed in this paper is directional. The direction information includes the relationship between words in the text. The directed edge in GAT can better reflect the syntactic dependency tree structure than the undirected edge. The PR curve area of SEGRE is 0.015 higher than that of SEGRE GCN on the NYT dataset, and 0.046 higher than that of SEGRE GCN on the GIDS dataset. This result confirms that GAT effectively encodes grammatical information and removes irrelevant word noise in sentences. In addition, the PR curve area of SEGRE is 0.053 higher than that of SEGRE type − on the NYT dataset, and 0.128 higher than that of SEGRE type − on the GIDS dataset. The introduction of entity type information indicates that it supplements text features and can enhance the relationship extraction performance. Further, the PR curve area of SEGRE is 0.031 higher than SEGRE att − on the NYT dataset, and 0.057 higher than SEGRE att − on the GIDS dataset. This proves that the attention mechanism in the bag helps reduce the noise between sentences. In conclusion, the entity type information has the greatest impact on the performance of the model because it provides additional semantics and is very helpful for the task.

Conclusions
In this paper, we propose SEGRE, a novel semantic enhanced approach for Distantly Supervised relation extraction. It aims at dealing with the low-quality datasets by increasing valuable additional semantic information and reducing the noise of irrelevant words in the sentence. Compared with other methods, the main innovations of the proposed method are as follows: In the word representation stage, SEGRE uses multi-level word representation, including word information, position information, and entity type information, which enriches the semantics contained in a word embedding. In the sentence representation stage, SEGRE uses a graph attention network, which extracts important information more effectively than a graph convolutional network and reduces noise in a sentence. In the bag representation stage, SEGRE added an intra-bag attention mechanism to calculate the representation of the bag, reducing the noise in the bag. SEGRE increases valuable semantic information throughout all stages of the model. Experimental results show that SEGRE achieves state-of-the-art results on two benchmark datasets.
Using graph neural networks to extract sentence semantics is our preliminary study on Relation extraction. We only considered semantic analysis at the sentence level, but future work should focus on the document level. Furthermore, the information contained in the document will be richer than a single sentence, but it will also bring more noise. Future work should reduce document-level noise and improve the effective use of document-level information.