Joint Self-Attention Based Neural Networks for Semantic Relation Extraction

: Relation extraction is an important task in NLP community. However, some models often fail in capturing Long-distance dependence on semantics, and the interaction between semantics of two entities is ignored. In this paper, we propose a novel neural network model for semantic relation classification called joint self-attention bi-LSTM (SA-Bi-LSTM) to model the internal structure of the sentence to obtain the importance of each word of the sentence without relying on additional information, and capture Long-distance dependence on semantics. We conduct experiments using the SemEval-2010 Task 8 dataset. Extensive experiments and the results demonstrated that the proposed method is effective against relation classification, which can obtain state-of-the-art classification accuracy just with minimal feature engineering.


Introduction
Relation extraction is a fundamental task in information extraction, which has important applications in question answering, information retrieval, big data analysis etc. Traditional approaches to relation extraction take entity recognition as a predecessor step in the pipeline predicting relations between given entities. In recent years, there has been a surge of interest in relation extraction task. The traditional methods are mainly based on supervised relation extraction [Suchanek, Ifrim and Weikum (2006); Qian, Zhou, Kong et al. (2008)], which usually suffer from the issue that lacks sufficient labelled relation-specific training data. If a large number of data sets are tagged, it is a time-consuming and laboring work. Meanwhile, artificial feature extraction methods need some tools of natural language processing, which lead to the propagation of the errors in the existing tools and hinders the performance of some systems [Bach and Badaskar (2007)].
Inspired by the idea mentioned above, we encode the text segment of every entity to its feature representation to bi-LSTM [Kiperwasser and Goldberg (2016)]. Then, we use self-attention mechanism to get semantic representation of text segment that is related to every entity, which can attention the sentence itself to extract relevant information and capture Long-distance dependence on semantics to capture the interaction between semantics of two entities in a sentence we propose joint self-attention bi-LSTM(SA-Bi-LSTM) to model the internal structure of the sentence to obtain the importance of each word in the sentence without relying on additional information. Empirical results from the SemEval-2010 Task 8 dataset show that the proposed approach just with minimal feature engineering obtains state-of-the-art classification accuracy about 85.3% F1 value. The main contribution of this paper can be summarized as follows: (1) In order to preserve the contextual information about the entity, we encode the text segment of every entity to its semantic representation through a bi-LSTM.
(2) To capture long distance dependencies of semantics, and the interaction between semantics of two entities we propose joint self-attention bi-LSTM(SA-Bi-LSTM) to model the internal structure of the sentence to obtain the importance of each word with the sentence without relying on additional information. (3) We conduct experiments using the SemEval-2010 Task 8 dataset. Extensive experiments and the results demonstrate that the proposed joint self-attention bi-LSTM(SA-Bi-LSTM) is effective for relation classification, which can obtain classification result from 85.3% F1 value.

Methodology
In this part, we first introduce the basic self-attention model, and then introduce the semantic relation extraction model of joint self-attention in detail.

Self-attention
Self-attention is also called intra-attention, it is a special attention mechanism. The main application of our model is multi-head attention. Multi-head attention is a variant of scaled dot-product attention. Scaled dot-product attention is that adds a scale dot product function of the basis of dot-product attention. Give the query matrix and key matrix with dimension dk and the value matrix with dimension dv as input, the calculation formula for scaled dot-product attention is as follows:

Figure 1:The structure of the multi-head attention
The structure of the multi-head attention is shown in Fig. 1. The specific calculation formula is as follows:

Joint self-attention Bi-LSTM(SA-Bi-LSTM)
The joint self-attention semantic relationship extraction model can explore the internal structure of the sentence to obtain the importance of each word of the sentence without relying on additional information. The model can directly calculate the dependence on words with considering the distance between words, so as to get the influence of each word with the semantics of sentences. We believe that capturing the contribution to different words to sentence semantics in a sentence is effective against improving the accuracy of relation classification. Specifically, we use two Bi-LSTM neural networks to model two entities respectively. Suppose we have a sentence with n tokens represented in a word embedding sequence: where 1 e , 2 e represent the entities in the sentence. For the context of entities, we will use Bi-LSTM modeling into sequential word embedding: Here we use two different Bi-LSTM neural networks to encode the context, and then we keep all of these hidden layers' information: Then we input H pre and H fol into the multi-head attention module to get the multilayered attention representation of the sentence. We connect the attention matrix together and input it into the dense net to get the overall feature representation of the context and provide them to the softmax layer to classify the semantic relation. An illustration of the model is shown in Fig. 2.

Dataset and evaluation metrics
We use the SemEval-2010 Task 8 dataset Hendrickx et al. [Hendrickx, Kim, Kozareva et al. (2009)] as the required for our experiments. This dataset is public and contains a total of 10,717 annotation examples, including 8,000 training instances and 2,717 test instances. The data has nine directional relationship classes and one other class with no orientation. The data onto SemEval-2010 Task 8 focuses on the semantic relationship between named pairs. For example, thief and screwdriver are in an INSTRUMENT-AGENCY relation in 'A thief who tried to steal the truck broke the ignition with screwdriver'. In the experiment, we do not distinguish the direction of the relationship, using 10 kinds of tags. In order to compare with the previous research results, we used the macro-averaged F1-score value as the evaluation criterion in our experiment. 1 2* * F precision recall precision recall = + (9)

Results of comparison experiments
We select some approaches as competitors to be compared with our method in Tab. 1. Kambhatla [Kambhatla (2004)] use traditional features and employ SVM as the classifier. Gormley et al. [Gormley, Yu and Dredze (2015)] proposes feature combination of handcrafted features and word embeddings. Socher et al. [Socher, Huval, Manning et al. (2012)] assigns a matrix to every word in the recursive procedure. Zeng et al. [Zeng, Dai, Li et al. (2018)] used a convolutional neural network to extract features, and Xu et al. [Xu, Feng, Huang et al. (2015)] considered more robuster relation representations from shortest dependency paths. The model of Xu et al. [Xu, Mou, Li et al. (2015)] also takes into account shortest dependency paths. However, Xu et al. [Xu, Mou, Li et al. (2015)] used another neural network structure, which is long short-term memory (LSTM) model. The model proposed by us is called joint self-attention bi-LSTM(SA-Bi-LSTM) which can obtain the importance of each word in the sentence without relying on additional information, and capture long distance dependencies of semantics. The experiment demonstrate that it is very important for semantic classification, our proposed SA-Bi-LSTM model yields an F1-score of 85.3%, whereas the previous best model achieved only F1-score of 84.1% [Xu, Feng, Huang et al. (2015)]. Tab. 1 illustrates the macro-averaged F1 measure results for these competing methods along with the resources, features and classifier used by each method. Based on these results, we make the following observations: It is relatively difficult to manually choose the best feature sets, which depends on human ingenuity and prior NLP knowledge. Socher et al. [Socher, Huval, Manning et al. (2012)] depend on the syntactic tree used in the recursive procedures. Errors in syntactic parsing inhibit the ability of these methods to learn high quality features. The position encoding is also another way of feature extraction, which encode position information from each entity to all the tokens in a sentence. So Zeng et al. [Zeng, Liu, Lai et al. (2014)] gain a lot of improvement about 82.7%. The model Gormley et al. [Gormley, Yu, Dredze et al. (2015)] connects word embedding with arbitrary linguistic structure, as expressed by hand crafted features, which get the advancement of classification result. Xu et al. [Xu, Feng, Huang et al. (2015)] and Xu et al. [Xu, Mou, Li et al. (2015)] learn more robust relation representations to the shortest dependency paths through a convolution neural network and long short-term memory network (LSTM), and two models mentioned above demonstrate effectiveness and practicability of the dependency paths in semantic relation classification task. Our method achieves the best result about 85.3%, and this is the best performance among all of the compared methods. The performance demonstrates the effectiveness of the self-attention mechanism, which can model the internal structure of the sentence to obtain the importance of each word in the sentence without relying on additional information, and capture long distance dependencies of semantics.

Conclusion
In this paper, we proposed a novel neural network model for semantic relation classification called joint self-attention Bi-LSTM(SA-Bi-LSTM) to model the internal structure of the sentence, which can obtain the importance of each word of the sentence without relying on additional information and capture Long-distance dependence on semantics. We conduct experiments using the SemEval-2010 Task 8 dataset. Extensive experiments and the results demonstrate that the proposed methods are effective against relation classification, which can obtain state-of-the-art classification accuracy just with minimal feature engineering.
In the future work, we will focus on exploring better neural network structure of eature extraction in relation extraction task. Meanwhile, knowledge base is an important tool for improving relation extraction performance, we will seek better methods of mutual enhancement of knowledge base complement and relation extraction.