A Novel Chinese Entity Relationship Extraction Method Based on the Bidirectional Maximum Entropy Markov Model

School of Foreign Languages, China University of Geosciences, Wuhan 430074, China School of Foreign Languages, Hubei University of Science and Technology, Xianning 437000, China Information Centre, Hubei University of Science and Technology, Xianning 437000, China School of Information Technology, Deakin University, Geelong VIC3220, Australia School of Public Administration, China University of Geosciences, Wuhan 430074, China


Introduction
In the age of big data, techniques of extracting valuable information from enormous quantities of texts have drawn the attention of many researchers. e extraction of information includes entity extraction, relationship extraction, and event extraction. As the key step in information extraction, relationship extraction provides technical foundation for subsequent tasks such as knowledge graphs, intelligent information retrieval, and semantic analysis. erefore, techniques of relationship extraction are beneficial not only for theoretical discussion but also for practical application.
Research on techniques to extract entities and their relationships can date back to the 1960s. Among the more prominent projects is the Linguistic String Project by New York University, which took the route of constructing massive language (English) corpora and achieved very satisfactory results when the team used these corpora to extract information from medical texts. In addition, a systematic research at Yale University extracted events in domains such as "earthquake" and "strike" from news texts and promoted the research and development of entity relationship extraction. By the late 1980s, with the convening of the Message Understanding Conference, research on entity relationship extraction had started to boom. After decades' development, theories and techniques of entity relationship extraction, from early models of manual design and rule extraction [1,2] to late models based on machine learning [3] and deep learning [4,5], are approaching maturity. With constant improvements in model accuracy and recall, extraction models are more adaptive than ever before. However, the most existing extraction techniques either have been keeping relationships and entities in their own silos. Extracting relationships and entities was conducted in separate steps before obtaining the mappings, or tag triples as a whole used the "proximity principle" of reinforced learning to extract relationships. Existing extraction techniques fit into three categories. Firstly, the relationship can be predicted and identified by an entity pair. e premise of this idea is that the relationships are already predefined [6]. e task of relationship extraction then becomes the task of searching the predefined relationship space for the most probable relationship between a given entity pair based the context where the entity pair is located. Secondly texts can be explored by the relationship of entity pairs. is method aims at finding the maximum number of entity pairs matching the criteria of the given relationship. A common issue of the two methods mentioned above is the subtasks, entity identification and relationship identification, are completely independent of each other, resulting in extraneous information such as entities without relationship.
is, in turn, increases error rates because the entities are paired up before their relationship is determined; when no relationship is found for an entity pair, this pair becomes extraneous. Such extraneous pairs increase error rates of the subtask and negatively impact the performance of subsequent relationship classification. Finally, some studies tag triples as a whole and use the "proximity principle" of reinforced learning to extract relationships [7]. is method integrates low-level features into more abstract high-level features to search for distributed feature representations and, thus, solves the problems of manual feature selection and the spread of feature extraction error haunting classical methods.
e conventional method has two drawbacks. Firstly, for most of the entity pairs do not hold relationships, numerous negative cases and imbalanced relationship classification occur. Secondly, overlapping triples become a critical issue. e shared entities or multiple relationships between two entities make learning more complicated or even impossible, since adequate training data cannot be obtained. For instance, "Mr. Zhang was born in Hubei, a province in Central China" could be interpreted into <Mr. Zhang, was born in, Hubei>, <Mr. Zhang, was born in, China>, and <Hubei, lies in, China>. e conventional algorithm cannot identify and classify properly without sufficient data.
To address these problems, this paper proposes a new method, entity relation chain. e head entity before relationship should be identified firstly, and then, the corresponding relationship and the tail entity can be predicted. For instance, in the sentence "Mr. Zhang was born in Hubei province," E1 "Mr. Zhang" and E2 "Hubei province" are usually identified firstly and the R "was born in" is recognized secondly. But, in the entity relation chain, E1 "Mr. Zhang" is firstly identified, and every possible R generated from E1 is the criterion for E2 "Hubei province.' In this entity relation chain, E1 can be taken as head entity, R as relation chain, and E2 as tail entity.
Experiments on data sets from People's Daily indicated that the proposed method can achieve a high performance.
We also evaluated the scalability of the method on English data sets of the English SemEval 2010 Task 8 which reveal that the Bi-MEMM also can obtain a better f-score.
is paper is organized as follows. Starting with the introduction of the research gap and our research purpose, we review and discuss the entity relationship extraction and the particularity of Chinese relation extraction. en, we develop the Bi-MiEM method for the entity relation extraction. e detailed experimental evaluation is illustrated in Section 4, and Section 5 concludes this work and provides the future direction for further research.

Definition of Entity Relationship Extraction.
Entity relationship extraction is usually described as entity relationship triples <E 1 , R, E 2 >, in which E 1 and E 2 refer to the entity type and R refers to the relation description type text. After the preprocessing process of named entity recognition relation trigger word recognition, the determined triples <E 1 , R, E 2 > are stored for further analysis or query.
According to the definition, we can divide the entity relationship extraction tasks into three key parts, name entity recognition, relation trigger word identification, and relation extraction. Name entity recognition refers to the identification of text having a specific meaning of the entity, mainly including the names of people and places, institutions, and proper nouns. Relation trigger word identification is to classify the words that trigger entity relationship, identify whether they are trigger words, and determine whether the extracted relations are positive. Relation extraction is the extraction of semantic relationships between entities from identified entities, such as location employee products.

Features of Entity Relationship Extraction.
Compared with NLP tasks such as sentiment analysis and news classification, the extraction of relationship is unique in three aspects.
Secondly, Entity Relationship Extraction involves heterogenous data. Data can come from different sources, and they can be structured, semistructured, or nonstructured. Deep learning [21] is usually applied in structured data; nonsupervised aggregation methods [4] are usually applied in nonstructured textual data due to unpredictable relationship categories; semisupervised [17] or distant supervised [22] methods are usually applied in semistructured data such as Wikipedia.
Lastly, Entity Relationship Extraction needs to handle various relationships, which easily leads to data noise. Relationships between entities are various, but early research often ignored such multiple relationships and failed to handle latent relationships. e adoption of graph structures [18] in relationship extraction in recent years ushered in a new technique for tackling overlaps of entities and relationships. To tackle data noise [23], it has been discovered that using a small number of adversarial examples can avoid model overfitting and proposed to use adversarial training to improve model performance.

Particularity of Chinese Relation Extraction.
Relationship extraction of Chinese texts falls behind the extraction of English because of its complexity and difficulty. e following two characteristics of Chinese make it more challenging for Chinese than English in terms of relationship extraction.
Chinese trigger words are hard to extract and are in abundance. is makes the recall rate of relationship extraction low. In the ACE corpus, Chinese trigger words are 30% more than those in English [24].
For the Chinese language, words are often polysemous, sentence structures are complex and flexible, and omissions appear frequently. e fact that the same word can express completely different meanings in different contexts or the same meaning can be represented with many different expressions makes the identification of relationship types particularly difficult.
In view of these problems, this paper proposes the following possible solutions. Firstly, the Joint Extraction of Entity Mentions and Relations model similar to Seq2Seq is proposed and the Bidirectional Maximum Entropy Markov is integrated into the model. Secondly, different from the existing relationship extraction techniques, relationship triples are treated as an entity relationship chain, entity E 1 is identified first, and then, the corresponding relationship R and entity E 2 based on E 1 are predicted. irdly, the validity of the proposed model is verified in Chinese data sets and the scalability is evaluated in English data sets.

Extraction Method Based on the Bi-MEMM Model
e previous solutions cannot efficiently deal with the entity relationship extraction entity overlap, relationship crossover, and so on. In this paper, a Bi-MEMM model similar to seq2seq simulated probability graph is proposed to solve such problems. e seq2seq decoder is modeled in the following way: P y 1 , y 2 , . . . , y n |x � P y 1 |x P y 2 |x, y 1 . . . P · y n |x, y 1 , y 2 , . . . , y n−1 . (1) In formula (1), the first word is predicted by x and the second word is predicted if the first word is known and repeated until the end mark appears. Similarly, the extraction of triples can be modeled in the following way: (2) In formula (2), "E 1 " can be predicted first, and "E 2 " corresponding to "E 1 " can be predicted by passing in "E 1 ".
en, E 1 and E 2 can be introduced to predict the relationship R between E 1 and "E 2 ." In actual processing, we can also combine the predictions of E 2 and R into one step, so the total step only needs two steps; the first step is to predict E 1 , and then, E 1 is introduced to predict E 2 and R corresponding to "E 1 ." Figure 1 demonstrates the overall structure of our Bi-MEMM model. It can be detailed as follows.

Bi-MEMM Model.
When it comes to techniques for extracting relationships and entities, character-word embedding is necessary only for Chinese, as word embedding is sufficient for English. By means of word segmentation with Chinese texts, we obtain character embedding and word embedding. en, we perform matrix transformation of word embedding and concatenate the transformed word embedding with character embedding of the word's constituent characters. e result of such concatenation is character-word embedding. For instance, "中国" has two character-word embedding: one is the concatenation of the matrix-transformed word embedding with character "中", and the other is the concatenation of the matrix-transformed word embedding with character "国".
Firstly, character-word-position embedding is transformed into coding matrix M through the Bi-LSTM Layer and Tanh Layer/Attention Layer.
Secondly, matrix M is copied into the Bi-MEMM Layer and Dense Layer. Sigmoid can be used as activation function for the Dense Layer. en, a two-dimension vector generated by each character can be used to predict the head and tail position of E1.
irdly, a labelled E 1 is randomly picked (randomly pick E 1 when training, and traverse all E 1 's when predicting), the subsequence corresponding to E 1 is fed in M into the first Self Attention Layer, together with the Position Embedding at corresponding position, and it transformed into a vector with the same length as the input sequence.
Lastly, matrix M is sent into the Bi-MEMM Layer and Dense Layer again. For each R corresponding to E 1 , the head and tail positions of E 2 can be also predicted by the Dense Layer with the activation function of sigmoid.
From the model structure of Figure 1, we can figure out it is similar to the copy mechanism, joint extraction model. In entity 'E1' identification, <E_1, R, E_2>, Bi-MEMM plays the same role as CRF. In E2 recognition, Bi-MEMM predicts E2 by every possible R with E1. If there is E2, the corresponding triples are regarded as an option or the triples will be discarded.

Bi-MEMM Construction and the Loss Function.
In formula (1), we assume that the dependency occurs only in adjacent locations, and the following formula is obtained: P y 1 , y 2 , . . . , y n |x � P y 1 |x P y 2 |x, y 1 P y 3 |x, y 2 . . . P · y n |x, . . . , y n−1 . (3) In formula (3), X � (x 1 , x 2 , . . . , x n ) is the input and Y � (y 1 , y 2 , . . . , y n ) is the tag sequence with the same size of X. According to the design of Linear CRF (Linear Chain Conditional Random Field), the following formula is obtained from formula (3): where g(y k−1 , y k ) is called the transition matrix. At this point, this is the MEMM. From equation (4), we can see that the solution of the MEMM is to decompose the overall probability distribution into the product of a stepwise distribution, so to calculate the loss, you only need to sum the cross entropy of each step. Substituting equation (4) into equation (3), we can get the loss of MEMM as follows: P↼y|x↼ � e f y 1 ;x ( )+g y 1 ,y 2 ( )+ ... +g y n−1 ,y n ( )+ f y n ;x ( ) y 1 e f y 1 ;x ( ) y 2 e g y 1 ,y 2 ( )+ f y 2 ;x ( ) . . . y n e g y n−1 ,y n ( )+ f y n ;x ( ) .
So far, we can see that MEMM, like seq2seq, has one significant defect: exposure bias [25]. When the model is trained, the prediction of the current step assumes that the labels of the previous step are correct and acquired. However, in the prediction stage, the actual labels of the previous step are unknown. If the current step is not strengthened during training, the reliability of the entire data chain will be greatly reduced. e way to calculate the probability of equation (5) is from left to right. Experiments show that adding a right-to-  Complexity left MEMM during modelling with reference to the LSTM and Bi-LSTM modes can improve its effect. en, we can get the following loss function. Finally, the average cross entropy of formulae (5) and (6) are taken as the final loss.
is can make up for the shortcomings of its asymmetric behaviour without increasing the parameters, and it can also strengthen the current training.

Experimental Design
Experiments are carried out to evaluate the efficiency of proposed method on Chinese data sets and the scalability on English data sets. For the Chinese data set, corpus data from People's Daily in January in the news field are collected, and the English data set adopted SemEval 2010 Task 8.
Several similar methods such as Bi-LSTM + CRF [5], Att-Bi-LSTM + CRF [26], and bert-based [27] were taken as the base line on the Chinese entity relationship extraction test.

Data Sets.
SemEval 2010 Task 8 marks the semantic relationship between noun pairs in a sentence rather than entity pairs. ere are 10 classes (cause-effect, componentwhole, entity-destination, product-producer, entity-origin, member-collection, message-topic, content-container, instrument-agency, and others) in total, among which one type does not distinguish the sequence of relationship arguments.
e corpus of People's Daily mainly includes three kinds of entity relations, personal name, place name, and organization name. In this paper, Spacy [33], PyhanLP [34], and other natural language processing auxiliary tools [35] are used in experiments.

4.2.
Hyperparameters. Due to differences in the data set of Chinese and English, for example, factors Embedding of China Character and Word Embedding of English are not consistent with some superparameters. In this paper, the average cross entropy of formula (6) is used as the loss function to train deep learning network with an Adam optimizer. e superparameters are shown in Table 1.

Evaluation Criteria.
Precision, recall, and F-measure are adopted as the basic evaluation criteria, in which precision and recall are contradictory and F-measure is taken to evaluate comprehensively and globally. eir calculation formulae are listed, respectively, as follows: Precision � True positive True positive + False positive , Recall � True positive True positive + False negative ,

Experimental Results and Analysis.
For the Chinese entity relationship extraction dataset, Bi-LSTM-CRF, Att-Bi-LSTM-CRF, and bert-based are applied as benchmark for performance testing. Precision, recall, and F-score are used as the evaluation criteria. e precision of different methods is shown in Table 2. eir recall and F-score are displayed in Figures 2 and 3. For the English entity relationship extraction data set, the F-score of six models is listed in Table 3 for the scalability evaluation of the proposed model. Table 2  Bi-MEMM has some features which can overcome the pitiful of traditional methods while dealing with Chinese entity relationship extraction. Firstly, the MEMM model, like the CRF model, has an attractive feature with the convexity of its loss function. e Bi-MEMM model fundamentally solves the label bias problem of the MEMM model and can make full use of context information. It can use complex, overlapping, and nonindependent information for its training and inference. Compared with the CRF model, the performance of feature selection in the Bi-  16 32 Complexity 5 MEMM model is no longer directly determining the level of system performance. Secondly, the entity relationship chain we proposed can efficiently tackle the problems as entity overlap and relationship intersection without the following two shortcomings. e first is error accumulation and entity redundancy caused by the mutual influence of entity recognition and relationship extraction which can lead to the computational complexity; the second is the lack of interaction information caused by ignoring the internal connection and dependency between the entity recognition and relationship extraction. Table 3 reveals the scalability of our proposed method, which can handle the English entity relationship extraction. Moreover, our method can reach an outstanding F-score of 84.6% which is overall higher than that of the other five methods. e results indicate that the proposed method not only performs well in dealing with Chinese entity relationship extraction but also has a superior scalability while dealing with English.

Summary and Future Work
In this paper, a joint extraction model based on joint coding is proposed, and Bi-MEMM is introduced into the joint extraction model and applied to entity relationship extraction tasks. Experiments show that the model performs well in Chinese data sets and has a strong scalability in English data sets. It owns the ability to learn the internal structure of a sentence without considering the complexity of named entities and relationships in the sentence. At the same time, we also notice that the model is still inadequate in dealing with the long-distance constraint of sample sentences, the implicit relation in entities, the reasoning of the same relation such as referential relation, subordination relation, and date writing format problem. Of course, annotation set data is also an important factor that cannot be ignored. We expect that future work could be carried out from the following aspects, such as integrating natural language algorithms (e.g., anaphora resolution into Deep Learning algorithms) and external knowledge bases (e.g., thesaurus, WordNet, HowNet, and knowledge map prior validation) waiting to be introduced into the model. We believe that the introduction of these methods in future modelling will greatly improve their accuracy.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.