PERLEX: A Bilingual Persian-English Gold Dataset for Relation Extraction

Relation extraction is the task of extracting semantic relations between entities in a sentence. It is an essential part of some natural language processing tasks such as information extraction, knowledge extraction, and knowledge base population. The main motivations of this research stem from a lack of a dataset for relation extraction in the Persian language as well as the necessity of extracting knowledge from the growing big-data in the Persian language for different applications. In this paper, we present"PERLEX"as the first Persian dataset for relation extraction, which is an expert-translated version of the"Semeval-2010-Task-8"dataset. Moreover, this paper addresses Persian relation extraction utilizing state-of-the-art language-agnostic algorithms. We employ six different models for relation extraction on the proposed bilingual dataset, including a non-neural model (as the baseline), three neural models, and two deep learning models fed by multilingual-BERT contextual word representations. The experiments result in the maximum f-score 77.66% (provided by BERTEM-MTB method) as the state-of-the-art of relation extraction in the Persian language.


Introduction
Relation Extraction (RE) is the task of identifying semantic relations between text entities and is one of the most crucial tasks in Natural Language Processing (NLP). In RE, entities are string literals which are marked in the sentence. Furthermore, the goal in RE is to detect a limited number of pre-defined relationships from the text. Knowledge base population is one of the applications of RE. A knowledge base contains a set of entities and relationships between them. There are many knowledge bases available in English at the moment, such as Yago [1], Freebase [2], DBpedia [3] and Wikidata [4]. However, the first knowledge base in the Persian language was developed recently [5], which was one of the motivations for this research.
At the outset, the first dataset for the Persian RE is introduced. Then, five language-agnostic methods for the RE task are employed, and the results of the method are compared with a baseline.
Although there are already standard RE datasets in English such as Semeval-2010-Task-8 (Multi-Way Classification of Semantic Relations Between Pairs of Nominals) [6], TACRED [7], and ACE 2005 [8], there is no dataset available for the Persian language. Thus, in this paper, we present "PERLEX", which is an expert-translated version of the Semeval-2010-Task-8 dataset.
The evaluation and comparison of the selected RE methods are carried out using PERLEX 1 dataset. It is rational to adapt existing state-of-the-art language-independent RE methods with our target language, i.e. Persian by reimplementation. We use five neural RE models, including a model based on convolutional neural networks, two models based on recurrent neural networks and two BERT-based models.
Before the introduction of BERT [9], the BLSTM-LET [10] model was one of the best models presented for the RE task. The application of Bidirectional LSTM Networks with Entity-aware Attention using Latent Entity Typing (BLSTM-LET) method for the RE task was regarded as one of the state-of-the-art language-agnostic approaches. BLSTM-LET outperformed previous state-of-the-art RE methods without the use of customized linguistic features while it relies solely on word embedding [11] features.
With the advent of BERT, many tasks of NLP have evolved. BERT [9] is a contextual text representation model that was shown to achieve state-of-the-art results in 11 different NLP tasks. Unlike previous word representations where each word would have a fixed embedding, words have different embeddings with BERT in different contexts. At present, the BERTEM-MTB [12] model has shown that in both Semeval-2010-Task-8 and TACRED datasets, it is the state-of-the-art of the RE task.
The remainder of this paper is organized as follows. Section 2 provides a summary of the related literature in RE task. In section 3, we elaborate on the design of our proposed dataset, PERLEX. Section 4 presents the experimental results along with further analyses of the obtained results. Finally, in section 5, we conclude this paper and propose possible future lines of extension of this study.

Related Works
In the following section, we first provide a brief review of the well-known RE datasets. Then, we divide the stateof-the-art RE algorithms into two different categories: deep-learning-based methods and non-deep-learning-based methods. The performance of these models on the PERLEX dataset is reported in the section 4.

Datasets
RE datasets can be classified into two general groups: distantly-supervised datasets and hand-labelled datasets.
In the hand-labelled datasets, the label of each relation mention is determined by human experts. Thus, the creation of such datasets is time-consuming and expensive. Datasets like ACE [8], Semeval-2010-Task-8 [6], TACRED [7], and FewRel [13] belong to the hand-labeled category.
However, labels of relation mentions in the distantly-supervised datasets are determined by the corresponding relations of the mentioned pairs in a knowledge base. The most extensive use of distantly-supervised dataset is in the approach proposed by Mintz et al. [14]. NYT-10 [15] where entities are aligned in the New York Times corpus to entities in Freebase.
Distantly-supervised datasets have Some advantages over hand-labelled ones. For example, there is no need for human experts to do the time-consuming process of annotation on the distantly-supervised datasets. Furthermore, distantly-supervised datasets can utilize labels already used in knowledge bases, which makes such datasets ideal for knowledge-base-related tasks such as knowledge base population. The main disadvantage of these datasets is their noisy labels. The are many approaches proposed to deal with the problem of noisy labels such as multiple-instance learning [15,16,17], reinforcement learning [18,19,20], the use of knowledge base side information [21,22], and attention mechanism [23,24]. In the Persian language, FarsBase [5] especially uses a distant-supervised method to extract triples for the knowledge base.

Non-Deep-Learning-Based Methods
Before the advent of deep learning models, NLP tasks relied on specific NLP tools such as dependency parsers and POS taggers for feature extraction. These models are not able to compete with deep learning models due to the costly nature of their handcrafted features and resources. Nevertheless, these features generally are extracted by NLP tools while some errors may cause by themselves. These methods employ classifiers, such as SVM and Maximum Entropy (MaxEnt). The-state-of-the-art method for RE was achieved by Rink and Harabagiu [25] on Semeval-2010-Task-8 dataset in 2010 using an SVM with several handcrafted features and resources including Lexical resources, Dependency, PropBank, FrameNet, Hypernym, NumLex-Plus, NGrams, and TextRunner [26]. Their model was the best non-deep-learning-based model, however later, it was outperformed by deep-learning-based models such as CNN and RNN methods.
LightRel [27] is another non-deep-learning-based method, which is a fast and lightweight logistic regression classifier. In this method, a relation mention is represented as a sequence of tokens. The main idea of this method consists of transforming these sequences into vectors of fixed length such that each token (or word) is represented only by four features including the word itself, its shape (a small, fixed amount of character-based features), the words cluster-id extracted from external knowledge base, and the words embedding of fixed size. Then a logistic regression classification model is trained to predict classes using feature vectors.

Deep-Learning-Based Methods
In this section, we present and describe some of the essential features of deep-learning-based models used for RE. Each of these models was state-of-the-art at its time, but shortly afterwards, it was outperformed by the next model.
Convolutional Neural Networks (CNNs) were initially used in computer vision to extract features from images, but they have recently been applied to various NLP tasks. Zeng et al. [28] used CNNs to extract features from sentences and classified their relations using these features. Their proposed model used a set of convolution and pooling layers followed by two fully-connected layers and a Softmax classifier to classify relations. Features of each word are the concatenation of word and position embeddings. Word embeddings are the vector representations of words in a ddimensional space, i.e. a vector with the size of d representing a word in each dimension. Position embeddings are the relative position of each word in a sentence relevant to the two entities in the sentence. Intuitively, what the convolution layer does, is the encoding of every N consecutive word into a feature vector while N is the kernel length of the convolution layer and is a hyper-parameter. A max-pooling layer is then used to extract the most relevant features of a sentence. Next, these feature vectors are fed into two fully-connected layers, and a Softmax classifier is used to determine the relation.

Attention-based Bidirectional Long Short-Term Memory Network (Att-BLSTM) is a RE model proposed by
Zhou et al. [29], which is capable of surpassing many state-of-the-art models without relying on NLP tools or lexical resources for feature extraction. This model utilizes a Recurrent Neural Network (RNN) for classification. In a regular RNN, the output of each time step is dependent on the current input and the output of the previous time step. However, in some tasks (e.g., machine translation), the output of the current time step is also dependent on the outputs of the future time steps. In such cases, bidirectional RNNs can be utilized. In this model, the embedding of each word is given as an input to two LSTM cells, one of which for the forward pass and the other for the backward pass. The outputs of each pair are then concatenated to produce the output relevant to each word. Following this LSTM layer, an attention layer is employed, producing the network's output by constructing a weighted sum of all outputs relevant to each word. The attention layer's function, as the name suggests, denotes higher weights to words with higher importance, which results in distinguishing more important words from others. For instance, the word "the" is less useful than a word like "Caused" for determining the Cause-Effect relation in a sentence. These attention weights are learned in the process of training. Should be noted, the Att-BLSTM model is independent of lexical or syntactic features and relies solely on word embeddings. This model was capable of outperforming many state-of-the-art models that rely on features extracted by NLP tools on the SemEval-2010-task 8 dataset.
Bidirectional LSTM with Entity-aware Attention using Latent Entity Typing (BLSTM-LET) proposed by Lee et al. [10], utilizes the self-attention introduced by Vaswani et al. [30], and latent entity typing to produce better representations of words. Additionally, A bi-directional LSTM is used for classification as well. Word embeddings are used as the input of the model. Multi-head attention is then used to produce contextualized representations for the words. These representations are then concatenated to position embeddings and latent entity type embeddings and fed into an attention layer to obtain the final representation of the sentence. Latent entity types provide information about the entities. Intuitively, latent entity types are clusters to which an entity can belong. Latent entity type embeddings are learned during the training process through the contextualized representation of the entities [10]. When dimension reduction is applied to entity type embeddings with visualization of entities in a 2-d plot, it can be shown that entity pairs such as "pollution" and "virus", or "worker" and "chairman" fall into the same cluster. This model outperformed any models that did not use NLP tools to extract features except BERT-based models.
BERT-based models have recently been applied in the field of RE and have been able to obtain the best results up to now and outperform previous methods.
In contrast to context-free models such as word2vec, Bidirectional Encoder Representations from Transformers (BERT) [9] is an unsupervised context-dependent language representation model. BERT was shown to achieve state- of-the-art results in different NLP tasks such as a set of eight tasks named GLUE when it was introduced. As opposed to previous word representations where each word would have a single fixed embedding vector, BERT embedding word vectors are different in any context. RE was not among the tasks where BERT has experimented; however, BERT discovered its path through this field immediately.
Enriching Pre-trained Language Model with Entity Information (R-BERT) is a recently proposed model by Wu and He [31], which used BERT for the task of RE and showed to be the best method on the Semeval-2010-Task-8 dataset. To encode a relation between two entities of a sentence using BERT, R-BERT adds the special token "$" before and after the first entity and another special token "#" before and after the second entity in a given sentence. R-BERT also adds another special token "[CLS]" to the beginning of each sentence. The final representation of each relation is calculated by concatenating three hidden state vectors including the hidden state vector corresponding to the [CLS] token, the averages of the hidden state vectors corresponding to the first, and second entity tokens. A fully connected layer followed by a softmax layer is then used to map the acquired relation representation to a relation.

BERTEM with Matching the Blanks (BERTEM-MTB)
is the most recent and current state-of-the-art approach, which is a BERT-based model and is very similar to R-BERT. This method was proposed by Soares et al. [12]. Similar to R-BERT, BERTEM-MTB method added special tokens before and after the entities. Unlike R-BERT, the tokens before and after each entity are different in BERTEM-MTB method. Relation representation in this model is the concatenation of the hidden states of the special tokens before each entity. These relation representations are then used to classify each sentence's relation. This method also adds another training step in the architecture of fine-tuning relation representations in BERT by replacing the entities with "[BLANK]" special token in sentences which their entity pairs are similar.

Construction of the PERLEX bilingual dataset
Unlike English and other rich-resource languages, the Persian language has no proper assets available for RE. In the English language, Semeval-2010-Task-8 challenge is one of the most well-known datasets for RE which has been utilized in many studies. This dataset contains 10717 example sentences and their corresponding relation types from which 8000 is for training and 2717 for the test. In this challenge, each relation extraction algorithm is asked to identify one of the nine predefined relationships in this dataset for the pair of entities specified in each sentence. The nine predefined relations are "Cause-Effect", "Component-Whole", "Content-Container", "Entity-Destination", "Entity-Origin", "Instrument-Agency", "Member-Collection", "Message-Topic", "Product-Producer, and the relation "Other" in case there is not a confirmed relationship between the two entities.
PERLEX is the parallel translation of all examples in Semeval-2010-Task-8 dataset. With this approach, the cost of sentence selection is eliminated. On the other hand, this dataset is constructed from an original and widely-utilized dataset. Therefore, it is possible to implicitly compare the results of implementing RE methods on this dataset with those of the English dataset. Table 1 illustrates the statistics related to the dataset.

Experiments and Analyses of Relation Extraction
In this section, we report experimental settings and classification results of six different models: Baseline, CNN, Att-BLSTM, BLSTM-LET, R-BERT, and BERTEM-MTB.

Experimental Setup
In PERLEX, we adapted the nine datasets similar to those in Semeval-2010-Task-8 dataset mentioned in section 2. Each class has two variations specifying the placement order of the subject and object in the sentence. For instance, Cause-Effect class possesses two variants: Cause-Effect (e1,e2) and Cause-Effect (e2,e1).
Generally, there are three ways to evaluate the classification results: 1. Taking  3. Using only one variation of each class (and ignoring directionality).
Moreover, there are two methods to measure f1-score, namely Micro-averaging and Macro-averaging. Additionally, the pairs of entities that do not fall into any of the main nine classes are labelled as "Other" in the dataset and do not participate in evaluations. We adopt the official evaluation method for Semeval-2010-Task-8 dataset, which is (9+1)-way classification with macro-averaging f1-score measurement while directionality is taken into account. This (9+1)-way means that we use the nine main classes plus "Other" in training and testing but "Other" is ignored when we calculate the f1-scores. In all non-BERT-based experiments, we use 300-dimensional word embeddings pre-trained by Poostchi et al. [32] and utilize 10% of the training set as the development set. Figure 1 illustrates the official f1-scores for each model. As we expected, the results are lower for the Persian language in comparison with the English language. This drawback is due to many challenges in the processing of the Persian language as a free-word-order and a more ambiguous language. Likewise, in English, the performance of BERTEM-MTB overcomes that of all the other five methods in the Persian language. Moreover, BERTEM is the superior method in all nine classes in the Persian language.

Baseline
We use features to train a Logistic Regression Classifier (LRC) and L2R LR solver as the baseline for the Persian language such as Word IDs (unique IDs for each word in the dataset), Part of Speech (POS), Tags for each pair of entities, Words between two entities, POS tags of the words between two entities, Dependency relations and their direction between two consecutive entities, POS word tags between two entities and The bag of words.
Based on the obtained results, the official f1-score for logistic regression on PERLEX dataset is 57.42%. It should be noted that we report logistic regression baseline method of LightRel [27] results for English.

CNN
For the CNN model, we use four different kernel lengths: 2, 3, 4, and 5. Then, we concatenate the outputs of these kernels. We set the number of kernels for each length to 128. We also use dropout [33] and L2 regularization to prevent over-fitting. Based on the obtained results, the official f1-score of CNN on PERLEX dataset is 69.28%.

Att-BLSTM
We used one layer of bi-directional LSTM and set the hidden state size to 100. To prevent over-fitting, we use L2 regularization, recurrent dropout, and regular dropout. Based on the obtained results, the official f1-score of Att-BLSTM on PERLEX dataset is 69.61%.

BLSTM-LET
We use four attention heads in the multi-head attention layer and set the layer size to 50 for each head. Hidden state of the LSTM is set to 300. Like the previous model, recurrent and regular dropout, as well as L2 regularization, are used. Based on the obtained results, the official f1-score of BLSTM with entity typing (BLSTM-LET) on PERLEX dataset is 70.79%.

R-BERT
We fine-tune the base BERT pre-trained model for this method. Other hyper-parameters can be seen in Table 2. Based on the obtained results, the official f1-score of BLSTM with entity typing (BLSTM-LET) on PERLEX dataset is 75.31%.

BERTEM-MTB
We fine-tune the base BERT pre-trained model for this method. Other hyper-parameters can be seen in Table 2. Based on the obtained results, the official f1-score of BLSTM with entity typing (BLSTM-LET) on PERLEX dataset is 77.66%.

Results Per Classes
The final results for individual classes can be seen in Table 3. As can be seen, the f1 measure of the BERTEM-MTB model is higher than the other models in all classes. Also, in almost all classes, the value of f1 has risen from the lowest in Baseline to the highest in R-BERT. However, the models on the Instrument-Agency class do not behave the same, which means that the baseline model is better than all models except BERTEM-MTB. The reason for this is that the baseline model uses dependency relations and their direction between two consecutive entities for this purpose, while other models do not use this information. Sentences containing the Instrument-Agency relation class are very similar in terms of the dependency tree. Consequently, the baseline model, which uses dependency tree information, has learned how to detect this kind of relationship by observing a similar pattern.

Conclusion
In this paper, the Relation Extraxtion (RE) task in the Persian language is conducted for the first time. For this purpose, we initially proposed a bilingual version of the Semeval-2010-Task-8 dataset, dubbed as PERLEX. Then, having investigated the state-of-the-art language-agnostic methods for RE in the English language, we adapted and customized some of these methods for the Persian language. Moreover, a logistic regression algorithm with syntactic and semantic features is employed as a baseline. The acquired experimental results not only double-proved the superiority of the BERT-based models over the baseline and other deep learning models but also proved its comparability to similar state-of-the-art methods in English. However, due to particular challenges in the processing of the Persian language such as being free-word-order and having ambiguity-prone nature in comparison with English, the performance of the customized methods in Persian was less than their performance in English.
As the future work of this study, more accurate Persian word embedding can be presented and applied to improve the results of non-BERT-based models. Moreover, by designing training steps tailored to the Persian language features, a novel BERT-based RE model can be proposed for the Persian language.

Acknowledgments
It is necessary to acknowledge the active collaboration of Dr. Sayyed Ali Hossayni and Mr. Kamyar Darvishi, who kindly collaborated with us, during the conduction of this research.