Russian-English dataset and comparative analysis of algorithms for cross-language embedding-based entity alignment

The problem of data fusion from data bases and knowledge graphs in different languages is becoming increasingly important. The main step of such a fusion is the identification of equivalent entities in different knowledge graphs and merging their descriptions. This problem is known as the identity resolution, or entity alignment problem. Recently, a large group of new entity alignment methods has emerged. They look for the so called “embeddings” of entities and establish the equivalence of entities by comparing their embeddings. This paper presents experiments with embedding-based entity alignment algorithms on a Russian-English dataset. The purpose of this work is to identify language-specific features of the entity alignment algorithms. Also, future directions of research are outlined.


Introduction
The knowledge graphs (KGs) are widely used as prior knowledge in applications such as recommender systems, decision-making systems, question-answering, etc. The more powerful an underlying knowledge graph is, the higher is the quality of applications based on it. Therefore, the problem of finding equivalent entities in several knowledge graphs and merging them into a unified knowledge graph is becoming increasingly important. This task has been known as entity alignment, entity deduplication, and identity resolution in the data base context. Various methods to establish the similarity of the symbolic features of entities have been studied extensively [1]. In the past few years, interest in the integration of multilingual knowledge graphs has increased, and the need in data fusion from multilingual knowledge graphs has made the problem of entity alignment (EA) vitally important.  The intuition behind all the entity alignment algorithms is that equivalent entities should have equivalent relations and attributes in various knowledge graphs. However, real KGs have different origins and schemas, and entities' relations and attributes can vary. For example, the English entity dbr:War_and_Peace_film_series in figure 1 has an attribute dbo:distributor, which the corresponding Russian entity Война и мир does not have. Also, the information about the film operator of the English entity is represented by an attribute triple while its Russian counterpart has a relational triple instead. Multiple inconsistencies between knowledge graphs require a variety of methods that should cope with these differences.
Since the creation of new methods is based mainly on the developer's intuition, it is essential to establish a common basis for understanding and comparing these various methods. Currently, the common basis is formed by testing various algorithms on a unified dataset. For example, an OpenEA library [2] (https://github:com/nju-websoft/OpenEA) collects embedding-based entity alignment algorithms. These algorithms are analysed and compared on the datasets containing alignments between English-German, English-French and English-Chinese entities of dbpedia-2016-10.
However, each language version of a knowledge graph has its specificity, and language-specific features of the EA algorithms need to be studied and interpreted appropriately. This paper presents the results of experiments with several groups of entity alignment algorithms tested on a Russian-English dataset.

The choice of embedding-based algorithms for entity alignment
The embedding-based entity alignment algorithms differ in the type of triples used to obtain an embedding. All currently known approaches use relational triples to construct structure embeddings, and some of them use literal triples to generate attribute embeddings. Recent algorithms increasingly try to leverage the names of entities represented by the rdfs:label relationships to create name embeddings. There are three main approaches to construct structure embeddings: translational, path-based, and neighbourhood-based embeddings.  [3][4][5]. Translational models interpret a relation as a translation vector from a head entity to a tail one. The current methods represent different entities as embeddings and look for entity alignment by evaluating the similarity between these embeddings. This approach captures the local semantics of relational triplets. Each relational triple of the form tr_r = (head entity, relation, tail entity) is associated with the three vector representations h, r, t. The vector r representing the relation r is considered as a geometric transformation, for example, the shift of the vector t with respect to h. The "energy" function of each relational triple is calculated as || h + rt ||, and stochastic gradient descent is used to optimize it. The strong assumption about the translational nature of relationships makes this approach unsuitable for modelling more complex information about relationships. For example, these methods are not able to handle correctly all the transitive relationships, such as "ancestor" or "descendant".

Triple-based, or translational embeddings
Path-based embeddings [6]. This approach attempts to identify the relationships that can be represented as a combination of relationships corresponding to some directed path in the knowledge graph. For example, the fact that (p1 has the son p2) and (p2 has the son p3) suggests that (p1 has the grandson p3). That is, if there is a sequence of interconnected relational triples of the form (e1, r1, e2), (e2, r2, e3), …., (en-1, rn-1, en), the path-based approach tries to find an embedding of the relation r * = comb (r1, r2, ..., rn). The operation of addition, multiplication, etc., can be used as a combination method. The latest variations of this approach for the combination of embeddings of relations use various modifications of recurrent neural networks (RNN).
Neighbourhood-based embeddings [7,8]. The embeddings of knowledge graph entities are constructed iteratively based on the embeddings of adjacent entities. These methods are based on the assumption that similar entities from different KGs have similar neighborhood structure. Therefore, equal embeddings should be generated for equivalent entities. However, due to the incompleteness and sparsity of real-life knowledge graphs, these neighborhoods can have multiple disparities of the neighborhood size and topological structure as it is shown in figure 1. This approach uses various versions of graph convolutional networks (GCN) to construct the embeddings of entities.
For experiments with the Russian-English dataset, the following algorithms have been used. JAPE [3] combines structural and attribute embeddings to map entities from different knowledge graphs. The structural embedding is built on the basis of an "entity overlay graph" created from two knowledge graphs. The attribute embedding is built on top of a Skip-gram model that attempts to capture attribute correlations. JAPE uses information about the types of attributes but does not use their specific values.
BootEA [4] exploits a triple-based algorithm to obtain embeddings. This algorithm utilizes a special way of obtaining "negative" triples, the so-called "truncated homogeneous negative sample", which replaces the head or relation of a given positive triple with a random entity from s nearest neighbours (s is a hyperparameter). To use the existing interlanguage links for training, new triples are created: for each pair of entities, all triples that include one of the pairs of entities are duplicated with a replacement for an entity from another knowledge graph. An important property of BootEA is the iterative labelling of plausible alignments, which are then used as training data. Moreover, on subsequent iterations, entities can change their label or become unlabelled if the newly generated alignments lead to conflicts.
MultiKE [5] builds three types of embeddings for each entity, using different "views": a name view, a relational view, and an attribute view. Each of the "views" is built according to its own algorithm. For example, for each word from the entity name, there is a vector obtained using fastText [9]; if such a vector does not exist, then the word vector is obtained by summing the vectors of characters obtained using the character embedding algorithm. The word embeddings are summed up and a name view embedding is obtained, which is directly involved in training the model. A relation view embedding is constructed as standard translational structure embedding. An attribute view embedding is generated using a convolutional neural network (CNN). The final embedding of an entity can be obtained using different methods of combining the three views described above.
RSN4EA [6] exploits "relational paths", alternating the chains of entities and relationships, to build the embeddings of relationships and entities. The head and tail of the relational path must be entities. A 4 conventional way to model relational paths is recurrent neural networks (RNSs). However, regular recurrent neural networks do not distinguish relationships from entities in a relational path. Therefore, RSN4EA uses a "skip" recurrent network architecture (RSN). This architecture allows the relational path entities to participate in predicting not only the relationship, but the object entity as well.
GCN-Align [7] leverages graph convolutional networks that build the embeddings of entities (graph vertices) based on vertex adjacency information. The algorithm uses two two-layer GCNs, each of which constructs an embedding for one knowledge graph in a unified vector space. This approach assigns two feature vectors to each entity in the GCN layers, a structure feature vector and an attribute feature vector. The final outputs of two GCNs are further used to discover entity alignments. Correspondence between entities is established based on the distances between their structure and attribute embeddings.
RDGCN [8]. Graph convolutional networks are not very well suited for dealing with directed graphs. Therefore, the RDGCN approach uses not only the structure of the original knowledge graphs (primal graph) to construct embeddings, but also auxiliary graphs that are dual to the original graphs (dual relation graph). The vertices of dual relation graphs are the edges of the original graphs. To implement interaction between the original knowledge graphs and dual relational graphs, the Graph Attention Network (GAT) [10] mechanism is used. The resulting embeddings of the original graphs are then fed into graph convolutional networks to extract information about the structure of a vertex neighborhood.

A Russian-English data set for the EA experimentsdfdfdff
A Russian-English dataset comprehensive for a Russian-speaking user has been created using the IDS algorithm [2]. This dataset contains 15 K and 100 K pre-aligned entities from the English and Russian versions of DBpedia 2016-10 (https://wiki.dbpedia.org/downloads-2016-10/). Similar to the already existing dataset DBP [X] K [2] containing the German-English, French-English and Chinese-English data sets, version 1 (V1) is obtained by directly using the IDS algorithm. To construct version 2 (V2), low degree objects (with the degree not exceeding five) in the source KG are randomly removed to double the average degree, and then IDS is performed to match the new KG.

Metrics for assessing the quality of embedding-based entity alignment algorithms
The Hits @ k metric is used to analyse the results of the EA algorithms. The metric Hits @ k = n% means that for n percent of objects from one knowledge graph, the equivalent object from the second knowledge graph is among the nearest k neighbours in the embedding space. Obviously, the Hits @ 1 metric is considered as the most indicative since this metric is equivalent to a precision. If Hits @ 1 = 100%, it means that such an algorithm can find the exact counterparts for all the entities. Table 1 and  table 2 show the final metrics Hits @ k, where k = 1, 5, 10, 50, 100, and the running time of each of the selected algorithms on sparse and dense Russian-English dataset. It is possible to see that BootEA has the best Hits@k metrics, however, on the sparse dataset, its results are quite close to the results of RSN4EA, GCN-Align and Jape. On the dense Russian-English data set, the superiority in BootEA and RSN4EA metrics over other solutions becomes obvious. However, these results differ from those obtained on the original DBP15K and DBP100K datasets [2], which demonstrate very good results for the MultiKE and RDGCN algorithms. The reason for this difference is that the literal embeddings of MultiKE exploit pretrained word embeddings for the respective languages. Another algorithm that has demonstrated essential quality degradation on the Russian-English dataset, compared to standard datasets, is RDGCN. Again, RDGCN leveraged a technique that translated the Chinese, German and French entity names into English and then utilised pretrained English word vectors to construct an input entity representation for the primal graph. Similar adjustments for Russian will be a subject of further investigation.
To gain more insight into the EA algorithms results, we used a program which takes as input the id of an English entity and outputs its ten nearest neighbours in the embedding space. The program has options allowing the users to choose a graph from which the nearest neighbours of a given entity are taken.
For example, the ten nearest neighbours of the entity id4 (http://dbpedia.org/resource/Hard_rock), produced by the GCN_Aligh algorithm, are shown in figure 2(a). The first line contains the object to which the elements nearest in vector space are searched, and the next lines contain its nearest ten neighbours. It is possible to see that all the nearest neighbours are related to music but their semantic closeness is difficult to estimate for a non-musician. Another EA result produced by the GCN_Aligh algorithm is shown in figure 2(b). The nearest neighbours of the entity id12550 (http://dbpedia.org/resource/Vladimir_Korotkov,_born_1941) turned out to be football players and football clubs. Note that the Russian equivalent of the English entity is situated in the 4th line. It is not necessery to be a specialist in football to see that nobody except http://ru.dbpedia.org/resource/Коротков,_Владимир_Петрович can be a counterpart of this English entity. To understand this, it would be easier to compare the string similarity of the entities' names. Thus, we can suppose that it is necessary to investigate hybrid methods exploiting embedding-based methods in combination with conventional methods. Moreover, these hybrid combinations can depend on particular language pairs and the types of entities.