Link Prediction Using Multi Part Embeddings

. Knowledge graph embeddings models are widely used to provide scalable and eﬃcient link prediction for knowledge graphs. They use diﬀerent techniques to model embeddings interactions, where their tensor factorisation based versions are known to provide state-of-the-art results. In recent works, developments on factorisation based knowledge graph embedding models were mostly limited to enhancing the ComplEx and the DistMult models, as they can eﬃciently provide predictions within linear time and space complexity. In this work, we aim to extend the works of the ComplEx and the DistMult models by proposing a new factorisation model, TriModel , which uses three part embeddings to model a combination of symmetric and asymmetric interactions between embeddings. We perform an empirical evaluation for the TriModel model compared to other tensor factorisation models on diﬀerent training conﬁg-urations (loss functions and regularisation terms), and we show that the TriModel model provides the state-of-the-art results in all conﬁgurations. In our experiments, we use standard benchmarking datasets (WN18, WN18RR, FB15k, FB15k-237, YAGO10) along with a new NELL based benchmarking dataset (NELL239) that we have developed.


Introduction
In recent years, knowledge graph embedding (KGE) models have witnessed rapid developments that have allowed them to excel in the task of link prediction for knowledge graphs [22]. They learn embeddings using different techniques like tensor factorisation, latent distance similarity and convolutional filters in order to rank facts in the form of (subject, predicate, object) triples according to their factuality. In this context, their tensor factorisation based versions like the DistMult [23] and the ComplEx [21] models are known to provide state-of-the-art results within linear time and space complexity [22]. The scalable and efficient predictions achieved by these models have encouraged researchers to investigate advancing the DistMult and the ComplEx models by utilising different training objectives and regularisation terms [8,9]. In this work, our objective is to propose a new factorisation based knowledge graph embedding model that extends the works of the DistMult and the ComplEx models while preserving their linear time and space complexity. We achieve that by modifying two of their main components: the embedding representation, and the embedding interaction function.
While both the DistMult and the ComplEx models use the bilinear product of the subject, the predicate and the object embeddings as an embedding interaction function to encode knowledge facts, they represent their embeddings using different systems. The DistMult model uses real values to represent its embedding vectors, which leads to learning a symmetric representation of all predicates due to the symmetric nature of the product operator on real numbers. On the other hand, the ComplEx model represents embeddings using complex numbers, where each the embeddings of an entity or a relation is represented using two vectors (real and imaginary parts). The ComplEx model also represents entities in the object mode as the complex conjugate of their subject form [21]. This enables the ComplEx model to encode both symmetric and asymmetric predicates.
Since the embeddings of the ComplEx models are represented using two part embeddings (real and imaginary parts), their bilinear product (ComplEx's embedding interaction function) consists of different interaction components unlike the DisMult model with only one bilinear product component. Each of these components is a bilinear product of a combination of real and imaginary vectors of the subject, the predicate and the object embeddings, which gives the ComplEx model its ability to model asymmetric predicates.
In this work, we investigate both the embedding representation and the embedding interaction components of the ComplEx model, where we show that the ComplEx embedding interaction components are sufficient but not necessary to model asymmetric predicates. We also show that our proposed model, TriModel , can efficiently encode both symmetric and asymmetric predicates using simple embedding interaction components that rely on embeddings of three parts. To assess our model compared to the ComplEx model, we carry experiments on both models using different training objectives and regularisation terms, where our results show that our new model, TriModel , provide equivalent or better results than the ComplEx model on all configurations. We also propose a new NELL [12] based benchmarking dataset that contains a small number of training, validation and testing facts that can be used to facilitate fast development of new knowledge graph embedding models.

Background and Related Works
Knowledge graph embedding models learn low rank vector representation i.e. embeddings for graph entities and relations. In the link prediction task, they learn embeddings in order to rank knowledge graph facts according to their factuality. The process of learning these embeddings consists of different phases. First, they initialise embeddings using random noise. These embeddings are then used to score a set of true and false facts, where a score of a fact is generated by computing the interaction between the fact's subject, predicate and object embeddings using a model dependent scoring function. Finally, embeddings are updated by a training loss that usually represents a min-max loss, where the objective is to maximise true facts scores and minimise false facts scores.
In this section we discuss scoring functions and training loss functions in state-of-the-art knowledge graph embedding models. We define our notation as follows: for any given knowledge graph, E is the set of all entities, R is the set of all relations i.e. predicates, N e and N r are the numbers of entities and relations respectively, T is the set of all known true facts, e and w are matrices of sizes N e × K and N r × K respectively that represent entities and relations embeddings of rank K, φ spo is the score of the triple (s, p, o), and L is the model's training loss.

Scoring Functions
Knowledge graph embedding models generate scores for facts using model dependent scoring functions that compute interactions between facts' components embeddings. These functions use different approaches to compute embeddings interactions like distance between embeddings [2], embedding factorisation [21] or embeddings convolutional filters [5].
In the following, we present these approaches and specify some examples of knowledge graph embedding models that use them.
• Distance-based embeddings interactions: The Translating Embedding model (TransE) [2] is one of the early models that use distance between embeddings to generate triple scores. It interprets triple's embeddings interactions as a linear translation of the subject to the object such that e s + w p = e o , and generates a score for a triple as follows: where true facts have zero score and false facts have higher scores. This approach provides scalable and efficient embeddings learning as it has linear time and space complexity. However, it fails to provide efficient representation for interactions in one-to-many, many-to-many and many-to-one predicates as its design assumes one object per each subject-predicate combination.
• Factorisation-based embedding interactions: Interactions based on embedding factorisation provide better representation for predicates with high cardinality. They have been adopted in models like DistMult [23] and ComplEx [21]. The DistMult model uses the bilinear product of embeddings of the subject, the predicate, and the object as their interaction, and its scoring function is defined as follows: where e s k is the k-th component of subject entity s embedding vector e s . DistMult achieved a significant improvement in accuracy in the task of link prediction over models like TransE. However, the symmetry of embedding scoring functions affects its predictive power on asymmetric predicates as it cannot capture the direction of the predicate. On the other hand, the ComplEx model uses embedding in a complex form to model data with asymmetry. It models embeddings interactions using the the product of complex embeddings, and its scores are defined as follows: where Re(x) represents the real part of complex number x and all embeddings are in complex form such that e, w ∈ C, e r and e i are respectively the real and imaginary parts of e, and e o is the complex conjugate of the object embeddings e o such that e o = e r o − ie i o and this introduces asymmetry to the scoring function. Using this notation, ComplEx can handle data with asymmetric predicates, and to keep scores in the real spaces it only uses the real part of embeddings product outcome. ComplEx preserves both linear time and linear space complexities as in TransE and DistMult, however, it surpasses their accuracies in the task of link prediction due to its ability to model a wider set of predicate types.
• Convolution-based embeddings interactions: Following the success of convolutional neural networks image processing tasks, models like R-GCN [17] and ConvE [5] utilized convolutional networks to learn knowledge graph embeddings. The R-GCN model learns entity embeddings using a combination of convolutional filters of its neighbours, where each predicate represent a convolution filter and each neighbour entity represents an input for the corresponding predicate filter. This approach is combined with the DistMult model to perform link prediction. Meanwhile, the ConvE model concatenates subject and predicate embeddings vectors into an image (a matrix form), then it uses a 2D convolutional pipeline to transform this matrix into a vector and computes its interaction with the object entity embeddings to generate a corresponding score as follows: where e s and w p denotes a 2D reshaping of e s and w p , ω is a convolution filter, f denotes a non-linear function, vec(x) is a transformation function that reshape matrix x of size m × n into a vector of size mn × 1.

Loss Functions
The task of link prediction can generally be cast as a learning to rank problem where the object is to rank knowledge graph triples according to their factuality. Thus, knowledge graph embedding models traditionally use ranking loss approaches like pairwise and pointwise loss functions as in TransE and ComplEx respectively to model their training loss during the learning process.
In these approaches a set of negative facts i.e. corruptions, is generated using a uniform random sample of entities to represent false facts, where training loss uses a min-max approach to maximise true facts scores and minimise false facts scores. Meanwhile, recent attempts considered using a multi-class loss to represent training error, where a triple (s, p, o) is divided into an input (s, p) and a corresponding class o and the objective is to assign class o to the (s, p) input.
In the following, we discuss these two approaches with examples from state-ofthe-art knowledge graph embedding models.
• Ranking loss functions: Knowledge graph embedding models has adopted different pointwise and pairwise ranking losses like hinge loss and logistic loss to model their training loss. Hinge loss can be interpreted as a pointwise loss or a pairwise loss that minimises the scores of negative facts and maximise the scores of positive facts to reach a specific configurable value. This approach is used in HolE [15], and it is defined as: where l(x) = 1 if x is true and −1 otherwise and [c] + is equal to max(c, 0). This effectively generates two different loss slopes for positive and negative scores as shown in Fig. 1.
The squared error loss can also be adopted as a pointwise ranking loss function. For example, the RESCAL [16] model uses the squared error to model its training loss with the objective of minimising the difference between model scores and their actual labels: The optimal score for true and false facts is 1 and 0, respectively, as shown in Fig. 1. Also, the squared loss requires less training time since it does not require configurable training parameters, shrinking the search space of hyperparameters compared to other losses (e.g., the margin parameter of the hinge loss).
The ComplEx [21] model uses a logistic loss, which is a smoother version of pointwise hinge loss without the margin requirement (cf. Fig. 1). Logistic loss uses a logistic function to minimise negative triples score and maximise positive triples score. This is similar to hinge loss, but uses a smoother linear loss slope defined as: L where l(x) is the true label of fact x that is equal to 1 for positive facts and is equal to −1 otherwise.
• Multi-class loss approach: ConvE model proposed a new binary cross entropy multi-class loss to model its training error. In this setting, the whole vocabulary of entities is used to train each positive fact that for a triple (s, p, o) all facts (s, p, o ) with o ∈ E and o = o are considered false. Despite the extra computational cost of this approach, it allowed ConvE to generalise over a larger sample of negative assistances therefore surpassing other approaches in accuracy [5]. In a recent work, Lacroix et. al. [9] introduced a softmax regression loss to model the training error of the ComplEx model as a multi-class problem. In this approach, the objective for each triple (s, p, o) is to minimise the following losses: where s ∈ E, s = s, o ∈ E and o = o. This resembles a log-loss of the softmax value of the positive triple compared to all possible object and subject corruptions where the objective is to maximise positive facts scores and minimise all other scores. This approach achieved a significant improvement to the prediction accuracy of ComplEx model over all benchmark datasets [9].

Ranking Evaluation Metrics
Learning to rank models are evaluated using different ranking measures including Mean Average Precision (MAP), Normalised Discounted Cumulative Gain (NDCG), and Mean Reciprocal Rank (MRR). In this study, we only focus on the Mean Reciprocal Rank (MRR) since it is the main metric used in previous related works.
Mean Reciprocal Rank (MRR). The Reciprocal Rank (RR) is a statistical measure used to evaluate the response of ranking models depending on the rank where x i is the highest ranked relevant item for query q i . Values of RR and MRR have a maximum of 1 for queries with true items ranked first, and get closer to 0 when the first true item is ranked in lower positions.

The TriModel Model
In this section, we motivate for the design decision of TriModel model, and we present its way to model embeddings interaction and training loss.

Motivation
Currently, models using factorisation-based knowledge graph embedding approaches like DistMult and ComplEx achieve state-of-the-art results across all benchmarking datasets [9]. In the DistMult model, embeddings interactions are modelled using a symmetric function that computes the product of embeddings of the subject, the predicate and the object. This approach was able to surpass other distance-based embedding techniques like TransE [23]. However, it failed to model facts with asymmetric predicate due to its design. The ComplEx model tackle this problem using a embeddings in the complex space where its embeddings interactions use the complex conjugate of object embeddings to break the symmetry of the interactions. This approach provided significant accuracy improvements over DistMult as it successfully models a wider range of predicates. The ComplEx embeddings interaction function (defined in Sec. 2) can be redefined as a simple set of interactions of two part embeddings as follows: where k is the sum of all embeddings components of index k = {1, ..., K}, and interactions i 1 , i 2 , i 3 and i 4 are defined as follows: where e 1 represents embeddings part 1, and e 2 is part 2 (1 → real and 2 → imaginary). Following this notation, we can see that the ComplEx model is a set of two symmetric interaction i 1 and i 2 and two asymmetric interactions i 3 and i 4 . Furthermore, this encouraged us to investigate the effect of using other forms of combined symmetric and asymmetric interactions to model embeddings interactions in knowledge graph embeddings. We investigated different combination of interactions i 1 , i 2 , i 3 and i 4 , and we have found that by removing and/or changing the definition of one of these interactions (maintaining that the interactions use all triple components) will preserve similar or insignificantly different prediction accuracy across different benchmarking datasets (See Table 1). This led us to investigate other different forms of interactions that uses a combination of symmetric and asymmetric interactions where we found that using embeddings of three parts can lead to better predictive accuracy than the ComplEx and the DistMult models.

TriModel Embeddings Interactions
In the TriModel model, we represent each entity and relation using three embedding vectors such that the embedding of entity i is {e 1 i , e 2 i , e 3 i } and the embedding of relation j is {w 1 j , w 2 j , w 3 j } where e m denotes the m part of the embeddings and where m ∈ 1, 2, 3 is used to represent the three embeddings parts.
The TriModel model is a tensor factorisation based model, where its embeddings interaction function (scoring function) is defined as follows:

Training the TriModel Model
Trouillon et. al. [20] showed that despite the equivalence of HolE and ComplEx models' scoring functions, they produce different results as they use different loss functions. They concluded that the logistic loss version of ComplEx outperforms its hinge loss version. In addition, we have investigated different other ranking losses with the ComplEx model, and we have found that squared error loss can significantly enhance the performance of ComplEx on multiple benchmarking datasets.
The TriModel model performs its learning process using two different training loss configurations: the traditional ranking loss and the multi-class loss. In the ranking loss configuration, the TriModel model uses the squared error (Eq. 6) and the logistic loss (Eq. 7) to model its training error, where a grid search is performed to choose the optimal loss representation for each dataset. In the multi-class configuration, it uses the negative-log softmax loss (Eq. 8) with the nuclear 3-norm regularisation [9] which is defined as follows: where m denotes the embedding part index, λ denotes a configurable regularisation weight parameter and |x| is the absolute value of x. This allows the model to answer the link prediction task in both directions: (subject, predicate, ?) and (?, predicate, object). We also consider the use of predicate reciprocals in training as described in Lacroix et. al. [9], where inverses of training predicates are added to the training set and trained with their corresponding original facts as shown in the following: where predicate p + N r is the inverse of the predicate p where the model learns and evaluates inverse facts using inverses of their original predicates. For all the multi-class configurations, the TriModel model regularises the training facts embeddings using a dropout layer [18] with weighted probability that it learns during the grid search.

Experiments
In this section, we discuss the setup of our experiments where we present the evaluation protocol, the benchmarking datasets and our implementation details.

Data
In our experiments we use six knowledge graph benchmarking datasets:

Implementation
We use TensorFlow framework (GPU) along with Python 3.5 to perform our experiments. All experiments were executed on a Linux machine with processor Intel(R) Core(TM) i70.4790K CPU @ 4.00GHz, 32 GB RAM, and an nVidia Titan Xp GPU.

Experiments Setup
We perform our experiments in two different configurations: (1) Ranking loss based learning: the models are trained using a ranking based loss function, where our model chooses between squared error loss and logistic loss using grid search.
(2) Multi-class loss based learning: the models is trained using a multi-class based training functions, where our model uses the softmax negative log loss functions described in Eq. 11 and Eq. 12.
In all of our experiments we initialise our embeddings using the Glorot uniform random generator [7] and we optimise the training loss using the Ada- In the evaluation process, we only consider filtered MRR and Hits@10 metrics [2]. In addition, in the ranking loss configuration, TriModel model uses a softmax normalisation of the scores of objects and subjects corruptions, that a score of a corrupted object triple (s, p, o i ) is defined as: , similarly, we apply a softmax normalisation to the scores of all possible subject entities.

Results and Discussion
In this section we discuss findings and results of our experiments shown in Table 3 and Table 4, where the experiments are divided into two configurations: models with ranking loss functions and models with multi-class based loss functions.

Results of The Ranking Loss Configuration
In the results of the ranking loss configuration shown in Table 3, the results show that the TriModel model achieves best results in terms of MRR and hits@10 in five out of six benchmarking datasets with a margin of up to 10% as in the YAGO10 dataset. However, on the FB15k-237 ConvKB [14] retains state-of-the-art results in terms of MRR and Hits@10. Results also show that the factorisation based models like the DistMult, ComplEx, R-GCN and TriModel models generally outperform distance based models like the TransE and ConvKB models. However, Table 3. Link prediction results on standard benchmarking datasets. Results taken from [21] and our own experiments. on the FB15k-237 dataset, both distance based models outperform all other factorisation based models with a margin of up to 15% in the case of the ConvKB and the TriModel model. We intend to perform further analysis on this dataset compared to other datasets to investigate why tensor factorisation models fail to provide state-of-the-art results in future works.

Results of The Multi-class Loss Configuration
Results of the multi-class based approach show that TriModel model provide stateof-the-art result on all benchmarking datasets, where the ComplEx models provide equivalent results on 3 out 6 datasets. Our reported results of the ComplEx model with multi-class log-loss introduced by Lacroix et. al. [9] are slightly different from their reported results as we re-evaluated their models with restricted embeddings size to a maximum of 200. In their work they used an embedding size of 2000, which is impractical for embedding knowledge graphs in real applications. And other previous works using the TransE, DistMult, ComplEx, ConvE, and ConvKB models have limited their experiments to a maximum embedding size of 200. In our experiments, we limited our embedding size to 200 and we have re-evaluated the models of [9] using the same restriction for a fair comparison 5 .

Ranking and Multi-class Approaches
In the link prediction task, the objective of knowledge graph embedding models is to learn embeddings that rank triples according to their faculty. This is achieved by learning to rank original true triples against other negative triple instances, where the negative instances are modelled in different ways in ranking approaches and multi-class loss approaches. In learning to rank approach, models use a ranking loss e.g. pointwise or pairwise loss to rank a set of true and negative instances [4], where negative instances are generated by corrupting true training facts with a ratio of negative  to positive instances [2]. This corruption happens by changing either the subject or object of the true triple instance. In this configuration, the ratio of negative to positive instances is traditionally learnt using a grid search, where models compromise between the accuracy achieved by increasing the ratio and the runtime required for training.
On the other hand, multi-class based models train to rank positive triples against all their possible corruptions as a multi-class problem where the range of classes is the set of all entities. For example, training on a triple (s, p, o) is achieved by learning the right classes "s" and "o" for the pairs (?, p, o) and (s, p, ?) respectively, where the set of possible class is E of size N e . Despite the enhancements of the predictions accuracy achieved by such approaches [5,9], they can have scalability issues in real-world large sized knowledge graphs with large numbers of entities due to the fact that they use the full entities' vocabulary as negative instances [13].
In summary, our model provides significantly better results than other SOTA models in the ranking setting, which is scalable and thus better-suited to realworld applications. In addition to that, our model has equivalent or slightly better performance than SOTA models on the multi-class approach.

Conclusions and Future Work
In this work, we have presented the TriModel model, a new tensor factorisation based knowledge graph embedding model that represents knowledge entities an relation using three parts embeddings, where its embedding interaction function can model both symmetric and asymmetric predicates. We have shown by experiments that the TriModel model outperforms other tensor factorisation based models like the ComplEx and the DistMult on different training objectives and across all standard benchmarking datasets. We have also introduced a new challenging small size benchmarking datasets, NELL239, that can be used to facilitate fast development of new knowledge graph embedding models.
In our future works, we intend to investigate new possible approaches to model embedding interactions of tensor factorisation models, and we intend to analyse the effects of properties of knowledge graph datasets like FB15k-237 on the efficiency of tensor factorisation based models.