MMKG: Multi-Modal Knowledge Graphs

We present MMKG, a collection of three knowledge graphs that contain both numerical features and (links to) images for all entities as well as entity alignments between pairs of KGs. Therefore, multi-relational link prediction and entity matching communities can benefit from this resource. We believe this data set has the potential to facilitate the development of novel multi-modal learning approaches for knowledge graphs.We validate the utility ofMMKG in the sameAs link prediction task with an extensive set of experiments. These experiments show that the task at hand benefits from learning of multiple feature types.


Introduction
A large volume of human knowledge can be represented with a multi-relational graph. Binary relationships encode facts that can be represented in the form of RDF [14] type triples (head, predicate, tail), where head and tail are entities and predicate is the relation type. The combination of all triples form a multi-relational graph, where nodes represent entities and directed edges represent relationships. The resulting multirelational graph is often referred to as a Knowledge Graph.
Knowledge Graphs (KGs) provide ways to efficiently organize, manage and retrieve this type of information, being increasingly used as external source of knowledge for problems like recommender systems [34], language modeling [2], question answering [33] or image classification [18]. While ranging from general purpose (DBPEDIA [3] or FREEBASE [4]) to domain-specific (IMDB or UNIPROTKB), KGs are often highly incomplete and, therefore, research has focused heavily on the problem of knowledge graph completion [20]. Link prediction (i.e. predicting missing relationships between the entities of the KG), relationship extraction [25] (i.e. classification of semantic relationship mentions) and ontology matching [27] (i.e. alignment and integration of entities and relationships across KGs) are some of the different ways to tackle the incompleteness problem. Novel data sets for benchmarking knowledge graph completion approaches, therefore, are important contributions to the community. This is especially true since one method performing well on one data set might perform poorly on others [31]. With this paper we introduce MMKG (Multi-Modal Knowledge Graphs), a collection of three knowledge graphs for link prediction and entity matching research. Contrary to existing data sets, these knowledge graphs contain both numerical features and images for all entities as well as entity alignments between pairs of KGs. There is a fundamental difference between MMKG and other visual-relational resources (e.g. [15,32]) . While MMKG is intended to perform relational reasoning across different entities and images, previous resources are intended to perform visual reasoning within the same image.
We use FREEBASE15K [5] as the blue print for the multi-modal knowledge graphs we constructed. FREEBASE15K is the major benchmark data set in the recent link prediction literature. In a first step, we aligned most FB15k entities to entities from DB-PEDIA and YAGO through the sameAs links contained in DBPEDIA and YAGO dumps. Since the degree of a node relates to the probability of an entity to appear in a subsampled version of a KG, we use this measure to populate our versions of DBPEDIA and YAGO with more entities. For each knowledge graph, we include entities that are highly connected to the aligned entities so that the number of entities in each KG is similar to that of FB15K. Lastly, we have populated the three knowledge graphs with numeric literals and images for (almost) all of their entities. We name the two new data sets DBPEDIA15K and YAGO15K. Although all three data sets contain a similar number of entities, this does not prevent potential users of MMKG from filtering out entities to benchmark approaches in scenarios where KGs largely differ with respect to the number of entities that they contain.
The contributions of the present paper are the following: -The creation of two knowledge graphs DBPEDIA15K and YAGO15K, that are the DBPEDIA and YAGO [29] counterparts, respectively, of FREEBASE15K. Furthermore, all three KGs are enriched with numeric literals and image information, as well as sameAs predicates linking entities from pairs of knowledge graphs. sameAs predicates, numerical literals and (links to) images for entities so as the relational graph structure are released in separate files. -We validate our hypothesis that knowledge graph completion related problems can benefit from multi-modal data: • We elaborate on a previous learning framework [10] and extend it by also incorporating image information. We perform completion in queries such as (head?, sameAs, tail) and (head, sameAs, tail?), where head and tail are entities, each one from a different KG. This task can be deemed something inbetween link prediction and entity matching. • We analyze the performance of the different modalities in isolation for different percentages of known aligned entities between KGs, as well as for different combinations of feature types.
The paper is organized as follows: In Section 2 we discuss the relevance of MMKG for link prediction and entity matching research. Section 3 elaborates on how the different elements of MMKG were constructed and provides relevant statistics of the resource.  Section 4 presents the learning framework and our extension, which is followed by experimental evidence in Section 5 that validates our hypothesis about the need of such data set. Finally, Section 6 presents our conclusions.

Relevance
There are a number of problems related to knowledge graph completion. Named-entity linking (NEL) [7,12] is the task of linking a named-entity mention from a text to an entity in a knowledge graph. Usually a NEL algorithm is followed by a second procedure, namely relationship extraction [19,25], which aims at linking relation mentions from text to a canonical relation type in a knowledge graph. Hence, relation extraction methods are often used in conjunction with NEL algorithms to perform KG completion from natural language content.
Link prediction and entity matching are two other popular tasks for knowledge graph completion. MMKG has been mainly created targeting these two tasks.
Link prediction. It aims at answering completion queries of the form (head?, predicate, tail) or (head, predicate, tail?), where the answer is supposed to be always within the KG.
Entity Matching. Given two KGs, the goal is to find pairs of records, one from each KG, that refer to the same entity. For instance, DBpedia:NYC ≡ FB:NewYork.

Relevance for Multi-Relational Link Prediction Research
The core of most of multi-relational link prediction approaches is a scoring function. The scoring function is a (differentiable) function whose parameters are learned such that it assigns high scores to true triples and low scores to triples assumed to be false. The majority of recent work fall into one of the following two categories: 1. Relational approaches [17,11] wherein features are given as logical formulas which are evaluated in the KG to determine the feature's value. For instance, the formula ∃x (A, bornIn, x) ∧ (x, capitalOf, B) corresponds to a binary feature which is 1 if there exists a path of that type from entity A to entity B, and 0 otherwise. 2. Latent approaches [20] learn fixed-size vector representations (embeddings) for all entities and relationships in the KG.
While previous work has almost exclusively focused on the relational structure of the graph, recent approaches have considered other feature types like numerical literals [10,24]. In addition, recent work on visual-relational knowledge graphs [23] has introduced novel visual query types such as "How are these two unseen images related to each other?" and has proposed novel machine learning methods to answer these queries. Different to the link prediction problem addressed in this work, the methods evaluated in [23] solely rely on visual data.
MMKG provides three data sets for evaluating multi-relational link prediction approaches where, in addition to the multi-relational links between entities, all entities have been associated with numerical and visual data. An interesting property of MMKG is that the three knowledge graphs are very heterogeneous (w.r.t. the number of relation types, their sparsity, and so on) as we show in Section 3. It is known that the performance of multi-relational link prediction methods depends on the characteristics of the specific knowledge graphs [31]. Therefore, MMKG is an important benchmark data set for measuring the robustness of the approaches.

Relevance for Entity Matching Research
There are numerous approaches to find sameAs links between entities of two different knowledge graphs. Though there are works [21,9] that solely incorporate the relational graph structure, there is an extensive literature on methods that perform the matching by combining relational structural information with literals of entities, where literals are used to compute prior confidence scores [28,16,22].
A large number of approaches of the entity matching literature have been evaluated as part of the Ontology Alignment Evaluation Initiative (OAEI) [1] using data sets such as YAGO, FREEBASE, and IMDB [16,28,22]. Contrary to the proposed multi-modal knowledge graph data sets, however, the OAEI does not focus on tasks with visual and numerical data. The main advantages of MMKG over existing benchmark data sets for entity matching are: (1) MMKG's entities are associated with visual and numerical data, and (2) the availability of ground truth entity alignments for a high percentage of the KG entities. The former encourages research in entity matching methods that incorporate visual and numerical data. The latter allows one to measure the robustness in performance of entity matching approaches with respect to the number of given alignments between two KGs. The benchmark KGs can also be used to evaluate different active learning strategies. Traditional active learning approaches ask a user for a small set of alignments that minimize the uncertainty and, therefore, maximize the quality of the final alignments.

MMKG: Dataset Generation
We chose FREEBASE-15K (FB15K), a data set that has been widely used in the knowledge graph completion literature, as a starting point to create the multi-modal knowledge graphs. Facts of this KG are in N-Triples format, a line-based plain text format for encoding an RDF graph. For example, the triple </ns/g.112ygbz6> </ns/type.object.type> </ns/film.film>.  405 11,194 11,199 indicates that the entity with identifier </ns/g.112ygbz6> is connected to the entity with identifier </ns/film.film> via the relationship </ns/type.object.type>.
We create versions of DBPEDIA and YAGO, called DBPEDIA-15K (DB15K) and YAGO15K, by aligning entities in FB15K with entities in these other knowledge graphs. More concretely, for DB15K we performed the following steps.
1. SAMEAS. We extract alignments between entities of FB15K and DBPEDIA in order to create DB15K. These alignments link one entity from FB15K to one from DBPEDIA via a sameAs relation. 2. RELATIONAL GRAPH. A high percentage of entities from FB15K can be aligned with entities in DBPEDIA. However, to make the two knowledge graphs have roughly the same number of entities and to also have entities that cannot be aligned across the knowledge graphs, we include additional entities in DB15K. We chose entities with the highest connectivity to the already aligned entities to complete DB15K. We then collect all the triples where both head and tail entities belong to the set of entities of DB15K. This collection of triples forms the relational graph structure of DB15K. 3. NUMERIC LITERALS. We collect all triples that associate entities in DB15K with numerical literals. For example, the relations /location/geocode/latitude links entities to their latitude. We refer to these relation types as numerical relations. Figure 2 shows the most common numerical relationships in the knowledge graphs. In previous work [10] we have extracted numeric literals for FB15K only. 4. IMAGES. We obtain images related to each of the entities of FB15K. To do so we implemented a web crawler that is able to parse query results for the image search engines Google Images, Bing Images, and Yahoo Image Search. To minimize the amount of noise due to polysemous entity labels (for example, there are two FREE-BASE entities with the text label "Paris") we extracted, for each entity in FB15K, all Wikipedia URIs from the 1.9 billion triple FREEBASE RDF dump 5 . For instance, for Paris, we obtained URIs such as Paris(ile-de-France,France) and Paris(City of New Orleans, Louisiana). These URIs were processed and used as search queries for disambiguation purposes. We crawled web images also following other type of search queries, and not only the Wikipedia URIs. For example, we used i) the entity name, and ii) the entity name followed by the entity's notable type as query strings, among others. After visual inspection of polysemous entities (as they are the most problematic entities), we observed that using Wikipedia URIs as query strings was the strategy that alleviated most the polysemy problem. We used the crawler to download a number of images per entity. For each entity we stored the 20 top ranked images retrieved by each browser. We filtered out images with a side smaller than 224 pixels, and images with a side 2.5 bigger than the other. We also removed corrupted, low quality, and duplicate images (pairs of images with a pixel-wise distance below a certain threshold). After all these steps, we kept 55.8 images per entity on average. We also scaled the images to have a maximum height or width of 500 pixels while maintaining their aspect ratio. Finally, for each entity we distribute a distinct image to FB15K and DB15K.
We repeat the same sequence of steps for the creation of YAGO15K with one difference. sameAs predicates from the YAGO dump align entities from that knowledge graph to DBPEDIA entities. We used them along with the previously extracted alignments between DB15K and FB15K to eventually create the alignment between YAGO and FB15K entities. Table 1 depicts the hyperlinks from where we extracted the different component for the generation of DB15K and YAGO15K.
Statistics of FB15K, DB15K and YAGO15K are depicted in Table 2. The frequency of entities and relationships in YAGO15K and DB15K are depicted in Figure 3 and 4, respectively. Entities and relationships are sorted according to their frequency. They show in logarithmic scale the number of times that each entity and relationship occurs  in YAGO15K and DB15K. Relationships like starring or timeZone occur quite frequently in YAGO15K, while others like animator are rare. Contrary to FB15K, the entity Male is unusual in YAGO15K, which illustrates, to a limited extent, the heterogeneity of the KGs.

Availability and Sustainability
MMKG can be found in the Github repository https://github.com/nle-ml. We will actively use Github issues to track feature requests and bug reports. The documentation of the framework has been published on the repository's Wiki as well. To guarantee the future availability of the resource, it has also been published on Zenodo. MMKG is released under the BSD-3-Clause License. The repository contains a number of files, all of them formatted following the N-Triples guidelines (https://www.w3.org/TR/n-triples/). These files contain information regarding the relational graph structure, numeric literals and visual information. Numerical information is formatted as RDF literals, entities and relationships point to their corresponding RDF URIs 6 . We also provide separates files that link both DB15K and YAGO15K entities to FB15K ones via sameAs predicates, also formatted as N-Triples.
To avoid copyright infringement and guarantee the access to the visual information (i.e. URLs to images are not permanent), we learn embeddings for the images through the VGG16 model introduced in [ Fig. 6. Illustration of the methods we evaluated to combine various data modalities.
architecture of this network is illustrated in Figure 5. We remove the softmax layer of the trained VGG16 and obtain the 4096-dimensional embeddings for all images of MMKG. We provide these embeddings in hdf5 [30] format. The Github repository contains documentation on how to access these embeddings. Alternatively, one can use the crawler (also available in the Github repository) to download the images from the different search engines.

Technical Quality of MMKG
We provide empirical evidence that knowledge graph completion related tasks can benefit from the multi-modal data of MMKG. Our hypothesis is that different data modalities contain complementary information beneficial for both multi-relational link prediction and entity matching. For instance, in the entity matching problem if two images are visually similar they are likely to be associated with the same entity and if two entities in two different KGs have similar numerical feature values, they are more likely to be identical. Similarly, we hypothesize that multi-relational link prediction can benefit from the different data modalities. For example, learning that the mean difference of birth years is 0.4 for the Freebase relation /people/marriage/spouse, can provide helpful evidence for the linking task. In recent years, numerous methods for merging feature types have been proposed. The most common strategy is the concatenation of either the input features or some intermediate learned representation. We compare these strategies to the recently proposed learning framework [10], which we have found to be superior to the concatenation and an ensemble type of approach.

Task: SAMEAS Link Prediction
We validate the hypothesis that different modalities are complementary for the sameAs link prediction task. Different to the standard link prediction problem, here the goal is to answer queries such as (head?, sameAs, tail) or (head, sameAs, tail?) where head and tail are entities from different KGs. We do not make the one-to-one alignment assumption, that is, the assumption that one entity in one KG is identical to exactly (at most) one in the other. A second difference is that in the evaluation of the SAMEAS prediction task, and in general in the link prediction literature, only one argument of a triple is assumed to be missing at a time. That partial knowledge of the ground truth is not given in the entity matching literature.

Model: Products of Experts
We elaborate on previous work [10] and extend it by incorporating visual information. Such learning framework can be stated as a Product of Experts (PoE).
In general, a PoE's probability distribution is where d is a data vector in a discrete space, θ i are the parameters of individual model In the KG context, the data vector d is always a triple d = (h, r, t) and the objective is to learn a PoE that assigns high probability to true triples and low probabilities to triples assumed to be false. For instance, the triple (Paris, locatedIn, France) should be assigned a high probability and the triple (Paris, locatedIn, Germany) a low probability. If (h, r, t) holds in the KG, the pair's vector representations are used as positive training examples. Let d = (h, r, t). We can now define one individual expert f (r,F) (d | φ (r,F) ) for each (relation type r, feature type F) pair f (r,L) (d | θ (r,L) ) : the embedding expert for relation type r f (r,R) (d | θ (r,R) ) : the relational expert for relation type r θ (r,N) ) : the numerical expert for relation type r f (sameAs,I) (d | θ (r,I) ) : the visual expert for relation type sameAs The joint probability for a triple d = (h, r, t) of the PoE model is now where c indexes all possible triples. For information regarding the latent, relational and numerical experts, we refer the reader to [10]. Although entity names are not used to infer sameAs links in this work, one may also define an expert for such feature.

Visual Experts
The visual expert is only learned for the sameAs relation type. The scores for the image experts is computed by the cosine similarity between two 4096dimensional feature vectors from the two images.
Let d = (h, r, t) be a triple. The visual expert for relation type r is defined as where · is the dot product and i h and i t are embeddings of the images for the head and tail entities.
Learning The logarithmic loss for the given training triples T is defined as To fit the PoE to the training triples, we follow the derivative of the log likelihood of each observed triple d ∈ T under the PoE [10] and we generate for each triple d = (h, r, t) a set E consisting of N triples (h, r, t ) by sampling exactly N entities t uniformly at random from the set of all entities. In doing so, the right term is then approximated by This is often referred to as negative sampling.

Additional Baseline Approaches
Apart from the product of experts, we also evaluate other approaches to combine various data modalities. All the evaluated approaches are illustrated in Figure 6.
Concatenation Given pairs of aligned entities, each pair is characterized by a single vector wherein all modality features of both entities are concatenated. For each pair of aligned entities we create a number of negative alignments, each of which is also characterized by a concatenation of all modality features of both entities. A logistic regression is trained taking these vectors as input, and their corresponding class label (+1 and -1 for positive and negative alignments, respectively). The output of the logistic regression indicates the posterior probability of two entities being the same. In Section 5 we refer to this approach as CONCAT.

Ensemble
The ensemble approach combines the various expert models into an ensemble classifier. Instead of training the experts jointly and end-to-end, here each of the expert models is first trained independently. At test time, the scores of the expert models are added and used to rank the entities. We refer to this approach as ENSEMBLE.

Experiments
We conducted experiments on two pairs of knowledge graphs of MMKG, namely, (FB15K vs. DB15K and YAGO15K vs. FB15K). We evaluate a number of different instances of the product of experts (PoE) model, as well as the other baseline methods, in the sameAs prediction task. Because of its similarity with link prediction, we use metrics commonly used for this task. The main objective of the experiments is to demonstrate that MMKG is suitable for the task at hand, and specifically that the related problems can benefit from learning of multiple feature types.

Evaluation
MMKG allows to experiment with different percentages of aligned entities between KGs. These alignments are given by the sameAs predicates that we previously found. We evaluate the impact of the different modalities in scenarios wherein the number of given alignments P [%] between two KGs is low, medium and high. We reckon that such scenarios would correspond to 20%, 50% and 80% out of all sameAs predicates, respectively. We use these alignments along with the two KGs as part of our observed triples T, and split equally the remaining sameAs triples into validation and test. We use AMIE+ [8] to mine relational features for the relational experts. We used the standard settings of AMIE+ with the exception that the minimum absolute support was set to 2 and the maximum number of entities involved in the rule to four. The latter is important to guarantee that AMIE+ retrieves rules like (x, r 1 , w), (w, SameAs, z), (z, r 2 , y) ⇒ (x, SameAs, y), wherein r 1 is a relationship that belongs to the one KG, and r 2 to the other KG. One example of retrieved rule by AMIE+ is: (x, father of DB15k , w), (w, SameAs, z), (z, children of FB15k , y) ⇒ (x, SameAs, y) In this case both father of DB15k and children of FB15k are (almost) functional relationships. A relationship r is said to be functional if an entity can only be mapped exactly to one single entity via r. The relational expert will learn that the body of this rule leads to a sameAs relationship between entities x and y and with a very high likelihood.
We used ADAM [13] for parameter learning in a mini-batch setting with a learning rate of 0.001, the categorical cross-entropy as loss function and the number of epochs was set to 100. We validated every 5 epochs and stopped learning whenever the MRR (Mean Reciprocal Rank) values on the validation set decreased. The batch size was set to 512 and the number N of negative samples to 500 for all experiments. We follow the same evaluation procedure as previous works of the link prediction literature. Therefore, we measure the ability to answer completion queries of the form (h, SameAs, t?) and (h?, SameAs, t). For queries of the form (h, SameAs, t?), wherein h is an entity of the first KG, we replaced the tail by each of the second KB's entities in turn, sorted the triples based on the scores or probabilities, and computed the rank of the correct entity. We repeated the same process for the queries of type (h?, SameAs, t), wherein t in this case corresponds to an entity of the second KG and we iterate over the entities of the first KG to compute the scores. The mean of all computed ranks is the Mean Rank (lower is better) and the fraction of correct entities ranked in the top n is called hits@n (higher is better). We also computer the Mean Reciprocal Rank (higher is better) which is an evaluation metric that is less susceptible to outliers. Note that the filtered setting described in [5] does not make sense in this problem, since an entity can be linked to an entity via a SameAs relationship only once. We report the performance of the PoE in its full scope in Tables 4 and 5. We also show feature ablation experiments, each of which corresponds to removing one modality from the full set. The performance of each modality in isolation is also depicted. We use the abbreviations PoE-suffix to refer to the different instances of PoE. suffix is a combination of the letters L (Latent), R (Relational), N (Numerical) and I (Image) to indicate the inclusion of each of the four feature types. Generalizations are complicated to make, given that performance of PoE's instances differ across percentages of aligned entities and pairs of knowledge graphs. Nevertheless, there are two instances of our PoE approach, PoE-lrni and PoE-rni, that tend to outperform all others for low and high percentages of aligned entities, respectively. Results seem to indicate that the embedding expert response dominates over others, and hence its addition to PoE harms the performance when such expert is not the best-performing one. Table 3 and Figure 7 provides examples of queries where numerical and visual information led to good performance, respectively. It is hard to find one specific reason that explains when adding numerical and visual information is beneficial for the task at hand. For example, there are entities with a more canonical visual representation than others. This relates to the difficulty of learning from visual data in the sameAs link prediction problem, as visual similarity largely varies across entities. Similarly, the availability of numerical attributes largely varies even for entities of the same type within a KG. However, Tables 4 and 5 provide empirical evidence of the benefit from including additional modalities. Table 6 depicts results for the best-performing instance of PoE and baselines discussed in Section 4. The best performing instance of PoE significantly outperforms the approaches CONCAT and ENSEMBLE. This validates the choice of the PoE approach, which can incorporate data modalities to the link prediction problem in a principled manner.

Conclusion
We present MMKG, a collection of three knowledge graphs that contain multi-modal data, to benchmark link prediction and entity matching approaches. An interesting property of MMKG is that the three knowledge graphs are very heterogeneous with respect to the number of relation types and the degree of sparsity, for instance. An extensive set of experiments validate the utility of the data set in the sameAs link prediction task.