1 Introduction

Recently, large-scale knowledge graphs, have been used for representation of transportation networks, e-commerce and shopper preference networks, social and communication networks, and many other real-world systems. Such graphs often hold hundreds of millions, or even billions of vertices and edges. Many data processing methods rely on large-scale graph analytics, which are often based on nodes and graph embedding and node feature extraction, which can further be used in a various machine learning tasks. However, state-of the art techniques are not scalable to large graphs without losing accuracy and/or efficiency. A knowledge graph often needs to be partitioned into multiple sub-graphs, called shards, and stored at multiple computing nodes, which then requires distributed or parallel graph processing.

Graph embedding, also known as network embedding, is a frequently used technique for learning low-dimensional representations of a graph’s vertices, attempting to capture and retain the graph’s structure, as well as its inherent properties. Many tasks on graphs, such as link prediction, node classification, and visualization, greatly benefit by embedding a very large, web-scale graph into a low-dimensional vector space. More specifically, we might be interested in estimating the most likely labels for nodes in a network, or predicting user interests in a social network, or we might be interested in predicting functional labels of proteins in a protein-protein interaction network [1]. Similarly, in a link prediction task [2], we might want to know if a pair of nodes in a graph should be connected by an edge. Link prediction is beneficial numerous fields. For example, in bioinformatics, it aids in the discovery of novel protein interactions [3], and it can recognize “real-world buddies” on social networks [4].

A knowledge graph (KG) is a directed graph G (V, E) whose nodes vi ∈ V are entities and edges ei ∈ E are relations connecting entities. Knowledge graphs are often represented as RDF [5] datasets, where triples (vi, ei, vj) represent some type of semantic dependency between the connected entities and nodes/entities are identified by URI’s. Target nodes in triples are either URIs or literals, and edge/relationships have types represented by URIs, as well. RDFS [6] is used to define a schema for an RDF knowledge graph. KGs are closely related to Heterogeneous Information Networks (HIN) [21].

Various approaches to graph embedding have been presented in the machine learning literature, e.g., [7,8,9]. They function well on smaller networks, but real-world knowledge graphs, which often have millions of nodes and billions of edges, present a far more difficult situation. For example, a decade ago, the Twitter’s followee-follower network had 175 million active users and approximately twenty billion edges [10]. Most of the existing graph embedding algorithms do not scale up to networks of this magnitude.

Knowledge graphs can be partitioned into smaller subgraphs, with a hope that many tasks can take advantage of distributed and/or parallel processing. Given a graph G = (V, E), where V is a set of vertices and E is a set of edges and a number k > 1, a graph partitioning of G is a subdivision of vertices of G into subsets of vertices V1,…, Vk that partitions the set V. A balance constraint requires that all partitioned subgraphs be equal, or close, in size. In addition, a common objective is to minimize the total number of cut edges (min-cut), i.e., edges crossing (cutting) partition boundaries.

In this paper we propose PartKG2Vec, an algorithm for scalable feature learning in partitioned knowledge graphs. Our approach creates embeddings based on random walks in partitioned knowledge graphs and offers significant runtime improvements due to performing the walks in parallel. This is important in random walk-based methods, especially in semantics based, such as metapath2vec [29], due to the high cost of selecting the next node at each step during a random walk.

The rest of the paper is structured as follows. In Sect. 2, we briefly discuss related work. We present the technical details for PartKG2Vec in Sect. 3. In Sect. 4, we briefly explain the implementation of PartKG2Vec. In Sect. 5, we empirically evaluate PartKG2Vec. We conclude with a discussion of the PartKG2Vec framework and highlight some interesting directions for future work in Sect. 6.

2 Related Work

Recently, graph representation learning has attracted a lot of attention. In general, there are two types of graph representation learning methods: unsupervised and supervised. The goal of unsupervised approaches is to learn low-dimensional representation that preserves the structure of a given graph. The supervised methods work in the same way as the unsupervised methods, but for a specific prediction task, such as node or graph classification. Only unsupervised approaches are discussed in this paper.

Unsupervised embedding methods map a graph’s nodes and edges, into a continuous vector space. Several of the graph embedding techniques have been motivated by the Word2Vec algorithm [12], which originated in natural language processing. One type of this algorithm relies on the skip-gram embedding models, where a word’s embedding is optimized to predict its context or adjacent words. A random walk in a graph is akin a sequence of words in a sentence, where the nodes visited in the walk can be thought of words in a sentence.

Deepwalk [13] is one of the first approaches to embedding of graph-structured data. Deepwalk relies on the parallels between graph nodes and words and its neural networks are trained to maximize the likelihood of predicting the context nodes for each target vertex in a graph, in terms of vertex proximity.

node2vec [14], is a popular unsupervised graph embedding algorithm, which extends Deepwalk's sampling strategy. It utilizes random walks. Also, node2vec utilizes breadth-first and depth-first to capture both local and global community structures, resulting in more informative embeddings.

LINE [15], which is an acronym of Large-scale Information Network Embedding, produces embeddings ensuring first- and second-order proximity. For first order, LINE minimizes the graph regularization loss, and for second order, decodes embeddings into context conditional distributions for each node, which is computationally expensive. Negative sampling is used by LINE to sample negative edges based on a noisy distribution over edges. Finally, LINE combines first and second order embedding with concatenation.

HARP [16], or Hierarchical representation learning for networks lowers the number of nodes in the graph by coarsening the graph in a hierarchical manner. Iteratively grouping nodes into super nodes, it creates a graph with similar properties to the original network, resulting in graphs of reduced size. Existing approaches, such as LINE or Deepwalk, are then used to learn node embedding for each coarsened graph. The random walk technique on Gt-1 uses the embedding learned for Gt as initialized embedding at time-step t. This technique is repeated until each node in the original graph is embedded.

2.1 Embedding of Partitioned Graphs

PyTorch-BigGraph [17], also known as PBG, is a multi-relation embedding system that can scale to graphs with billions of nodes and trillions of edges by incorporating various improvements to existing multi-relation embedding systems. PBG can train very large embedding using a distributed cluster using graph partitioning. The adjacency matrix is decomposed into N buckets, with each bucket training on the edges individually. PBG then performs distributed execution across multiple machines or swaps embedding from each partition to disk to reduce memory use.

MILE [18], or Multi-Level Embedding, is a graph embedding framework that can scale to large graphs. It uses a hybrid matching technique to repeatedly coarsen the graph into smaller ones while maintaining its structure. It then uses known embedding methods on the coarsest graph and uses a graph convolution neural network to refine the embedding to the original graph. It is independent of the underlying graph embedding techniques and may be applied to a wide range of existing graph embedding methods without requiring them to be modified. It has been demonstrated that, MILE dramatically improves graph embedding time (by an order of magnitude).

Accurate, Efficient and Scalable Graph Embedding [19] relies on the GCN [20] model and its variants are strong graph embedding tools for enabling graph classification and clustering. A unique graph sampling based GCN parallelization strategy that achieves excellent scalable performance on very large graphs without sacrificing accuracy. To scale, it uses parallelism within and across many sampling instances for the graph sampling step and devises an efficient data structure for concurrent accesses. Data partitioning improves cache utilization within the sampled graph. On several large datasets, its parallel graph embedding exceeds state-of-the-art approaches in terms of scalability, efficiency, and accuracy.

PartKG2Vec, presented in this paper, is a parallel processing of partitioned knowledge graph to generate the random walk for the embedding. PartKG2Vec is a graph embedding system capable of handling big graphs. A parallelization approach, that achieves good scalability on very large graphs while maintaining accuracy. PartKG2Vec can be used with any random walk generator algorithm with minor adjustments.

The nodes in the knowledge graph are partitioned into the sub-graphs using METIS [11]. First, random walks of the subgraphs in a partitioned graph generate partial random walks. These partial walks are then combined to form complete walks. This way, the likelihood of preserving network neighborhoods of nodes in a d-dimensional feature space is maximized. The random walks are performed independently, starting with the initial nodes within each partition. Some of these walks will be incomplete (shorter than a desired length), as they reach a partition boundary. These walks are then completed with fragments of random walks in the neighboring partitions. A full set of complete walks is then used for representational learning to generate knowledge graph embedding.

PBG and PartKG2Vec use partitioning to support Knowledge Graph which is too large for a single machine and helps in distributed training of the model. PBG creates buckets from the cross edges (pi, pj), These buckets are loaded and subdivided among the CPU threads for training. PartKG2Vec is different, as we create the metadata of cut edges for each partition and complete or partial random walks are generated separately in each partition. Before the representation learning, the partial random walks are completed by concatenation with other walk fragments (from neighboring partitions). MILES repeatedly coarsening the graph into smaller graph, where multiple nodes in graph are collapsed to form super nodes and edges between them are the union of edges. Whereas in PartKG2Vec, we partition the knowledge graph to reduce the size of graph while maintaining the structure.

3 Partitioned Knowledge Graph Embedding

Our method, PartKG2Vec, (1) partitions a knowledge graph into k partitions, (2) distributes the resulting shards to k computing nodes, (3) in each shard, complete and/or partial random walks are created; a walk is partial, if a cut edge on a walk is encountered, before a desired walk length is reached, (4) a complete set of random walks is obtained from already existing complete walks in individual shards and by concatenating partial walks with sub-walks in neighboring partitions; a sub-walk is a fragment of a walk beginning with the target node in a cut edge terminating the corresponding partial walk, and (5) using the complete walks for graph embedding. Further downstream, graph embedding can be used to solve other problems, including link prediction, node classification, and many other tasks.

Knowledge graphs used in the method presented in this paper are represented as RDF datasets. However, others graph representations can be easily adapted and used in PartKG2Vec. This method has a time complexity bounded by O(|V | log |V |).

3.1 Graph Indexing and Partitioning and Segregator

To speed up the process of learning embedding for the knowledge graph, we created indices, using Apache Lucene [22], on all triples in the knowledge graph, based on their subjects, predicates, and objects. Using these indexes, the graphs triples of the form (S, P, O) can be efficiently searched, similarly as we have done in our prior works WawPart [27] and in AWAPart [28]. This indexing helps the system to convert the knowledge graph into a representation suitable for graph partitioning, as the URIs used in triples are also converted to numeric identifiers. This new graph representation is then partitioned into several sub-graphs using METIS [11]. We have experimented with other ways to partition the knowledge graph, including bisection methods, community detection methods, and others, but found METIS to produce the best partitions with a low number of cut edges in acceptable runtime.

The partitioning outputs the node list and their partition identifiers. Our system compares this list with the complete graph to produce the list of cut edges. Consequently, along with its edges, each partition stores information about the cut edges. The cut edges are replicated across the shards which share the cut edges. That is, a cut edge {u, v} is stored with both partitions, to which the nodes u and v belong.

3.2 Random Walk Generation

The created knowledge graph partitions (sub-graphs) are stored as shards at computing nodes for the processing and random walk generation. Figure 1(a) shows an example of two partitions P1 and P2 with vertices {a, b, c, d, u, v} and {m, n, o, p, q, v, u}, respectively. The partitions are connected with a cut edge {u, v}. A modified node2vec algorithm attempts to generate random walks of walk_length length, within each partition. However, if a walk in partition P1 encounters a cut edge {u, v} transitioning to partition P2, the random walk is terminated and recorded it as a partial_walk. The node v is recorded as the exit node of P1 and an entry node of P2. Figure 1(b) shows a partial random walk, interrupted because it attempted to cross to P2 using the cut edge {u, v}. Node v is an exit node for partition p1 and an entry node for partition p2. The exit and entry nodes and the current walk length is stored with the partial walk. This data is later used to complete the partial walk, as described below.

Fig. 1.
figure 1

Completion of walk from partial walks. (a) Nodes in partition p1 and p2. (b) Random walk on partition p1 is interrupted because of the exit node {v}. (c) Random walks W and sub walks SW passing through the entry node {v} on p2. (d) Formation of a complete walk CW from a partial walk in P1 and a sub-walk in P2.

A random walk in a partition may traverse a node v which is also a node in a cut edge. However, even though the node is in a cut edge, the walk does not terminate at v (as a partial walk) and continues within the same partition. A sub-walk is a sub-sequence of nodes in a random walk, beginning at a node v of a cut edge, but not crossing to the other partition. The modified node2vec algorithm also records and indexes all sub-walks within each partition. Figures 1(c1) and 1(c2) show random walks and sub-walks in P2. In Fig. 1(c3), a sub-walk does not exist, because the walk in the figure does not traverse node v, which is in the cut edge.

3.3 Accumulator and Graph Embeddings

The Random Walk Accumulator collects all the partial and complete random walks collected within all partitions. Complete walks do not need any further work, as they already have the desired walk length. However, any partial walks must be extended to the required walk length. For each partial walk, a sub-walk with the same starting node as the partial walk’s exit node is randomly selected and concatenated to the partial walk to make a complete random walk. As shown in Fig. 1(d), a complete random walk a, b, d, c, u, v, p, o, n, q is created using the sub-walk from Fig. 1(c1) and the partial walk from Fig. 1(b). Once the full set of complete random walks is created, it can be used by the representation learning module.

Fig. 2.
figure 2

PartKG2Vec pipeline

4 Implementation

Figure 2 shows the processing pipeline used in PartKG2Vec, while Fig. 3 shows the architecture of the system. A knowledge graph is given as input and its embedding is produced as the output. KG2Index Converter indexes the knowledge graph using Lucene [22] and converts it to a format suitable for partitioning. The graph is partitioned using METIS [11] into k partitions by the Partition Engine. The partition data (the edge lists) are sent to the Graph Partition Segregator, which creates the final partitions and identifies cut edges to be included with each partition.

The k partitioned sub-graphs (shards) are then sent to the k processing nodes to produce random walks. All partial (PRW) and complete walks (CRW) are transferred from the processing nodes to the master node. The master-node runs an Accumulator, which gathers all the walks (partial/sub-walks and complete walks) and other critical information. Already complete walks are simply retained, but the Accumulator uses partial walks and matching sub-walks to create complete walks (of the desired length). At the end of accumulation process, a corpus of complete random walks is finalized. This set of random walks is then used for representation learning. Finally, the Lucene index is applied to restore the original node identifiers (URIs) in the knowledge graph embedding.

Fig. 3.
figure 3

PartKG2Vec architecture

5 Evaluation

Two popular datasets, Yago39K [23] and NELL [24], were used for the evaluation of PartKG2Vec. Yago39K contains a subset of the Yago knowledge base [25], which includes data extracted from Wikipedia, WordNet and GeoNames. Yago39K contains 123,182 unique entities (nodes) and 1,084,040 edges, using 37 different relation types. NELL is a knowledge graph mined from Web documents and contains 49,869 unique nodes, 296,013 edges, using 827 relation types. The evaluation experiments discussed here were conducted on an Intel i7-based cluster.

Two experiments were used to evaluate the performance of PartKG2Vec. The first experiment was designed to evaluate the runtime of producing the embedding on the complete vs. partitioned graph. The second experiment intended to compare the graph embedding produced by PartKG2Vec (based on the modified node2vec and DeepWalk algorithms) with the embedding produced by the original algorithms on un-partitioned graphs. In the two experiments, both knowledge graphs (Yago39K and NELL) were partitioned into N = 10 partitions. We set all walk parameters to their default values, namely the number of walks to 10, walk length to 80, number of workers to 8, and the window size to 10, and the walk parameters of p and q both sets to 1.

5.1 Experiment 1: Runtime Improvement

This experiment demonstrates the improvement in the runtime of the random walk generation on the partitioned graph as compared to the random walks produced on the complete graph by the original algorithms. In Fig. 4, the runtime of node2vec and PartKG2Vec (with modified node2vec) on Yago39K and NELL is shown, while in Fig. 5, the runtime of Deepwalk and PartKG2Vec (with modified Deepwalk) on Yago39K and NELL is shown. Graph preprocessing and the 10 iterations of random walk generation are shown. Node2vec did not require all the steps before graph preprocessing and accumulation of random walks. All these extra steps were required for PartKG2Vec, but it did not require considerable time. Consequently, we can consider these steps as insignificant. Figure 4 indicates that the time required by PartKG2Vec for graph preprocessing took 32% of the time required by node2vec (695 vs 2175 s). Ten iterations of random walk generation were used in both algorithms, but PartKG2Vec runs in parallel, so it took only 17.75% of the time required by the original node2vec (480 vs. 2702 s) on the complete knowledge graph.

Fig. 4.
figure 4

PartKG2Vec (node2vec) runtime comparison with node2vec on Yago39K and NELL

Fig. 5.
figure 5

PartKG2Vec (Deepwalk) runtime compared to Deepwalk on Yago39K and NELL

Similarly, Fig. 4, shows that the time required by PartKG2Vec for graph preprocessing was only 20.5% of the time required by node2vec (13.5 vs. 66 s), on the NELL dataset. Ten iterations of random walk generation were used in both algorithms, but PartKG2Vec_N2V runs in parallel, so it took only 6.5% of time (51 vs. 794 s) of the time taken by node2vec on the complete graph. Learning the graph embedding takes the same time for node2vec and PartKG2Vec, because at this point both algorithms work on the similar random walk pool.

Figure 5, with the results for the YAGO dataset, shows that the time required for the generation of random walks by PartKG2Vec is only 21.6% of the time used by Deepwalk (46 vs. 213 s), and for NELL dataset, the time required for the random walk generation by PartKG2Vec is only 28% of the time used by Deepwalk (23 vs. 82 s).

5.2 Experiment 2: Embedding Quality

This experiment demonstrates the embedding quality based on the random walks generated by PartKG2Vec vs. node2vec and Deepwalk. Again, the experiment used the same two knowledge graphs (NELL and Yago39K) and produce their embeddings by PartKG2Vec vs. node2vec and Deepwalk, with varied dimensions d ∈ {128, 64, 32, 16}. The algorithms were executed 25 times, for each dimension. To compare the produced embeddings, the average divergence scores [26] SA,d were computed. Broadly speaking, a divergence score is the result of comparing a graph with the edges re-created from an embedding produced for a graph and the original graph. When comparing embeddings, a lower divergence score indicates a better embedding and, conversely, a higher divergence score means that a given embedding is not as good.

Figure 6 shows divergence scores of the embeddings of the Yago39K dataset produced by node2vec and PartKG2Vec_N2V (modified node2vec) and embeddings produced by Deepwalk and PartKG2Vec_DW (a PartKG2Vec implementation on Deepwalk). The embeddings have very similar divergence scores at every dimension. Incidentally, node2vec (and PartKG2Vec_N2V) produce better embeddings than those produced by Deepwalk and PartKG2Vec_DW. Comparing the embeddings produced for the NELL dataset leads to similar conclusions, as the divergence scores for node2vec and PartKG2Vec_N2V and for Deepwalk and PartKG2Vec_DW are very similar.

Fig. 6.
figure 6

Average divergence scores of embeddings on Yago39K and NELL produced by node2vec and Deepwalk and their corresponding PartKG2Vec_N2V and PartKG2Vec_DW methods.

This demonstrates that the embeddings generated from node2vec on the original graph are very similar to those generated by PartKG2Vec_N2V and the embeddings generated from Deepwalk are similar to those from PartKG2Vec_DW.

6 Conclusions and Future Work

We propose a system, PartKG2Vec, to create embeddings of partitioned knowledge graphs. The method uses modified node2vec and Deepwalk random walk algorithms to take advantage of the partitioning and perform in parallel. Our experiments showed that the embeddings produced on the original knowledge graphs are very similar to those produced by our method on the partitioned graphs. Importantly, PartKG2Vec offers significant performance improvements over the embedding algorithms on the unpartitioned (original) knowledge graphs, which would improve the runtime of embedding very large graphs.

In the future, we intend to study other embedding algorithms utilizing different types of random walks, especially incorporating the semantics in knowledge graphs, such as metapath2vec [29] and RegPattern2Vec [30].