Relational Graph Convolutional Networks: A Closer Look

In this paper, we describe a reproduction of the Relational Graph Convolutional Network (RGCN). Using our reproduction, we explain the intuition behind the model. Our reproduction results empirically validate the correctness of our implementations using benchmark Knowledge Graph datasets on node classification and link prediction tasks. Our explanation provides a friendly understanding of the different components of the RGCN for both users and researchers extending the RGCN approach. Furthermore, we introduce two new configurations of the RGCN that are more parameter efficient. The code and datasets are available at https://github.com/thiviyanT/torch-rgcn.


Message Passing
We will begin by describing the basic Graph Convolutional Network (GCN) layer for directed graphs. 3 This will serve as the basis for the RGCN in Section 3.2.
The GCN [KW16] is a graph-to-graph layer that takes a set of vectors representing nodes as input, together with the structure of the graph and generates a new collection of representations for nodes in the graph. A directed graph is defined as G = (V, E), where V is a set of vertices (nodes) and i, j ∈ E is a set of tuples indicating the presence of directed edges, pointing from node i to j. Equation 1 shows the message passing rule of a single layer GCN for an undirected graph, G. show the node embedding before and after the message passing step, respectively. The neighboring nodes are labelled from a to e.
Here, X is a node feature matrix, W represents the weight parameters, and σ is a non-linear activation function. A is a matrix computed by row-normalizing 4 the adjacency matrix of the graph G. The row normalization ensures that the scale of the node feature vectors do not change significantly during message passing. The node feature matrix, X, indicate the presence or absence of a particular feature on a node.
Typically, more than a single convolutional layer is required to capture the complexity of large graphs. In these cases, the RGCN layers are stacked one after another so that the output of the preceeding RGCN layer H (l−1) is used as the input for the current layer H (l) , as shown in Equation 2. In our work, we will use superscript l to denote the current layer.
If the data comes with a feature vector for each node, these can be used as the input X for the first layer of the model. If feature vectors are not available, one-hot vectors, of length N with the non-zero element indicating the node index, are often used. In this case, the input X becomes the identity matrix I, which can then be removed from Equation 1.
We can rewrite Equation 1 to make it explicit how the node representations are updated based on a node's neighbors: Here, x i is an input vector representing node i, h i is the output vector for node i, and N i is the collection of the incoming neighbors of i, that is the nodes j for which there is an edge (j, i) in the graph. For simplicity, the bias term is left out of the notation but it is usually included. We see that the GCN takes the average of i's neighbouring nodes, and then applies a weight matrix W and an activation σ to the result. Multipliying x i T W by 1 |Ni| means that we sum up all the feature vectors of all neighboring nodes. This makes every convolution layer permutation equivariant, and that is: if either the nodes (in A) are permuted, then the output representations are permuted in the same way. Overall, this operation has the effect of passing information about neighboring nodes to the node of interest, i and this called message passing. Message passing is graphically represented in Figure 3.1 for an undirected graph, where messages from neighboring nodes (a − e) are combined to generate a representation for node i. After message passing, the new representation of node i is a mixture of the vector embeddings of neighboring nodes.
If a graph is sparsely connected, a single graph-convolution layer may suffice for a given downstream task. Using more convolutional layers encourages mixing with nodes more than 1-hop away, however it can also lead to output features being oversmoothed [LHW18]. This is an issue as the embeddings for different nodes may be indistinguishable from each other, which is not desirable.
In summary, GCNs perform the following two operations: 1) They replace each node representation by the average of its neighbors, and 2) They apply a linear layer with a nonlinear activation function σ. There are two issues with this definition of the GCN. First, the input representation of node i does not affect the output representation, unless the graph contains a self-loop for i. This is often solved by adding self-loops explicitly to all nodes. Second, only the representations of nodes that have incoming links to i are used in the new representation of i. In the relational setting, we can solve both problems elegantly by adding relations to the graph which we will describe in the next section.

Extending GCNs for multiple relations
In this section, we explain how the basic message passing framework can extended to Knowledge Graphs. We define a Knowledge Graph as a directed graph with labelled vertices and edges. Formally, a KG can be defined as G = (V, E, R), where R represents the set of edge labels (relations) and s, r, o ∈ E is a set of tuples representing that a subject node s and an object node o are connected by the labelled edge r ∈ R.
The Relational Graph Convolutional Network extends graph convolutions to Knowledge Graphs by accounting for the directions of the edges and handling message passing for different relations separately. Equation 4 is an extension of the regular message passing rule (Equation 1).
where R is the number of relations, A r is an adjacency matrix describing the edge connection for a given relation r and W r is a relation-specific weight matrix. The extended message passing rule defines how the information should be mixed together with neighboring nodes in a relational graph. In the message passing step, the embedding is summed over the different relations. 5 With the message passing rule discussed thus far, the problem is that for a given triple s, r, o a message is passed from s to o, but not from o to s. For instance, for the triple Amsterdam,located in,The Netherlands it would be desirable to update both Amsterdam with information from The Netherlands, and The Netherlands with information from Amsterdam, while modelling the two directions as meaning different things. To allow the model to pass messages in two directions, the graph is amended inside the RGCN layer by including inverse edges: for each existing edge s, r, o , a new edge o, r , s is added where r is a new relation representing the inverse of r. A second problem with the naive implementation of the (R)GCN is that the output representation for a node i does not retain any of the information from the input representation. To allow such information to be retained, a self-loop s, r s , s is added to each node, where r s is a new relation that expresses identity. Altogether, if the input graph contains R relations, the amended graph contains 2R + 1 relations:

Reducing the number of parameters
We use N in and N out to represent the input and output dimensions of a layer, respectively. While the GCN [KW16] requires N in × N out parameters, relational message passing uses R + × N in × N out parameters. In addition to the extra parameters required for a separate GCN for every relation, we also face the problem that Knowledge Graphs do not usually come with a feature vector representing each node. As a result, as we saw in the previous section, the first layer of an RGCN model is often fed with a one-hot vector for each node. This means that for the first layer N in is equal to the number of nodes in the graph. show the node embedding before and after the message passing step, respectively. The neighboring nodes are labelled from a to e.
In their work, Schlichtkrull et al. introduced two different weight regularisation techniques to improve parameter efficiency: 1) Basis Decomposition and 2) Block Diagonal Decomposition. Figure 3.3 shows visually how the two different regularisation techniques work.
Basis Decomposition does not create a separate weight matrix W r for every relation. Instead, the matrices W r are derived as linear combinations of a smaller set of B basis matrices V b , which is shared across all relations. Each matrix W r is then a weighted sum of the basis vectors with component weight C rb : Both the component weights and the basis matrices are learnable parameters of the model, and in total they contain fewer parameters than W r . With lower number of basis functions, B, the model will have reduced degrees of freedom and possibly better generalisation.
Block Diagonal Decomposition creates a weight matrix for each relation, W r , by partitioning W r into Nin B by Nout B blocks, and then fixing the off-diagonal blocks to zeros (shown in Figure 3.3). 6 This deactivates the off-diagonal blocks, such that only the diagonal blocks are updated during training. An important requirement for this decomposition method is that the width/height of W r need to be divisible by B.
Here, B represents the number of blocks that W r is decomposed into and Q rb are the diagonal blocks containing the relation-specific weight parameters. Equation 6 shows that taking the direct sum of Q r over all the blocks gives W r , which can also be expressed as the sum of diag(Q rb ) over all the diagonal elements in 6 The off-diagonal blocks are N in B by N out B matrices containing only zeros. In Equation 6, we present these zeroed blocks simply with 0. the block matrix. The higher the number of blocks b, the lower the number of trainable weight parameters for each relation, W r , and vice versa. Block diagonal decomposition is not applied to the weight matrix of the identity relation r s , which we introduced in Section 3.2 to add self-loops to the graph.
In this Section, we will describe how we implemented the Relational Graph Convolutional Network. We begin by introducing crucial concepts for the implementation.

Einstein Summation
Message passing requires manipulating high-dimensional matrices and tensors using many different operations (e.g. transposing, summing, matrix-matrix multiplication, tensor contraction). We will use Einstein summation to express these operations concisely.
Einstein summation (einsum) is a notational convention for simplifying tensor operations [Kup14]. Einsum takes two arguments: 1) an equation 8 in which characters identifying tensor dimensions provide instructions on how the tensors should be transformed and reshaped, and 2) a set of tensors that need to be transformed or reshaped. For example, einsum(ik, jk → ij, A, B) represents the following matrix operation: 7 TensorFlow 2 is not backward compatible with TensorFlow 1 code. 8 We use a notation that maps directly to the way einstein summation is used in code, rather than the standard notation.
The general rules of an einsum operation are that indices which are excluded from the result are summed out, and indices which are included in all terms are treated as the batch dimension. We use einsum operations in our implementation to simplify the message passing operations.

Sparsity
Since many graphs are sparsely connected, their adjacency matrices can be efficiently stored on memory as sparse tensors. Sparse tensors are memory efficient because, unlike dense tensors, sparse tensors only store non-zero values and their indices. We make use of sparse matrix multiplications. 9 For sparse matrix operations on GPUs, the only mulitplication operation that is commonly available is multiplication of a sparse matrix S by a dense matrix D, resulting in a dense matrix. We will refer to this operation as spmm(S, D). For our implementation, we endeavour to express the sparse part of the RGCN message passing operation (Equation 4), including the sum over relations, in a single sparse matrix multiplication.

Stacking Trick
Using nested loops to iteratively pass messages between all neighboring nodes in a large graph would be very inefficient. Instead, we use a trick to efficiently perform message passing for all relations in parallel.
Edge connectivity in a relational graph is represented as a three-dimensional adjacency tensor A ∈ R R + ×N ×N , where N represents the number of nodes and R + represents the number of relations. Typically, message passing is performed using batch matrix multiplications as shown in Equation 4. However, at the time of writing, batch matrix operations for sparse tensors are not available in most Deep Learning libraries. Using spmm is the only efficient operation available, so we stack adjacency matrices and implement the whole RGCN in terms of this operation.
We augment A by stacking the adjacency matrices corresponding to the different relations A r vertically and horizontally into A v ∈ R (N +R + )×N and A h ∈ R N ×(N +R + ) , respectively.
Here, [·] represents a concatenation operation, and A v and A h are both sparse matrices. By stacking A r either horizontally or vertically, we can perform message passing using sparse matrix multiplications rather than expensive dense tensor multiplications. Thus, this trick helps to keep the memory usage low.
Algorithm 1 shows how message passing is performed using a series of matrix operations. All these are implementations of the same operation, but with different complexities depending on the shape of the input. 1) If the inputs X to the RGCN layer are one-hot vectors, X can be removed from the multiplication. The featureless message passing simply multiplies A with W , because the node feature matrix X is not given. Note that X, in this case, can also be modelled using an identity matrix I. However, since AW = AIW , we skip this step to reduce computational overhead.
2) In the horizontal stacking approach, X multiplied with W . This yields the XW tensor, which is then reshaped into a N R × N matrix. The reshaped XW matrix is then multiplied with A h using spmm.
3) In the vertical stacking approach, the X is mixed with A v using spmm. The product is reshaped into a tensor of dimension R × N × N . The tensor AX is then multiplied with W .
Any dense/dense tensor operations can be efficiently implemented with einsum, but sparse/dense operations only allow multiplication of sparse matrix by dense matrix. Therefore, we adopt the two different stacking approaches for memory efficiency. The vertical stacking approach is suitable for low dimensional input and high dimensional output, because the projection to low dimensions is done first. While the horizontal stacking approach is good for high dimensional input and low dimensional output as the projection to high dimension is done last. These matrix operations are visually illustrated in  indicates multiplication between dense tensors, which can be implemented with an einsum operator. refers to sparse-by-dense multiplication, for which the spmm operation is required. Black arrow indicates tensor reshaping.

Algorithm 1: Message Passing Layer
Input: Thus far, we focused on how Relational Graph Convolutional layers work and how to implement them. As mentioned in Section 2, RGCNs can be used for many downstream tasks. Now, we will discuss how these graph-convolutional layers can be used as building blocks in larger neural networks to solve two downstream tasks implemented in the original RGCN paper [SKB + 18]: node classification and link prediction. In the next two sections, we detail the model setup, our reproduction experiments and new configurations of the models. We begin with node classification.

Downstream Task: Node Classification
In the node classification task, the model is trained under a transductive setting which means that the whole graph, including the nodes in the test set, must be available during training, with only the labels in the test set withheld. The majority of the nodes in the graph are unlabelled and the rest of the nodes are labelled (we call these target nodes). The goal is to infer the missing class information, for example, that Amsterdam belongs to the class City.

Model Setup
Figure 5.1 is a diagram of the node classification model with a two-layer RGCN as described in [SKB + 18]. Full-batch training is used for training the node classification model, meaning that the whole graph is represented in the input adjacency matrix A for the RGCN. The input is the unlabeled graph, the output are the class predictions and the true predictions are used to train the model. The first layer of the RGCN is ReLU activated and it embeds the relational graph to produce low-dimensional node embeddings. The second RGCN layer further mixes the node embeddings. Using softmax activation the second layer generates a matrix with the class probabilities, Y ∈ R N ×C , and the most probable classes are selected for each unlabelled node in the graph. The model is trained by optimizing the categorical cross entropy loss: where Y is the set of labelled nodes, K represents the number of classes, t ik is one-hot encoded ground truth labels and h

e-RGCN
In GCNs [KW16], the node features X are represented by a matrix X ∈ R N ×F , where N is the number of nodes and F is the number of node features. When node features are not available, one-hot vectors can be used instead. An alternative approach would be to represent the features with continuous values E ∈ R N ×D , where D is the node embedding dimension.
In the GCN setting [KW16], using one hot vectors is functionally very similar to using embedding vectors: the multiplication of the one hot vector by the first weight matrix W , essentially selects a row of W , which then functions as an embedding of that node. In the RGCN setting, the same holds, but we have a separate weight matrix for each relation, so using one-hot vectors is similar to defining a separate node embedding for each relation. When we feed the RGCN a single node embedding for each node instead, we should increase the embedding dimension D to compensate.
Initial experiments showed that this approach slightly underperforms the one-hot approach on the benchmark data used in [SKB + 18]. After some experimentation, we ended up with the following model, which we call the embedding-RGCN (e-RGCN). Its message passing rule is described in Equation 11. The weight matrix is restricted to a diagonal matrix (with all off diagonal elements fixed to zero). 10 and then the product is multiplied by the adjacency matrix.
Here, E is the node embeddings broadcasted across all the relations R, w r is a vector containing weight parameters for relation r. Here, diag(·) is a function that takes a vector L ∈ R Q as an input and outputs a diagonal matrix N ∈ R Q×Q , where the diagonal elements are elements from the original vector L.
Using a diagonal weight matrix improves parameter efficiency, while enabling distinction between relations. We created a new node classfication model, where the first layer is an e-RGCN layer and the second layer is a standard RGCN (without regularisation) that predicts class probabilities. This model provides competitive performance with the RGCN, using only 8% of the parameters.

Node Classification Experiments
All node classification models were trained following Schlichtkrull et al. using full-batch gradient descent for 50 epochs. However, we used 100 epochs for e-RGCN on the AM dataset. Glorot uniform initialisation [GB10] was used to initialise parameters with a gain of √ 2 corresponding to the ReLU activation function. Kaiming initialization [HZRS15] was used to initialise the node embeddings in the e-RGCN node classification model. 11 Basis decomposition was used for the RGCN-based node classfication. 12 All RGCN and e-RGCN models, except for the e-RGCN on the AM dataset, were trained using a GPU.

Datasets
We reproduce the node classification experiments using the benchmark datasets that were used in the original paper: AIFB [BS07], MUTAG [DLdCD + 91], BGS [dV13] and AM [DBWVG + 12]. We also evaluate e-RGCN on the same datasets. AIFB is a dataset that describes a research institute in terms of its staff, research group, and publications. AM (Amsterdam Museum) is a dataset containing information about artifacts in the museum. MUTAG is derived as an example dataset for the machine learning model toolkit about complex molecules. The BGS (British Geological Survey) dataset contains information about geological measurements in Great Britain.
Since the messages in a two-layer RGCN cannot propagate further than two hops, we can prune away the unused nodes from the graph. This significantly reduces the memory usuage for large datasets (BGS & AM) without any performance deterioration. To the best of our knowledge, this was first implemented in the DGL library [WZY + 19]. For the AM and BGS datasets, the graph was pruned by removing any nodes that are 2 hops away from the target nodes. Pruning significantly reduces the number of entities, relations and edges and thus, lowers the memory consumption of the node classification model, making it feasible to train it on a GPU with 12GB of memory. Table 1 shows the statistics for the node classification datasets. We use the same training, validation and test split as in [SKB + 18].  Table 2 shows the results of the node classification experiments in comparison to the original RGCN paper. Torch-RGCN achieves similar performances to TF-RGCN reported in [SKB + 18]. We observed that the training times of the node classification models largely depended on the size of the graph dataset. The CPU training times varied from 45 seconds for the AIFB dataset to 20 minutes for the AM dataset. Since our implementation makes use of GPU's, we were able to run the Torch-RGCN models on a GPU and train the model within a few minutes.  Figure 6.1: A schematic visualisation of link prediction models. Edges are coloured (red and green) to indicate different edge labels. RGCN-based encoders can be seen an extension to traditional link predictors, such as DistMult [YYH + 14] and TransE [BUGD + 13]. RGCNs enrich the node representations used by these models by mixing them along the edges of the graph, before applying the score function. In this case, removing the RGCN layers and the upstream edge sampling, recovers the original DistMult. In the last step, the vectors corresponding to entities and relation are element-wise multiplied and the product x is summed. For a given triple s, r, o , the model produces a single scalar value x which indicates how likely the triple is to be true.

Downstream Task: Link Prediction
We now turn towards the second task performed in the original paper, multi-relational link prediction. The aim is to learn a scoring function that assigns true triples high scores and false triples low scores [BUGD + 13], with the correct triple ranking the highest. After training, the model can be used to predict which missing triples might be true, or which triples in the graph are likely to be incorrect.

Model Setup
We follow the procedure outlined by Schlichtkrull et al. Figure 6.1 shows a schematic representation of the link prediction model as described in the original paper. During training, traditional link predictors [BUGD + 13, YYH + 14] simultaneously update node representations and learn a scoring function (decoder) that predicts the likelihood of the correct triple being true. RGCN-based link predictors introduce additional steps upstream.
We begin by sampling 30,000 edges from the graph using an approach called neighborhood edge sampling (see Section 6.1.2). Then, for each triple we generate 10 negative training examples, generating a batch size of 330,000 edges in total. Node embeddings E ∈ R N ×D are used an input for the RGCN. 13 The RGCN performs message passing over the sampled edges and generates mixed node embeddings. Finally, the DistMult scoring function [YYH + 14] uses the mixed node embeddings to compute the likelihood of a link existing between a pair of nodes. For a given triple s, r, o , the model is trained by scoring all potential combination of the triple using the function: Here, e s and e o are the corresponding node embedding of entities s and o, generated by the RGCN encoder. r is a low-dimensional vector of relation r, which is part of the DistMult decoder. As Schlichkrull et al. highlighted in their work, the DistMult decoder can be replaced by any Knowledge Graph scoring function. We refer the reader to [RBG19] and [RBF + 21] for a comprehensive survey of state-of-the-art KGE models.
Similar to previous work on link prediction [YYH + 14], the model is trained using negative training examples. For each observed (positive) triple in the training set, we randomly sample 10 negative triples (i.e. we use a negative sampling rate of 10). These samples are produced by randomly corrupting either the subject or the object of the positive example (the probability of corrupting the subject is 50%). Binary cross entropy loss 14 is used as the optimization objective to push the model to score observable triples higher than the negative ones: where T is the total set of positive and negative triples, l is the logistic sigmoid function, and y is an indicator set to y = 1 for positive triples and y = 0 for negative triples. f (s, r, o) includes entity embeddings from the RGCN encoder and relations embeddings from the DistMult decoder.

Edge Dropout
In their work, Schlickrull et al. [SKB + 18] apply edge dropout to the link prediction model which acts as an additional regularisation method. This involves randomly selecting edges and removing them from a graph. As described in Section 3, for every edge in the graph inverse edges A r and self-loops A s are added within the RGCN layer. Dropping edges after this step poses a potential data leakage issue because inverse edges and self-loops of dropped edges will be included in the message passing step and thus, invalidate the model performance. To circumvent this issue, edges are dropped from a graph before feeding it into the RGCN. Edge dropout is applied such that the dropout rates on the self-loops R s are lower than for the data edges R and inverse edges R . One way to think about this that this ensures that the message from a node to itself is prioritised over incoming messages from neighboring nodes. In our implementation, we separate out A s from A and then apply the different edge dropout rates separately. The edge dropout is performed before row-wise normalising A.

Edge Sampling
Graph batching is required for training the RGCN-based link prediction model, because it is computationally expensive to perform message passing over the entire graph due to the large number of hidden units used for the RGCN. 15 Schlichtkrull et al. sample an edge with the probability proportional to its weight. In uniform edge sampling, equal weights are given to all the edges. However, in neighborhood edge sampling, initial weights are 13 In the original implementation, the embeddings are implemented as affine operation (i.e. biases are included) and they are ReLU activated. We reproduce this behaviour but it is not clear whether this gives any benefits over simple, unactivated embeddings (as used in the e-rgcn).
14 Schlichtkrull et al. multiply their loss by . ω is the negative sampling rate and |Ê| is the number of edges sampled.
We leave this term out of our implementation, because it is a constant and thus it would not affect the training. 15  proportional to the node degrees of vertices connected to edges. Then as edges are being sampled, the weight of its neighboring edges is increased and this increases the probability of these edges being sampled [KW18]. This sampling approach benefits link prediction because the neighboring edges provide context information to deduce the existence of a relation between a given pair of entities. In contrast, uniform edge sampling assumes that all edges are independent of each other, which is not be applicable to Knowledge Graphs.

Link Prediction Experiments
As in the original paper, the models are evaluated using Mean Reciprocal Rank (MRR) and Hits@k (k = 1, 3 or 10). The Torch-RGCN model was trained for 7,000 epochs. We monitored the training of our models by evaluating it at regular intervals every 500 epochs. Schlichtkrull initialisation (see Appendix ) was used to initialise all parameters in the link prediction models [SKB + 18] and in our reproductions. Schlichtkrull et al.
[SKB + 18] trained their models on the CPU. Our Torch-RGCN implementation and c-RGCN models run on the GPU. Early stopping was not used. We used the hyperparameters described in [SKB + 18].

Datasets
To evaluate link prediction, [SKB + 18], and WordNet (WN18) [BUGD + 13]. We only use WN18 16 . WN18 is a subset of WordNet the graph which describes the lexical relations between words. To check our reproduction, we also used FB-Toy [RBG19] which was not in the original paper. FB-Toy is a dataset consisting of a subset of FB15k. In Table 3, we show the statistics corresponding to these graph datasets.

Details
For link prediction, a single-layer RGCN with basis decomposition for WN18 and for FB-Toy a two-layer RGCN with block diagonal decomposition is used. An L2 regularisation penalty of 0.01 for the scoring function is applied. To compute the filtered link prediction scores, triples that occur in the training, validation and test are filtered. The Torch-RGCN model is trained on WN18 using batched training, in which 30,000 neighboring edges are sampled at every epoch. An edge dropout rate of 0.2 is applied for self-loops and 0.5 for data edges and inverse edges. Edges are randomly sampled from a Knowledge Graph using the neighborhood edge approach. In our reproduction attempts, we have found that this approach enables the model to perform better than uniform edge sampling. The RGCN is initialised using Schlichtkrull normal initialisation (see Appendix ), while the DistMult scoring function is initialised using standard normal initialisation.
We follow the standard protocol for link prediction. See [RBG19] for more details. Some of the hyperparameters used in training were not detailed in [SKB + 18]. To the furthest extent possible, we followed the same training regime as the original paper and code base, and we recovered missing hyperparameters. The hyperparameters for all experiments are provided in documented configuration files on https://github. com/thiviyanT/torch-rgcn.

Results
We verify the correctness of our implementation by reproducing the performance on a small dataset (FB-Toy) and by comparing the statistics of various intermediate tensors in the implementation with those of the reference implementation. 17 We selected a number of intermediate tensors in the link prediction model in our implementation and the original implementation. Then, we measured the statistics of the intermediate tensors.
In Table 4 we report the statistics of the intermediate tensors for TF-RGCN (original model) and Torch-RGCN (our implementation) link prediction models. These results suggests that the parameters used by both models came from a similar distribution and thus verified that they are one-to-one replication. After confirming that our reproduction is correct, we attempted to replicate the link prediction results on the WN18 dataset. 18 Table 5 also shows the results of the link prediction experiments carried out. The scores obtained by the Torch-RGCN implementation is lower than that of the TF-RGCN model and therefore, we were unable to duplicate the exact results reported in the original paper [SKB + 18]. We believe that the discrepancies between the scores is caused by the differences in the hyperparameter configurations. The exact hyperparameters that Schlichtkrull et al. used in their experiments were not available.
Despite our best efforts, we were unable to reproduce the exact link prediction results reported in the original paper [SKB + 18]. This is due to the multitude of hyperparameters 19 , not all of which are specified in  Figure 6.2: A schematic visualisation of c-RGCN based link prediction model. Here, the encoder has a bottleneck architecture. f θ and g φ are linear layers. Prior to message passing f θ compresses the input node embeddings, and then g φ projects the mixed node embeddings back up to their original dimensions. The red arrow indicates the residual connection. All edges from the training set are used (i.e. edge sampling is not required). the paper, and the long time required to train the model, with runtimes of several days. We did however manage to show the correctness of our implementation using a small-scale experiment. We consider this an acceptable limitation of our reproduction, because the current training time of the RGCN, compared to the state-of-the-art KGE models [RBG19]. A Distmult embedding model can be trained in well under an hour on any of the standard benchmarks, and as shown in [RBG19], outperforms the RGCN by a considerable margin. Thus, the precise link prediction architecture described in [SKB + 18] is less relevant in the research landscape.

c-RGCN
The link prediction architecture presented in [SKB + 18] does not represent a realistic competitor for the state of the art and is very costly to use on large graphs. Furthermore, a problem with the original RGCN link predictor [SKB + 18] is that we need high dimensional node representations to be competitive with traditional link predictors, such as DistMult [YYH + 14], but the RGCN is expensive for high dimensions. However, we do believe that the basic idea of message passing is worth exploring further in a link prediction setting.
To show that there is promise in this direction, we offer a simplified link prediction architecture that uses a fraction of the parameters of the original implementation [SKB + 18] uses. This variant places a bottleneck architecture around the RGCN in the link prediction model, such that the embedding matrix E is projected down to a lower dimension, C, and then the RGCN performs message propagation using the compressed node embeddings. Finally, the output is projected up back to the original dimension, D, and computes the DistMult score from the resulting high-dimensional node representations. We call this encoding network the compression-RGCN (c-RGCN). Equations 14 and 15 show the message passing rule for the first and second layer of the c-RGCN encoder, respectively. We selected a node embedding size of 128 and compressed it to a vector dimension of 16. We also include a residual connection by including E in the second layer of the c-RGCN. The residual connection allows the model, in principle, to revert back to DistMult if the message passing adds no value. If all RGCN weights are set to 0, we recover the original DistMult.
where g(X) = XW φ + b with W φ ∈ R E×C . As shown in Table 5, the c-RGCN does not perform much worse than the original implementation. However, it is much faster, and memory efficient enough for full batch evaluation on the GPU. There is a clear trade-off between compression size of the node embeddings and the performance in link prediction. While this result is far from a state-of-the-art model, it serves as a proof-of-concept that there may be ways to configure RGCN models for a better performance/efficiency tradeoff.

Discussion
We now discuss the implications for the use of the RGCN model, the performance of the new variants and the lessons learned from this reproduction.

Implications for RGCN usage
We believe that Relational Graph Convolutional Networks are still very relevant because it is one of the simplest members of the message passing models and is good starting place for exploration of machine learning for Knowledge Graphs.
RGCNs clearly perform well on node classification tasks because the task of classifying nodes benefits from message passing. This means that a class for a particular node is selected by reasoning about the classes of neighboring nodes. For example, a researcher can be categorised into a research domain by reasoning about information regarding their research group and close collaborators. Traditional Knowledge Graph Embeddings (KGE) models, such as TransE and DistMult, lack the ability to perform node classification.
While the RGCN is a promising framework, in its current setting we found that the link prediction model proposed by Schlichtkrull et al. is not competitive with current state-of-the-art [RBG19] and the model is too expensive with considerably lower performance. In our paper, we clarify that RGCN-based link predictors are extensions of KGE models [RBG19], thus training RGCN to predict links will always be more expensive than using a state-of-the-art KGE model. RGCN-based link predictor take several days to train, while state-of-the-art relation models run in well under an hour [RBG19].
To aid the usage of RGCN, we presented two new configurations of the RGCN: e-RGCN. We propose a new variant of the node classification model which uses significantly less parameters by exploiting a diagonal weight matrix. Our results show that it performs competitively with the model from [SKB + 18].
c-RGCN. We also present a proof-of-concept model that performs message passing over compressed graph inputs and thus, improves the parameter efficiency for link prediction. The c-RGCN has several advantages over the regular RGCN link predictor: 1) c-RGCN does not require sampling edges, because it is able to process the entire graph in full-batch, 2) c-RGCN takes a fraction of the time it takes to train an RGCN link predictor, 3) c-RGCN uses fewer parameters, and 4) It is straightforward to implement. Although the results for the c-RGCN are not as strong, this sets a path for further development towards efficient message models for relational graphs.

Reproduction
Evolving technologies pose several challenges for the reproducibilty of research artifacts. This includes frequent updates being made to existing frameworks, such as PyTorch and TensorFlow, often breaking backward compatibility. We were in a strong position to execute this reproduction: 1) an author of this paper also worked on the original paper, 2) we contacted one of the lead authors of this paper who was very responsive and 3) we were able to run the original source code 20 inside a virtual environment. Nevertheless, we found it considerably challenging to make a complete reproduction. To explain why and to contribute to avoiding such situations in the future, we briefly outline the lessons we have learned during the reproduction.
Parameter Statistics. There were discrepancies between the description of the link prediction model in [SKB + 18] and the source code. The source code reproduces the values similar to the MRR scores reported in [SKB + 18]. Thus, to reproduce the results we had to perform a deep investigation of the source code. Using the original source, we relied on comparing the parameter statistics and tensor sizes at various points in both models. Since these statistics are helpful to verify the correctness of an implementation, we believe this is a useful practice in aiding reproduction. For complex models with long runtimes, an overview of descriptive statistics of parameter and output tensors for the first forward pass can help to check implementation without running full experiments. We are publishing statistics for intermediate products that we obtained for the link prediction models (see Table 4).
Small dataset. We found that the link prediction datasets used by [SKB + 18] were large and thus, impractical for debugging RGCN because it is costly to train them on large graphs. Using a smaller dataset (FB-Toy [SKB + 18]) would enable quicker testing with less memory consumption. Thus, we report link prediction results on the FB-Toy dataset [RBG19] (Table 6.2).
Training times. The training times were variable and strongly depended on the size of the graph, the number of relations and the number of epochs. Schlichtkrull et al. reported the computational complexity, but not practical training times. It turns out that this is an important source of uncertainty in verifying whether re-implementations are correct. We measured the runtimes, which includes training the model and using the pre-trained model for making inference. For 7000 epochs, the link prediction runtimes for Torch-RGCN and c-RGCN on the WN18 dataset are 2407 and 53 minutes, respectively. Node classification experiments took a few minutes to complete, because they only required 50-100 epochs. We encourage authors to report such concrete training times.
Hyperparameter Search. We found that hyperparameters reflect the complexity of the individual datasets. For example, AIFB, the smallest dataset, was not prone to overfitting. Whereas, the larger AM dateset required basis decomposition and needs a reduced hidden layer size. For link prediction, we were unable to identify the optimum hyperparameters for WN18, FB15k and FB15k-237 due to the sheer size of the hyperparameter space and long training times. We provide a detailed list of hyperparameter we use in our reproduction. While this is becoming more common in the literature, this serves as further evidence of the importance of this detailed hyperparameter reporting.
Other factors. We still faced the common challenges in software reproduction that others have long noted [FvEP + 13], including missing dependencies, outdated source code, and changing libraries. An additional challenge with machine learning models is that hardware (e.g. GPUs) now also can impact the performance of the model itself. For instance, while we were able to run the original link prediction code in TensorFlow 1.4, the models no longer seemed to benefit from the available modern GPUs. Authors should be mindful that even if legacy code remains executable for a long time, executing it efficiently on modern hardware may stop being possible much sooner. Here too, reporting results on small-scale experiments can help to test reproductions without the benefit of hardware acceleration.

Conclusion
We have presented a reproduction of Relational Graph Convolutional Networks and, using the reproduction, we provide a friendly explanation of how message passing works. While message passing is evidently useful for node classification, our findings also show that RGCN-based link predictors are currently too costly to make for a practical alternative to the state-of-the-art. However, we believe that improving the parameter efficiency RGCNs could potentially make it more accessible. We present two novel configurations of the RGCN: 1) e-RGCN, which introduces node embeddings into the RGCN using fewer parameters than the original RGCN implementation, and 2) c-RGCN, a proof-of-concept model which compresses node embeddings and thus speeds up link prediction. These configurations provide the foundation for future work. We believe that the techniques proposed in this paper may also be important for others implementing other message passing models. Lastly, our new implementation of RGCN using PyTorch, TorchRGCN, is made openly available to the community. We hope that this can help serve the community in the use, development and research of this interesting model for machine learning on Knowledge Graphs.