Graph Barlow Twins: A self-supervised representation learning framework for graphs

The self-supervised learning (SSL) paradigm is an essential exploration area, which tries to eliminate the need for expensive data labeling. Despite the great success of SSL methods in computer vision and natural language processing, most of them employ contrastive learning objectives that require negative samples, which are hard to define. This becomes even more challenging in the case of graphs and is a bottleneck for achieving robust representations. To overcome such limitations, we propose a framework for self-supervised graph representation learning - Graph Barlow Twins, which utilizes a cross-correlation-based loss function instead of negative samples. Moreover, it does not rely on non-symmetric neural network architectures - in contrast to state-of-the-art self-supervised graph representation learning method BGRL. We show that our method achieves as competitive results as the best self-supervised methods and fully supervised ones while requiring fewer hyperparameters and substantially shorter computation time (ca. 30 times faster than BGRL).


Introduction
Graph representation learning has been intensively studied for the last few years, having proposed various architectures and layers, like GCN [12], GAT [25], GraphSAGE [9] etc. A substantial part of these methods was introduced in the semi-supervised learning paradigm, which requires the existence of expensive labeled data (e.g. node labels or whole graph labels). To overcome this, the research community has been exploring unsupervised learning methods for graphs. This resulted in a variety of different approaches including: shallow ones (DeepWalk [17], Node2vec [7], LINE [22]) that ignore the feature attribute richness, focusing only on the structural graph information; and graph neural network methods (DGI [26], GAE, VGAE [11]) that build representations upon node or graph features, achieving state-of-the-art performance in those days.
Recently self-supervised paradigm is the most emerging branch of unsupervised graph representation learning and gathers current interest and strenuous research effort towards better results. The most prominent methods were developed around the contrastive learning approach, such as GCA [30], GraphCL [27], GRACE [29] or DGI [26]. Although contrastive methods are popular in many machine learning areas, including computer vision and natural language processing, their fundamental limitation is the need for negative samples. Consequently, the sampling procedure for negative examples highly affects the overall quality of the embeddings. In terms of images or texts, the definition of negative samples might seem not that problematic, but in the case of graphs there is no clear intuition. For instance, what is the negative counterpart for a particular node in the graph, should it be a node that is not a direct neighbor, or a node that is in a different graph component? There are multiple options available, but the right choice strictly dependent on the downstream task.
Researchers have already tackled the problem of building so-called negative-sample-free methods. In research being conducted in computer vision they obtained successful results with methods such as BYOL [6], SimSiam [3] or Barlow Twins [28]. These models utilize siamese network architectures with various techniques, like gradient stopping, asymmetry or batch and layer normalizations, to prevent collapsing to trivial solutions. Based on BYOL, [23] proposed the Bootstrapped Representation Learning on Graphs (BGRL) framework. It utilizes two graph encoders: an online and a target one. The former one passes the embedding vectors to a predictor network, which tries to predict the embeddings from the target encoder. The loss is measured as the cosine similarity and the gradient is backpropagated only through the online network (gradient stopping mechanism). The target encoder is updated using an exponential moving average of the online encoder weights. Such setup has been shown to produce graph representation vectors that achieve state-of-the-art performance in node classification using various benchmark datasets. Notwithstanding, assuming asymmetry between the network twins (such as the predictor network, gradient stopping, and a moving average on the weight updates) the method is conceptually complex.
Employing a symmetric network architecture would seem more intuitive and reasonable, hence in this paper, we propose Graph Barlow Twins (G-BT), a self-supervised graph representation learning framework, which computes the embeddings cross-correlation matrix of two distorted views of a single graph. The approach was firstly introduced in image representation learning as the Barlow Twins model [28] but was not able to handle graphs. The utilized network architecture is fully symmetric and does not need any special techniques to build non trivial embedding vectors. The distorted graph views are passed through the same encoder which is trained using the backpropagated gradients (in a symmetrical manner).
Our approach differs from previous application of Barlow Twins cost function in terms of: pivoting the Barlow Twins loss on graph data and selection of appropriate encoder for such case, data augmentation functions (and their hyperparameters), simplified part of neural network structure that is not necessary to apply in graph case, and provided experimental results for the batched scenario.
Our main contributions can be summarized as follows: 1. We propose a self-supervised graph representation learning framework Graph Barlow Twins. It is built upon the recently proposed Barlow Twins loss, which utilizes the embedding cross-correlation matrix of two distorted views of a graph to optimize the representation vectors. Our framework neither requires using negative samples (opposed to most other selfsupervised approaches) nor it introduces any kind of asymmetry in the network architecture (like stateof-the-art BGRL). Moreover, our architecture is converges substantially faster than all other state-of-theart methods. 2. We evaluate our framework in node classification tasks: (1) for 5 smaller benchmark datasets in a transductive setting, (2) using the ogb-arxiv dataset from the Open Graph Benchmark (also in the transductive setting), (3) for multiple graphs in the inductive setting using the PPI (Protein-Protein Interaction) dataset, and finally (4) for the large-scale graph dataset ogb-products in the inductive setting. We use both GCN-based encoders as well as a GAT-based one. We observe that our method achieves analogous results compared to state-of-the-art methods. 3. We ensure reproducibility by making the code of both our models as well as experimental pipeline available: https://github.com/pbielak/graph-barlow-twins.

Related works
Self-supervised learning. The idea of self-supervised learning (SSL) has a long history. Introduced in the early work of Schmidhuber [19] has more than 30 years of exploration and research now. Recently self-supervised learning was again rediscovered and found a broad interest, especially in computer vision and natural language processing. One of the most prominent SSL methods for image representation learning, Bootstrap Your Own Latent, BYOL [6], performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks. It relies on two neural networks that interact and learn from each other. From an augmented view of an image, it trains the first one to predict the target network representation of the same image under a different view. At the same time, it updates the second network with a slow-moving average of the first network. Another approach to image representation SSL implements simple siamese networks, namely SimSiam [3]. It achieves comparative results while not demanding negative samples, large batches, nor momentum encoders. Authors emphasize collapsing solutions for the loss and structure but show how a stop-gradient operation plays an essential role in preventing it. Recent method, Barlow Twins [28], advances the SSL field with a new objective function that naturally avoids collapses by measuring the cross-correlation matrix between the outputs of two twin, identical networks fed with distorted versions of a sample, and makes it as close to the identity matrix as possible.
Representations of distorted versions of samples are then expected to be similar, reducing the redundancy between them. What differentiates the method is that it does not require large batches or asymmetry between the network twins. It outperforms previous methods on ImageNet for semi-supervised classification.
Graph representation learning. Learning the representation also spreads to other domains. The graph embedding problem has also attracted much attention from the research community worldwide in recent years. Plenty of methods have been developed, each focused on a different aspect of network embeddings, such as proximity, structure, attributes, learning paradigm or scalability. There exist plenty of shallow methods, among others DeepWalk [17], Node2vec [7] or LINE [22], that use a simple notion of graph coding through random walks or on encoderdecoder objectives that optimize first and second-order node similarity. More complex graph neural networks, such as GCN [12] or GraphSAGE [9] implements the basic encoder algorithm with various neighborhood aggregation. Following the extension, graph attention network GAT [25] leverages masked self-attentional layers to address the shortcomings of graph convolutions and their troublesome approximations.
Self-supervised graph representation learning. Inspired by the success of contrastive methods in vision and NLP, the procedures were also adapted to graphs. Early DGI [26] employs GNN to learn node embeddings and obtains the graph embedding via a readout function and maximizes the mutual information between node embeddings and the graph embedding by discriminating nodes in the original graph from nodes in a corrupted graph. GCA [30] studied various augmentation procedures. GRACE [29] creates two augmented versions of a graph, pulls together the representation of the same node in both graphs, and pushes apart representations of every other node. Recent GraphCL [27] method is another example of representative approach us-ing contrastive learning. All the previous methods use negative sampling approaches for the embedding optimization, yet such setting has a high complexity. To overcome this, BGRL [23] proposed to use an approach that does not rely on negative samples. It uses two kinds of encoder networks (online and target), introducing a non-intuitive asymmetric pipeline architecture, but provides state-of-theart SSL results. Moreover, it relies on several techniques to prevent trivial solutions (gradient stopping, momentum encoder). A concurrent approach to BGRL is DGB [2].

Proposed framework
Motivated by the emerging self-supervised learning paradigm and its recent applications in graph representation learning (BGRL [23]), we propose Graph Barlow Twins -a framework that builds node embeddings using a symmetric network architecture and an empirical cross-correlation based loss function.
We start with an attributed graph G. Using carefully selected augmentation functions A p X ,p A (·), we compute two augmented versions of this graph, i.e., G (1) and G (2) . Each time the augmentation function is applied, it yields a different graph version. This is due the fact that it is performed using parameters sampled according to p A and p X . Next, we apply the same encoder network (which is being pretrained using our proposed framework) on both created graphs and obtain two node embedding matrices -Z (1) and Z (2) . Finally, we compute the empirical crosscorrelation matrix C of these node embeddings and compute a loss function which forces the cross-correlation matrix to be as close as possible to the identity matrix. The encoder is trained by backpropagating the gradient of such loss function.
The overall pipeline of our framework is visually shown in Figure 1, and the precise algorithm is presented in Algorithm 1. Let's now describe each framework step in detail: Graph data. We represent a graph G with nodes V and edges E as the tuple: (X, A), where X ∈ R |V|×k is the node feature matrix and k is the feature dimensionality; A ∈ {0, 1} |V|×|V| is the adjacency matrix, such that A i,j = 1 iff (i, j) ∈ E. In the general case, a graph could also have associated edge features or graph level features, but for simplicity we omit those here. Nevertheless, these could also be used in our framework, as long as the encoder can make use of such features.
Generating graph views via augmentation. Following other works [23,30,27,29], we select two kinds of augmentations edge dropping and node feature masking -and generate two views of the input graph G (1) and G (2) (see Eq. 1). In the edge dropping case A p A (A), we remove edges according to a generated mask of size |E| (number of edges in the graph) with elements sampled from the Bernoulli distribution B(1 − p A ), where p A is the probability of dropping an Algorithm 1: Graph Barlow Twins training Input: attributed graph G, augmentation function A p X ,p A (·), encoder network f θ (·), number of epochs N , learning rate η Output: trained encoder network f θ 1 for i ← 1 to N do // Generating graph views via augmentation edge. When it comes to masking node features A p X (X), we employ a similar scheme and generate a mask of size k also sampled from the Bernoulli distribution B(1 − p X ), where p X is the probability of masking features at nodes. Note that we mask node features at the scale of the whole graph, i.e. the same features are masked for each node. Other works apply different augmentation parameters p X , p A for each generated view, but as our framework is fully symmetrical, we postulate that it is enough to use the same parameters to generate both augmentations (see Section 5.1).
Encoder network for node embeddings. The main component of the proposed framework is the encoder network f θ : G → R |V|×d . It takes an augmented graph as the input and computes (in our case) a d-dimensional representation vector for each node in the graph. Note that we do not specify any particular encoder network and one may use even encoders that construct embeddings for edges or whole graphs. In our experiments, we will show the application of GCN [12] and GAT [25] based encoder networks. Both augmented graph views G (1) , G (2) are passed through the same encoder, resulting in two embedding matrices Z (1) and Z (2) , respectively. The original Barlow Twins method specified also a projector network (implemented as an MLP) to reduce high embedding dimensionality (of the ResNet encoder). Our approach eliminates that step as it uses GNNs with low dimensional embeddings. Figure 1: Overview of our proposed Graph Barlow Twins framework. We transform an input graph G using an augmentation function and obtain two views: G (1) and G (2) . We pass both of them through the same GNN encoder f θ to compute two embedding matrices Z (1) , Z (2) . We build a loss function such that the embeddings' empirical cross-correlation matrix C is optimized into the identity matrix.
Loss function. In our work, we propose to use a negativesample-free loss function to train the encoder network. We first normalize the embedding matrices Z (1) and Z (2) along the batch dimension (a mean of zero and a standard deviation equal to one), and then we compute the empirical cross-correlation matrix C ∈ R d×d : where b are the batch indexes and i, j are the indexes of embeddings. Such setting was originally proposed under the name Barlow Twins. Neuroscientist H. Barlow's redundancy-reduction principle has motivated many methods both in supervised and unsupervised learning [4,20,1].
Recently, [28] has employed this principle to build a selfsupervised image representation learning algorithm (we bring this idea to the domain of graph-structured data). The cross-correlation matrix C is optimized by the Barlow Twins loss function L BT (see Equation 3) to be equal to the identity matrix. The loss is composed of two parts: (I) the invariance term and (II) the redundancy reduction term. The first one forces the on diagonal elements C ii to be equal to one, hence making the embeddings invariant to the applied augmentations. The second term optimizes the off-diagonal elements C ij to be equal to zero -this results in decorrelated components of the embedding vectors.
The λ > 0 parameter defines the trade-off between the invariance and redundancy reduction terms when optimizing the overall loss function. In [24], the authors proposed to use λ = 1 d , which we employ in our experimental setting (see Section 5.3). Otherwise, one can perform a simple grid search to find the best λ value in a particular experiment.
Please note, that in such setting the gradient is symmetrically backpropagated through the encoder network. We do not rely on any special techniques, like momentum encoders, gradient stopping, or predictor networks. In preliminary experiments, we also investigated the Hilbert-Schmidt Independence Criterion (due to its relation to the Barlow Twins objective [24]), but we did not observe any performance gain.

Experiments
We evaluate the performance of our model using a variety of popular benchmark datasets, including smaller ones, such as WikiCS, Amazon-Photo or Coauthor-CS, as well as larger ones, such as ogb-arxiv, ogb-products, provided by the Open Graph Benchmark [10]. In this section, we will provide an overview of the utilized datasets and the experimental scenario details, as well as the discussion of the results. Overall, we use a similar experimental setup, as the state-of-the-art self-supervised graph representation learning method BGRL [23], so we can perform a fair comparison to this method. To track our experiments and provide a simple way for reproduction, we employ the Data Version Control tool (DVC [13]) -for details see Appendix Appendix D. We perform all experiments on a TITAN RTX GPU with 24GB RAM.

Datasets
Following, we provide brief descriptions for each dataset, including the basic statistics (see Table 1) and the examined Table 1: Dataset statistics. We use small to medium sized standard datasets together with the larger ogb-arxiv dataset in the transductive setting. We also evaluate the inductive setting using the ogb-products and PPI (multiple graphs) dataset. dataset split type for the node classification downstream task: • WikiCS [15] is a network of Computer Science related Wikipedia articles with edges denoting references between those articles. Each article belongs to one of 10 subfields (classes) and has features computed as averaged GloVe embeddings of the article content. We use the provided 20 train/val/test data splits without any modifications.
• Amazon Computers, Amazon Photos [14] are two networks extracted from Amazon's co-purchase data. Nodes are products and edges denote that these products were often bought together. Each product is described using a Bag-of-Words representation (node features) based on the reviews. There are 10 and 8 product categories (node classes), respectively. For these datasets there are no data splits available, so similar to BGRL, we generate 20 random train/val/test splits (10%/10%/80%).
• Coauthor CS, Coauthor Physics are two networks extracted from the Microsoft Academic Graph [21]. Node are authors and edges denote a collaboration of two authors. Each author is described by the keywords used in their articles (Bag-of-Words representation; node features). There are 15 and 5 author research fields (node classes), respectively. Similarly to the Amazon datasets there are no data splits provided, so we generate 20 random train/val/test splits (10%/10%/80%).
• ogb-arxiv is a larger graph from the Open Graph Benchmark [10] with about 170 thousand nodes and about 1.1 million edges. The graph was extracted from the Microsoft Academic Graph [21], where nodes represents a Computer Science article on the arXiv platform and edges denote citations across papers.
The node features are build as word2vec embeddings of the whole article content. There are 40 subject areas a node can be classified into (node label/class). The ogb-arxiv dataset provides a single train/val/test split, so we use it without any modifications, but we retrain the whole framework 20 times.
• Protein-Protein Interaction (PPI) [31] consists of 24 separate graphs. Each node in a single graph represents a protein, described by 50 biological features, and edges denote interactions among those proteins. There are 121 node labels, but note that contrary to other cases, PPI uses multilabel classification, i.e. a single protein can be assigned with multiple labels. Aligned with other methods, we provide results in terms of the Micro-F1 score. For PPI, there exists a predefined data split, where 20 graphs are used for training, 2 graphs for validation and 2 graphs for testing. Note that the validation and test graphs are completely unobserved during training time, hence the model is more challenged during inference time.
• ogn-products is a large-scale graph from the Open Graph Benchmark [10] with about 2.4 million nodes and 62 million edges. The graph was extracted from the Amazon product co-purchasing network. Nodes represent products from the Amazon store and edges denote whether two products were bought together. There are 100 node features, which are obtained from bag-of-words products descriptions reduced using PCA. Each product (node) can be classified into one of 47 categories (node labels). This dataset comes with a predefined data split, so we use as is.

Evaluation protocol
Self-supervised framework training. We start the evaluation procedure by training the encoder networks using our proposed Graph Barlow Twins framework. In all scenarios, we use the AdamW optimizer [8] with weight decay equal to 10 −5 . The learning rate is updated using a cosine annealing strategy with a linear warmup period. Our framework uses a single set of augmentation parameters for both graph views. Therefore we do not use reported values of these parameters from other publications that use two different sets. Instead we perform a grid search over the range: p A , p X : {0, 0.1, . . . , 0.5} for 500 epochs with a warmup time of 50 epochs. We implement our experiment using the PyTorch Geometric [5] library. All datasets are available in this library as well. The details about the used augmentation hyperparameters, node embedding dimensions and the encoder architecture are given in Appendix A and Appendix C.
Node embedding evaluation. We follow the linear evaluation protocol proposed by [26]. We use the trained encoder network, freeze the weights and extract the node embeddings for the original graph data without any augmentations. Next, we train a L 2 -regularized logistic regression classifier from the Scikit learn [16] library. We also perform a grid search over the regularization strength using following values: {2 −10 , 2 −9 , . . . , 2 9 , 2 10 }. In the case of the larger ogb-arxiv, ogb-products and the PPI datasets, the Scikit implementation takes too long to converge. Hence, we implement the logistic regression classifier in PyTorch and optimize it for 1000 steps using the AdamW optimizer.
We check various weight decay values using a smaller grid search: {2 −10 , 2 −8 , . . . , 2 8 , 2 10 }. We use these classifiers to compute the classification accuracy and report mean and standard deviations over 20 model initializations and splits, except for the ogb-arxiv, ogb-products and PPI datasets, where we there is only one data split providedwe only re-initialize the model weights 20 times (5 times for ogb-products due to long training time).

Transductive experiments
We evaluate and compare our framework to other graph representation learning approaches on 6 real-world datasets using the transductive setting. The whole graph including all the node features is observed during the encoder training. The node labels (classes) are hidden at that moment (unsupervised learning). Next, we use the frozen embeddings and labels of training nodes to train the logistic regression classifier.

Small and medium sizes benchmark datasets
Our first experiment uses 5 small and medium sized popular benchmark datasets, namely: WikiCS, Amazon Computers, Amazon Photos, Coauthor CS and Coauthor Physics.
Encoder model. Similarly to [23], we build our encoder f θ as a 2-layer GCN [12] network. After the first GCN layer we apply a batch normalization layer (with momentum equal to 0.01) and the PReLU activation function. Accordingly to the original Barlow Twins method, we do not apply any normalization or activation to the final layer. A graph convolution layer (GCN) uses the diagonal degree matrix D to apply a symmetrical normalization to the adjacency matrix with added self-loopsÂ = A + I. Hence the propagation rule of such layer is defined as follows: Note that we do not include the activation σ(·) in this definition, as we first apply the batch normalization and then the activation function.
Results and discussion. We train our framework for a total of 1000 epochs, but we observe that our model converges earlier at about 500-900 epochs (depending on the dataset; see Appendix B). This is significantly faster than the stateof-the-art method BGRL, which converges and reports results for 10 000 epochs. Additionally, we reproduce the results of BGRL and provide values for BGRL at 1000 epochs. In Table 2 we report the mean node classification accuracy along with the standard deviations. As our experimental scenario was aligned with BGRL one, we re-use their reported scores and compare them to our results. We observe that our proposed method outperforms other baselines and achieves comparable results to state-of-the-art methods. Moreover, our G-BT model outperforms BGRL at 1000 epochs.

ogb-arxiv dataset
In the next experiment, we use ogb-arxiv -a larger graph from the Open Graph Benchmark [10] with about 170 thousand nodes and about 1.1 million edges.
Encoder model. Due to the larger size of the graph, we extend the encoder f θ to a 3-layer GCN model. We employ batch normalization and PReLU activations after the first and second layer, leaving the final layer as is (i.e. without any activation of normalization). In the BGRL paper, the authors suggested to use layer normalization together with weight standardization [18], yet we did not observe any performance gain, but more importantly the training procedure was unstable, with many peaks in the loss function.
Results and discussion. In Table 3 we report the mean classification accuracy along with the standard deviations. Note that we provide values for both validation and test splits, as the provided data splits are build according to chronological order. Hence, any model will be more affected with the out-of-distribution error on further (in time) away data samples. We evaluate our model for 500 epochs but it converges as fast as about 300-400 epochs (further training did not give any improvements). The model achieves results which are only 1.5 pp off to the state-of-the-art method BGRL, which in turn takes 10 000 epochs to converge to such solution.

Inductive experiments
We evaluate our proposed G-BT framework in inductive tasks over a single and multiple graphs.

PPI
For the inductive learning case with multiple graphs, we employ the Protein-Protein Interaction (PPI) dataset [31]. Aligned with other methods, we provide results for multilabel node classification in terms of the Micro-F1 score.
Encoder model. We employ a Graph Attention (GAT) [25] based encoder model, as previous works have shown better results of such network compared to standard GCN layers on PPI. Specifically, we build our encoder f θ as a 3-layer GAT network with skip connections. The first and second layer uses 4 attention heads of size 256 which are concatenated, and the final layer uses 6 attention heads of size 512, whose outputs are averaged instead of applying concatenation. In the GAT model, an attention mechanism learns the weights that are used to aggregate information from neighboring nodes. The attention weights α ij are computed according to the following equation: where N i are the neighbors of node i, W is a trainable matrix to transform node attributes, a is the trainable attention matrix, and || denotes the concatenation operation. We use the ELU activation for the first and second layer, leaving the last layer without any activation function. We do not apply any normalization techniques in the model as preliminary experiments showed no performance improvement.
Results and discussion. We train our framework using a batch size of 1 graph for a total of 500 epochs, which turned out to be enough for the model to converge (we conducted some preliminary experiments). In Table 4, we report the mean Micro-F1 score along with the standard deviations over 20 model initialization, as this dataset provided only one data split. Training for only 500 epochs provided results on par with SOTA method -BGRL -our model achieves 70.49 using a GAT encoder.

ogb-products
We study the applicability of our proposed model in the case of large-scale graphs. We select the ogb-products dataset, which has about 2.5 million nodes and 61 million edges.
Encoder model and setup. We utilize the same GAT-based encoder as for PPI. Due to the size of this dataset and the resulting training time, we decide to perform inductive node classification, i.e., during training we use only the nodes from the training set and edges among them. Moreover, as this graph does not fit into GPU memory, we selected a batched setting with neighbor sampling (as proposed in [9]) instead of the full-batch scenario. We train our model with a batch size of 512 for 100 epochs.
Results and discussion. BGRL does not report results for this dataset, so we modify the implementation of the BGRL method to accept batches instead of whole graphs and evaluate it on this dataset. We also include results from the OGB leaderboard, but note that virtually all methods reported there are trained in a semi-supervised setting, contrary to our approach in the self-supervised setting. Therefore, we may expect worse results. We summarize the mean and std node classification accuracy values in Table 5. We observe that G-BT highly outperforms BGRL on both validation and test sets.

Training time comparison
We compare the training time of all considered models by the duration of single epoch (the evaluation phase is the same in all models). We run each model for 10 training epochs and report the mean and standard deviation of the time measurements (Table 6). In virtually all cases our model takes the least time for a single training iteration due to the simple symmetrical architecture. Compared to BGRL our method speeds up computations about 17-42 times.

Batched processing
Our proposed method allows working in both full-batch and mini-batch settings. For most considered datasets, splitting them up into batches is not required as these fit completely into the GPU's memory. Nevertheless, we run additional experiments where we train our G-BT model on these datasets in a batched manner. Batches are created using neighbor sampling, i.e. for a k-layer encoder model, we sample the k-hop neighborhood of a node. More specifically, we first subsample the direct neighbors, then we sample neighbors of those, etc (as proposed in [9]). We re-use the augmentation hyperparamter values and number of epochs found in the full-batch case and retrain the G-BT model for each batch size 5 times (Table 7). We observe an expected decrease in performance when using the batched scenario (subject to further finetuning).

Ablation and hyperparameter sensitivity study
To gain a better understanding of our proposed method, we conduct an ablation and hyperparameter sensitivity study. In particular, we focus on the augmentation functions (types and hyperparameters) and the encoder architectures.

Augmentation hyperparameter sets
In our model, we postulate to use the same augmentation function hyperparameters to generate both graph views. This is motivated by the symmetrical architecture of our model, and hyperparameter search complexity. Performing a simple grid search over the value space yields in our case a total number of 6 2 = 36 evaluated combinations (values: {0, 0.1, . . . , 0.5}). In contrary, usage of different parameter sets for both graph views, would generate (6 2 ) 2 = 1296 combinations, which can be further reduced by exploiting the symmetrical architecture, yielding a final value of 630 = 36 2 combinations to evaluate. To demonstrate the impact of using both the same and different augmentation hyperparameter value sets we provide the results in Figure 2. There is no substantial difference in terms of test accuracy.

Ablation study
For the ablation study, we examine the utilized augmentation functions and the encoder architectures. We considered four literature-based configurations of augmentation functions [23] and four encoders as following: • for the augmentation functions: without any augmentation functions using only node feature masking using only edge dropping using both augmentation functions (node feature masking and edge dropping) as in the original version of our method • for the encoder architectures:   -MLP-based encoder (in contrast to a graph neural network) -a three-layer model with batch normalization and PReLU activations after the first two layers, whereas the output of the third linear layer is unmodified (aligned with other models in our paper); one-layer GCN encoder -a single GCN layer without any normalization or activation functions; two-layer GCN encoder -as in the orisomeginal version of our method; three-layer GCN encoder -the same architecture as used for the ogb-arxiv dataset.
We conducted all the ablation study experiments with the WikiCS dataset. We trained each model version for 1000 epochs (with 100 warmup epochs). All the remaining hyperparameters were borrowed from the best performing G-BT model on the WikiCS dataset, as reported in our paper.
We visualize the influence of both the augmentation functions and encoder architectures in Figure 3. Let's notice that using both augmentation functions provides the best results. However, using only node feature masking already leads to decent results (1pp difference; 75.9% acc). We conclude that node feature masking is more expressive than edge dropping, as using only edge dropping provides a smaller results boost (72% accuracy). Without any augmentation functions, we observe a noticeably lower quality (67% accuracy). In fact, no augmentation with large enough number of training epochs should result in very poor quality (the representation collapses into a constant embedding).
For the encoder architectures, we notice that using the two-layer GCN model (as evaluated in our main experimental pipeline) leads to the best results. One might expect Figure 3: Visualization of the ablation study. We consider the two most important parts of our model: the augmentation functions and the encoder architecture. We evaluate different settings and conclude that: (1) using both augmentation methods (node feature masking and edge dropping) at once, yields the best results, (2) the two-layer GCN model works the best, whereas a deeper one (three layers) suffers from the oversmoothing issue.
a larger model (three-layer GCN) to work better, yet we observe a performance drop when using such an encoder. This may be related to the oversmoothing issue in Graph Neural Networks. The one-layer GCN is also a reasonable choice in comparison to the two-layer model as its accuracy is only 1.1pp worse than the baseline. Moreover, the time and space complexity of deeper GNN models (two, three, . . . ) tend to explode in the case of highly dense graphs. Ignoring the graph structure by using an MLP-based encoder leads to results of about 68% accuracy.

Loss function trade-off parameter
The λ parameter in the loss function (see: Eq. 3) controls the trade-off between the invariance and redundancy reduction term. We evaluate multiple choices for this parameter and report the results in Figure 4. We use a similar setup as in the ablation study (see: Section 5.2) -i.e., we utilize the WikiCS dataset and train a two-layer GCN encoder for 1000 epochs with 100 warmup epochs, whereas other hyperparameters (except the λ parameter) are borrowed from the best performing G-BT model on the WikiCS dataset. We evaluate the following values for λ = {0, 0.001, 1 256 , 0.01, 0.1, 0.5, 1.0, 2.0}. Note that the value λ = 1 d = 1 256 corresponds to the default choice used throughout all our other experiments. Using this value was initially suggested in [24]. Let's notice that the results in Figure 4 show that our proposed G-BT model achieves the best performance for exactly such λ. Moreover, one can conclude that smaller values yield higher performance than larger ones. In particular, when using λ = 1 (both invariance and redundancy reduction terms are equally important) the performance deteriorates and settles at about 55% accuracy, whereas using values smaller than 0.01 results in a much better embeddings -about 75% accuracy.

Impact of projector network
In our proposed G-BT method, we omit the so-called projector network, which was utilized in the original Barlow Twins method. Its main purpose is to reduce high embedding dimensionality (of the ResNet encoder), whereas in our approach that step is solved by utilizing GNNs with low dimensional embeddings. We evaluate multiple choices of the projector network dimensionality (including our base case with no projector network at all) and report the results in Figure 5. We use the same setup as previously, namely in the ablation study (see: Section 5.2) and loss function trade-off parameter study (see: Section) 5.3)i.e., we utilize the WikiCS dataset and train a two-layer GCN encoder for 1000 epochs with 100 warmup epochs, whereas other hyperparameters are borrowed from the best performing G-BT model on the WikiCS dataset. We evaluate the following values for the projector dimensionality {128, 256, . . . , 8192, 16348}. We did not observe any significant differences in the model performance, regardless of whether we employ a projector network or we omit it completely. We performed a Friedman test and the computed p-value confirmed our observation.

Conclusions
In this work we presented Graph Barlow Twins, a selfsupervised graph representation learning framework, which utilizes the embeddings' cross-correlation matrix computed from two distorted views of a particular graph. The framework is fully symmetric and does not need any special techniques to build non trivial embedding vectors. It builds representations that are invariant to the applied augmentations and reduces the redundancy in the representation vectors by enforcing the cross-correlation matrix to be equal to the identity matrix (Barlow Twins loss). Using 8 Figure 4: Evaluation of the impact of the loss function's λ parameter on the overall model performance. The more the redundancy reduction term is present, the lower the accuracy is. Empirically the best performing contribution is for λ = 1 d , where d is the dimensionality of the embedding vectors. real-world datasets we evaluate our model in node classification tasks, both transductive and inductive, and achieve results that are on par or better than SOTA methods in SSL graph representation learning. We also show that our model converges an order of magnitude faster than other approaches.
Overall, our method allows to reduce the computation cost (faster convergence) while keeping a decent performance in downstream tasks. Consequently, it can be used to process larger graph datasets and efficiently perform tasks such as node classification, link prediction or graph classification. These tasks have crucial impact on various machine learning areas where graph structured data is used, e.g. detection of bots or hate speech in social networks, or building graph based recommendation engines.
Further studies can address the utilization of other negative-sample free approaches and applications of the proposed model in further graph-related tasks, such as link prediction or graph classification, and extensions to other types of data that are more specific than graphs (e.g., texts, tabular data).

Appendix A. Augmentation hyperparameters
Our proposed framework uses a single pair of augmentation hyperparameters p A ∈ R, p X ∈ R compared to other methods that use different values to generate both graph views. We show that a single set is enough to achieve a decent performance in a symmetrical network architecture like ours. Therefore, we cannot use the reported values of other works. We instead perform a grid search over these hyperparameters and use those where the model performs the best (in terms of classification accuracy or Micro-F1 score, for PPI). We do not evaluate the model during training and just use the final version after training. We use the following setting: • the framework is trained for 500 epochs, • we set the learning rate warmup time to 50 epochs, For values greater than 0.5 the augmentation removes too much information from the graph. In the case of the ogbproducts dataset, due to its large size, we trained our model only for 10 epochs with a warmup period of 2 epochs, but we evaluated the same augmentation hyperparameter values. We summarize the augmentation hyperparameters of the best performing models in Table A.8.

Appendix B. Training setup
For all datasets, we train our framework using the AdamW [8] optimizer with a weight decay of 10 −5 . The learning rate is adjusted using a cosine annealing strategy with a linear warmup period up to the base learning rate. During training we set a total number of epochs and an evaluation interval, after which the frozen embeddings are evaluated in downstream tasks (using either the l 2 regularized logistic regression from Scikit learn [16] with liblinear solver, or the custom PyTorch version with AdamW for ogb-arxiv and PPI). For instance, if we set the total number of epochs to 1000 and the evaluation interval to 500, the model will be evaluated at epochs: 0, 500 and 1000 (three times in total). We report the values for the best performing model found during those evaluations. We summarize these training statistics in Table B.9.

Appendix C. Encoder architecture
We compare our framework against the state-of-theart self-supervised graph representation learning method BGRL [23]. To provide a fair comparison, we use similar encoder architectures to the ones presented in their paper. We do not use any predictor networks in our framework, so we need to slightly modify the encoders to be better suited for the loss function (as given in the Barlow Twins paper [28]), i.e. we do not apply any normalization (like batch or layer normalization) or activation function in the final layers of the encoder. Note that the lack of predictor network and batch normalization in the final layer, reduces the overall number of trainable network parameters. In all cases, we use a batch normalization with the momentum equal to 0.01 (as in BGRL [23], where they use the equivalent weight decay equal to 0.99).
For the small up to medium sized datasets, i.e. WikiCS, Amazon-CS, Amazon-Photo, Coauthor-CS, Coauthor-Physics, we use a 2-layer GCN [12] based encoder with the following architecture: • GCN(k, 2d), • BatchNorm(2d), • PReLU(), where k is the number of node features and d is the embedding vector size.
For the ogb-arxiv dataset, we use a slightly larger model -a 3-layer GCN [12] based encoder. We tried to utilize weight standarization [18] and layer normalization, but our model did not benefit from those techniques (as it helped in BGRL [23]). The training procedure under this setting was unstable with various fluctuations and peaking of the loss function. The final architecture is summarized as follows: • GCN(k, d), • BatchNorm(d), • PReLU(), • GCN(d, d), • BatchNorm(d), • PReLU(), • GCN(d, d). In the inductive experiment with the PPI dataset, we use a 3-layer GAT [25] based encoder. Graph Attention network are known to perform better on this dataset compared to GCNs. This was also showed in BGRL [23], where their approach with GAT layers provided state-of-the-art performance in self-supervised graph representation learning for PPI. Our architecture can be summarized as follows: • GAT(k, 256, heads=4) + Linear(k, 4 * 256) • ELU(), • GAT(4 * 256, 256, heads=4) + Linear(4 * 256, 4 * 256) • ELU(), • GAT(4 * 256, d, heads=6) + Linear(4 * 256, d) The outputs of the attention heads in the first and second layer are concatenated and for the last GAT layer, the attention heads outputs are averaged. In every layer, we utilize skip connections using linear layers to project the outputs of the previous layer (features in the case of the first layer) to the desired dimensionality.
The exact values for the input feature dimension k and the embedding dimension d are given in Table C.10. • train_bgrl_full_batch@<dataset_name> -trains and evaluates the BGRL model in the full-batch case for WikiCS, Amazon-CS, Amazon-Photo, and Coauthor-CS, • bgrl_hps_batched@ogbn-products -runs the augmentation hyperparameter search for BGRL using the ogb-products dataset, • bgrl_batched_train@ogbn-products -trains and evaluates the BGRL model for the ogb-products dataset, • evaluate_features_products -evaluates the performance of ogb-products' raw node features, • evaluate_deepwalk_products -evaluates the performance of DeepWalk on the ogb-products dataset; additionally the case of DeepWalk features concatenated with raw node features is also evaluated.
All hyperparameters described in this Appendix are stored in configuration files in the experiments/configs/ directory, whereas the experimental Python scripts are placed in the experiments/scripts/ directory.