Evaluating graph neural networks under graph sampling scenarios

Background It is often the case that only a portion of the underlying network structure is observed in real-world settings. However, as most network analysis methods are built on a complete network structure, the natural questions to ask are: (a) how well these methods perform with incomplete network structure, (b) which structural observation and network analysis method to choose for a specific task, and (c) is it beneficial to complete the missing structure. Methods In this paper, we consider the incomplete network structure as one random sampling instance from a complete graph, and we choose graph neural networks (GNNs), which have achieved promising results on various graph learning tasks, as the representative of network analysis methods. To identify the robustness of GNNs under graph sampling scenarios, we systemically evaluated six state-of-the-art GNNs under four commonly used graph sampling methods. Results We show that GNNs can still be applied on single static networks under graph sampling scenarios, and simpler GNN models are able to outperform more sophisticated ones in a fairly experimental procedure. More importantly, we find that completing the sampled subgraph does improve the performance of downstream tasks in most cases; however, completion is not always effective and needs to be evaluated for a specific dataset. Our code is available at https://github.com/weiqianglg/evaluate-GNNs-under-graph-sampling.


INTRODUCTION
In the last few years, graph neural networks (GNNs) have become standard tools for learning tasks on graphs. By iteratively aggregating information from neighborhoods, GNNs embed each node from its k-hop neighborhood and provides a significant improvement over traditional methods in node classification and link prediction tasks (Dwivedi et al., 2020;Shchur et al., 2018). Powerful representation capabilities have led to GNNs being applied in areas such as social networks, computer vision, chemistry, and biology (Hou et al., 2020). However, most GNN models need a complete underlying network structure, which is often unavailable in real-world settings (Wei & Hu, 2021).
As far as we know, this is the first work to systematically evaluate the impact of graph sampling on GNNs.

RELATED WORK
In this section, we briefly review related works on graph sampling and GNNs.

Graph sampling
Graph sampling is a technique to pick a subset of nodes and/or edges from an original graph. The commonly studied sampling methods are node sampling, edge sampling, and traversal-based sampling (Al Hasan, 2016;Ahmed, Neville & Kompella, 2013). In node sampling, nodes are first selected uniformly or according to some centrality, such as degree or PageRank, then the induced subgraph among the selected nodes is extracted. In edge sampling, edges are selected directly or guided by nodes. Node sampling and edge sampling are simple and suitable for theoretical analysis, but in many real scenarios we cannot perform them due to various constraints, e.g., the whole graph is unknown (Hu & Lau, 2013). Traversal-based sampling, which extends from seed nodes to their neighborhood, is more practical. Therefore, a group of methods was developed, including breadth-first search (BFS), depth-first search (DFS), snowball sampling (SBS) (Goodman, 1961), forest fire sampling (FFS) (Leskovec, Kleinberg & Faloutsos, 2005), random walk (RW), and Metropolis-Hastings random walk (MHRW). With the numerous graph sampling methods developed, the question of how they impact GNNs still remains to be answered.

GNNs
After the first GNN model was developed (Bruna et al., 2014), various GNNs have been exploited in the graph domain. GCN simplifies ChebNet (Defferrard, Bresson & Vandergheynst, 2016) and speeds up graph convolution computation. GAT and MoNet extend GCN by leveraging an explicit attention mechanism (Lee et al., 2019). Due to powerful represent capabilities, GNNs have been applied into a wide range of applications including knowledge graphs (Zhang, Cui & Zhu, 2020), molecular graph generation (De Cao & Kipf, 2018), graph metric learning and image recognition (Kajla et al., 2021;Riba et al., 2021). Recently, graph sampling was investigated in GNNs for scaling to larger graphs and better generalization. Layer sampling techniques have been proposed for efficient mini-batch training. GraphSage performs uniform node sampling on the previous layer neighbors (Zeng et al., 2019). GIN extends GraphSage with arbitrary aggregation functions on multiple sets, which is theoretically as powerful as the Weisfeiler-Lehman test of graph isomorphism (Xu et al., 2018). In contrast to layer sampling, GraphSAINT constructs mini-batches by directly sampling the training graph, which decouples the sampling from propagation (Zeng et al., 2019). However, in most GNNs it is assumed that the underlying network structure is complete without data loss, which is often not the case.
In addition, different GNNs are compared in Errica et al. (2019) and Shchur et al. (2018) with regard to node classification and graph classification tasks, respectively, a systematic evaluation of deep GNNs is presented in Zhang et al. (2021), and a reproducible framework for benchmarking of GNNs is introduced in Dwivedi et al. (2020). The most related work to ours is Fox & Rajamanickam (2019), in which the robustness of GIN to additional structural noise is studied. Our work focuses on graph sampling that can be considered as a random structure removed from the original network.

Models
We focus on the robustness of GNNs under graph sampling scenarios. As shown in Fig. 1, G O is the partial observed graph from a network G, which is often difficult to make complete observations. We train GNNs on G O and then evaluate on three typical learning tasks: node classification, link prediction and graph classification. In this paper, we treat G O as one of the many graphs generated by a certain sampling process from a known G, consequently we are able to determine the robustness of GNNs in a statistical way via multiple independent random sampled G O .
We denote the original network as G(V ,E,X ), where V and E represent node and edge sets, respectively, and X ∈ R |V |×d denotes the attribute matrix. There is no missing structure in G. The observed or sampled graph is represented by We evaluate six popular GNNs (GCN, GraphSage, GAT, MoNet, GatedGCN and GIN) with four traversal-based graph sampling methods (BFS, FFS, RW, and MHRW). The six GNN models are selected according to performance and popularity; moreover, they cover all three categories of GNN models: isotropic (GCN, GraphSage), anisotropic (GAT, MoNet, GatedGCN) and Weisfeiler-Lehman (GIN) GNNs (Dwivedi et al., 2020). We test only traversal-based sampling methods for two reasons: these methods are practical in real settings (Hu & Lau, 2013), and these methods extract connected subgraphs, which is a prerequisite for GNNs. In graph sampling, we iteratively pick nodes and edges starting from a random seed node until the cardinality of the sampled node set V O reaches a given number. Apart from the original sampled subgraph has the same edges as G between the vertices in V O ; hence, G O can be considered as a completion of G O .
We follow the principles of Dwivedi et al. (2020) and develop a standardized training, validation, and testing procedure for all models for fair comparisons.
In addition, we considered multilayer perceptron (MLP) as a baseline model, which utilizes only node attributes without graph structures.

EXPERIMENTS Datasets
In our benchmark, we used nine datasets including six social networks (Cora, CiteSeer, PubMed (Yang, Cohen & Salakhutdinov, 2016), Actor (Pei et al., 2020), ARXIV and COLLAB (Hu et al., 2020)), two super-pixel networks of images (MNIST, CIFAR10 (Dwivedi et al., 2020)) and one artificial network generated from Stochastic Block Model (CLUSTER (Dwivedi et al., 2020)). Statistics for all datasets are shown in Table 1. We treated all the networks as undirected and only considered the largest connected component, moreover, we ignored edge features in our experiments.

Setup
Setups for our experiments are summarized in Table 2. All datasets were split into training, validation, and testing data. For node classification tasks, Cora, CiteSeer and PubMed were split according to Yang, Cohen & Salakhutdinov (2016), first of the 10 splits from Pei et al. (2020) was picked for Actor, and CLUSTER was split according to Dwivedi et al. (2020); For link prediction tasks, we used a random 70%/10%/20% training/validation/test split for positive edges in all datasets; For graph classification tasks, the splits were derived from Dwivedi et al. (2020).
In GNNs, all models had a linear transform for node attributes X before hidden layers. The number of hidden layers L was set to L = 2 to avoid over-smoothing for small-scale datasets such as Actor, Core, CiteSeer and Pubmed, and we set to L = 3 for ARXIV and COLLAB, L = 4 to MNIST, CIFAR10 and CLUSTER. We added residual connections between GNN layers for medium-scale datasets (i.e., ARXIV, COLLAB, MNIST, CIFAR10   and CLUSTER) as suggested by Dwivedi et al. (2020). We chose the hidden dimension and the output dimension that made the number of parameters almost equal for each model. The number of attention heads of GAT was set to 8, and the mean aggregation function in GraphSage was adopted. In MoNet, we set the number of Gaussian kernels to 3, and used the degrees of the adjacency nodes as the input pseudo-coordinates, as proposed in Monti et al. (2017). We used the same training procedure for all GNN models for a fair comparison. Specifically, the maximum number of training epochs was set to 1,000, and we adopted Glorot (Glorot & Yoshua, 2010) and zero initialization for the weights and biases, respectively. Also, we applied the Adam (Kingma & Ba, 2015) optimizer, and we reduced learning rate with a factor of 0.5 when a validation metric has stopped improving after the given reduce patience. Furthermore, we stopped the training procedure early if (a) learning rate was less than 1e-5, or (b) validation metric did not increase for 100 consecutive epochs, or (c) training time was more than 12 h. All model parameters were optimized with cross-entropy loss when G O was sampled.
We implemented all the six models by the Pytorch Geometrics library (Fey & Lenssen, 2019) and the four graph sampling methods based on Rozemberczki, Kiss & Sarkar (2020).

RESULTS
For each dataset, sampling method, and GNN model, we performed 4 runs with 4 different seeds, then reported the average metric. To answer Q1, we show means, µ, and standard deviations, δ, of metrics for all datasets with sampling ratio r = |V s |/|V | ∈ [0.1,0.5] using GCN and MHRW (Table 3). It is worth to mention that the other GNN models and graph sampling methods had similar results. There are a few observations to be made. First, the means, µ, increase and the standard deviations, δ, decrease as the sampling ratio increases in node classification and graph classification tasks, which aligns with our intuition. Second, the performance is acceptable in most single graph datasets when r is relatively large, e.g., compared to the complete cases, the relative losses = 1−µ r /µ complete are all less than 15% for CiteSeer, Cora, Pubmed, ARXIV and COLLAB when r ≥ 0.4. This is partly because the nodes in G O have acquired sufficient neighborhood structure to accomplish the messaging and aggregation needed by GNNs. Therefore, we can still use GNNs in most single graph datasets under sampling scenarios, as long as the sampling ratio, r, is chosen properly. The choice of the appropriate r varies depending on the dataset, sampling method, and GNN model. For example, in order to make ≤ 10% on node classification tasks, the sampling ratio should satisfy r ≥ 0.5 for Actor and PubMed, r ≥ 0.4 for Cora, and r ≥ 0.1 for CiteSeer. By contrast, the performance degradation is severe for multi-graph datasets (i.e., CLUSTER, MNIST, CIFAR10), which is mainly due to the fact that independent random sampling destroys the intrinsic association between graphs. Hence, we cannot directly use GNNs with independent random sampling scenarios.
To answer Q2, we show µ and δ for all datasets when we fix r = 0.3 in Table 4. According to Table 4, the best performing GNN model(s) is consistent across different sampling methods for a specific dataset, especially in node classification tasks, e.g., GatedGCN for Actor, GCN for Cora, CiteSeer, and PubMed. The consistency suggests that datasets have a strong preference for a specific GNN model, and there is no silver-bullet GNN for all datasets. Another observation is that, some datasets show a tendency towards sampling methods, e.g., BFS for Actor, RW for ARXIV. To compare all GNN models and sampling methods, we consider the relative metric score, as proposed in Shchur et al. (2018). That is, for GNN models, we take the best µ from four sampling methods as 100% for each dataset, and the score of each model is divided by this value, then the results for each model are averaged over all datasets and sampling methods. We also rank GNN models by their performance (1 for best performance, 7 for worst), and compute the average rank for each model. Similarly, we calculate the score of each sampling method. The final scores for GNN models and sampling methods are summarized in Table 5. These results provide a reference for the selection of sampling methods, and a guidance for sampling-based GNN training like GraphSAINT (Zeng et al., 2019).
GNNs outperform MLP on average in Table 5, and this confirms the superiority of GNNs, which combine structural and attribute information, compared to methods that consider only attributes. On small datasets, GCN is the best GNN model , which proves that simple methods often outperforms more sophisticated ones (Dwivedi et al., 2020;Shchur et al., 2018). In addition, BFS is found to be the best sampling method for small datasets, partly because it samples node labels more uniformly than other methods. Figure 2 shows a comparison of the Kullback-Leibler divergence between label distributions of training and testing from different sampling methods in PubMed (NC); it can be seen that BFS has a lower score, which leads to better generalization power in GNNs. On medium datasets, the best GNN model changes to GAT, and the most competitive sampling method are RW and MHRW. This may be due to the fact that RW and MHRW can obtain a more macroscopic structure compared to BFS and FFS. To answer Q3, we considered the induced subgraph G O as a completion of G O . We chose the preferred GNN model for each dataset, e.g., GatedGCN for Actor, then computed the induced relative metric improvement percent as τ = µ r /µ r − 1. Figure 2 shows the improvements on all datasets with r ∈ 0.1,0.3,0.5. From Fig. 3 it can be seen that network completion can improve performance in most cases. Comparing Figs. 3A, 3B and 3C shows that the induced improvement τ increases as the sampling ratio r decreases especially when we perform MHRW or RW, which indicates the necessity of network completion when τ is low.
On the other hand, Fig. 3 reveals the complexity of datasets under sampling scenarios, which indicates that network completion is not always effective. Some datasets benefit from network completion in all cases, e.g., Cora (NC), ARXIV and MNIST; and there are also some datasets seem to be unaffected by completion, e.g., PubMed (LP) when r ∈ 0.3,0.5 (see Figs. 3B-3C); what is more, network completion has side effects on datasets such as COLLAB. The complexity may be partly explained by structure noise in network. It Table 4 The means and standard deviations of metrics (µ± δ (%)) for all nine datasets with sampling ratio r = 0.3. NC, LP are short for node classification and link prediction, respectively. For each dataset and graph sampling method, the best metric is marked in bold. For each dataset and GNN method, the best metric is shown in red.

Dataset
Sampling   (Luo et al., 2021;Zheng et al., 2020). We treat graph sampling as a structural denoising process. If the original network G has only a small amount of structure noise, completion restores the informative edges removed by sampling, thus improving the GNN performance. Whereas if the structure noise is large in G, completion weakens the denoising effect of sampling and leads to performance degradation.

CONCLUSIONS
We focused on the performance of GNNs with partial observed network structure. By treating the incomplete structure as one of the many graphs generated by a certain sampling process, we determined the robustness of GNNs in a statistical way via multiple independent random sampling. Specifically, we performed an empirical evaluation of six state-of-the-art GNNs on three network learning tasks (i.e., node classification, link prediction and graph classification) with four popular graph sampling methods. We confirmed that GNNs can still be applied under graph sampling scenarios in most single graph datasets, but not on multiple graph datasets. We also identified the best GNN model and sampling method, that is, GCN and BFS for small datasets, GAT and RW for medium datasets. Which provides a guideline for future applications. Moreover, we found that network completion can improve GNN performance in most cases, however, specific analysis is needed case by case due to the complexity of datasets under sampling scenarios. Thus, suggesting that completion and denoising should be done with careful evaluation. We hope this work, along with the public codes, will encourage future works on understanding the relationship between structural information and GNNs.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.

Author Contributions
• Qiang Wei conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.
• Guangmin Hu conceived and designed the experiments, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability
The following information was supplied regarding data availability: