Keywords

1 Introduction

Many real-world data present an inherent structure and can be modelled as sequences, graphs, or hypergraphs [2, 5, 9, 15]. Graph-structured data, in particular, are very common in practice and are at the heart of this work.

We consider the problem of graph classification. That is, given a set \(\mathcal {G} = \{G_i\}_{i = 1}^m\) of arbitrary graphs and their respective labels \(\{y_i\}_{i = 1}^m\), where \(y_i \in \{1, \ldots , C\}\) and C is the number of classes, we aim at finding a mapping \(f_\theta : \mathcal {G} \rightarrow \{1, \ldots , C\}\) that minimizes the classification error, where \(\theta \) denotes the parameters to optimize.

Graph neural networks (GNNs) and their deep learning variants, the graph convolutional networks (GCNs) [1, 7, 9, 10, 13, 17, 20, 27], have gained considerable interest recently. GNNs learn latent node representations by recursively aggregating the neighboring node features for each node, thereby capturing the structural information of a node’s neighborhood.

Despite the profusion of GNN variants, some of which achieve state-of-the-art results on tasks like node classification, graph classification, and link prediction, GNNs remain very little studied. In particular, it is often unclear what a GNN learns and how the learned graph (or node) mapping influences its generalization performance. In a recent work, [25] present a theoretical framework to analyze the expressive power of GNNs, where a GNN’s expressiveness is defined as its ability to compute different graph representations for different graphs. Theoretical conditions under which a GNN is maximally expressive are derived. Although it is reasonable to assume that a higher expressiveness would result in a higher accuracy on classification tasks, this link has not been explicitly studied so far.

In this paper, we design a principled experimental procedure to analyze the link between expressiveness and the test accuracy of GNNs. In particular:

  • We define a practical measure to estimate the expressiveness of GNNs;

  • We use this measure to define a new penalized loss function that allows training GNNs with varying expressive power.

To illustrate our experimental framework, we introduce a simple yet practical architecture, the Simple Permutation-Invariant Graph Convolutional Network (SPI-GCN). We also present an original graph data set of metal hydrides that we use along with benchmark graph data sets to evaluate SPI-GCN.

This paper is organized as follows. Section 2 discusses the related work. Section 3 introduces preliminary notations and concepts related to graphs and GNNs. In Sect. 4, we introduce our graph neural network, SPI-GCN. In Sect. 5, we present a practical expressiveness estimator and a new expressiveness-based loss function as part of our experimental framework. Section 6 presents our results and Sect. 7 concludes the paper.

2 Related Work

Graph neural networks (GNNs) were first introduced in [11, 19]. They learn latent node representations by iteratively aggregating neighborhood information for each node. Their more recent deep learning variants, the graph convolutional networks (GCNs), generalize conventional convolutional neural networks to irregular graph domains. In [13], the authors present a GCN for node classification where the computed node representations can be interpreted as the graph coloring returned by the 1-dimensional Weisfeiler-Lehman (WL) algorithm [24]. A related GCN that is invariant to node permutation is presented in [27]. The graph convolution operator is closely related to the one in [13], and the authors introduce a permutation-invariant pooling operator that sorts the convolved nodes before feeding them to a 1-dimensional classical convolution layer for graph-level classification. A popular GCN is Patchy-san [17]. Its graph convolution operator extracts normalized local “patches” (neighborhood representations) of the graph which are then sorted and fed to a 1-dimensional traditional convolution layer for graph-level classification. The method, however, requires the definition of a node ordering and running the WL algorithm in a preprocessing step. On the other hand, the normalization of the extracted patches implies sorting the nodes again and using the external graph software Nauty [14].

Despite the success of GNNs, there are relatively few papers that analyze their properties, either mathematically or empirically. A notable exception is the recent work by [25] that studies the expressive power of GNNs. The authors prove that (i) GNNs are at most as powerful as the WL test in distinguishing graph structures and that (ii) if the graph function of a GNN—i.e. its graph embedding scheme—is injective, then the GNN is as powerful as the WL test. The authors also present the Graph Isomorphism Network (GIN), which approximates the theoretical maximally expressive GNN. In another study [4], the authors present a simple neural network defined on a set of graph augmented features and show that their architecture can be obtained by linearizing graph convolutions in GNNs.

Our work is related to [25] in that we adopt the same definition of expressiveness, that is, the ability of a GNN to compute distinct graph representations for distinct input graphs. However, we go one step further and investigate how the graph function learned by GNNs affects their generalization performance. On the other hand, our SPI-GCN extends the GCN in [13] to graph-level classification. Our SPI-GCN is also related to [27] in that we use a similar graph convolution operator inspired by [13]. Unlike [27], however, our architecture does not require any node ordering, and we only use a simple multilayer perceptron (MLP) to perform classification.

3 Some Graph Concepts

A graph G is a pair (VE) of a set \(V = \{v_1, \ldots , v_n\}\) of vertices (or nodes) \(v_i\), and a set \(E \subseteq V \times V\) of edges \((v_i, v_j)\). In this work, we represent a graph G by two matrices: (i) an adjacency matrix \(\text {A}\in \mathbb {R}^{n\times n}\) such that \(a_{ij} = 1\) if there is an edge between nodes \(v_i\) and \(v_j\) and \(a_{ij} = 0\) otherwise,Footnote 1 and (ii) a node feature matrix \(\text {X}\in \mathbb {R}^{n\times d}\), with \(d\) being the number of node features. Each row \(\text {x}_i \in \mathbb {R}^d\) of \(\text {X}\) contains the feature representation of a node \(v_i\), where \(d\) is the dimension of the feature space. Since we only consider node features in this paper (as opposed to edge features for instance), we will refer to the node feature matrix \(\text {X}\) simply as the feature matrix in the rest of this paper.

An important notion in graph theory is graph isomorphism. Two graphs \(G_1 = (V_1, E_1)\) and \(G_2 = (V_2, E_2)\) are isomorphic if there exists a bijection \(g: V_1 \rightarrow V_2\) such that every edge (uv) is in \(E_1\) if and only if the edge (g(u), g(v)) is in \(E_2\). Informally, this definition states that two graphs are isomorphic if there exists a vertex permutation such that when applied to one graph, we recover the vertex and edge sets of the other graph.

3.1 Graph Neural Networks

Consider a graph G with adjacency matrix \(\text {A}\) and feature matrix \(\text {X}\). GNNs use the graph structure (\(\text {A}\)) and the node features (\(\text {X}\)) to learn a node-level or a graph-level representation—or embedding—of G. GNNs iteratively update a node representation by aggregating its neighbors’ representations. At iteration l, a node representation captures its l-hop neighborhood’s structural information. Formally, the lth layer of a general GNN can be defined as follows:

$$\begin{aligned} \text {a}^{l + 1}_i&= \text {AGGREGATE}^l(\{\text {z}^l_j: j \in N(i)\}) \end{aligned}$$
(1)
$$\begin{aligned} \text {z}^{l + 1}_i&= \text {COMBINE}^l(\text {z}^l_i, \text {a}^{l + 1}_i) , \end{aligned}$$
(2)

where \(\text {z}_i^{l + 1}\) is the feature vector of node \(v_i\) at layer l and where \(\text {z}_i^0 = \text {x}_i\). While \(\text {COMBINE}\) usually consists in concatenating node representations from different layers, different—and often complex—architectures for \(\text {AGGREGATE}\) have been proposed. In [13], the presented GCN merges the \(\text {AGGREGATE}\) and \(\text {COMBINE}\) functions as follows:

$$\begin{aligned} \text {z}_i^{l + 1} = \text {ReLU}\left( \text {mean}(\{\text {z}_j^{l}: j \in N(i) \cup \{i\}\}) \cdot \text {W}^l\right) , \end{aligned}$$
(3)

where ReLU is a rectified linear unit and \(\text {W}^l\) is a trainable weight matrix. GNNs for graph classification have an additional module that aggregates the node-level representations to produce a graph-level one as follows:

$$\begin{aligned} \text {z}_G = \text {READOUT}(\{\text {z}_i^L: v_i \in V \}) , \end{aligned}$$
(4)

for a GNN with L layers. In [25], the authors discuss the impact that the choice of \(\text {AGGREGATE}^l\), \(\text {COMBINE}^l\), and \(\text {READOUT}\) has on the so-called expressiveness of the GNN, that is, its ability to map different graphs to different embeddings. They present theoretical conditions under which a GNN is maximally expressive.

We now present a simple yet practical GNN architecture on which we illustrate our experimental framework.

4 Simple Permutation-Invariant Graph Convolutional Network (SPI-GCN)

Our Simple Permutation-Invariant Graph Convolutional Network (SPI-GCN) consists of the following sequential modules: (1) a graph convolution module that encodes local graph structure and node features in a substructure feature matrix whose rows represent the nodes of the graph, (2) a sum-pooling layer as a \(\text {READOUT}\) function to produce a single-vector representation of the input graph, and (3) a prediction module consisting of dense layers that reads the vector representation of the graph and outputs predictions.

Let G be a graph represented by the adjacency matrix \(\text {A}\in \mathbb {R}^{n\times n}\) and the feature matrix \(\text {X}\in \mathbb {R}^{n\times d}\), where \(n\) and \(d\) represent the number of nodes and the dimension of the feature space respectively. Without loss of generality, we consider graphs without self-loops.

4.1 Graph Convolution Module

Given a graph G with its adjacency and feature matrices, \(\text {A}\) and \(\text {X}\), we define the first convolution layer as follows:

$$\begin{aligned} \text {Z}= f(\hat{\text {D}}^{-1} \hat{\text {A}} \text {X}\text {W}) , \end{aligned}$$
(5)

where \(\hat{\text {A}} = \text {A}+ \text {I}_{n}\) is the adjacency matrix of G with added self-loops, \(\hat{\text {D}}\) is the diagonal node degree matrix of \(\hat{\text {A}}\),Footnote 2 \(\text {W}\in \mathbb {R}^{d\times d'}\) is a trainable weight matrix, f is a nonlinear activation function, and \(\text {Z}\in \mathbb {R}^{n\times d'}\) is the convolved graph. To stack multiple convolution layers, we generalize the propagation rule in (5) as follows:

$$\begin{aligned} \text {Z}^{l + 1}= f^l(\hat{\text {D}}^{-1} \hat{\text {A}} \text {Z}^l\text {W}^l) , \end{aligned}$$
(6)

where \(\text {Z}^0 = \text {X}\), \(\text {Z}^l\) is the output of the lth convolution layer, \(\text {W}^l\) is a trainable weight matrix, and \(f^l\) is the nonlinear activation function applied at layer l. Similarly to the GCN presented in [13] from which we draw inspiration, our graph convolution module merges the \(\text {AGGREGATE}\) and \(\text {COMBINE}\) functions (see (1) and (2)), and we can rewrite (6) as:

$$\begin{aligned} \text {z}^{l + 1}_i = f^l\left( \text {mean}(\{z^l_j: j \in N(i) \cup \{i\}\}) \cdot \text {W}^l\right) , \end{aligned}$$
(7)

where \(\text {z}^{t + 1}_i\) is the ith row of \(\text {Z}^{l + 1}\).

We return the result of the last convolution layer, that is, for a network with L convolution layers, the result of the convolution is the last substructure feature matrix \(\text {Z}^L\). Note that (6) is able to process graphs with varying node numbers.

4.2 Sum-Pooling Layer

The sum-pooling layer produces a graph-level representation \(\text {z}_G\) by summing the rows of \(\text {Z}^L\), previously returned by the convolution module. Formally:

$$\begin{aligned} \text {z}_G= \sum _{i = 1}^n\text {z}^L_i . \end{aligned}$$
(8)

The resulting vector \(\text {z}_G\in \mathbb {R}^{d_L}\) contains the final vector representation (or embedding) of the input graph G in a \(d_L\)-dimensional space. This vector representation is then used for prediction—graph classification in our case.

Using a sum pooling operator is a simple idea that has been used in GNNs such as [1, 21]. Additionally, it results in the invariance of our architecture to node permutation, as stated in Theorem 1.

Theorem 1

Let G and \(G_\varsigma \) be two arbitrary isomorphic graphs. The sum-pooling layer of SPI-GCN produces the same vector representation for G and \(G_\varsigma \).

This invariance property is crucial for GNNs as it ensures that two isomorphic—and hence equivalent—graphs will result in the same output. The proof of Theorem 1 is straightforward and omitted for space limitations.

4.3 Prediction Module

The prediction module of SPI-GCN is a simple MLP that takes as input the graph-level representation \(\text {z}_G\) returned by the sum-pooling layer and returns either: (i) a probability p in case of binary classification or (ii) a vector \(\text {p}\) of probabilities such that \(\sum _{i} p_i = 1\) in case of multi-class classification.

Note that SPI-GCN can be trained in an end-to-end fashion through backpropagation. Additionally, since only one graph is treated in a forward pass, the training complexity of SPI-GCN is linear in the number of graphs.

In the next section, we describe a practical methodology for studying the expressiveness of SPI-GCN and its connection to the generalization performance of the algorithm.

5 Investigating Expressiveness of SPI-GCN

We start here by introducing a practical definition of expressiveness. We then show how the defined measure can be used to train SPI-GCN and help understand the impact expressiveness has on its generalization performance.

5.1 Practical Measure of Expressiveness

The expressiveness of a GNN, as defined in [25], is its ability to map different graph structures to different embeddings and, therefore, reflects the injectivity of its graph embedding function. Since studying injectivity can be tedious, we characterize expressiveness—and hence injectivity—as a function of the pairwise distance between graph embeddings.

Let \(\{\text {z}_{G_i}\}_{i = 1}^m\) be the set of graph embeddings computed by a GNN \(\mathcal {A}\) for a given input graph data set \(\{G_i\}_{i = 1}^m\). We define \(\mathcal {A}\)’s expressiveness, \(\mathcal {E}(\mathcal {A})\), as follows:

$$\begin{aligned} \mathcal {E}(\mathcal {A}) = \text {mean}(\{||\text {z}_{G_i} - \text {z}_{G_j}||_2: i, j = 1, \dots , m, \ i \ne j\}) , \end{aligned}$$
(9)

that is, \(\mathcal {E}(\mathcal {A})\) is the average pairwise Euclidean distance between graph embeddings produced by \(\mathcal {A}\). While not strictly equivalent to injectivity, \(\mathcal {E}\) is a reasonable indicator thereof, as the average pairwise distance reflects the diversity within graph representations which, in turn, is expected to be higher for more diverse input graph data sets. For permutation-invariant GNNs like SPI-GCN,Footnote 3 \(\mathcal {E}\) is zero when all graphs \(\{G_i\}_{i = 1}^m\) are isomorphic.

5.2 Penalized Cross Entropy Loss

We train SPI-GCN using a penalized cross entropy loss, \(\mathcal {L}_p\), that consists of a classical cross entropy augmented with a penalty term defined as a function of the expressiveness of SPI-GCN. Formally:

$$\begin{aligned} \mathcal {L}_p&= \text {cross-entropy}(\{y_i\}_{i = 1}^m, \{\hat{y}_i\}_{i = 1}^m) - \alpha \cdot \mathcal {E}(\text {SPI-GCN}) , \end{aligned}$$
(10)

where \(\{y_i\}_{i = 1}^m\) (resp. \(\{\hat{y}_i\}_{i = 1}^m\)) is the set of real (resp. predicted) graph labels, \(\alpha \) is a non-negative penalty factor, and \(\mathcal {E}\) is defined in (9) with \(\{\text {z}_{G_i}\}_{i = 1}^m\) being the graph embeddings computed by SPI-GCN.

By adding the penalty term \(- \alpha \cdot \mathcal {E}(\text {SPI-GCN})\) in \(\mathcal {L}_p\), the expressiveness is maximized while the cross entropy is minimized during the training process. The penalty factor \(\alpha \) controls the importance attributed to \(\mathcal {E}(\text {SPI-GCN})\) when \(\mathcal {L}_p\) is minimized. Consequently, higher values of \(\alpha \) allow to train more expressive variants of SPI-GCN whereas for \(\alpha = 0\), only the cross entropy is minimized.

In the next section, we assess the performance of SPI-GCN for different values of \(\alpha \). We also compare SPI-GCN with other more complex GNN architectures, including the state-of-the-art method.

6 Experiments

We carry out a first set of experiments where we compare our approach, SPI-GCN, with two recent GCNs. In a second set of experiments, we train different instances of SPI-GCN with increasing values of the penalty factor \(\alpha \) (see (10)) in an attempt to understand how the expressiveness of SPI-GCN affects its test accuracy, and whether it is the determining factor of its generalization performance, as implicitly suggested in [25]. Our code and data are available at https://github.com/asmaatamna/SPI-GCN.

6.1 Data Sets

We use nine public benchmark data sets including five bioinformatics data sets (MUTAG [6], PTC [22], ENZYMES [3], NCI1 [23], PROTEINS [8]), two social network data sets (IMDB-BINARY, IMDB-MULTI [26]), one image data set where images are represented as region adjacency graphs (COIL-RAG [18]), and one synthetic data set (SYNTHIE [16]). We also evaluate SPI-GCN on an original real-world data set collected at the ICMPE,Footnote 4 HYDRIDES, that contains metal hydrides in graph format, labelled as stable or unstable according to specific energetic properties that determine their ability to store hydrogen efficiently.

6.2 Architecture of SPI-GCN

The instance of SPI-GCN that we use for experiments has two graph convolution layers of 128 and 32 hidden units respectively, followed by a hyperbolic tangent function and a softmax function (per node) respectively. The sum-pooling layer is a classical sum applied row-wise; it is followed by a prediction module consisting of a MLP with one hidden layer of 256 hidden units followed by a batch normalization layer and a ReLU. We choose this architecture by trial and error and keep it unchanged throughout the experiments.

6.3 Comparison with Other Methods

In these experiments, we consider the simplest variant of SPI-GCN where the penalty term in (10) is discarded by setting \(\alpha = 0\). That is, the algorithm is trained using only the cross entropy loss.

Baselines. We compare SPI-GCN with the well-known GCN, Patchy-san (PSCN) [17], the Deep Graph Convolutional Neural Network (Dgcnn) [27] that uses a similar convolution module to ours, and the recent state-of-the-art Graph Isomorphism Network (GIN) [25].

Experimental Procedure. We train SPI-GCN using full batch Adam optimizer [12], with cross entropy as the loss function to minimize (\(\alpha = 0\) in (10)). Upon experimentation, we set Adam’s hyperparameters as follows. The algorithm is trained for 200 epochs on all data sets and the learning rate is set to \(10^{-3}\). To estimate the accuracy, we perform 10-fold cross validation using 9 folds for training and one fold for testing each time. We report the average (test) accuracy and the corresponding standard deviation in Table 1. Note that we only use node attributes in our experiments. In particular, SPI-GCN does not exploit node or edge labels of the data sets. When node attributes are not available, we use the identity matrix as the feature matrix for each graph.

We follow the same procedure for Dgcnn. We use the authors’ implementationFootnote 5 and perform 10-fold cross validation with the recommended values for training epochs, learning rate, and SortPooling parameter k, for each data set.

For PSCN, we report the results from the original paper [17] (for receptive field size \(k = 10\)) as we could not find an authors’ public implementation of the algorithm. The experiments were conducted using a similar procedure as ours.

For GIN, we also report the published results [25] (GIN-0 in the paper), as it was not straightforward to use the authors’ implementation.

Results. Table 1 shows the results for our algorithm (SPI-GCN), Dgcnn [27], PSCN [17], and the state-of-the-art GIN [25]. We observe that SPI-GCN is highly competitive with other algorithms despite using the same architecture for all data sets. The only noticeable exceptions are on the NCI1 and IMDB-BINARY data sets, where the best approach (GIN) is up to 1.28 times better. On the other hand, SPI-GCN appears to be highly competitive on classification tasks with more than 3 classes (ENZYMES, COIL-RAG, SYNTHIE). The difference in accuracy is particularly significant on COIL-RAG (100 classes), where SPI-GCN is around 34 times better than Dgcnn, suggesting that the features extracted by SPI-GCN are more suitable to characterize the graphs at hand. SPI-GCN also achieves a very reasonable accuracy on the HYDRIDES data set and is 1.06 times better than Dgcnn on ENZYMES.

The results in Table 1 show that despite its simplicity, SPI-GCN is competitive with other practical graph algorithms and, hence, it is a reasonable architecture to consider for our next set of experiments involving expressiveness.

Table 1. Accuracy results for SPI-GCN and three other deep learning methods (Dgcnn, PSCN, GIN).
Table 2. Expressiveness experiments results. SPI-GCN is trained on the penalized cross entropy loss, \(\mathcal {L}_p\), with increasing values of the penalty factor \(\alpha \). For each data set, and for each value of \(\alpha \), we report the test accuracy (a), the expressiveness \(\mathcal {E}(\text {SPI-GCN})\) (b), and the IGED (c). Highlighted are the maximal values for each quantity.

6.4 Expressiveness Experiments

Through these experiments, we try to answer the following questions:

  • Do more expressive GNNs perform better on graph classification tasks? That is, is the injectivity of a GNN’s graph function the determining factor of its performance?

  • Can the performance be explained by another factor? If yes, what is it?

To this end, we train increasingly injective instances of SPI-GCN on the penalized cross entropy loss \(\mathcal {L}_p\) (10) by setting the penalty factor \(\alpha \) to increasingly large values. Then, for each trained instance, we investigate (i) its test accuracy, (ii) its expressiveness \(\mathcal {E}(\text {SPI-GCN})\) (9), and (iii) the average normalized Inter-class Graph Embedding Distance (IGED), defined as the average pairwise Euclidean distance between mean graph embeddings taken class-wise divided by \(\mathcal {E}(\text {SPI-GCN})\). Formally:

$$\begin{aligned} \text {IGED} = \frac{\text {mean}(\{||\text {z}^*_{c} - \text {z}^*_{c'}||_2: c, c' = 1, \dots , C, \ c \ne c'\})}{\mathcal {E}(\text {SPI-GCN})} , \end{aligned}$$
(11)

where \(\text {z}^*_k\) is the mean graph embedding for class k. The IGED can be interpreted as an estimate of how well the graph embeddings computed by SPI-GCN are separated with respect to their respective class.

Experimental Procedure. We train SPI-GCN on the penalized cross entropy loss \(\mathcal {L}_p\) (10) where we sequentially choose \(\alpha \) from \(\{0, 10^{-3}, 10^{-1}, 1, 10\}\). We do so using full batch Adam optimizer that we run for 200 epochs with a learning rate of \(10^{-3}\), on all the graph data sets introduced previously. For each data set and for each value of \(\alpha \), we perform 10-fold cross validation using 9 folds for training and one fold for testing. We report in Table 2 the average and standard deviation of: (a) the test accuracy, (b) the expressiveness \(\mathcal {E}(\text {SPI-GCN})\), and (c) the IGED (11), for each value of \(\alpha \) and for each data set.

Results. We observe from Table 2 that using a penalty term in \(\mathcal {L}_p\) to maximize the expressiveness—or injectivity—of SPI-GCN helps to improve the test accuracy on some data sets, notably on MUTAG, PTC, and SYNTHIE. However, larger values of \(\mathcal {E}(\text {SPI-GCN})\) do not correspond to a higher test accuracy except for two cases (PTC, SYNTHIE). Overall, \(\mathcal {E}(\text {SPI-GCN})\) increases when \(\alpha \) increases, as expected, since the expressiveness is maximized during training when \(\alpha > 0\). The IGED, on the other hand, is correlated to the best performance in four out of ten cases (ENZYMES, IMDB-BINARY, and IMDB-MULTI), where the test accuracy is maximal when the IGED is maximal. On HYDRIDES, the difference in IGED for \(\alpha = 10^{-1}\) (highest accuracy) and \(\alpha = 1\) (highest IGED value) is negligible.

Our empirical results indicate that while optimizing the expressiveness of SPI-GCN may result in a higher test accuracy in some cases, more expressive GNNs do not systematically perform better in practice. The IGED, however, which reflects a GNN’s ability to compute graph representations that are correctly clustered according to their effective class, better explains the generalization performance of the GNN.

7 Conclusion

In this paper, we challenged the common belief that more expressive GNNs achieve a better performance. We introduced a principled experimental procedure to analyze the link between the expressiveness of a GNN and its test accuracy in a graph classification setting. To the best of our knowledge, our work is the first that explicitly studies the generalization performance of GNNs by trying to uncover the factors that control it, and paves the way for more theoretical analyses. Interesting directions for future work include the design of better expressiveness estimators, as well as different (possibly more complex) penalized loss functions.