On the approximation capability of GNNs in node classification/regression tasks

Graph Neural Networks (GNNs) are a broad class of connectionist models for graph processing. Recent studies have shown that GNNs can approximate any function on graphs, modulo the equivalence relation on graphs defined by the Weisfeiler--Lehman (WL) test. However, these results suffer from some limitations, both because they were derived using the Stone--Weierstrass theorem -- which is existential in nature, -- and because they assume that the target function to be approximated must be continuous. Furthermore, all current results are dedicated to graph classification/regression tasks, where the GNN must produce a single output for the whole graph, while also node classification/regression problems, in which an output is returned for each node, are very common. In this paper, we propose an alternative way to demonstrate the approximation capability of GNNs that overcomes these limitations. Indeed, we show that GNNs are universal approximators in probability for node classification/regression tasks, as they can approximate any measurable function that satisfies the 1--WL equivalence on nodes. The proposed theoretical framework allows the approximation of generic discontinuous target functions and also suggests the GNN architecture that can reach a desired approximation. In addition, we provide a bound on the number of the GNN layers required to achieve the desired degree of approximation, namely $2r-1$, where $r$ is the maximum number of nodes for the graphs in the domain.


Introduction
Graph processing is becoming pervasive in many application domains, such as social networks, Web applications, biology and finance.Intuitively, graphs allow to represent patterns along with their relationships.Indeed, graphs can naturally encode high-valued information that is hard to represent with vectors or sequences, the most common data structures used in Machine Learning (ML).Graph Neural Networks (GNNs) are a class of machine learning models that can process information represented in the form of graphs.In recent years, the interest in GNNs has grown rapidly and numerous new models and applications have emerged [36].The first GNN model was introduced in [32].Later, several other approaches have been proposed, including Spectral Networks [8], Gated Graph Sequence Neural Networks [19], Graph Convolutional Neural Networks [16], GraphSAGE [12], Graph attention networks [35], and Graph Networks [5].However, despite the differences among the various GNN models, most adopt the same computational scheme, based on a local aggregation mechanism.The information related to a node is stored into a feature vector, which is updated recursively by aggregating the feature vectors of neighboring nodes.After k iterations, the feature vector of a given node v captures both the structural information and the attributes of the nodes in the v's k-hop neighborhood.At the end of the learning process, the node feature vectors can be used to classify or to cluster the objects/concepts represented by a (some) node(s), or by the whole graph.
Recently, a great effort has been devoted to the study of the expressive power of GNNs [28].Intuitively, the local computational framework is primarily responsible for the capabilities and the limitations of GNNs, since GNNs can take into account both the connectivity and features of the neighboring nodes, but they may not be able to distinguish between nodes having similar neighborhoods.Therefore, a fundamental question is to define which graphs (nodes) can be distinguished by a GNN, i.e. for which input graphs (nodes) the GNN produces different encodings.In [37], GNNs are proved to be as powerful as the Weisfeiler-Lehman graph isomorphism test (1-WL) [18].Such an algorithm allows to test whether two graphs are isomorphic or not 1 .The 1-WL algorithm is based on a graph signature which is obtained by assigning a color to each node, where the graph coloring is achieved by iterating a local aggregation function.More generally, there exists a hierarchy of algorithms, called 1-WL, 2-WL, 3-WL, etc., which recognize larger and larger classes of graphs.It has been shown that a GNN can simulate the 1-WL test, provided that a sufficiently general aggregation function is used, but the basic GNN model cannot implement higher order tests [25].Consequently, the 1-WL test characterizes both the expressiveness and limitations of GNNs, defining the classes of graphs/nodes that GNNs can distinguish.
Another important aspect is the study of the approximation capability of GNNs.Formally, in node classification/regression tasks, a GNN implements a function φ(G, v) → R m that takes in input a graph G and returns an output at each node.Similarly, in graph classification/regression tasks, a GNN implements a function φ(G) → R m .In both cases, the objective is to define which classes of functions can be approximated by a GNN.
In [31], the approximation capability of the original GNN model (OGNN), namely the first GNN model to be proposed, has been studied using the concept of unfolding trees and unfolding equivalence.The unfolding tree T v , with root node v, is constructed by unrolling the graph starting from v (see Fig. 1).Intuitively, T v exactly describes the information used by the GNN at node v and can be employed to study the expressive power of GNNs in node classification/regression tasks.The unfolding equivalence is, in turn, an equivalence relationship defined between nodes having the same unfolding tree.In [31], it was proved that OGNNs can approximate in probability, up to any degree of precision, any measurable function τ (G, v) → R m that respects the unfolding equivalence, namely that produces the same outputs on equivalent nodes.Currently, unfolding trees -also termed computation graphs [10] -are widely used to study the GNN expressiveness.Universal approximation results have been proved for Linear Graph Neural Networks [4,23], Folklore Graph Neural Networks [22] and, more generally, for a large class of GNNs [37,4] that includes most of the recent architectures, also considered in this paper.Despite advances in research on approximation theory for GNNs, there are still open problems to be investigated.First of all, the most general results available on modern GNNs are based on the Stone-Weierstrass theorem and state that the functions which can be approximated by GNNs are dense in the invariant continuous function space, modulo the 1-WL test [4].However, the Stone-Weierstrass theorem is existential in nature, so that, given a target function to be approximated, it does not allow to construct a GNN architecture that can reach the desired approximation -defining, for example, the number of its layers, and the feature dimension required to build the approximator.Moreover, the current results apply only to continuous functions on node/edge labels, which are defined on a compact subset of R L , a fact that may not hold in practical application domains, since, for instance, the function to be approximated may show step-wise behavior with respect to some inputs.Finally, all the results on the expressive capacity of modern GNN models are dedicated to graph classification/regression tasks, but node classification/regression problems are also widely present in practical applications and it is important to generalize the theoretical results on expressivity to them as well.In addition, it is useful to study the relationships between unfolding trees and the 1-WL test in this context.Indeed, it can be observed that the Weisfeiler-Lehman test assigns a color to all the nodes of a graph to make them distinguishable, and it can be naturally expected that the equivalence classes defined by the colors are related to those defined by the unfolding trees.In fact, it has been proved that the two mechanisms, colors or unfolding trees, produce the same profiles for graphs [9], namely the same number of nodes per equivalence class, but whether they produces exactly the same profiles with respect to single nodes, i.e., nodes get assigned the same equivalence class, is still an open problem.A formal and precise answer to this question will allow us to use the two frameworks in a targeted or exchangeable way in the context of node classification/regression tasks.

Non-equivalent
In this work, we present an alternative approach to study the approximation capability of recent GNNs that allows to answer to the above questions.
The main contributions of this paper are listed below.
• We prove that, on connected graphs, modern GNNs, realizing node-focused functions, are capable of approximating, in probability and up to any precision, any measurable function that respects the 1-WL equivalence.
Intuitively, this means that GNNs are a kind of universal approximators for functions on the nodes of the graph, modulo the limits enforced by the 1-WL test.Such a result describes the GNN capability for node classification/regression tasks.
• The presented proof is the most general on the GNN approximation capability that we are aware of, since it holds for generic graphs with real feature vectors and for a broad class of GNNs, which includes most of the current models.Moreover, it is assumed that the target function is measurable, which permits the approximation of discontinuous and more complex functions w.r.t.existing results, e.g.[13].Finally, the proof is based on a technique that allows us to deduce information on the architecture of the GNN that can reach the desired approximation.Such an information cannot be derived with the Stone-Weierstrass theorem and includes, f.i" hints on the number of iterations, the number of layers, the dimension of hidden features, and the type of the network to be used to implement the aggregation function.
• It is shown that, in order to reach any desired approximation accuracy, a single real hidden feature is sufficient, the aggregation network must contain at least one hidden layer, and the GNN must adopt at least 2r − 1 iterations, namely the GNN must include 2r − 1 layers, where r is the maximum number of nodes of any graph in the domain.The latter bound on GNN iterations/layers can be surprising because we may expect that r iterations are sufficient to diffuse the information on the whole graph.We will clarify that such a bound is due to the nature of node classification/regression tasks.Actually, r iterations are sufficient for graph classification/regression tasks, but they are not enough for node-focused tasks, which are more expensive from a computational point of view.
The rest of the paper is organized as follows.In Section 2, some related work is described.Notation and basic concepts are introduced in Section 3, while Section 4 presents the main contribution of this paper.Finally, Section 5 collects some conclusions and presents future perspectives.To make the reading more fluid, the proofs are collected in the Appendix.

Related Work
Great attention has recently been paid to the Weisfeiler-Lehman test and its correlation with the expressiveness of GNNs.Xu et al. [37] have shown that message passing GNNs are at most as powerful as the 1-WL test; this upper bound could be overcome by injecting the node identity in the message passing procedure, as implemented in [38].Morris et al. [25] have gone beyond the 1-WL test, implementing k-order WL tests as message passing mechanisms into GNNs.In [28], the WL test mechanism applied to GNNs is studied within the paradigm of unfolding trees (also called computational graphs), without really establishing an equivalence between the two concepts, so as in [39] (where the unfolding trees are called rooted subgraphs).In [2], it is shown that the Weisfeiler-Lehman test tends to oversquash the information coming from the neighbours; moreover, it is claimed that GNNs with at least K layers, where K is the diameter of the graphs in the dataset, do not suffer from under-reaching, which means that the information cannot travel farther than K edges along the graph.Nevertheless, a theoretical proof that GNNs succeed in overcoming the under-reaching behavior is not provided.
Universal approximation properties have been demonstrated for several GNN settings.The OGNN [32] model was proved to be a universal approximator on graphs preserving the unfolding equivalence in [31].Universal approximation is shown for GNNs with random node initialization in [1] while, in [37], they are proved to be able to encode any graph with countable input features.The universal approximation property has been extended to Folklore Graph Neural Networks in [22], to Linear Graph Neural Networks and general GNNs in [4,23], both in the invariant and equivariant case, but without any reference to the required number of layers.A relation between the graph diameter and the computational power of GNNs has been established in [21], where the GNNs are assimilated to the so-called LOCAL models [3,20,26] and it is proved that a GNN with a number of layers larger than the diameter of the graph can compute any Turing function of the graph.Nevertheless, no information on the aggregation function characterization is given.The generalization capability of GNNs has been also studied using different approaches, which include the Vapnik-Chervonenkis dimension for OGNNs [33], and the uniform stability [40] and Rademacher complexity [10] for modern GNNs.Designing GNN architectures that provide good generalization along with good expressive power is a hot research topic (see, e.g., [27]).Moreover, an extensive survey on the theory of Graph Neural Networks can be found in [13].
The results presented in this work differ from what can be found in literature mainly because we prove the GNN ability to approximate measurable functions based on a proof which is constructive, i.e. capable of suggesting the network architecture that will guarantee a given approximation.

Preliminaries
In this section, we introduce the required notation and the basic definitions used throughout the manuscript.

Graphs
A graph G is a pair (V, E), where V is the set of vertices or nodes and E is the set of edges between nodes in V. Graphs are directed or undirected, according to whether the edge (v, u) is different from the edge (u, v) or not.Moreover, a graph is connected if there is a path from any node to any other node in the graph.In the following, we assume that graphs are undirected and connected.
The set ne[v] is the neighborhood of v, i.e. the set of nodes connected to v by an edge, while ne i (v) denote the i-th neighbor of v -the set of all nodes connected to v with a path of length i.Finally, |G| defines the cardinality of the set of vertices in G. From now on, we will always consider finite cardinality graphs, i.e., |G| = L < ∞.
Nodes may have attached features, collected into vectors called labels, identified with ℓ v ∈ R L .

Graph neural networks
Graph Neural Networks adopt a local computational mechanism to process graphs.The information related to a node v is stored into a feature vector h v ∈ R m , which is updated recursively by combining the feature vectors of neighboring nodes.After k iterations, the feature vector h k v is supposed to contain a representation of both the structural information and the node information within a k-hop neighborhood.After processing is complete, the node feature vectors can be used to classify the nodes or the entire graph.
More rigorously, in this paper, we consider GNNs that use the following general updating scheme: where the node feature vectors are initialized with the node labels, i.e., h Here, differently from other approaches, we assume that labels can contain real numbers.Moreover, AGGREGATE (k) is a function which aggregates the node features obtained in the (k − 1)-th iteration, and COMBINE (k) is a function that combines the aggregation of the neighborhood of a node with its feature at the (k − 1)-th iteration.In graph classification/regression tasks, the GNN is provided with a final READOUT layer that produces the output combining all the feature vectors at the last iteration K: whereas, in node classification/regression tasks, the READOUT layer produces an output for each node, based on its features: In this paper, we will focus mainly on node classification/regression tasks.The learning domain of the GNN will be denoted by the graph-node pair D = G × V, where G is a set of graphs and V is a subset of their nodes.Therefore, the function φ, implemented by the GNN, takes in input a graph G and one of its nodes v, and returns an output φ(G, v) ∈ R o , where o is the output dimension.
It is worth mentioning that the OGNN model is not formally covered, both because in OGNNs the input of AGGREGATE (k) and COMBINE (k) contains the node labels ℓ v and possibly also the edge features, and because the node features are not initialized to ℓ v .Other models, such as MPNN [11], NN4G [24] and GN [5] are not included as well for similar reasons.Of course, Eq. ( 2) could easily be extended to include also OGNNs and the models mentioned above, but here we prefer not to complicate the proposed framework to keep the notation and proofs simple.

Unfolding trees and unfolding equivalence
Unfolding trees2 and unfolding equivalence are two concepts that have been introduced in [31] with the aim of capturing the expressive power of the OGNN model.Intuitively, an unfolding tree T d v is the tree obtained by unfolding the graph up to the depth d, using the node v as its root.Fig. 1 shows some examples of unfolding trees.In the following, a formal recursive definition is provided.
where Tree(ℓ v ) is a tree constituted of a single node with label ℓ v and Tree(ℓ v , ) is the tree with the root node labeled with ℓ v and having sub-trees v , is obtained by merging all unfolding trees T d v for any d.■ Note that, since a GNN adopts a local computation framework, its knowledge about the graph is updated step by step, every time Eq. ( 1) is applied.Actually, at the first step, k = 0, the feature vectors h 0 v depends only on the local label.Then, at step k, the GNN updates the feature vector h k v using the neighbour data, with the node feature vector that depends on the k-distant neighbourhood of v. Thus, intuitively, the unfolding tree T k v describes the information that is theoretically available to the GNN at node v and step k.Such an observation has been used in [31] to study the expressive power of the OGNN model and will be used also in this paper for the same purpose.
In this context, two questions have been studied.
(1) Can GNNs compute and store into the node features a coding of the unfolding trees, namely can GNNs store all the theoretically available information?
(2) Since unfolding trees are different from the input graphs, how does this affect the GNN expressive power?
Regarding the first question, it has been shown that indeed both OGNNs and modern GNNs can compute and store in the node features a coding of the unfolding trees, provided that the appropriate network architectures are used in COMBINE (k) and AGGREGATE (k) [28,31,37].Regarding question (2), we can easily argue that if two nodes have the same unfolding tree, then GNNs produce the same encoding on those nodes.Such a fact highlights an evident limitation of the expressive power of GNNs.The unfolding equivalence is a formal tool designed to capture such a limit: it is an equivalence relation that brings together nodes with the same unfolding tree, namely it groups nodes that cannot be distinguished by GNNs.
Definition 3.3.2.Two nodes u, v are said to be unfolding equivalent u ∽ ue v, if T u = T v .Analogously, two graphs G 1 , G 2 are said to be unfolding equivalent G 1 ∽ ue G 2 , if there exists a bijection between the nodes of the graphs that respects the partition induced by the unfolding equivalence on the nodes 3 .■ Since GNNs have to fulfill the unfolding equivalence, also the functions on graphs that they can realize share this limit.In our results on the approximation capability of GNNs, our focus is on functions that preserve the unfolding equivalence.Those functions are general enough except that they produce the same output on equivalent nodes.

The color refinement algorithm and the Weisfeiler-Lehman test
The first order Weisfeiler-Lehman test (1-WL test in short) [18] is a method to test whether two graphs are isomorphic, based on a graph coloring algorithm, called color refinement.The coloring algorithm is applied in parallel on the two graphs.Each node keeps a state (or color) that gets refined in each iteration by aggregating information from its neighbors' state.The refinement stabilizes after a few iterations and it outputs a representation of the graph.Two graphs with different representations, i.e. with a different number of nodes for each color, are not isomorphic.Conversely, if the numbers match, then the graphs are possibly isomorphic.Note that the test is not conclusive in the case of a positive answer, as the graphs may still be non-isomorphic.Actually, the algorithm just provides an approximate solution to the problem of graph isomorphism.
There exist different versions of the coloring algorithm: in this paper, we adopt a coloring scheme in which also the node labels are considered.Since GNNs process both the structure and the labels of the graphs, it is useful to consider both these sources of information, in order to analyse the GNN expressive power.Such an approach has been used, for example, in [28].More precisely, the coloring is carried out by an iterative algorithm which, at each iteration, computes a node coloring c l ∈ Σ, being Σ a subset of values representing the colors.The node colors are initialized on the basis of the node features and then they are updated using the coloring from the previous iteration.The algorithm is sketched in the following.
1.At iteration 0, we set c (0) v = HASH 0 (ℓ v ) where HASH 0 is a function that bijectively codes every possible feature with a color in Σ. 2. For any iteration t > 0, we set where HASH is a function that bijectively maps the input pairs to a unique value in Σ.Moreover, we assume that the same HASH function is used for all the iterations 4 .
In order to compare two graphs , the coloring refinement is applied in parallel on G ′ , G ′ , and, at each step, the color profiles generated on each graph are compared, namely, {c at any iteration, the colors of the graphs are different, then the 1-WL test fails and we can conclude that the graphs are not isomorphic; otherwise, the test succeeds.The 1-WL test allows to distinguish most non-isomorphic graphs, but may succeed on some rare examples.
In this paper, we use the color refinement also to compare nodes.Thus, given two nodes u, v, which in the most general case can belong to different graphs, we compare their colors at each iteration, i.e., c t u = c t v .If, at any iteration, the node colors are different, then the 1-WL node test fails, otherwise it succeeds.Notice that the color of a node n at iteration t depends on the sub-graph G t n , defined by the t-hop neighbourhood of n.Thus, intuitively, the 1-WL node test allows to check the isomorphism of the neighbourhoods of two nodes, G t u ∽ G t v .By the mentioned algorithms, we can easily produce a definition of WL-equivalence for graphs and nodes.Definition 3.4.1 (WL-equivalence).Two graphs, G ′ = (V ′ , E ′ ) and G ′′ = (V ′′ , E ′′ ), are said to be WL-equivalent, if they have the same multisets of colors at each iteration of the color refinement algorithm, i.e., {c m |m ∈ V ′′ } for any t.Analogously, two nodes, u and v, are said to be WL-equivalent, u ∽ W L v, if they have the same colors at each step of the color refinement algorithm, i.e. c It is interesting to observe that the color refinement procedure must be iterated until a difference in colors is detected between the compared items, either graphs or nodes, or until a maximum number of iterations is reached.It is well known that the color refinement of the common Weisfeiler-Lehman test, defined for graph comparison, can be halted when the node partition defined by colors become stable: if the two graphs share the colors when the stability is reached, then the equality will last forever.More precisely, let π t (G) be the partition of the nodes of G constructed by collecting in the same class the nodes that have the same color at iteration t.It is not difficult to prove that the partitions become finer at each iteration, π t−1 (G) ⪰ π t (G), and that there exists an iteration T at which they become stable, π T −1 (G) ≡ π T (G), Moreover, it can be proved that r − 1, where r is the number of nodes in G, is both an upper bound and a lower bound for the number T of iterations required to reach the stability [15].
Note that the stability of the node partition does not imply that the colors do not change.Actually, if the colors are not reused, as in our definition, except in the case where the graph is free of connections, new colors appear at each iteration.
Intuitively, this happens because the use, at a node u, of a new color, which has not been considered in the past, causes the algorithm to create new colors for the neighbors of u as well: thus, new colors will be generated forever.This observation can be used to explain why the upper bound on the iterations of the color refining procedure is different in the case of node or graph equivalence.We will see that we must wait for 2r − 1 iterations before halting the procedure in the former case, whereas, as mentioned above, r − 1 iterations are sufficient in the latter.

Main results
In this section, the main results of the paper are presented and discussed.For ease of reading the proofs of the theorems are given in the Appendix.

Unfolding and Weisfeiler-Lehman equivalence
The first proposed result regards the relationship between the unfolding and the Weisfeiler-Lehman equivalence.The following two theorems clarify that the two equivalence relations produce the same partitions of nodes and graphs.Moreover, the correspondence exists also between the intermediate equivalences defined by, respectively, the colors at each iteration of the WL algorithm and the unfolding trees having a corresponding depth.Formally, let us denote by ∽ uet the unfolding equivalences, at depth t, between nodes and graphs that are defined as in 3.3.2but considering unfolding trees of depth t in place of infinite trees.Similarly, let us denote by ∽ W Lt the WL-equivalences, at iteration t, that are defined as in 3.4.1,where only the colors of the refinement procedure up to the t-th iteration are considered.Let G = (V, E) be a labeled graph.Then, for each u, v ∈ V, u ∽ ue v holds if and only if u ∽ W L v holds.Moreover, for each integer t ≥ 0, u Both the unfolding equivalence and the WL-equivalence have been described using a recursive definition local to nodes.Figure 2 shows an example in which the unfolding trees and the colors of two nodes are iteratively computed: in the example, the colors of the nodes become different when also the unfolding trees become different.
Indeed, the existence of a relationship between the equivalences appears to be a natural consequence of their definition.
In fact, it is sometimes assumed in the literature (f.i., in [23]) that the two tools can be used interchangeably but, as far as we know, there is no formal demonstration of their effective equivalence.More precisely, in [17,3,9], it has been proved that the 1-WL test and unfolding trees produce the same profile on graph without attributes.Therefore, Theorem 4.1.2is just an extension of those results to the case of graph with attributes.On the other hand, Theorem 4.1.1,focused on nodes, is completely novel.
Theorems 4.1.1 and 4.1.2are interesting since they formally confirm that the two equivalences are exactly interchangeable and can be used together to study GNNs.While the Weisfeiler-Lehman test has been often adopted to analyse the expressive power of GNNs in terms of their capability of recognizing different graphs, the unfolding equivalence and, more precisely, unfolding trees, can provide a tool to understand the information that a GNN can use at each node to implement its function.
On the approximation capability of GNNs in node classification/regression tasks Figure 2: A graphical representation of the relationship between the color refinement and the unfolding equivalence, applied on nodes 1 and 4 of the given graph.
For example, it is well known that GNNs cannot distinguish regular graphs where nodes have the same features (see e.g.[28]).Of course, in this case, a GNN is not able to distinguish any node, since all the unfolding trees are equal (see Figure 3a).On the one hand, when a target node has different features with respect to the others, also the unfolding trees incorporate such a difference and the nodes at different distances from this target node belong to different equivalence classes (see Figure 3b).On the other hand, if all the labels are different, then each node belongs to a different class, since all unfolding trees are different (see Figure 3c).We observe that, in principle, by adding random features to the node labels, we could make all the nodes distinguishable and improve the GNN expressive power.This fact was already mentioned for OGNNs [31] and has been recently observed also for modern GNN models [29].Obviously, this is true only in theory, as the introduction of random features usually produces overfitting.However, some particular tasks exist where random features do not cause any overfitting, for example if these features are not related to the node content (see [31], Section IV.A), while, in other cases, it is the particular model which is able to efficiently use random labels [28].
A further important argument of our analysis regards how much deep must be unfolding trees, i.e., how many iterations of color refinement are needed, in order to make the equivalence stable.Actually, Theorems 4.1.1 and 4.1.2suggest that the unfolding and Weisfeiler-Lehman equivalences remain paired up to any depth/iteration t.Those equivalences naturally become finer and finer as the iterations proceed, i.e, ∽ uet−1 ≻∽ uet and ∽ W Lt−1 ≻∽ W Lt , until T , when they become stable and equal to the corresponding infinite equivalences, namely As already mentioned in Section 3, according to the literature [15], it is known that, for the WL-equivalence on graphs, r − 1 is both an upper and lower bound on T , where r is the maximum number of nodes in the graphs.The following theorem, which takes inspiration from the results in [15] about covering trees, shows that, for equivalences on nodes, the bounds are different and we must wait up to 2r − 1 iterations, i.e., trees of depth of 2r − 1, until the equivalences become stable.
Theorem 4.1.3.The following statements hold for graphs with at most r nodes.
1. Let G and H be connected graphs and x, y be nodes of G and H, respectively.The infinite unfolding trees T x , T y are equal if and only if they are equal up to depth 2r − 1, i.e., T x = T y iff T 2r−1 2. For any r, there exist two graphs G and H with nodes x, y, respectively, such that the infinite unfolding trees T x , T y are different, but they are equal up to depth 2r − 16 √ r, i.e., T x ̸ = T y and T t x = T t y for i ≤ 2r − 16 √ r.
In order to get an intuitive explanation of the reason why the bounds are different for graph and node equivalences, let us consider the case of two graphs G and H that are not equivalent, i.e., G ̸ ≡ W L H holds.Moreover, let us assume that the parallel application of the refinement algorithm detects the difference in colors at iteration T , namely G ̸ ≡ W L T H, for example because a new color is generated for graph G that is not present in H.At this iteration, the WL algorithm is halted, since we detected at least a node u in G that is different from all the nodes in H. Conversely, if we continue the color refinement, the new color of u will generate other new colors, which are not present in H, also for the neighbors of u.After at most r iterations, the difference spreads throughout the graph, so that, finally, all the nodes in G are different from those in H.This is intuitively correct, since all the nodes in G are connected to a node that does not exist in H. Therefore, we can observe that, while the first difference between the nodes of the two graphs arises after r − 1 iterations, the diffusion of such information to all the nodes takes additional r steps.Obviously, a similar conclusion can be derived also considering the unfolding equivalence and the depths of the unfolding trees.
An example that illustrates this situation is depicted in Figure 4.The two graphs in (a) and (b) have been proposed in [17] and satisfy the lower bound of point 2 of Theorem 4.1.3.In the example, we assume that all the nodes have the same attributes, even if, for the sake of clarity, they are displayed with different symbols in terms of their "role" in the coloring scheme.The graphs in (a) and (b) are constructed using copies of the subgraph modules in (c), (d) and (e), which are merged in a sequence; (a) and (b) are equal except at the top: in (a), at the end of the sequence, there is a copy of (d), while in (b) there is a copy of (e).The interesting case happens when the sequence is long enough so that 2r − 16 √ r > r holds.In this case, we have the following situation: graphs (a) and (b) are distinguishable by the 1-WL test in less than r steps; nevertheless, a number of steps t > 2r − 16 √ r > r is needed to distinguish the nodes u and v. Thus, intuitively, color refinement can recognize that (a) and (b) are not isomorphic, but the detection of the difference occurs only when the information about the asymmetry -which is on one side of the sequence -arrives to the other side of the sequence, where the different modules have been placed.After that, the different modules have been detected and the information on their difference is propagated to the rest of the graphs in a number of iterations proportional to the length of the sequences to arrive back to nodes u and v.
In order to formally link the concept of unfolding trees to the computational capability of GNNs, let us now recall the definition of unfolding equivalence.
The class of functions that preserve the unfolding equivalence on D will be denoted with F(D).A characterization of F(D) is given by the following result.), for any node v ∈ D.
A short, formal proof can be found in Appendix A.
Theorem 4.1.5represents an improvement of the results reported in [30]; our contribution here is to show that, considering the unfolding tree down to the depth 2n − 1, we can provide the complete information on a graph to a function f belonging to F(D).
Note that Theorem 4.1.5suggests not only that the functions that compute the output on a node using unfolding trees preserve the unfolding equivalence, but also that the converse holds, namely all the functions that preserve the unfolding equivalence can be computed as functions of the unfolding trees.Since GNNs can implement only functions of the unfolding trees, we may expect that there is a tight relationship between what GNNs can do and the class F(D).Actually, in [31], it has been shown that the OGNN model can approximate in probability, up to any degree of precision, any function in F(D) and a similar result will be derived for modern GNNs in this manuscript.

Approximation capability
The above discussion is about what GNNs cannot do, since we have proved that they are unable to distinguish nodes that originate equal unfolding trees.Another obvious limit is that, at each node v, a GNN considers only the part of the graph that is reachable from v and cannot implement any function depending on the information inaccessible from that node.For this reason, for simplicity, we have decided to consider only connected graphs.In this section, we pose our attention on two further questions that are related to each other, namely which functions can be approximated by GNNs and if there are any limitations other than that relating to the unfolding equivalence.
In order to address these issues, we consider the class of functions that preserve the unfolding equivalence (see Definition 4.1.4).The following theorem proves that GNNs can approximate in probability, up to any precision, any function of this class, which means that GNNs are a sort of universal approximators on graphs, modulo the limitations due to the unfolding equivalence.Theorem 4.2.1 (Approximation by GNNs).Let D be a domain containing connected graphs with at most r nodes.For any measurable function τ ∈ F(D) preserving the unfolding equivalence, any norm ∥ • ∥ on R, and any probability mea-sure P on D, there exists a GNN defined by the continuously differentiable functions COMBINE (k) , AGGREGATE (k) , ∀k ≤ r − 1, and by the function READOUT, with feature dimension m = 1 (i.e, h k v ∈ R), such that the function φ (realized by the GNN) computed after 2r − 1 steps satisfies the condition for any reals ϵ, λ, where ϵ > 0, 0 < λ < 1.
Theorem 4.2.1 intuitively states that, given a function τ , there exists a GNN that can approximate it.COMBINE (k)  and AGGREGATE (k) can be any continuously differentiable function, while no assumptions are made on READOUT.This situation does not correspond to practical cases, where the GNN adopts particular architectures and those functions are realized by neural networks or, more generally, parametric models -for example made of layers of sums, max, average, etc.Therefore, it is of fundamental interest to clarify whether the theorem still holds when the components COMBINE (k) , AGGREGATE (k) and READOUT are parametric models.
Let us now study the case when the employed components are sufficiently general to be able to approximate any function.We call Q this class of networks, which corresponds to GNN models with universal components.In order to simplify our discussion, we introduce the transition function f (k) to indicate the stacking of the AGGREGATE (k) and COMBINE (k) , i.e., ) .Then, we can formally define the class Q. Definition 4.2.2.A class Q of GNN models is said to have universal components if, for any any ϵ > 0 and any continuous target functions COMBINE (k) , AGGREGATE (k) , READOUT, there exists a GNN belonging to Q, with functions COMBINE (k) w , AGGREGATE (k) w , READOUT w and parameters w such that holds, for any input values h, h 1 , . . ., h s , q.Here, the transition functions f (k) and f w are defined using the target functions COMBINE (k) , AGGREGATE (k) , and the GNN functions COMBINE (k) w , AGGREGATE (k) w , respectively, and ∥ • ∥ ∞ is the infinity norm.■ The following result shows that Theorem 4.2.1 still holds even for GNNs with universal components.
Theorem 4.2.3.APPROXIMATION BY NEURAL NETWORKS Let us assume that the hypotheses of Theorem 4.2.1 are fulfilled and Q is a class of GNNs with universal components.Then, there exists a parameter set w and some functions COMBINE (k) w , AGGREGATE (k) w , READOUT w , implemented by neural networks in Q, such that the thesis of Theorem 4.2.1 holds.
The proof of Theorem 4.2.3 is included in the Appendix.However, some related topics are discussed below, to better understand some properties of GNNs.
• In the proof of Theorem 4.2.1, we first define an encoding function ▽ (see the Appendix) that maps trees to real numbers.The functions COMBINE (k) and AGGREGATE (k) are designed so that, at each step, the node feature vector approximates a coding of the unfolding function The function READOUT decodes the unfolding and produces the desired outputs.
• In the proof of Theorem 4.2.3, it is shown that Theorem 4.2.1 still holds even when the transition and READOUT functions are approximated.Thus, we can use any parametric model to implement those functions.We can expect that, also for the GNNs of Theorem 4.2.3, the transition function stores into the feature vector an approximate coding of the unfolding tree, while READOUT decodes such a coding and gives the desired outputs.Obviously, in a practical case, a GNN can store only useful information, required to produce the output, and not just all the informative content of the unfolding trees.
The following remarks may further help to understand our results.
• GNNs with universal components.Intuitively, the universality condition means that the architectures used to implement f w and READOUT w must be sufficiently general to be able to approximate any possible target function.From the theory of standard neural networks, those architectures must have at least two layers (one hidden and one output layer) [30].Such a conclusion is similar to the one reported in [37], where a related result is described and where it is suggested that, in order to be able to implement the 1-WL test, the GNN must use a two layer transition function.Indeed, in this way, the GNN can implement an injective encoding of the input graph into the node features.Nonetheless, the proposed result is slightly different with respect to the one reported in [37], as, in theory, the encoding may fail to be injective, provided that the approximation remains sufficiently good in probability.However, the conclusion about the architecture still holds.

GNNs with transition functions f (k)
w exploiting two layer architectures include Graph Isomorphism Networks (GINs) [37], which were claimed to realize an injective encoding.Similarly, also the OGNN model, for which a result similar to Theorem 4.2.1 was proved, adopts a two layer architecture for the transition function: in this case, AGGREGATE (k)  w consists of a MultiLayer Perceptron (MLP) with a hidden layer and COMBINE (k)   w was implemented by a sum.Similar results have been devised also in [4], where a different version of the COMBINE (k) w function has been modeled as a sum of MLPs.• READOUT universality.The condition on the universality of the READOUT function can be relaxed, provided that a higher dimension for the feature vector is used, namely m > 1. READOUT w can indeed cooperate with the transition function in order to produce the output.In the limit case, the output can be completely prepared by the transition function and stored in some components of h K v so that READOUT w is just a projection function.
• GNN architectures that are not universal approximators.Most of GNN models, e.g.Graph Convolutional Neural Networks, GraphSAGE and so on, use a single layer architecture to implement the transition function.Thus, even if they do employ universal components, such as those specified by Definition 4.2.2, they have a limited computational power with respect to two layer architectures and this is supported by theoretical results.In [37], Lemma 7, it is shown that, if the transition function is made up by a single layer with ReLU activation functions, the encoding function cannot be injective.A similar result was obtained for linear recursive neural networks5 in [6].However, in general, it is not correct to assert that GNNs with single layer transition functions cannot be universal approximators for functions on graphs, as this property depends on the used GNN model and on other architectural/training details.For example, a GNN model with a single layer transition component can use several iterations of Eq. ( 1) to emulate a GNN with a deeper transition component.
In the former model, the node features emulate the transition network hidden layers and COMBINE must contain a self-loop, namely must have access to the previous features of each node.
• Feature dimension.Surprisingly, Theorems 4.2.1 and 4.2.3 suggest that a feature vector of dimension m = 1 is enough to establish the universal approximation capability of GNNs.It is obvious, however, that the dimension of the feature vector plays an important role in determining the complexity of the coding function for a given domain.We expect that the larger the dimension, the smaller the complexity of the coding.This complexity, in turn, affects the complexity of the transition function, the difficulty in learning such a function, the number of patterns required for training the GNN and so on.A GNN can employ up to r − 1 iterations/layers to diffuse all the information from one node to any other node with the message passing mechanism.After r − 1 iterations, the information stored in a node provides a sort of signature for that node, which may allow to distinguish some nodes from others.Yet, such a signature is not complete, since the first time a node "communicate" with another has no information about itself.Adding r iterations/layers allows nodes to communicate again and exchange their current signatures to produce more accurate signatures.It is worth noting that this reasoning provides also an intuitive explanation about why graph regression/classification tasks differ from node tasks.In graph tasks, the GNN uses a READOUT function that aggregates the features of all the nodes in the graph, and possibly can do the work required by the second diffusion phase.In node tasks, READOUT operates only on a single node, so that the second diffusion phase is mandatory.
• The same COMBINE and AGGREGATE can be used for all the layers.Even if, for clarity, in our theoretical analysis, we focus on the GNN model that is the most used and exploits different functions in each layer k, our proofs do not exploit such a characteristic.Therefore, all the results hold also for those GNNs -sometimes called recursive -using the same COMBINE and AGGREGATE functions on each layer.
On the approximation capability of GNNs in node classification/regression tasks Note that, throughout the manuscript, we have used the idea that the unfolding tree represents the information available to a GNN to compute its output, and we have mentioned that a similar approach has been applied by other authors as well.From a formal point of view, Theorems 4.2.1 defines a method by which a GNN can actually encode an unfolding tree into the node features, so that it has been proved that all the information collected into the unfolding trees can be used by GNNs.However, also the reverse implication holds true, that is a GNN cannot encode more information into features than that contained into the unfolding trees.Indeed, this is a consequence of the fact that GNNs have no greater discriminatory capability than the 1-WL test (see [25], Theorem 1).Therefore, the unfolding trees totally collect the information used by a GNN.
Finally, the following corollary provides an alternative way to describe the approximation ability of GNNs as a function of their unfolding trees.

Conclusion
In this paper, we have shown that GNNs can approximate, in probability, any function that preserves the unfolding equivalence (i.e., that passes the 1-WL test).Our proof improves on existing results both because it applies to node classification/regression tasks and because it is more general, since it holds for measurable functions.Moreover, by using our theoretical framework, we have provided details on the GNN architectures that can reach a given approximation, including the number of iteration/layers, the state dimension and the architecture of AGGREGATE (k) , COMBINE (k)  and READOUT networks.
Future developments may include further extensions of our results beyond the 1-WL domain and covering GNN models not considered by the framework used in this paper.Moreover, the proposed results are mainly focused on the expressive power of GNNs, but GNNs with the same expressive power may differ for other fundamental properties, e.g., the computational and memory requirements and the generalization capability.Understanding how the architecture of AGGREGATE (k) , COMBINE (k) and READOUT impacts on those properties is of fundamental importance for practical applications of GNNs. and Eq. ( 11) is true if and only if, by induction, which implies Moreover, Eq. ( 12) means that an ordering on ne[u] and ne[v] exists such that Instead, by induction, Eq. ( 15) holds iff an ordering on ne[u] and ne[v] exists so as Finally, putting together Eqs. ( 14) and ( 16), we obtain Theorem 4.1.1 is therefore proven, as its statement just rephrases the statement of Lemma A.1 in terms of the equivalence notation.Theorem 4.1.2is the natural extension of Theorem 4.1.1 to graphs.
Proof of Theorem 4.1.3 In order to prove the theorem, we introduce the concept of universal covering, first presented in [17], which allows us to derive useful properties on the unfolding trees (see [17] for more details).
Let G = (V, E).Given a graph H = (V ′ , E ′ ) and a homomorphism α from H to G, if: for all v ∈ V ′ , then α is called an attributed covering map and H is called a covering graph.Given a connected graph G and a vertex x ∈ V, let us define a graph U x (G) as follows.The vertex set of U x (G) consists of all non-backtracking walks in G starting at x, that is, of sequences (x 0 , x 1 , . . ., x k ) such that x 0 = x, x i and x i+1 are adjacent, and x i+1 ̸ = x i−1 .Two such walks are adjacent in U x (G) if one of them extends the other by one component, that is, one is (x 0 , . . ., x k , x k+1 ) and the other is (x 0 , . . ., x k ).U x (G) is a tree and γ G defined as γ G (x 0 , . . ., x k , x k+1 ) = x k is a covering map from U x (G) to G. We call U an attributed universal cover of G if U covers any covering graph of G. Therefore, U x (G) is an attributed universal cover of G. Remark A.2.Given that we are dealing with attributed graphs, we will drop the "attributed" adjective from now on, to make the notation lighter.
The next lemma, which is proved in [17], shows the bijective correspondence between universal coverings and colors up to a certain depth/iteration.Lemma A.3. [17] Let U and W be universal covers of graphs G and H, respectively.Furthermore, let α be a covering map from U to G and β be a covering map from W to H. Let x ∈ V(U) and y ∈ V(W), and let u = α(x) and v = β(y).Then, for any t, The equivalence is formally proved by the following lemma.Lemma A. Note that, since b is independent of δ, then D ⊂ M for any δ.Since τ is integrable, there exists a continuous function which approximates τ , in probability, up to any degree of precision.Thus, without loss of generality, we can assume that τ is equi-continuous on M. By definition of equi-continuity, a real δ > 0 exists such that holds for any node v and for any pair of graphs G 1 , G 2 having the same structure and satisfying ∥ℓ G1 − ℓ G2 ∥ ∞ ≤ δ.
Let us apply Lemma A.6 again, where, now, the δ of the hypothesis is set to δ, i.e. δ = δ.From then on, D = G i × {v i }, 1 ≤ i ≤ n, represents the set obtained by the new application of Lemma A.6 and I b,η i , 1 ≤ i ≤ 2d, denote the corresponding intervals defined in the proof of the same lemma.Let θ : R → Z be a function that encodes reals into integers as follows: for any i and any z ∈ I b,η i , θ(z) = i.Thus, θ assigns to all the values of an interval I b,η i the index i of the interval itself.Since the intervals do not overlap and are not contiguous, θ can be continuously extended to the entire R.Moreover, θ can be extended also to vectors, being θ(Z) the vector of integers obtained by encoding all the components of Z. Finally, let Θ : G → G represent the function that transforms each graph by replacing all the feature labels with their coding, i.e.L Θ(G) = θ(L G ). Let Ḡ1 , . . ., Ḡv be graphs, each one extracted from a different set G i .Note that, according to points 3, 4, 5 of Lemma A.6, Θ produces an encoding of the sets G i .More precisely, for any two graphs G 1 and G 2 of D, we have Θ( Consider, now, the problem of approximating τ • Γ on the set (Θ( Ḡ1 ), v 1 ), . . ., (Θ( Ḡn ), v n ).Theorem A.7 can be applied to such a set, because it contains a finite number of graphs with integer labels.Therefore, there exists a GNN that implements a function φ s.t., for each i, However, this means that there is also another GNN that produces the same result operating on the original graphs G i , namely a GNN for which φ(G i , v i ) = φ(Θ( Ḡi ), v i ) (20) holds.Actually, the graphs G i and Ḡi are equal except that the former has the coding of the feature labels attached to the nodes, while the latter contains the whole feature labels.Thus, the GNN that operates on Ḡi is that suggested by Theorem A.7, except that AGGREGATE (0) preliminary creates a coding of θ(ℓ v ).
Putting together the above equality with Eqs. ( 18) and (19)  functions AGGREGATE (k) , COMBINE (k) , and READOUT, which produce exactly the same computations when they are applied on the graphs G i , but that can be extended to the rest of their domain, so that they are continuously differentiable.Obviously, such an extension exists since those functions are only constrained to interpolate a finite number of points 6 .
Proof sketch of Theorem 4.2.3 Proof.As in the proof of Theorem 4.2.1, without loss of generality, we will assume that the feature dimension is m = 1.First of all, note that Theorem 4.2.1 ensures that we can find COMBINE (k) , AGGREGATE (k) , ∀k ≤ N , and READOUT so that, for the corresponding function φ implemented by the GNN, holds.Let us consider the corresponding transition function f , defined by Since COMBINE (k) and AGGREGATE (k) are continuously differentiable, f k is continuously differentiable.Considering that the theorem has to hold only in probability, we can also assume that the domain is bounded, so that f k is bounded and has a bounded Jacobian.Let B be a bound on the Jacobian/derivative of f k for any k and any input.The same argument can also be applied to the function READOUT, which is continuously differentiable w.r.t.its input and can be assumed to have a bounded Jacobian/derivative.Let us assume that B is also a bound for the Jacobian/derivative of READOUT.Moreover, let COMBINE (k) w and AGGREGATE (k) w be functions implemented by universal neural network that approximate COMBINE (k) , AGGREGATE (k) , ∀k ≤ r, respectively, and such that holds for every k and a η > 0. Let READOUT w be the function implemented by a universal neural network that approximates READOUT, so that ∥READOUT − READOUT∥ ∞ ≤ η In the following, it will be shown that, when η is sufficiently small, the GNN implemented by the approximating neural networks is sufficiently close to the GNN of Theorem 4.2.1 so that the thesis is proved.
Let F k , F k w be the global transition functions of the GNNs that are obtained by stacking all the f k and f k w for all the nodes of the input graph.The node features are computed at each step by Hk = F k ( Hk−1 ), H k = F k w (H k−1 ), where Hk ,H k denote the stacking of all the node features of the graph obtained by the two transition functions, respectively.Then, where N = |G| is number of nodes in the input graph.Moreover, w (H 1 )∥ ∞ ≤ ηN B + ηN = ηN (B + 1) .
The above reasoning can then be applied recursively to prove that Since the output of the GNN is computed using the encoding at step N , we have Finally, since we can consider the maximum number of nodes N as bounded 7 , then we can find a GNN based on neural networks so that η is small enough to achieve which, together with Eq. ( 21), produces the bound of Theorem 4.2.1.

Figure 1 :
Figure 1: An example of a graph with some unfolding trees.The symbols outside the nodes represent features.The two nodes on the left part of the graph are equivalent and have equivalent unfolding trees.

Figure 3 :
Figure 3: (a) A regular graph where all nodes have the same features.All unfolding trees are equal.(b) The equivalence classes when only one node has different features.(c) The equivalence classes when all nodes has different features.

Definition 4 . 1 . 4 .
A function f : D → R m is said to preserve the unfolding equivalence on D

Theorem 4 . 1 . 5 (
Functions of unfolding trees).A function f belongs to F(D) if and only if there exists a function κ, defined on trees, such that f (G, v) = κ(T 2n−1 v

Figure 4 :
Figure 4: In (a) and (b), two graphs G, H are depicted that satisfy the lower bound of point 2 of of Theorem 4.1.3.We assume that all the nodes have the same attributes even if they are displayed with different symbols in terms of their "role" in the coloring scheme.Graphs in (a) and (b) are constructed by aggregating in a sequence two copies of the same subgraph (c); then, module (d) is added at the top of graph (a), while module (e) is added at the top of graph (b).It is worth noting that (a) and (b) do not satisfy the relation 2r − 16 √ r > r; nevertheless, adding multiple times module (c) to the tail of both (a) and (b), we can find two graphs satisfying the requested relation.

Corollary 4 . 2 . 4 .
The class of functions implemented by a GNN with universal components is dense in probability in the F(D) class of functions that preserve the unfolding equivalence in the domain D of connected graphs.

Figure 5 :
Figure 5: The ATTACH operator on trees.

•
Number of steps.Theorems 4.2.1 and 4.2.3 suggest that 2r − 1 steps are enough to approximate any function.Such a result is a consequence of Theorems 4.1.3and 4.1.5.Intuitively, this bound can be explained reusing the discussion on Theorem 4.1.3.
[31]heorem 4.2.1 holds if and only if Theorem A.7 holds.Proof.Although the proof is quite identical to that contained in[31], we report it here with the new notation.Theorem 4.2.1 is more general than Theorem A.7, which makes this implication straightforward.Suppose instead that Theorem A.7 holds and show that this implies Theorem 4.2.1.Let us apply Lemma A.6 with values for P and λ equal to the corresponding values of Theorem 4.2.1, being δ any positive real number.It follows that there is a real b and a subset D of D s.t.P ( D) > 1 − λ.Let M be the subset of D that contains only the graphs G satisfying ∥ℓ G ∥ ∞ ≤ b.