Deeper Exploiting Graph Structure Information by Discrete Ricci Curvature in a Graph Transformer

Graph-structured data, operating as an abstraction of data containing nodes and interactions between nodes, is pervasive in the real world. There are numerous ways dedicated to extract graph structure information explicitly or implicitly, but whether it has been adequately exploited remains an unanswered question. This work goes deeper by heuristically incorporating a geometric descriptor, the discrete Ricci curvature (DRC), in order to uncover more graph structure information. We present a curvature-based topology-aware graph transformer, termed Curvphormer. This work expands the expressiveness by using a more illuminating geometric descriptor to quantify the connections within graphs in modern models and to extract the desired structure information, such as the inherent community structure in graphs with homogeneous information. We conduct extensive experiments on a variety of scaled datasets, including PCQM4M-LSC, ZINC, and MolHIV, and obtain a remarkable performance gain on various graph-level tasks and fine-tuned tasks.


Introduction
Graph data include considerable structure information; however, existing graph-based algorithms do not fully use the inherent structural information of graphs. Real-word datasets with an inherent node-edge structure, such as citation networks [1], molecules [2], and the Internet [3], can be naturally represented by graphs. Moreover, graphs can be manually established in scattered data such as point clouds [4,5].
The vast majority of GNNs use a message passing (MP) mechanism to explore the graph structure information by aggregating neighborhood information [6][7][8]; however, they unavoidably run into oversmoothing and oversquashing issues. Due to the MP mechanism, most graph convolution of GNNs may be considered as a special case of Laplacian smoothing [9]. Analogously to random walk on graphs, smoothing operations on graphs result in the mixing of the characteristics of individual nodes.Multiple processes are taken to smooth the characteristics of individual nodes, culminating in the reduction of variability across nodes from diverse groups. This phenomenon of the inability to classify nodes when the network is deeper is the most widely discussed defect of GNNs, i.e., oversmoothing [9,10]. Another newly discussed problem of GNNs is oversquashing [11,12], which indicates that information flows between long-distant nodes encounter an unavoidable distortion. Oversmoothing and oversquashing are inevitable side effects of MP GNNs. Rong et al. [10] alleviated oversmoothing by randomly dropping a percentage of edges in the graph. Alon and Yahav [11] tried to tackle oversquashing by adding a fully adjacent layer. However, these approaches could not totally resolve these issues [13].
Graph-based transformers are another line of recent research. Transformers were originally proposed as powerful solvers for natural language processing (NLP) tasks [14] and soon became prevailing in many domains, such as computer vision [15], time series [16], and graph representation learning [17,18]. For graph-based transformers, current works mainly focus on how to integrate a graph structure into positional encoding (PE) in transformers [18,19]. Since graph data do not have a canonical position as in images and sequences, the most widely used PE is the graph Laplacian eigenvectors, which preserve the global structure with a permutation invariance [20]. Different from PE methods, Graphormer [21] added structural encodings to the self-attention module as a structureaware bias of attention weights.It has been experimentally proved that Graphormer is exempt from the problem of oversmoothing. Moreover, because of the self-attention mechanism in the transformer architecture, each node in the network attends to the others as if they were entirely nearby nodes. Consequently, transformer-based graph learners can efficiently avoid the issue of oversquashing. Thus, it is natural to take graph transformers as the backbone architecture for graph-based models.
However, current graph structure descriptors, such as node degrees and shortest path distances (SPD), have limited expressiveness. Rich information in the topology of the graphs still remains unexplored. Graph-based tasks rely heavily on structure information. The basic distinction between graph data and other data types, such as pictures or sequences, is the non-Euclidean node-edge structure. Graphs can be treated as a discretized manifold [22] from the topological view. Based on the homophily assumption of most graphs, the mainstream graph-based tasks, such as node classification, link prediction and graph classification/regression, tend in essence to strengthen the connection between nodes with the same property and discriminate against nodes with different properties. To describe the geometric relationships of nodes from intra-/intercommunities, we draw inspiration from recent research focusing on developing community detection algorithms [22][23][24] with the help of a geometric notion, i.e., the discrete Ricci curvature (DRC) [25].
The DRC quantifies the intensity of connections between nodes and their neighborhoods with regard to the local graph topology. Node pairs being densely connected are associated with positive DRC values, while sparsely connected pairs give rise to negative DRC values. As illustrated in Figure 1, the nodes connected by a yellow edge are in the same community and have densely connected/overlapped neighborhoods, while the nodes connected by a green edge are from distinct communities with few connections/overlaps between their neighborhoods. Therefore, the DRC value of the yellow edges is 1.33, which is obviously larger than the value −0.6 of the green edges. Purple edges correspond to a scenario between the two extremes; thus, they have a DRC between −0.6 and 1.33. Intuitively, the DRC has the ability to measure the connectedness of nodes and their neighborhoods, thus it can be integrated to graph transformers to explore deeper structure information. In this paper, we propose a novel curvature-based topology-aware graph transformer architecture, namely, Curvphormer, to exploit advanced structural information from a topological view. We evaluated the performance of our proposed algorithms on widely used testbeds such as MolHIV, PCQM4M-LSC, and ZINC. Curvphormer exceeded previous benchmarks by a significant margin.

Related Work
In this section, we highlight the most recent approaches on NN-based models working on demystifying the structure information of graph data. Then, we give prominence to some related applications of the DRC in finding the underlying structure of graphs.
2.1. Structural Encodings 2.1.1. On MP-GNNs GNN methods processing graph data have natural merits from a theoretical basis. Most GNNs follow the MP mechanism and leverage random walk algorithms to explore the underlying structure of graphs with the aid of stochastic theories [9,26]. Some other GNN methods try to incorporate local structure information by utilizing a local k-hop subgraph as the structure fingerprint of its central node [27,28]. Moreover, some methods propose to explicitly or implicitly introduce some additional structure information encoded by geometric notions such as DRC to GNNs [29,30]. However, due to the inevitable oversmoothing and oversquashing problems and the limited expressiveness of GNNs, the increment of structure information does not yield much improvement in performance.

On Graph-Based Transformers
The challenge of building a powerful transformer architecture in graph representation is how to properly encode structure information into a positional encoding (PE) module [18] or the self-attention module [21]. Dwivedi and Bresson [18] exploited graph structure by precomputing the Laplacian eigenvectors of the adjacency matrix acting as the PE in the vanilla transformer architecture to provide distance-aware information. Graph-BERT [19] operates on sampled linkless subgraphs for the local structure information and enhances its capability on extremely large graphs. Furthermore, Graph-BERT introduces three PE embeddings to take in the positional information on local subgraphs. Specifically, a Weisfeiler-Lehman (WL) absolute PE is leveraged to capture the global information, and an intimacy-based PE and a hop-based relative PE are introduced to extract the local information in subgraphs. It is notable that TokenGT [17] puts forward that pure transformers can attain impressive performance on graphs by an orthonormal node identifier and a type identifier. It suggests that the transformer architecture itself has the potential to fit in the graph structure. The key to developing transformers for graphs is to extract proper graph structure information in the model. Thus, most graph transformers incorporate graph structure information by some strong graph-specific modifications. Following this guideline, further involving advanced geometric descriptors into the transformer architecture is a promising direction.

DRC in Finding Graph Structure
In light of the property of the Ricci curvature in Riemannian geometry, the discrete version of the Ricci curvature is a natural choice as a topological descriptor. Ni et al. [3] leveraged the DRC to analyze Internet topologies. Sia et al. [23] constructed a community detection algorithm by removing negative curved edges step by step. Lai et al. [24] leveraged a DRC-based Ricci flow to deform a graph, then intracommunity nodes became closer and intercommunity nodes dispersed. The DRC is capable of finding the underlying relationship between nodes, characterizing them to clusters with identical or distinct properties.

Method
In this section, we elaborate the formulation of the discrete Ricci curvature (DRC) and how to incorporate it in Curvphormer. Firstly, the basic settings are stated in Section 3.1. Then, we carefully define the Ricci curvature on graphs in Section 3.2. In Section 3.3, we propose the curvature-based topology-aware Curvphormer.

Preliminaries
Let G = (V, E ) be a simple connected graph where V = {v 1 , · · · , v n } is the set of nodes and E ⊂ V × V is the set of edges. n = |V | and m = |E | are the number of nodes and edges, respectively. There are two kinds of information from G, i.e., • Attribute information: It represents the attribute features carried by the datasets. For example, the signal intensity of a signal tower (which can be abstracted as a node in the network), is a kind of attribute information. Actually, not only nodes but also edges in graphs can contain attribute information. For example, the bonds between molecule pairs can have different types, which can be included in the edge features. We denote the node features by X = (x 1 , . . . , x n ) T ∈ R n×d and edge features by E = (x e 1 , . . . , x e m ) T ∈ R m×q , where d and q are the dimension of node and edge features, respectively. • Structure information: It represents the positions and interactions of nodes. Because of the absence of canonical node ordering, without loss of generality, position information can be viewed as a simple kind of interactions between nodes, i.e., a node is adjacent or nonadjacent to others. More complex interactions are simply represented by the node-edge form. Thus, in graphs, structure information is usually encoded by the adjacency matrix of the entire graph or subgraphs. Let A = {a ij } ∈ R n×n denote the adjacency matrix, where a ij = 1 when (v i , v j ) ∈ E , and a ij = 0 otherwise.

Discrete Ricci Curvature
The Ricci curvature is originally a geometric notion, which plays a very important role in Riemannian manifold analysis. It quantifies the degree of space bending. For its discrete counterpart, the discretized Ricci curvature measures the connectedness of the neighborhood of two nodes. For the discretization of the Ricci curvature, there are two mainstream forms, i.e., the Ollivier Ricci curvature [25,31] and the Forman Ricci curvature [32]. Since the Ollivier Ricci curvature has more theoretical foundations and depicts inherent structures more intrinsically [33], we applied a limit-free Ollivier Ricci curvature [24,34] as the definition of the DRC.
The Ollivier Ricci curvature is defined on the base of the transportation distance. Firstly, we define the probability distribution of nodes on the graph, which indicates the connections or information flow between one node and others, especially its adjacent neighbors.

Definition 1. Probability distribution:
For ∀α ∈ [0, 1] and ∀x ∈ V, the information flow from node x to other nodes y ∈ V can be defined as a probability distribution on V by otherwise. (1) where w xy denotes the edge weight on edge (x, y) ∈ E, y ∼ x means y is connected with x by an edge, and γ(·) is an arbitrary non-negative real-valued one-to-one function. In our experiments, we set γ(w) = w.
By the virtue of this definition, m α x extracts the local topology of node x on the basis of the graph. The relationship between any two nodes x and y is proportional to the distance between their neighborhoods, which is defined as the transportation distance between two distributions m α x and m α y .

Definition 2. Transportation distance: Let
Then, the transportation distance between two probability distributions m α x and m α y is defined as where d(·, ·) is a distance function.
Here, we leveraged Dijkstra's shortest path distance as d(·, ·) in this work. In order to differentiate topology structures on the basis of graph geometry, the DRC is defined as follows: Ollivier Ricci curvature [25]: Note that, in the computation of Ollivier's Ricci curvature, when the node pair x and y connect densely, κ(x, y) is larger than the sparsely connected pairs. When computing Ollivier's Ricci curvature, in order to avoid the limit operation, former works set α to 0.5 [3,22] and utilized κ α as an approximation of κ. In this work, we leveraged another limit-free version of Ollivier's Ricci curvature for computation convenience [34]. Definition 4. Let B : V × V → R be a coupling function. We simply denote µ 0 x as µ x . For any x, y ∈ V, if B satisfies for all v = y, then we call B as a * -coupling between µ x and µ y . Theorem 1. The * -coupling-based Ricci curvature is formulated as: and for any x, y ∈ V , x = y, the following equation holds: (Refer to [34] for proof.) Thus, κ * illustrates the topological characteristic of a graph as an Ollivier Ricci curvature and omits the limit calculation. In our implementation, we leveraged this κ * curvature when computing the DRC and denoted the DRC by κ for simplicity. The proof of Theorem 1 can be found in [34]. Algorithm 1 formulates the computation of the DRC.

Algorithm 1: Computation of Discrete Ricci Curvature (DRC)
Input: A graph G = (V, E ). Output: A weighted graph G = (V, E , w, κ), where w and κ are the weights and discrete Ricci curvature on edges, respectively. 1 Initialization. If G is unweighted, set edge weights w e = 1, ∀e ∈ E ; 2 Compute the shortest path distance (SPD) of each pair of nodes, i.e., d(u, v)∀u, v ∈ V; 3 for e = (x, y) ∈ E do 4 Compute the discrete Ricci curvature. κ e = 1 d(x,y) sup B ∑ u,v∈V B(u, v)d(u, v); 5 end

Curvphormer
Curvphormer incorporates the advanced geometric information represented by the DRC into a graph-based transformer architecture. The overall architecture of Curvphormer is demonstrated in Figure 2.
Edge level  Figure 2. Illustration of Curvphormer with attribute/structure encodings. The input is a combination of two types of node-level information, i.e., node features and node degree encoding. Edge-level information, i.e., encodings of edge features and curvatures, describes the interactions between node pairs; therefore, these two encodings are added to the multihead self-attention module as a bias of the attention weights.

Attribute Encoding
As mentioned before, in graph data, the attribute information is the features carried by nodes and edges, describing some specific information in the dataset. Node features are the most important information characterizing a dataset. In Curvphormer, we leveraged the node features without any affine transformation. In many graphs, edges also have attribute features, which are essential for understanding the underlying graph structure. Although edge features are provided by the dataset, they usually indicate the type or intensity of the interactions between nodes. Thus, for any node pair (v i , v j ) in a graph, the correlation between v i and v j has to account for the edges connecting them. Let v i and v j be connected by a shortest path denoted by v i e 1 ∼ · · · e N ∼ v j . The correlation between v i and v j can be formulated by the mean of the embedded edge features along the path.
where x e k ∈ R q is the edge feature of e k . EdgeEmbedding k (x e k ) = x T e k · w k , w k ∈ R q is a learnable vector.

Structural Encoding
Structure information here refers to the knowledge of the graph that is induced by the connectedness. As demonstrated in Figure 2, we considered two dimensions of structure information. One is the node-level information to quantify the importance of nodes in the graph. Taking the citation network as an example, the more influential a paper is, the more citations it has, and vice versa. Thus, in an abstract graph, an important node must connect to more neighbors. The node degree is an intuitive choice to describe this node property as in [21]. Let d i = ∑ j∈V a ij be the degree of node v i . Then, we embed d i into a vector: where w i ∈ R d is a learnable vector. Then, we incorporate the node's degree embedding matrix D = (η(v 1 ), . . . , η(v n )) T ∈ R n×d with the node features as the input of the subsequent module, i.e., H (0) = X + D.
The other is the edge-level information, which can be interpreted by the positional relationship between any node pairs via the edges connecting them. Former works encoded the position information on graphs by a simple shortest path distance (SPD) [21,35]. However, the SPD can only provide a relative distance on graphs. Graphs can be viewed as a discretized manifold in Riemannian spaces. Thus, the topology structure of the manifold determines the foundation of graphs. A pure SPD neglects the topology structure of the spaces where graphs are embedded in. As we stated in Section 3.2, the DRC depicts the connectedness on the basis of the node's neighborhoods. Nodes with a positive DRC connect densely, while a negative DRC is related to sparsely connected nodes. By virtue of the expressive power of DRC, we encode the relations of the nodes on the graph topology with where w ij is a learnable scalar.

Self-Attention Mechanism
The self-attention module is the main part of the transformer architecture, which captures the global information by connecting all positions [14,21]. It computes the weighted sum of values, where the weights of values is obtained by a query-key function. Let H = (h 1 , . . . , h n ) T ∈ R n×d be the input of the module. In Curvphormer, when a node attends other nodes in the graph, the edge attribute information Γ = {γ(v i , v j )} as well as the DRC-based structural information Φ = {ϕ(v i , v j )} are added to the attention weights to provide more topology-aware ability. Therefore, the self-attention can be formulated as where Q = HW Q , K = HW K , V = HW V , and W Q , W K ∈ R d×d K , W V ∈ R d×d V . Thus, the correlation between nodes v i and v j is The multihead self-attention is obtained by where W O ∈ R hd×d model .

Curvphormer Structure
Curvphormer follows the basic architecture of Graphormer [21], which is a variant of the vanilla transformer encoder [14]. Each layer of Curvphormer consists of a multihead attention module (MHA) and a feed-forward network (FFN) module. The detailed implementation of a Curvphormer layer is formulated as Moreover, in order to enhance the ability of Curvphormer to capture the representation of the entire graph, as in [21], a virtual node is applied, which is connected to all nodes in the graph by virtual edges, and the corresponding structural encodings are set to distinct learnable variables. The training procedure of Curvphormer is mainly based on a transformer encoding module. The self-attention mechanism has a complexity of O(n 2 · d) per layer, where n is the number of nodes, and d is the dimension of node features. Before training, Curvphormer computes the DRC as the input of the structural encoding. The computing complexity of DRC is O(m ·d 3 ), where m is the number of edges, andd is the average degree of nodes. It is time-consuming to compute the DRC on very large graphs, thus we compute this valuable structure information of graphs before training.

Experiments
In this section, we conduct three experiments to intuitively clarify the motivation as well as effectiveness of Curvphormer. Firstly, we illustrate the importance of the topology information in Section 4.1 on a small dataset, i.e., Zachary's Karate Club Network [36], indicating the importance of our inclusion of the curvature as a factor. Then, we intuitively show the expressiveness of the DRC on graph structures comparing it with the widely used graph structure descriptor SPD in Section 4.2. Finally, we perform experiments on three different scaled real-world datasets to test the performance of Curvphormer in Section 4.3.

Structure Information Is Crucial in Graph-Based Tasks
To illustrate the importance of graph structure information, we devised a binary node classification experiment on the small Karate Club Network (Karate). Karate is composed of two communities with 34 nodes (members of the club). The edges between nodes indicate the interactions between club members. We applied a simple two-layer GCN model [6] to learn the underlying graph structure. Moreover, the node feature was designed based on three cases, i.e., random numbers, the SPD, and the DRC, for testing the influence of different kinds of information in a simple NN-based model.
The accuracy of these three scenarios is shown in Table 1 (best performance in 10 runs). For random features, even though they could not provide any useful information, the classification accuracy was still better than random guess because of the utilization of the adjacency matrix in the model. Notice that when more structure information was provided, the performance of the model improved remarkably. Moreover, the DRC outperformed the SPD in this experiment setting. It indicated that advanced topology information could extract more effective structure information than simple distance information. Now, we intuitively show the expressiveness of the DRC compared to that of the SPD by a small graph composed of two small communities bridged by an edge, as shown in Figure 3. Though both the SPD and DRC had the ability to know there were two communities, the DRC depicted more in-depth structure information than the SPD. Note the interactions between nodes 1, 3 and nodes 1, 5. Nodes 1 and 3 were from the same community, while nodes 1 and 5 were from different communities. The relationships of these two pairs were different, while SPD 13 = SPD 15 = 2 (highlighted by orange circles in Figure 3c). Moreover, edge e 45 was the only bridge edge connecting the two communities. However, SPD 45 = 1 (red dotted circle in Figure 3c) could not differentiate e 45 from other one-hop pairs. The SPD was incapable of describing these differences in structure. Fortunately, the DRC could amend these defects because it considered the nodes' neighborhoods. The tightly interacting pairs tended to have a larger DRC than sparsely interacting pairs. DRC 13 = 1 was apparently larger than DRC 15 = 0.08 for the first case. Meanwhile, DRC 45 = −0.83 highlighted the difference of this edge from others.

Experiments on Real-Word Datasets
In this part, we devised our experiments on three different scaled datasets, i.e., Mol-HIV (small), ZINC (medium), and PCQM4M-LSC (large). Statistics of the datasets are summarized in Table 2. We summarize the statistics of datasets used in this work in Tables 1 and 3, and Figure 4.  Figure 4. Testing the performance of Curvphormer on MolHIV for different number of layers. Curvphormer surpasses the baseline Graphormer by a significant margin and attains stable satisfactory performance for a varying number of layers.

Experimental Set-Up
We benchmarked Curvphormer with the non-topology-aware Graphormer baseline [21]. The basic setting of Curvphormer followed [21] but we modified some parameters for the model fine-tuning. The number of attention heads and the dimension of node/edge features were set to 16. We used AdamW as the optimizer and set the hyperparameter Adam-to 1 × 10 −8 and Adam-(β 1 , β 2 ) to (0.99, 0.999). The learning rate was set to 2 × 10 −4 with a lower bound of 1 × 10 −9 . The batch size was set to 512. All models and tasks were trained on eight NVIDIA 3080ti GPUs for about three days. Other settings were the same as those of the baseline. We trained Curvphormer on PCQM4M-LSC and ZINC from scratch. We fine-tuned the pretrained model on ZINC with the small dataset MolHIV to test the transferable ability of Curvphormer. In addition, in order to test if Curvphormer could effectively resist the performance drop caused by oversmoothing, we tested Curvphormer on the MolHIV dataset with a varying number of layers up to 20. Table 3 summarizes the performance of Curphormer and other baselines on PCQM4M-LSC, ZINC, and MolHIV. The metrics were the mean absolute error (MAE) for the regression task and the AUC for the classification task. We report the MAE on the validation set (ValidMAE) for PCQM4M-LSC because its test set was not publicly available. Curvphormer achieved the best results and noticeably surpassed the previous state-of-the-art GNNs as well as the recent graph-transformer model GT [18] and Graphormer [21].

Results
Next, we tested Curvphormer's performance further on the MolHIV dataset by comparing it with the baseline Graphormer. Figure 4 shows that both models were capable of resisting oversmoothing. Meanwhile, Curvphormer surpassed Graphormer by a noticeable margin for all layer configurations. It is noteworthy that when the model layer changed from 12 to 16, the performance of Graphormer dropped from 80.51 to 70.70. In contrast, Curvphormer achieved a comparable result after a slight drop.

Conclusions and Discussion
This work introduced Curvphormer, a topology-aware graph transformer that incorporates advanced structure information into an expressive Graphormer architecture. The DRC effectively differentiated the topology structure of graphs with the homophily property and helped our model achieve remarkable performance improvements on different scaled datasets in graph classification/regression tasks. It showed that applying more geometric descriptors to expressive graph models is rewarding. Meanwhile, the exploration of graph structure information is still challenging. For example, discovering the topology information of heterogeneous graphs still needs future endeavors. Moreover, the computation complexity of the DRC restricts its application in large dynamic systems. In a nutshell, Curvphormer inspires a better understanding of graph structure and encourages future work.

Conflicts of Interest:
The authors declare no conflict of interest.