1 Introduction

Graph structures are very common in real life. Graph Neural Networks (GNNs) have emerged as a formidable tool for capturing and learning the low-dimensional representations of nodes in these structures. These learned vector representations can be applied to a variety of tasks on graphs, including vertex classification [1], relation learning [2] and link prediction [3,4,5]. GNNs enhance the expressiveness of these vectors while preserving the structural information of graphs through their unique hierarchical architecture. Wu et al.’s comprehensive survey on graph neural networks provides a broad background for the field, detailing various neighborhood aggregation and graph pooling schemes aimed at enhancing models’ capabilities in understanding and representing graph information [6].

However, due to the complexity of network structure, as the number of network layers increases [7], node representations become difficult to distinguish, and classifiers find it difficult to assign correct labels to each node. Training accuracy decreases as the number of layers increases, and the over-smooth problem [8,9,10] also occurs. To solve this problem, strategies for jump connections are designed. Representative algorithms include GCNII [11], ResGCN [12], and JK-Nets [13], however, these methods do not account for the varying impacts of neighbors at different hops in the network.

At the same time, these variants of GNNs cannot effectively distinguish graph structure information on simple graphs(A “simple graph” refers to networks with fewer nodes and straightforward edge relationships, while a “complex graph” is described as having a dense network with intricate connections, potentially including a large number of nodes.) under certain conditions [14], as shown in Fig. 1. Nodes v and \(v'\) are the central nodes, and their representations are generated by aggregating neighboring features. Analyzing whether different structures can be distinguished under different aggregation function settings (if different structures can be identified, their representations of the two should be different). In Fig. 1a, the two nodes\('\) results aggregated by Mean and Max are [1/2,1/2] and [1,1], respectively. Mean and Max aggregation cannot distinguish their structures, while in Fig. 1b, the Max aggregation over the neighborhood of v and \(v'\) yields the same representation [1,1], the Max aggregation cannot distinguish their structures. The above example shows that in simple graph structures, some variants of GNN [15] cannot effectively distinguish graph structure information using aggregation functions under certain conditions. In addition, these variants of GNN pay more attention to local neighboring node information and relatively less consideration to the structure information of the entire graph during node updates, which cannot capture the relationships and global structure between nodes in the graph well. In summary, the existing classical GNNs expression ability are not powerful to capture different graph structures.

GIN [14] aggregates neighboring nodes, combines the aggregation results with the node features, and finally uses fully connected network that can fit any rule to make them have monomorphic properties. It introduces a learnable parameter to adjust its own features and adds the adjusted features to the aggregated neighboring features, ensuring that the central node and neighboring nodes are distinguishable. However, this model mainly focuses on the neighbor information and have a limited understanding of global information.

In our work, we propose an improved method based on GraphSAGE, which utilizes neighbor sampling and aggregation strategies to effectively capture local and global information in the graph structure. The main contributions are as follows:

  • We propose a deep learning framework GraphSAGE++ based on GraphSAGE. It extracts the feature representation of the target node at each layer and then concatenates all layer feature representation to obtain final result, where the ith layer feature information equals i-hop (i=1,2,3...) structure information between each target node and its i-hop neighbors. Considering the different impact of neighbors at each layer, our model assigns corresponding weights and proportions based on the role of the ith layer feature information in the global representation. This final representation preserves neighborhood information from 1 to K hops, thus solving the over-smoothing problems.

  • Introducing strategies that actively combine double aggregations with concatenation helps enhance the distinction of structural information and improve the model’s expressive capabilities.

  • We conduct extensive experiments on vertex classification and prediction tasks. These experiments are conducted on several real datasets. The experimental results showed that our method achieved better performance on graph datasets and demonstrated significant advantages in tasks such as node classification and link prediction.

2 Related Work

Fig. 1
figure 1

Motivation for this paper. a Demonstrates that when graph structures with nodes labeled [1,0] and [0,1] are aggregated using Mean and Max functions, the resulting vectors are indistinguishable, hence unable to capture unique structural information. b Shows that the Mean aggregation results in distinguishable vectors, whereas the Max function still produces an indistinguishable vector [1,1], highlighting the limitation of these aggregation functions in differentiating between graph structures. This indicates the need for more sophisticated aggregation mechanisms to accurately reflect the underlying graph topology in feature representations

In recent years, many methods have emerged in graph representation learning, aiming to improve the expressive ability of graph structures.

Matrix decomposition-based methods: Matrix decomposition-based methods are a type of method that utilizes matrix decomposition techniques to process graph data. These methods typically represent the graph structure in matrix form and then apply matrix decomposition algorithms to decompose and reduce the dimensionality of the graph matrix to obtain low dimensional representations or features of the graph. In SVD-GCN [16, 17], singular value decomposition is performed on the adjacency matrix of a graph to decompose it into a product of three matrices, including an orthogonal matrix, a diagonal matrix, and a transposed orthogonal matrix. This decomposition can extract the eigenvectors and eigenvalues of a graph, thereby capturing the structure and important information. NetMF [18, 19] decomposes the adjacency matrix of a graph into a product of a node embedding matrix and a feature embedding matrix. The node embedding matrix represents the position of a node in a low dimensional space, and the feature embedding matrix represents the feature representation of a node. By minimizing reconstruction errors and regularization terms, the embedding matrix of node features is optimized, and low dimensional embedding representations with good representation ability are learned. GraRep [20] integrates the obtained information by manipulating the probability matrix and captures the global structure by iteratively updating the representation of nodes. DEGREE [21] discovers nonlinear interactions between subgraphs throughout the aggregation process by decomposing their feedforward propagation process, which helps to detect incorrect predictions and improve the model’s credibility. As the size of the graph increases, the computational and storage costs also increase, and information loss may occur. In contrast, methods based on graph neural networks can better and more flexibly handle large-scale graph data with high computational efficiency and scalability.

Subgraph isomorphism-based methods: Subgraph isomorphism-based methods have a wide range of applications in graph data analysis and learning. They are able to utilize the structural information of subgraphs to extract important features of graphs and perform tasks such as graph similarity metrics, graph matching and generation, etc. GSN [22] introduces a mechanism of subgraph isomorphism counting, which compares and matches local subgraphs in a graph with a set of predefined subgraph patterns. By counting the number of successfully matched subgraphs, the frequency of occurrence or distribution of different subgraph patterns in the graph can be obtained to capture the local structure and connectivity patterns of the graph more comprehensively, and to enhance the ability of graph neural networks to differentiate between different subgraphs. Algorithms such as WL [23], GraphSNN [24] determine subgraph isomorphism of nodes by comparing their labels with the label of their neighboring nodes. Two nodes are considered isomorphic if they have the same sequence of labels. This comparison based on subgraph isomorphism allows these algorithms to identify graphs with similar local structures and is used for graph isomorphism determination and graph classification tasks. However, due to the inability of WL [23] and GraphSNN [24] to deal with the order sensitivity of node labels and global structure information, they may not be able to accurately represent some graphs with complex global structures.

Methods based on graph neural networks: Methods such as GCN [25, 26], SGCN [27], GIN [14], and Causal GraphSAGE [28] aggregate information over neighboring nodes through convolution of multiple layers. These methods can capture the local structure of the nodes but have limitations in dealing with the global structure. There are some methods that focus on improving the neighbor sampling. GraphSAGE [29] is one of the classical methods that aggregate information by randomly sampling neighbor nodes, which can significantly and efficiently reduce the computational cost, enabling node embedding learning even on large graphs. Some methods introduce importance weights or attention mechanisms for neighboring nodes, such as GAT [30, 31], which use the degrees of neighboring nodes or the similarity between nodes to assign weights and aggregate node features at each layer by attention weights. These methods can more accurately select neighbor nodes for information aggregation and improve the quality of node representation. However, current approaches tend to treat multi-scale representation learning and neighbor sampling as independent steps and are not well integrated.

Methods based on random walk: MIRW [5] considers the mutual influence between nodes to better capture the complexity of the network. This method contrasts with traditional random walk strategies, which consider all nodes and links equally important and therefore cannot fully reflect the general structure of the graph. In addition, CSADW [4] enhances the link prediction ability in social networks by combining a new transition matrix with structural and attribute similarity. This combination improves traditional techniques by ensuring that random walks are biased towards nodes with similar structures, thereby capturing structural and unstructured similarities. They perform well in capturing local node interactions and structural attribute similarities, but they have limitations in handling global graph structures [32].

To effectively address these issues, we propose the GraphSAGE++ algorithm, which considers multi-scale graph structure information by a weighted multi-dimensional mechanism. This involves assigning different weights to different hop neighboring nodes based on their importance to redefine their dimensionality, thereby better capturing local and global graph structure information. Our method can more comprehensively model the structure of graphs and improve the performance of graph representation learning.

3 GraphSAGE++

In this section, we will introduce our framework. We first introduce the definition of the relevant problem, then introduce its overall framework, and finally discuss how to integrate the global structural information of the graph more effectively.

3.1 Problem Definition

This section provides some basic definitions.

Definition 1

Given a graph \(G = (V,E,X)\), where \(V=\{v_1,v_2,...,v_n\}\), It represents a set of n nodes in a graph. \(N = |V|\) is the number of nodes in the graph, and |E| is the number of edges. \(E=\{e_{i,j}\}_{i,j=1}^n\), Represents the set of all edges in a graph, When there is an edge between \(v_i\) and \(v_j\), \(e_{i,j}=1\), otherwise \(e_{i,j}=0\). \(X \in R^{N\times D}\) represents the feature matrix for all notdes with \(x_v \in R^D\) representing the feature vector of a node v.

Definition 2

The feature vector of v at the i-th layer denoted by \(h_{v}^{i}\) and \(h_{v}^{0} = X_v\). \({\mathcal {N}}(v)=\{u \in V:(u,v\in E)\}\) represents the neighborhood set of vertex v in the graph.

3.2 GraphSAGE

The advantage of GraphSAGE [29] lies in its ability to effectively capture local structural information of nodes and its scalability to process large-scale graph data. The main idea is to obtain node representations by aggregating neighbor information through neighbor sampling. GraphSAGE is a K-layer network. For a node v, three operations of neighbor sampling, aggregating neighbor information, and node representation updating are performed at each layer. Specifically, at the ith layer the model firstly collects fixed neighbors \({\mathcal {N}}(v)\) of the node v. Then it aggregates the neighbor information, as shown in Eq. (1).

$$\begin{aligned} h_{{\mathcal {N}}(v)}^i \leftarrow AGGEREGATE\left( h_{u}^{i-1},u \in {\mathcal {N}}(v) \right) \end{aligned}$$
(1)

Finally, it updates the vector representation of node v, as shown in Eq. (2), where Sigmad function \(\sigma \) is a nonlinear function: \(\frac{1}{1+e^{(-x)}} \), it is capable of extracting nonlinear relationships.

$$\begin{aligned} h_{v}^{i} \leftarrow \sigma \left( W_i \cdot CONCAT\left( h_{v}^{i-1},h_{{\mathcal {N}}(v)}^i\right) \right) . \end{aligned}$$
(2)

\(W_i\) is the training parameter, which is obtained by training with the following objective function:

$$\begin{aligned} L(p,q) = -\sum (p(v)logq(v) + (1-p(v))log(1-q(v))) \end{aligned}$$
(3)

This objective function mainly portrays the distance between the actual output (probability) and the desired output (probability). The probability distribution p is the desired output, and the probability distribution q is the actual output. The smaller the value, the closer the two probability distributions are.

3.3 Framework of GraphSAGE++

GraphSAGE suffers from the over-smoothing problems. And it mainly captures the local structural information of the nodes, while the global structural information of the graph is not effectively captured. To solve the problems, we have improved GraphSAGE. Specifically, for a node v, its direct neighbor nodes have the greatest influence on it. While the influence of i-hop neighbor nodes gradually decreases as the hop increases. The overall framework is shown in Fig. 2.

As shown in Fig. 2, At the ith layer the node v representation \(h_{v}^{i} \) is reserved after the three operations of neighbor sampling, aggregating neighbor information, and node representation updating. At the Kth layer, the representations from all layer are concatenated to obtain the final global representation of \([h_v^1, h_v^2,..., h_v^K]\). Such a representation preserves the neighbor information from 1 to K hops, thus solving the over-smoothing problem.

Neighbors with different hops affect nodes differently, so the node vectors of each layer are composed of global representations with corresponding weights and proportions. The specific idea is that the dimension of the node representation decreases as the hops increases. Let d be the dimension of the final global representation(the value of d can be set by the user, and the value is 128 in the paper). \(N_i\) is the dimension of node representation from the ith layer, K is the maximum number of layers of the model(the value in the paper is 8). The ratio of the ith layer to d is \(p_i\), and the specific formula is shown as follows. We firstly compute the value of the given ratio \(p_i\) by the Eqs. (4) and (5), and then use Eq. (6) to calculate to get the dimension \(N_i\).

$$\begin{aligned} \vartheta = (K-i+1)^{\frac{1}{2} } \end{aligned}$$
(4)
$$\begin{aligned} p_i = \frac{\vartheta }{ {\textstyle \sum _{0}^{K}}\vartheta } \end{aligned}$$
(5)
$$\begin{aligned} N_i= int(d*p_i) \end{aligned}$$
(6)
Fig. 2
figure 2

GraphSAGE++ architecture

Algorithm 1 shows the process of GraphSAGE++ forward propagation. Firstly, we randomly sample the neighbor nodes of v using the neighbor sampling function. Then, by means of the aggregation function, we obtain the representation \(h_{u}^i\) of the neighbor nodes of v. Next, we perform a concatenating operation and obtain the representation of v by applying the nonlinear change \(\sigma \) to the result of the concatenation. In line 8, we perform a normalization operation on the representation of the vertex. Eventually, after concatenating all nodes v, we obtain the final vector representation \(H_v\): \([h_v^1,h_v^2,...,h_v^K]\). In line 5, we assign a fixed number of dimensions to the representation at each layer by using Eq. (6) to ensure that the important representation information is effectively preserved.

Algorithm 1
figure a

GraphSAGE++ forward_propagation algorithm

Algorithm 2 shows the GraphSAGE++ back-propagation process. The vector representation \(h_v\) of vertex v at each hop is obtained by forward-propagation process, see line 4. Lines 3 to 5 are the process of calculating the loss value, firstly, softmax operation is performed on the vector representation of each node to convert the node vector representation to a probability distribution p, and then the label y of the node is converted to a one-hot encoding, which is computed by the loss function to obtain the loss value. Lines 6 to 7 are the process of calculating the gradient and updating the parameters.

Algorithm 2
figure b

GraphSAGE++ backward_propagation algorithm

Algorithm 3 is the GraphSAGE++ algorithm. Firstly, the weight matrix W is initialized with Xavier.Then the global vector representation \(H_v\) of v is obtained by calling Algorithm 1 forward_propagation, see lines 4 to 5 of the algorithm. Finally, the parameter update is accomplished by calling Algorithm 2 back_propagation and the loop is iterated until the whole model converges.

Algorithm 3
figure c

GraphSAGE++ Training parameters framework

3.4 Model Optimization

In this section, we make some improvements. As shown in Fig. 1, the Mean aggregation and Max aggregation functions cannot effectively capture graph structure information. Mean aggregation takes the average of the neighboring features of a node, which means that each neighbor’s feature has the same weight on the final aggregation result. This aggregation method can lead to information loss because it ignores the importance differences between different neighboring nodes. In some cases, mean aggregation may not be able to distinguish different neighboring nodes because their features are mixed together. Max aggregation takes the maximum value of the neighboring features of a node as the aggregation result. This aggregation method preserves the most important features in each neighboring node, but may also lose other information. Max aggregation usually focuses more on certain important neighbor nodes, but does not pay attention to other neighbors. In order to better distinguish the structural information and improve the expression ability, we have designed different methods: GraphSage++ with double aggregate (GraphSage++DA), GraphSage++ with double aggregate & concat (GraphSage++DAC) and GraphSage++ with double aggregate & mixed concat (GraphSage++DAMC).

3.4.1 GraphGAGE++DA

The GraphSAGE++DA is shown in Fig. 3. The main body consists of two GraphSAGE models aggregated using Mean and Max, respectively. At the ith layer, they share the parameter \(W_i\). The GraphSAGE model with the aggregation function of Mean obtains the feature vector representation of the target node at the ith layer, \(h_{v}^{K}\), and the GraphSAGE model with the aggregation function of Max get the feature vector representation \( h_{v}^{K}{'}\) of the target node at the Kth layer. Concatenateing these two feature vectors to obtain the final vector representation of the target node \([h_{v} ^{K}, {h_{v}^ {K}}{'}]\). GraphSAGE++DA does not effectively capture both local and global graph structure information.

Fig. 3
figure 3

GraphSAGE++DA architecture

3.4.2 GraphGAGE++DAC

GraphSAGE++DA suffers from the same oversmoothing problem. For this reason Graphsage++DAC was designed. As shown in Fig. 4, the vector representation of each layer is saved, and the dimension of each layer is based on Eq. (5). \(h_v^i\) indicates the representation using the Mean aggregation function, and \(h_v^i{'}\) is the aggregation function of the ith layer is the representation using the Max aggregation function. And the 2K vectors are concatenated to obtain the final vector representation of the target node \([h_v ^1, h_v^2,..., h_v^K, h_v^1{'}, h_v^2{'},..., h_v^K{'}]\), where the first K vectors are representations of K layers whose aggregation function is Max.

3.4.3 GraphGAGE++DAMC

The only difference between GraphSAGE++DAC and GraphSAGE++DAMC is that the final vector representation of the concat is different, where the concatenated final vector is represented as \([h_{v}^{1}, h_{v}^{1}{'}, h_{v}^{2}, h_{v}^{2}{'},..., h_{v}^{K}, h_{v}^{K}{'}]\).

4 Experiments

For the implementation of our models, we utilized the PyTorch Geometric (PyG) framework, which is specifically designed for deep learning on irregularly structured input data such as graphs. PyG is well-regarded for its efficient handling of large-scale graph data and its extensive library of optimized graph neural network layers. All experiments were conducted on a computing setup powered by an Apple M1 Pro(14 Core)@3.20Ghz processor, which provided the necessary computational capacity for the intensive training and evaluation phases of our graph neural network models. Our code and implementation details can be found on: https://github.com/ejwww/SAGE-plus.

Fig. 4
figure 4

GraphSAGE++DAC architecture

4.1 Experimental Settings

To evaluate the performance of different GNN modes, the evaluation metrics listed in Table 1 were used, where TP, TN, FP, and FN represent the number of true positive, true negative, false positive, and false negative, respectively.

Parameter settings: For all algorithms, the size of the training set varies from 0.1 to 0.9 and the rest are test sets. The value of the dimension d set in the paper is 128. The learning rate lr of the model is set to a value of 0.001. The batch size is set to 64. he value of the maximum number of hops K of the model is set to 8. To optimize the model parameters, we employed the backpropagation algorithm combined with the Adam optimizer. The Adam optimizer is favored in the deep learning community for its adaptive learning rate capabilities and momentum incorporation, which facilitate faster convergence and robust generalization when training complex models. For regularization, we integrated \(L_2\) regularization into our training process to mitigate the risk of overfitting. This approach penalizes the square magnitude of parameters, effectively constraining the complexity of the model and promoting simpler, more generalizable solutions that perform better on unseen data. In order to increase the generalization ability of the model and reduce the risk of overfitting, we introduced the dropout parameter p. The value of p is set to 0.5, because a smaller value of p means that more elements are discarded, the capacity of the model decreases, thereby reducing the risk of overfitting; A larger value of p means that fewer neurons are discarded and the model’s capacity increases, but it may also increase the risk of overfitting. The value of the maximum number of hops K of the model is set to 8. The neighborhood sample sizes \(S_1=20\) and \(S_2=15\) when the value of K is 2. When K is extended to higher values, we define that \(S_3=10\), \(S_4=5\), \(S_5=4\), \(S_6=3\), and so on, decreasing layer by layer. We use more neighbor sampling for layers near the input layer, while layers near the output layer use less sampling. This method can reduce computational complexity and avoid over smoothing, preventing excessive use of computing and storage resources, especially when there are many network layers.

Table 1 The evaluation indicators used in this study

4.1.1 Datasets

We test with seven real datasets commonly used in vertex embeddings. The details of these datasets are shown in Table 2, including social networks, protein datasets and citation networks. Cora [33] is the standard citation network benchmark data set. Here, the vertex represents the paper, the edge represents the reference of one paper to another, the vertex feature is the word bag representation of the paper, and the vertex label is the academic topic of the paper. The dataset, Pubmed [33] consists of 19,717 scientific publications on diabetes from the Pubmed database, in which it categorizes the entire data into three categories. Amazon [33], the graph structure of the dataset represents the relationship between users and goods, where users and goods are the nodes of the graph and the purchases between users and goods are the edges of the graph. PPI [34] is a subgraph generated from human proteins with different labels for different proteins; there are fifty labels in this dataset. Citeseer [33] is a citation network where nodes in the dataset represent scholarly literature, directed edges represent citation relationships between the literature, and the nodes contain some characteristic information about the literature. ENZYMES [35] is a collection of graph data constructed on the basis of the protein structure of biomolecules, with a total of 600 graphs corresponding to 600 samples (protein molecules) with six structures. KarateClub [36] describes the social relationships of members of a karate club with 34 members as nodes, and adds an edge between nodes if two members remain socially connected outside the club.

Table 2 Network datasets statistics

4.1.2 Baseline algorithm

We use the following representative algorithms in comparative experiments.

GCN [25] is one of the earliest proposed graph convolutional network model. It uses adjacency matrices and node feature matrices for node feature updating and propagation. At each layer the feature representation of the current node is updated by aggregating the features of neighboring nodes.

SGCN [27] is a simplified version of the graph convolution neural network model that simplifies the graph convolution operation to a single linear operation by removing the nonlinear transformation and polynomial fitting, reducing model complexity and computational overhead, and improving model interpretability and efficiency.

GraphSAGE [29] is the graph convolutional neural network algorithm, which aggregates the neighbor information of the vertices through an aggregation function and updates it through training, and as the number of iterations increases, the vertices are able to aggregate to higher-order neighbor information.

JK-Net [13] is a framework for graph neural networks that aims to improve the performance of GNNs on graph data tasks such as node classification, graph classification, etc. The core idea of JK-Net is to integrate information from different levels of graph convolutional layers through jumping connections (jumping connections) to improve the model’s ability to learn the representation of the graph data.

GIN [14]:In graph isomorphism detection, the goal is to determine whether two given graphs are isomorphic, i.e., their nodes can be mapped one-to-one and keep the edges connected in the same relationship, GIN applies this idea to graph neural networks. Its learns the graph representation by combining the features of the nodes with the features of the neighboring nodes, which are then nonlinearly varied by a multilayer perceptron.

GCNII [11] introduces a cross layer information transmission and importance coefficient mechanism, which adaptively adjusts the importance of different layers, taking into account the information of multiple neighboring orders at each layer, and weighting them with different importance coefficients.

GraphSNN [24] proposes a new message-passing framework, GMP, which can inject local structures into aggregation schemes based on overlapping subgraphs Performance is improved on different graph structures.

SVD-GCN [16] is a graph convolutional neural network model based on singular value decomposition, which obtains low-order features and high-order features by performing singular value decomposition of the adjacency matrix, and performs downscaling and representation learning of graph data by preserving low-order features and truncating high-order features.

DGCN [37] is a graph neural network model that further optimizes the topology of the graph by decoupling the processes of feature learning and graph structure learning, where the feature learning phase exploits the similarity of node features for feature propagation and representation learning, and the graph structure learning phase focuses on the connectivity between nodes to further optimize the topology of the graph.

4.2 Weisfeiler-Lehman Test

The WL test is an iterative graph local feature encoding algorithm for detecting graph isomorphism. We make some adjustments to the algorithm. For each node in the graph, we initialize a label for it, and then iteratively use the GraphSAGE++ to aggregate the features of neighboring nodes. At the same time, the aggregated features are mapped to new labels, and the labels of nodes in the two subgraphs are compared. If the labels of the two subgraphs are the same, they are considered isomorphic. The aggregation functions used for the WL isomorphism test in this paper are the same as those used in the fixed-point classification and link prediction tasks, which are Mean aggregation function and Max aggregation function. This section validates the effectiveness of the algorithm on the WL isomorphism test task based on experimental results and compares it with other baseline algorithms. The average of ten experiments is taken as the result here.

Figure 5 shows the comparison of the results of the WL isomorphism detection accuracy of different algorithms on different datasets. As can be seen from the figure, at K=2 hops, the method proposed in this paper has been more competitive, and GIN on the Citeseer dataset is comparable to the GraphSAGE++DAC and GraphSAGE++DAMC performances. Compared to GraphSAGE, our schemes have obtained better performances on both the Cora dataset and the Citeseer dataset, GraphSAGE++DAMC improves the isomorphism detection accuracy by seventy percent compared to GraphSAGE. As the number of hops increases, the ability of this paper’s scheme to perform WL isomorphism detection becomes more competitive. When K is greater than 4, the isomorphic detection accuracy of GIN, GCINII, GraphSAGE, and SGCN all show a significant downward trend. With the increase of hop, GraphSAGE++ achieved good performance and tended to stabilize, among which the comprehensive performance of GraphSAGE++DAMC is more prominent.

Fig. 5
figure 5

Accuracy results of different algorithms for WL isomorphism detection with increasing K value on different datasets, where our models can achieve higher WL-test accuracy

4.3 Experimental Result

In this section, we verify the effectiveness of our methods in link prediction, classification, and visualization. And we compare them with other baseline algorithms. It is noteworthy that throughout the tables in this section (Tables 312), the best achieved results are highlighted in bold.

Table 3 f1_micro values obtained by completing the vertex classification task on the Cora dataset

4.3.1 Classification Task

The purpose of vertex classification tasks is to predict the labels of each vertex.

Table 4 f1_micro values obtained by completing the vertex classification task on the Pumbed dataset
Table 5 f1_micro values obtained by completing the vertex classification task on the PPI dataset

Tables 3, 4, 5, 6, 7 and Fig. 6 demonstrate the performance of our methods and several other algorithms in performing classification tasks on 5 real datasets. During the evaluation process, we ran the program ten times and then derived their averages to ensure fair competition. The experimental results show that GraphSAGE++ achieves good results in all datasets. On the Amazon dataset, GraphSAGE based JK-Net performed better (when the training set is 70%), and compared to GraphSAGE, GraphSAGE+DAC improved performance on classification tasks by about 13% (when the training set is 90%). On the PPI dataset, GraphSAGE+DAMC improved by about 7% compared to GraphSAGE (when the training set is 50%). On the Citeseer dataset, GraphSAGE++DAMC showed a performance improvement of about 6% compared to GraphSAGE++DA (when the training set is 70%), while GraphSAGE++DAC showed a performance improvement compared to GraphSAGE++DA. Overall, in terms of classification tasks, GraphSAGE++DAC and GraphSAGE++DAMC have achieved better performance compared to other algorithms.

Table 6 f1_micro values obtained by completing the vertex classification task on the Citeseer dataset
Table 7 f1_micro values obtained by completing the vertex classification task on the Amazon dataset
Fig. 6
figure 6

Values of f1_micro and f1_macro obtained by different algorithms for classification tasks on different datasets

Figure 7 demonstrates the time consumed by different algorithms to complete the node classification task on the PPI dataset, which still has a relatively large overhead in time compared to the traditional algorithms due to the use of concatenating operations. Although the algorithm proposed in this paper cannot outperform SVD-GCN, GraphSNN and DGCN etc. in terms of time efficiency, it achieves better performance than them in most of the cases in link prediction and classification tasks.

Fig. 7
figure 7

Comparison of algorithms’ time-consuming on the link prediction tasks

4.3.2 Link Prediction

The purpose of link prediction is to predict possible connections or edges based on known information about the network structure.

Tables 8910 display the results of the link prediction task on the Amazon,Pubmed and Cora datasets. In the Cora dataset, our scheme achieved better performance as the training set increased. Compared to GraphSAGE, GraphSAGE++DAMC increased by about 7%, compared to GraphSAGE++Mean and GraphSAGE++Max increased by about 2%, and compared to GraphSAGE++DA increased by about 3% (when the training set is 50%). In the Pubmed dataset, GraphSAGE++DAC increased by about 20% compared to GraphSAGE (when the training set was 70%), and GraphSAGE++DAMC increased by 17% compared to GraphSAGE (when the training set was 90%). It is worth mentioning that overall, GraphSAGE+DAMC performs significantly better than other algorithms.

Table 8 f1_micro values obtained by completing the link prediction task on the Amazon dataset
Table 9 f1_micro values obtained by completing the link prediction task on the Pubmed dataset

Figure 8 shows the effect of the different values of K in the experiment on the acc score of each algorithm when performing the classification task on Cora, Citeseer, Amazon, and Pubmed datasets, and the results show that our algorithms are not affected by the value of K and the overall shows an upward trend. Among them, as the depth increases, the DGCN algorithm realizes the function of controlling its expression power to strengthen or weaken gradually through the decoupling operation, so the effect of DGCN to become less effective by the increase of the value of K is not significant. The baseline algorithms show a significant decreasing trend in the acc score after the value of K is 6. In addition, compared to the baseline algorithms, our algorithms consistently maintain high accuracy scores across various values of K, with the GraphSAGE++DAMC variant demonstrating significant advantages. Specifically, on the Pubmed dataset, GraphSAGE++DAMC outperforms GraphSAGE++Mean, GraphSAGE++Max, and GraphSAGE++DA by 3–4% in terms of accuracy.

Table 10 Auroc values obtained by completing the link prediction task on the Cora dataset
Fig. 8
figure 8

Accuracy values of different algorithms as K values increase on different datasets

4.3.3 Visualization

We first learn the representation vectors of the vectors by different models and then map the representation vectors to a two-dimensional space using t-SNE [38], where the vertices of different classes are labeled with different colors so that the ideal visualization result should be that the vertices of the same class are closer together.

The visualization results are shown in Fig. 9. We choose the Citeseer dataset as the dataset for the visualization task, and we can see that as far as the baseline algorithm is concerned, effective clustering occurs only in small and scattered areas, and the total overall view is still confusing and indistinguishable. Our algorithm is able to distinguish different points more clearly, and in contrast to the other algorithms, the visualization results of GraphSAGE++DAMC are much more outstanding. The experimental results show that our algorithmic framework is a more desirable global graph structure.

Fig. 9
figure 9

Visualization results on the Citeseer dataset. Each dot represents a document and the color of the dot indicates the category

4.4 Ablation Experiment and Analysis

The ablation experiments aim to analyze and evaluate the impact of different components in the graph neural network model on the performance of the model in order to reveal the role and importance of these components in the task. n this section, we evaluate the impact of these components on the overall performance of the Cora dataset by removing the use of proportional weight allocation and concatenating strategies. The value of K is set to 3.

Table 11 shows the impact of using a proportional weight allocation strategy on the results of three different concatenating algorithms in GraphSAGE++. We use Eq. (2) to calculate the dimensionality of each layer representation. Through experimental comparison, it was found that when we use a proportional weight allocation strategy, GraphSAGE++DA, GraphSAGE++DAC, and GraphSAGE++DAMC all achieved better results. This also confirms our viewpoint in Sect. 3.3 that neighbors with different hop numbers have different impacts on nodes, and a fixed dimensionality is allocated according to the weight proportion for each layer representation, This effectively preserves the important representation information in the node vector.

Table 12 shows the impact of using concatenating strategies on the results. Through experiments, it was found that the method using concatenating strategy (GraphSAGE++) showed a greater performance improvement compared to the method not using concatenating strategy (GraphSAGE). This also confirms our viewpoint that we introduce strategies which actively combine double aggregations with concatenation helps enhance the distinction of structural information and improve the model’s expressive capabilities.

Table 11 Comparison of the accuracy of different GraphSAGE++ models in ablation experiments
Table 12 Comparison of accuracy between GraphSAGE and GraphSAGE++models in ablation experiments

5 Conclusion

In this work, we have proposed the GraphSAGE++ framework that considers multi-scale graph structure information by a weighted multi-dimensional mechanism, effectively capturing both local and the global structure information. The final representation preserves neighborhood information from 1 to K hops, thus overcoming the common over-smoothing issue seen in deep network architectures of traditional graph neural networks successfully. In addition, the strategies combining double aggregations with concatenation are introduced to better distinguish the structural information. It has been shown experimentally that the method is effective in improving the expressive ability of the model. The experimental results of this research confirm that GraphSAGE++ demonstrates superior performance across various datasets in tasks such as vertex classification, link prediction, and visualization. Notably, GraphSAGE+DAMC outperforms existing methods across multiple evaluation metrics.

Despite the results of our research, there are still some challenges and room for improvement. The proportional allocation of dimensions to each layer can be combined with the PageRank [39, 40] algorithm, which utilizes the linking structure between nodes to be able to better capture the correlation and importance between nodes. Furthermore, optimizing model training efficiency and extending our framework to accommodate increasingly intricate graph structures remain as pertinent objectives for forthcoming research endeavors.