Correntropy-Induced Wasserstein GCN: Learning Graph Embedding via Domain Adaptation

Graph embedding aims at learning vertex representations in a low-dimensional space by distilling information from a complex-structured graph. Recent efforts in graph embedding have been devoted to generalizing the representations from the trained graph in a source domain to the new graph in a different target domain based on information transfer. However, when the graphs are contaminated by unpredictable and complex noise in practice, this transfer problem is quite challenging because of the need to extract helpful knowledge from the source graph and to reliably transfer knowledge to the target graph. This paper puts forward a two-step correntropy-induced Wasserstein GCN (graph convolutional network, or CW-GCN for short) architecture to facilitate the robustness in cross-graph embedding. In the first step, CW-GCN originally investigates correntropy-induced loss in GCN, which places bounded and smooth losses on the noisy nodes with incorrect edges or attributes. Consequently, helpful information are extracted only from clean nodes in the source graph. In the second step, a novel Wasserstein distance is introduced to measure the difference in marginal distributions between graphs, avoiding the negative influence of noise. Afterwards, CW-GCN maps the target graph to the same embedding space as the source graph by minimizing the Wasserstein distance, and thus the knowledge preserved in the first step is expected to be reliably transferred to assist the target graph analysis tasks. Extensive experiments demonstrate the significant superiority of CW-GCN over state-of-the-art methods in different noisy environments.

Abstract-Graph embedding aims at learning vertex representations in a low-dimensional space by distilling information from a complex-structured graph. Recent efforts in graph embedding have been devoted to generalizing the representations from the trained graph in a source domain to the new graph in a different target domain based on information transfer. However, when the graphs are contaminated by unpredictable and complex noise in practice, this transfer problem is quite challenging because of the need to extract helpful knowledge from the source graph and to reliably transfer knowledge to the target graph. This paper puts forward a two-step correntropy-induced Wasserstein GCN (graph convolutional network, or CW-GCN for short) architecture to facilitate the robustness in cross-graph embedding. In the first step, CW-GCN originally investigates correntropy-induced loss in GCN, which places bounded and smooth losses on the noisy nodes with incorrect edges or attributes. Consequently, helpful information are extracted only from clean nodes in the source graph. In the second step, a novel Wasserstein distance is introduced to measure the difference in marginal distributions between graphs, avoiding the negative influence of noise. Afterwards, CW-GCN maps the target graph to the same embedding space as the source graph by minimizing the Wasserstein distance, and thus the knowledge preserved in the first step is expected to be reliably transferred to assist the target graph analysis tasks. Extensive experiments demonstrate the significant superiority of CW-GCN over state-of-the-art methods in different noisy environments.

I. INTRODUCTION
G RAPH embedding refers to learning a unique lowdimensional, compact, and continuous vector representation for each graph node [1], [2]. These representations play a substantial role to the success of various tasks, including action recognition [3], object classification [4] and dimensionality reduction [5]. Recently, graph embedding has made great progress by simultaneously capturing topological structures, node attributes and node labels from nonlinear structured graphs, where different sources of information can compensate with each other [6], [7], [8], [9], [10], [11], [12]. To further generalize the representation from trained graphs to new (sub)graphs, the graph embedding methods can be easily expanded by defining the embedding as a parametric function of the feature vectors. This expansion brings satisfactory generalization under the assumption that the new graph follows the same distribution as the trained graph, meaning that the two graphs are from the same domain. With these representations, if we learn some classic machines (e.g., node classifiers or visualization models) using the labels from the trained graph, these machines can be directly applied to the new graph. However, in some practical applications, it is expensive or even impossible to obtain node labels for graphs in a target domain. In this case, some mature graph from a related but different domain (i.e., source domain) can be used as prior knowledge, which contains plenty of label information. For instance, given a newly formed social graph (target), no users have labels reflecting their interests and it is difficult to classify the users into different groups based on their interests. We can utilize the abundant class information in a mature graph (source). As for a newly collected proteinprotein interaction graph (target) and a well-established protein graph (source), we could classify the proteins of the target graph into different function categories for disease diagnosis by leveraging knowledge from the labeled source graph. The difference in attribute distributions and structures between target and source graphs, limits the generalization of the above representation methods. For instance, the source graph is an English language graph and the target graph is a French language graph, where each node represents a document and the link weight is the cosine similarity. Apparently, the two graphs have greatly different text features and the same link weights in different graphs indicate different relevances. Even so, the two graphs still share a lot of the same semantic information which is difficult to be uncovered.
Many efforts in graph embedding have been made to realize the knowledge transfer between two graphs from different domains [13], [14], [15], [16], [17]. Inspired from domain adaptation in machine learning, some recent works have shown their superiority by estimating and minimizing the distribution discrepancy of two graphs [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. Specifically, these works mainly contain two components: 1) extracting information from source graph (i.e., topological structures, attributes and labels) using a graph embedding model, e.g., graph convolutional network (GCN) [6] or the autoencoder with positive pointwise mutual information (PPMI) matrix [28]; 2) transferring information to target graph by deriving a common latent space where the representation distributions of two graphs are minimized based on a measurement, e.g., maximum mean discrepancy (MMD) [29] or Wasserstein distance [30]. However, as the key role in knowledge transfer, the source graph is always contaminated by unpredictable and severe noise in practice (e.g., attribute and edge pollution exist simultaneously), especially when the graph comes from the web. Coping with various kinds of complex noise is not an easy problem mainly due to the following two reasons. (1) Difficult knowledge extraction. Most existing graph embedding models rely on the assumption that the graph edges and node attributes reflect the likelihood of label agreement. In other words, the nodes in the neighborhood correspond to similar attributes and the same label. When the source graph contains noisy edges and attributes, this noisy information may induce non-correspondence and make it difficult to extract helpful knowledge from the source graph. (2) Difficult knowledge transfer. The previous discrepancy measurements will be easily impacted by noise, leading to unreliable graph alignment. For instance, MMD is a nonparametric criterion in a reproducing kernel Hilbert space (RKHS), where the empirical estimate is sensitive to noise. Consequently, minimizing the gap between graphs without the influence of noise remains another central problem.
In this paper, the two aforementioned challenges are tackled by a two-step correntropy-induced Wasserstein GCN (CW-GCN) architecture. In step 1, the low-dimensional representations of the source graph are learned by: 1) detecting and suppressing the severely contaminated nodes which have noisy attributes or links, bypassing any specific assumptions on noise (e.g., sparse); 2) distilling three kinds of helpful information (structures, attributes and labels) only from the clean nodes. To this end, CW-GCN originally investigates correntropy-induced loss [31], [32] in GCN, which is derived from information theory and proven to have the theoretical foundation of handling unpredictable noise and outliers. By introducing auxiliary variables, this step is proven to be optimized under Half-Quadratic (HQ) [33] analysis. The auxiliary variables further pave the way for the reliable knowledge transfer in step 2. Specifically, integrating the auxiliary variables with the assumption that the node representations in the same class follow a Gaussian distribution, step 2 explores a robust and efficient Wasserstein distance to measure the difference in marginal distributions between graphs without the influence of noise. With the aim of minimizing the Wasserstein distance, the target graph is mapped to the same embedding space as the source graph. Consequently, in the shared embedding space, the rich knowledge preserved from the source graph in the first step (especially the label-discriminative information) is expected to be reliably transferred to assist the target graph analysis tasks. It is emphasized that in our two-step paradigm, the target mapping has unshared weights with the source mapping and the source mapping is fixed in step 2. It is a flexible and powerful paradigm to: 1) extract representations more specific to target graph; 2) reduce the impact of the contaminated source nodes on target graph mapping; 3) avoid overly complicated models. The main contributions of our work are summarized as follows.
• Based on the theoretic guarantee of correntropy, we investigate correntropy-induced loss in graph embedding. The new loss can be generally applied in graph embedding to tackle the noisy cases where the assumption (i.e., the graph edges and node attributes reflect the likelihood of label agreement) does not hold.
• A novel Wasserstein distance is introduced to address the scenario in which source and target graph representations have different distributions. Compared with the previous measurements, our distance facilitates both robustness and efficiency in reducing the cross-graph bias.
• The above two components are implemented in a twostep paradigm, which allows independent and asymmetric source and target mappings. This powerful paradigm helps the proposed CW-GCN achieve robust knowledge extractor and knowledge transfer successively.
• Experiments on real-world graphs demonstrate the promise of CW-GCN by considerable improvements under different noisy tasks, compared with the state-ofthe-art graph embedding and domain adaptation methods. The rest of this paper is organized as follows. We briefly review related works in Section II. Section III develops the problem definition and proposed method CW-GCN. Then, in Section IV, we report the experimental results. Finally, the concluding remarks are given in Section V.

A. Graph Embedding
Graph embedding or network embedding learns vector representations to reveal the semantics of the original graphs. Existing graph embedding methods can be further divided into two categories. The methods in the first category focus on preserving topological proximities between nodes, e.g., local neighborhood structures and global community structures [34], [35], [36], [37]. To further analyse attributed graphs and get better representations, the methods in the second category aim at jointly embedding the graph structure, vertex attributes and labels, based on the assumption that the two nodes connected by an edge likely have the similar attributes and the same label [6], [7], [8], [9], [10], [11], [12], [38]. In this line, GCN [6] is among the most successful paradigms, which extends existing convolutional neural networks for processing graphs. Specifically GCN directly embeds the graph structure and the node attribute with a spectral convolutional function for each layer, and minimizes the cross-entropy error over all labeled nodes. GCN has successful applications in broad areas [3], [4] and various models based on GCN have been proposed [11], [38], [39], [40]. For instance, GAM [38] applies an agreement model on the top of GCN to propagate labels in semi-supervised setting. The attributed graph embedding methods show great generalization capability by assuming that training and test graphs follow the same distribution. Based on the learnt representations, off-the-shelf supervised learning machines such as node classifiers or visualization models can be directly applied to the new graph using the labels from the trained graph.
The problem of realizing knowledge between two different graphs from different distributions has received increasing attention [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. For instance, some methods assume that some links across two graphs or some common nodes are available and utilize these pieces of prior information for graph alignment [15], [16], [17]. Another line of works relies on shallow structures [13], [14] to preform linear transformation for linear structural data. Recently, cross-graph embedding methods based on domain adaptation and end-to-end network architectures [18], [19], [20], [21], [22], [23], [24], [25], [26], [27] are suggested to have a potential to outperform the traditional methods. For instance, CDNE [18] constructs two autoencoders for source and target graphs respectively, where the input is PPMI [28] matrix to capture the graph structure. Then the MMD distances between two graph are minimized for graph-invariant representations. ACDNE [19] constructs PPMI based autoencoders as well and employs the adversarial-based distance measurement as in DANN [41]. ASN [22] adopts a dual GCN model to combine local and global consistency in network topology capturing and also uses the adversarial-based distance measurement to reduce the distribution discrepancy across domains. AdaGIn [25] employs the spatial GNN layers to compute node representations for the source and target graphs. Afterwards, conditional adversarial networks [42] are employed to reduce the domain discrepancy. SR-GNN [21] focuses on the use of CMD [43] and MMD as distance metrics to measure distributions discrepancy for efficiency. Different from the aforementioned methods, we develop the correntropy loss based GCN to handle the complex uncertainty cased by unpredictable noise, resulting in clean information preservation from the noisy source graph. Furthermore, the Wasserstein distance is extended to efficiently and robustly adapt marginal distributions between graphs.
Most recently, Graph Augmentation Learning (GAL) [44], [45], [46], [47] has provided promising solutions in addressing restrictions in graph learning, e.g., low-quality node attributes or low-quality graph structures. Specifically, LA-GNN [46] proposes a local augmentation strategy to learn representations of nodes with few neighbors. PTDNet [44] enhances the robustness performance of GNNs by removing task-irrelevant edges. DropEdge [45] randomly removes graph edges in message passing mechanism to alleviate over-smoothing. Pro-GNN [47] jointly learns a structural graph and a robust graph neural network model to defend adversarial attacks on graphs, i.e., adding or deleting or rewiring edges. Our work differs from the GAL methods in two aspects. First, we introduce the empirical correntropy-induced loss into graph embedding to effectively handle the challenges when the graph data may be contaminated by complex noise in real-world applications (e.g., noisy edges and corrupted node attributes exist simultaneously). Second, our work focuses on reliably transferring knowledge from the source graph to the target graph. Therefore, a robust Wasserstein distance is explored based on the C-loss to measure the difference in marginal distributions between graphs without the influence of noise.

B. Domain Adaptation
Feature extraction based domain adaptation in machine learning aims at deriving a common latent space by minimizing the distributions discrepancy of two domains with different measurements (strategies) [48]. Recently, building a deep structure for nonlinear spaces is regarded as a powerful way to bridge the distribution gap [41], [42], [43], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58]. In this deep family, two categories are mainly explored: matching the marginal distributions of two domains (e.g. [41], [43], [49], [50]); implicitly or explicitly aligning the class-conditional distributions based on the target pseudo-labels provided by the source classifier (e.g. [42], [56], [58]). The p-th Wasserstein distance [59] is a kind of distance measure between two probabilities P and Q, defined as where P, Q ∈ {P : ρ(x, y) p dP(x) < ∞, ∀y ∈ M} are two probability measures on the set M with order p and (P, Q) is the set of all measures on M × M with marginals P and Q. The 1-st Wasserstein distance [30], [60] has been successfully applied in domain adaptation methods [50], [59], showing gradient superiority and theoretic advantages (e.g., generalization guarantee) compared with other measurements. These methods utilize the 1-st Wasserstein distance to adapt the marginal distributions between domains, where the distance can be estimated through constructing and training a multi-layered network in a compact space. In each step towards minimizing the distance, the multi-layered network will firstly be iteratively trained via weight clipping, which will greatly increase the overall complexity and may cause gradient vanishing or exploding problems [61]. In addition, the applied distance will be easily impacted by real-world noise, leading to degraded (or even negative) information transfer.

C. Correntropy
In information-theoretic learning, correntropy [31] is a similarity measure between random variables X and Y : where k(.) is a translation-invariant Mercer kernel function and E[·] denotes the expectation operator. Correntropy is essentially the correlation in RKHS and has a close relationship with M-estimation [62]. Given a finite number of samples , the empirical correntropy-induced loss function or the C-loss function is When the Gaussian kernel is considered, C-loss is proven to have the nice property [63], [64]: 1) it is Bayes consistent; 2) it embeds the higher order statistics; 3) it behaves like L 2 -norm for a small error vector and L 0 -norm for a large error vector. As a bounded, smooth, and nonconvex loss, C-loss has been successfully applied to many applications (e.g., face recognition, signal processing) and is proven to be applicable under a variety of unpredicted noisy environments (e.g., missed entries, dense corruptions or heavy-tailed noise) [65], [66], [67]. In this paper, the proposed CW-GCN is designed for graphs with node-to-node interactions and dependencies, which investigates C-loss in GCN and makes it possible to boost graph embedding performance on noisy graphs. Furthermore, CW-GCN is proven to be optimized in HQ way, where the robustness is explicitly explained.

III. PROPOSED METHOD A. Problem Definition
collects the d t -dimensional attribute vector. d t is assumed to be equal to d s for simplicity, but it can be easily generalized to the setting d t ̸ = d s . The topological structure of G t can be represented by an adjacency matrix A t ∈ R m×m as well. These two relative graphs have the properties: 1) the marginal probability distributions of X t and X s are not equal, P t (X t ) ̸ = P s (X s ); 2) E s and E t indicate different relevances associated with the attributes; 3) they share the same label space. The task of our method is to uncover the shared space between G s and G t by preserving knowledge from G s and minimizing the graph bias. Consequently, in this shared space, if we train a graph analysis model using the labels from G s , the model is expected to have good performance on G t . Fig. 1 shows the workflow of our new architecture consisting of the following two steps.
1) The first step aims at mapping the source nodes to lowdimensional vectors, where two fundamental questions need to be addressed: how to preserve the topological structure A s , the content information X s as well as the label information Y s ? How to detect the contaminated nodes in G s with noisy attributes or links, and eliminate their negative impacts on mapping process?
2) The second step maps the target graph to the same embedding space as the source graph by solving the question: how to efficiently and robustly match the marginal probability distributions of source and target graph representations?

B. The First
Step: Representation Learning for G s 1) Feature Extractors: To encode the graph structure A s and content information X s directly, we adopt GCN [6] in this Fig. 1. Framework of our proposed method. We first learn the source GCN and classifier using labeled source graph. In the process, the noisy nodes (contain noisy links or corrupted attributes, or both) are suppressed and their negative impacts are eliminated. Next, we learn the target GCN based on extending the Wasserstein distance to avoid noisy source nodes as well. At the testing time, the target nodes in the shared space are classified by the source classifier learned in the first step.
paper. Note that our graph learning framework is general to various graph embedding models. Specifically, the following spectral convolutional function f (Z k+1 s , A s ) is defined to build the transformation for each layer: where A s = A s + I is the adjacency matrix with added self-connections, ( D s ) ii = j ( A s ) i j . Z k s is the matrix of activations in the k th layer and W k s is the parameter matrix to be learned.
Based on the well defined spectral convolution function, the arbitrary deep convolutional neural networks can be constructed. In this work, we adopt two layers to map the node features to the node embedding space, 2) The Overall Objective Function: To make Z 1 s and Z 2 s label-discriminative, a fully connected layer is added as the classifier, where σ is a softmax function for multi-class nodes or a sigmoid function for multi-label nodes. W y s is the trainable parameter matrix in the fully connected layer. The output Y s = [ŷ 1 s , . . . , y n s ],ŷ i s ∈ R c×1 predicts a probability distribution for v i s over a set of c classes. Finally, the whole network is trained by minimizing the loss function l( Y s , Y s ) over all labeled source nodes. In the basic case without noise contamination, the cross-entropy function or the L 2 -norm function is commonly employed as l(·). Note that the spectral convolutional function indicates that the representation of each node in Z 1 s or Z 2 s depends only on the attributes itself and the attributes of the neighbors. In this term, minimizing the crossentropy function or the L 2 -norm function relies on the basic assumption that the nodes in the neighborhood correspond to the same label. That is, the quality of the learnt representations relies on the quality of G s .
In the noisy conditions, the source nodes contain two parts: the clean nodes and the contaminated nodes (i.e., the outliers). When we assume nothing regarding the noise, it is hard to separate these two parts. The outliers in G s essentially contain noisy links or corrupted attributes, or both. The distinctive attributes between the outliers and their neighbors may result in different labels, thus leading to large prediction error. However, the traditional cross-entropy or L 2 -norm functions are sensitive to noise, which put great emphasize on minimizing the prediction error of these contaminated nodes. Consequently, based on the learnt Z 1 s and Z 2 s , the low-dimensional representations of the clean nodes loss the ability of preserving reliable information. To optimally eliminate the negative influence of the contaminated nodes, it is necessary to explore a robust measurement for the prediction error. We originally introduce the empirical C-loss into graph embedding. Defining e i = ∥ŷ i s − y i s ∥ 2 and φ(e i ) = (1 − ex p(−e 2 i /µ 2 )), the C-loss in Eq. (3) with Gaussian kernel can be written as: l c ( Y s , Y s ) = n i=1 φ(e i ). In this term, the following expression with an L 2 -norm regularization can be obtained: where µ is the Gaussian kernel width and α is the trade-off parameter. The column-wise minimization ∥ŷ i s −y i s ∥ 2 2 penalizes the error corresponding to a single node as a whole. It is derived from L 2,1 -norm (group sparsity) and is used to control node-specific error. In Eq. (7), the outliers with large prediction error will obtain stable C-loss values and only make limited impacts on the minimization. To give a clear illustration of the robustness, we graphically depict some functions φ(e) in Table I. With the increasing of e, the curves tend to be flattened out. In this term, φ(e) imposes bounded penalties to the large outliers, where the kernel width µ controls the increasing slope of φ(e). The explanation of b in Table I will be given in the following section. With the robust measurement for the prediction error, the learnt Z 1 s and Z 2 s successfully capture the structure proximity, attribute proximity and labeldiscriminative information.
3) Optimization and Robustness Analysis: We explicitly analyse the robustness of Eq. (7) and propose an iteratively learning procedure in this section. Based on the convex conjugate function theory [68], we have Lemma 1 [69].
Lemma 1: Let us consider the function φ that satisfies the hypotheses: φ is even φ is continuous near zero and 2x . Then 1) The function φ can be expressed as where , other wise (9) and the expression of ψ(b) is not required to compute b x . As a consequence of Lemma 1, Proposition 1 can be derived which enables Eq. (7) to be minimized in HQ way.
□ From Proposition 1, Eq. (7) is translated into an augmented objective function, where B = {b 1 , . . . , b n } are called auxiliary variables. Inheriting from HQ analysis [69], Eq. (11) can be minimized in an alternative strategy. The detailed procedure is in Algorithm 1, containing two components: 1) updating W 1 s , W 2 s and W y s using gradient descent, where ψ(b i ) is a constant; 2) directly obtaining the closed-form of b i following Proposition 1.

Algorithm 1
Step 1: Representation Learning for G s Remark 1: The variable b in Algorithm 1 explicitly explains the robustness. The polluted node, which may have dissimilar attributes and different labels with its neighbours, will yield large prediction error base on GCN. In this term, the polluted node will get small value b following Proposition 1. Table I depicts some specific functions φ(e) and their corresponding auxiliary variables b. As shown in the table, this auxiliary variable greatly decreases when the error increases. Consequently, b, which acts as the weight in the loss function in Eq. (11), dismisses the negative influence of the polluted node on updating . Inheriting from the superiority of correntropy in robust learning, b can effectively deal with noisy graphs in the challenging conditions with different kinds of complex noise. We emphasize that b also plays an important role in knowledge transfer in the next section.

C. The Second
Step: Representation Learning for G t 1) Feature Extractors: We adopt the two-layered GCN as well which maps G t to the same space as G s , where A t = A t + I, ( D t ) ii = j ( A t ) i j , W 1 t and W 2 t are the parameter matrixes to be learned.
2) The Wasserstein Distance: Given the node representations z l t ∈ Z l t and z l s ∈ Z l t (l = 1, 2), we further assume P s (z l s ) and P t (z l t ) to be Gaussian distributions: P s (z l s ) = N (m l s , C l s ), P t (z l t ) = N (m l t , C l t ), where m l s,t is the mean of the l-th layer and C l s,t is the covariance. We impose the 2-nd Wasserstein distance W 2 (P, Q) as defined in Eq. (1) to measure the distribution gap. Based on the simplified form of the 2-nd Wasserstein distance under Gaussian assumption [70], [71], the following novel 2-nd Wasserstein distance under noisy cases is introduced: where Actually, m l s is the weighted mean of the nodes in G s with the corresponding non-negative weights B, and σ l s is the unbiased estimate of the weighted sample variance [72]. Compared with the previous 1-st Wasserstein distance [30], [50], [60], the 2-nd Wasserstein distance under Gaussian assumption circumvents training a multi-layered network and has shown its effectiveness in face recognition [70] and image generation [71]. However, the above 1-st or 2-nd Wasserstein distance treats each data sample equally. The contaminated nodes in G s will seriously affect the computation of mean and covariance, and may lead to poor knowledge transfer. By contrast, Eq. (14) is based on the weights B from a clear foundation of robustness. The contaminated nodes with low weights contribute little to m l s and σ l s , and thus have limited impacts on the distance measurement.
3) The Overall Objective Function and Optimization: Combining the robust Wasserstein distance in Eq. (14) and an L 2 -norm regularization, the loss function of training W 1 t and W 2 t is described as where γ is the trade-off parameter to balance different terms. In Eq. (15), W 1 t and W 2 t have unshared and independent weights with W 1 s and W 2 s , where W 1 s and W 2 s are fixed during training. This flexible strategy reaps the following advantages: 1) it allows G t get more specific feature representations according its graph properties, and thus effectively uncovers the common semantic information among two different graphs; 2) the noisy source nodes suppressed in the first step will have limited impact on training W 1 t and W 2 t ; 3) overly complicated models with degeneration solutions are avoided.
Consequently, the target model is optimally modified to match the source model, and thus the marginal distributions between graphs are adapted (i.e., P t (z l t ) ≈ P s (z l s )). We then assume that such source and target models satisfy P t (y t | z l t ) ≈ P s (y s | z l s ), which is the covariate shift assumption in transfer learning [73]. As a result, the label-discriminative information preserved in the first step is expected to be transferred from G s to assist the graph analysis tasks on G t .

Algorithm 2
Step 2: Representation Learning for G t The overall procedure of iteratively optimizing W 1 t and W 2 t is summarized in Algorithm 2. The gradients of the first term in Eq. (15) can be computed as where ϵ is a constant. Note that the target graph has no label access, a target model may quickly learn a degenerate solution without proper initialization. Therefore, the learnt source model is used as the initialization for the target model.

IV. EXPERIMENTS
In this section, we systematically evaluate different properties of CW-GCN using a set of experiments on multiple real-world datasets. The experimental results demonstrate the robustness of our architecture to different types of noise on single-graph and cross-graph classification tasks, compared with the state-of-the-art methods. The source code of our method is provided in https://github.com/CocoLab-2022/ CW-GCN.

A. Robust Single-Graph Node Classification
In this section, our C-loss is tested for traditional singlegraph embedding under two conditions successively: no noise; noisy edges and attributes.
1) Data Preparation: Cora [74] and Citeseer [74] are two benchmark graph datasets, where each node represents an article and each edge represents the citation relationship. Each article has only one class label indicating one topic. The features of nodes are sparse bag-of-words vectors indicating whether each unique word is present in each article. These two datasets have been the de facto standard for evaluating graph node classification and their statistics are shown in Table II.

2) Comparison Methods and Experiments Settings:
Note that our C-loss function can be easily integrated to various graph embedding models. In this section, we consider three base models GCN [6], ANRL [75] and GAT [12]. Our methods are denoted as C-GCN, C-ANRL and C-GAT which have the same network structure as the base models respectively, expect that C-loss is employed on the output. To further demonstrate the robustness of C-loss, we also include PDTNet [44] in the category of graph augmentation learning. All the comparison methods are implemented based on the source codes provided by the authors and we follow the same parameter settings respectively. The main hyper parameter in our models is the Gaussian kernel width σ and we empirically search it in the range [1,10].
3) Experimental Results on Original Graphs: We use the train/validation/test splits in [6] and [10], where 20 labels per class are available during training. Table III reports the mean classification results on the test nodes after 20 runs with random weight initializations. The best results of the baselines are taken from the original papers respectively. Note that we report Micro-F1 and Macro-F1 of ANRL and C-ANRL as in [75] for fair comparison. As can be seen from the table, C-loss function achieves similar and comparable results across different base models, showing its effectiveness in handling clean graphs without noise pollution. In particular, PTDNet improves the performances of GCN by actively removing task-irrelevant edges or decreasing their weights.
4) Experimental Results on Noisy Graphs: In this section, we simultaneously add two types of complex noise on the graph datasets. (a) Noisy edges: 30 percent of the labeled training nodes are randomly chosen whose original correct edges are deleted, and then 15 random edges are added on each chosen node. (b) Noisy features: 30 percent of the labeled training nodes are randomly chosen as well whose features are replaced with i.i.d samples from a typical heavy-tailed noise, i.e., Cauchy distribution. The Cauchy noise is centered at 0 with the scale parameter S = 1. Table IV summarizes  the comparison results after 20 runs with random weight  TABLE IV  NODE CLASSIFICATION RESULT ON NOISY GRAPHS   TABLE V DATASET STATISTICS IN CROSS-GRAPH EXPERIMENTS initializations and random noise pollution. By introducing the C-loss function, our methods consistently outperform the basic models. For instance, the performances of C-GCN over GCN are 74.7% to 72.1% on cora, and 65.9% to 63.4% on citeseer. This loss replacement is simple and does not incur any additional computation cost, yet brings powerful results. These severe corruptions limit the learning cabability of PTDNet and may lead to suboptimal performances in robustness and generalization. In summary, correntropy is a more effective loss to deal with heavily polluted edges and attributes in graph embedding.
B. Robust Cross-Graph Node Classification 1) Data Preparation: DBLPv7 (D7) [76], Citationv1 (C1) [76] and ACMv9 (A9) [76] are three public citation graphs (nodes correspond to articles and edges to citations) from ArnetMiner. They are extracted from three different original sources respectively: DBLP Computer Science Bibliograph, Microsoft Academic Graph and Association for Computer Machinery. Following the previous work [18], [19], the union bag-of-words features are constructed for each node and each node has multiple labels showing its research topics. Blog1 (B1) [77] and Blog2 (B2) [77] are two public social graphs (nodes correspond to bloggers and edges to friendship) from the BlogCatalog dataset. The keywords of the bloggers self-description compose the node attributes and each node is associated with one label indicating its interest group. Dataset statistics are summarized in Table V. Note that compared with the citations graphs, each node in the social graphs has a larger number of neighbors and higher feature dimension. The attribute distributions and the node-to-node interactions across the above graphs are related but varied to some extent. By randomly selecting two graphs as source and target graphs respectively, we construct 8 cross-graph node classification tasks: 2) Comparison Methods and Experiments Settings: We systematically compare our method with several state-of-theart domain adaptation and graph representation algorithms.
• Logistic Regression Classifier (LR) is trained using labeled source nodes and is directly applied for node classification in target graph.
• DANN [41] and WDGRL [50] minimize the distribution gap using node attributes and source node labels. They are not designed for graphs and neglect the powerful nodeto-node interactions.
• CDNE [18] and ASN [22] jointly combine structure proximity, node attributes and source node labels for graph embedding. Meanwhile, the distance between two graphs is minimized for graph-invariant representations.
• CGDM+GCN adopts GCN as the feature extractor in the framework of the domain adaptation algorithm CGDM [52].
• CW-GCN is our proposed method for robustly preserving knowledge and minimizing the graph bias.
The proposed CW-GCN is initialized using Glorot initialization and trained using the Adam optimizer.In the first step, the weight of L 2 -norm regularization α is chosen from {5 × 10 −6 , 5 × 10 −7 }. For example, α is set to 5 × 10 −6 on citation graphs and set to 5 × 10 −7 on social graphs. The learning rate is 0.01, and the dropout with p = 0.5 is applied to the first and the second layers. The maximum epoch is 5000 and we stop training early when the results on the source graph achieve 95%. The Gaussian kernel width σ is chosen from {10, 100, 1000}. In the second step, the weight of L 2 -norm regularization γ is 5 × 10 −6 , the learning rate is chosen from {0.001, 0.01}, and the dropout with p = 0.1 is applied to the first and the second layers. We stop training if the training loss does not decrease for 1000 consecutive epochs. In both steps, the hidden dimensionality of the first layer is set to be 256 and the hidden dimensionality of the second layer is set to be 128.
The implementations of the comparison methods realized by the original authors are used. For a fair comparison, the node representations in each baseline are also set to have 128 dimensions. ASNE, ANRL, SEANO and Planetoid sample node sequences from the graph for structure preservation based on skip-gram model [78]. Following the same setting as [18], a unified graph is constructed in the above methods, where the first n nodes are from the source graph and the last m nodes are from the target graph. Therefore, the sequences sampled from the target nodes give more information. Note that the training process of GCN and GAT cannot utilize the target graph since there is no edge between two graphs. For each experiment, all labeled source nodes and unlabeled target nodes are used 3) Experimental Results on Citation Graphs: In this section, our method CW-GCN is evaluated in three conditions successively: no noise; noisy edges; noisy feaatures.
In the no-noise condition, we report the Micro-F1 and Macro-F1 [79] results on the unlabeled target nodes over 10 random weight initializations in Table VI. Micro-F1 and Macro-F1 are widely used for evaluating classification performance on multi-label datasets. Results for DANN, ANRL, SEANO, GCN and CDNE are taken from the CDNE paper [18]. The results can be summarized as follows.
Firstly, the domain adaptation methods DANN and WDGRL improve the Micro-F1 and Macro-F1 of LR on all the datasets, which verifies that bridging the gap between graphs is necessary in these challenging node classification tasks. Secondly, the traditional graph embedding methods (i.e, ANRL, ASNE, SEANO, Planetoid, SEANO, GCN, GAT) always outperform DANN and WDGRL. The results show that preserving topological structure is important in obtaining valuable graph representations. Thirdly, the graph adaptation methods CDNE, ASN and CW-CCN always achieve higher results than CGDM+GCN. The possible reason is that the specific loss in CGDM designed for non-graph data (e.g., self-supervised loss) is not suitable for complex graph structure. Finally, CW-GCN, ASN and CDNE obtain comparable performance, which successfully distill different kinds of information (i.e., structure proximity, node attributes and node labels) from the source graph and reducing the graph discrepancy.
CW-GCN is proposed with the aim of extracting and transferring reliable information from the severely polluted source graph with noisy edges or features, that is, the graph edges and node attributes do not reflect the likelihood of label agreement. To demonstrate the ability of CW-GCN in detail, we further conduct experiments under two different noisy conditions successively. (a) 10 percent of the labeled source nodes are randomly chosen whose original correct edges are deleted, and then 100 random edges are added on each chosen node. (b) 10 percent of the labeled source nodes are randomly chosen whose features are replaced with i.i.d samples from a Cauchy distribution centered at 0 with the scale parameter S = 1. Table VII shows the results over 10 random repetitions in the first noisy condition. Note that LR, DANN and WDGRL will not be affected by the noisy edges. The following observations can be drawn.
Firstly, the performances of ANRL and GAT are a little better than those in the basic cases respectively. A possible explanation for ANRL is that it aims at minimizing the autoencoder loss function between the reconstruction output of each node and its neighbors. Inheriting from denoising autoencoder [80], the loss between the chosen node (i.e., correct attributes) and corrupted version (i.e., partially destroyed neighbors) may yield better performances. A possible explanation for GAT is that it allows for assigning different importance to nodes of a same neighborhood, and thus the chosen nodes may get more informative neighbors from random edges. Secondly, the noisy edges have different levels of negative impacts on the performances of the other methods (i.e., ASNE, SEANO, Planetoid, SEANO, GCN, CDNE, ASN and CGDM+GCN). In particular, the performances of the cross-graph method ASN greatly degrade from 76.29% to 29.50% on the average Micro-F1, from 74.99% to 9.09% on the average Macro-F1. The possible reason is that ASN employs the reconstruction loss to maintain the graph structure, while the reconstruction progress is highly influenced by the noisy edges. These results demonstrate that it is hard to extract helpful information from the source graph and to minimize the graph gap under this noisy case. Finally, putting emphasis on promoting robustness, our method consistently obtains the best result by a significant margin in each of the six tasks.
The average Micro-F1 and Macro-F1 results under the second noisy condition are shown in Table VIII and we achieve the following observations. Firstly, LR performs better than it under original conditions. The possible reason is that the distributions of source and target graphs may get more overlap with added noise, which may be helpful for direct classification on the target graph. However, the noise makes DANN and WDGRL more difficult to estimate and reduce distribution gap. Secondly, the performances of ANRL, ASNE, SEANO, Planetoid, SEANO, GCN and GAT degrade in different levels with the noisy attributes (e.g., SEANO: −40.63% on the average Micro-F1 and −55.90% on the average Macro-F1). Thirdly, in these challenging tasks, the performances of CDNE and CGDM+GCN are highly deteriorated (e.g., CGDM+GCN: −50.07% on the average Micro-F1 and −57.89% on the average Macro-F1). Finally, the performances of CW-GCN are slightly influenced by the noise. In particular, the performances of CW-GCN on D7 → A9 and A9 → C1 are a little better than those in the basic cases respectively. A possible explanation is that after the elimination of noisy nodes, the more informative data play more important roles in training, leading to improved knowledge transfer. In summary, our method achieves much higher Micro-F1 and Macro-F1 results than the other competitors on all the datasets, which illustrates the great robustness of CW-GCN in handling a variety of unpredicted noisy environments. 4) Experimental Results on Social Graphs: We emphasize that the social graphs have higher average degrees than the citation graphs (i.e., nodes have more features and neighbors). In this section, we compare CW-GCN with the graph adaptation methods CDNE, ASN and CGDM+GCN on these difficult datasets in two noisy conditions.  conditions. As mentioned in the above section, ASN utilizes the reconstructed adjacency matrix to model the structure in graph adaptation and this strategy may have special difficulty in dealing with higher average degrees. Meanwhile, the limited capability of CGDM-GCN reveals that designing loss functions regarding graphs is important in graph adaptation, especially in complex noisy cases. CDNE and CW-GCN obtain comparable performances with noisy edges. The superiority of CDNE may come from using PPMI based model to measure the structural proximities. Note that our correntropy-induced Wasserstein distance can be easily integrated to various GNN models and we leave this possible extension in future. However, in the noisy-attribute tasks, CDNE shows greatly degraded performances. The relative performance gains (Macro) of CW-GCN over CDNE are 37.85% and 33.96% in B1 → B2 and B2 → B1, respectively. In summary, our correntropy-induced Wasserstein distance is an effective measure to suppress various kinds of complicated noises, resulting in reliable graph knowledge transfer.
C. Empirical Analysis 1) Robustness of C-GCN Under Different Levels of Noises: We further evaluate the robustness to three different levels of noises. The noise has the following form: p percent of training nodes are randomly chosen whose original correct edges are deleted, and then q random edges are added on each chosen node; p percent of training nodes are randomly chosen as well whose features are replaced with Cauchy distribution. Level I (as shown in Table IV): p = 30%, q = 15. Level II: p = 40%, q = 15. Level III: p = 30%, q = 30. The experiments are randomly repeated 20 times and the classification accuracy is shown in Table IX. With the increasing of noise levels, the performances of C-GCN are consistently better than GCN. These results demonstrate the effectiveness and the robustness of our C-loss function in all kinds of difficult cases.
2) Robustness of CW-GCN Under Different Levels of Noises: We consider two variants of CW-GCN to give an insight into its performance. W-GCN employs the crossentropy loss in Eq. (15) instead of C-loss in CW-GCN. CW-GCN (b = 1) treats each source node equally in the process of minimizing the Wasserstein distance, i.e., b = 1 in Eq. (14). Fig. 3 shows the robustness of the variants to different levels of noise on the D7 → A9 dataset (the hardest of the six datasets). Specifically, p percent of the labeled source nodes are randomly chosen whose original correct edges are deleted, and then 100 random edges are added on each chosen node, where p = 10%, 13%, 15%, 17%, 20%. (1) W-GCN achieves worse performances than other methods across all the noise levels, demonstrating that the graph quality greatly influences the embedding results and a robust cost function is desirable to ameliorate the negative influence. (2) CW-GCN (b = 1) always performs worse than CW-GCN, which reflects that robust knowledge extraction and robust knowledge transfer are both essential in cross-graph node classification. In this term, our proposed novel Wasserstein distance shows its robustness at transferring knowledge in all kinds of difficult cases.
3) Complexity Analysis: We investigate the computation complexity of the graph adaptation models CDNE, ASN and CW-GCN. Table X illustrates the running time on tasks C1 → D7, D7 → A9 and A9 → C1. We found that the two-step WC-GCN achieves highest computational efficiency, which reduces the training time of ASN over 70% on D7 → A9 as an example. On this D7 → A9 task, the running time of CW-GCN's first step is 30 seconds. Specifically, in each epoch, 0.6 seconds are spent on updating and evaluating the model, and 1.0 seconds are spent on saving the node embedding files. The second step of CW-GCN takes 58 seconds. Note that CDNE suffers the high computation cost since the source code is performed on the CPU platform.

V. CONCLUSION
In graph embedding, knowledge transfer between two different but related graphs is promising and challenging. This paper proposes CW-GCN method to uncover a common latent feature space for source and target graphs under severely noisy environments. Specifically, CW-GCN is implemented in a two-step learning paradigm, allowing independent source and target mappings. Inspired from correntropy, the first step is distinguished with integrating three important sources of information (graph structure, node attribute and node labels) from the noisy source graph. Next, our method aims at efficiently and reliably adapting marginal distributions between graphs with the development of the existing Wasserstein distance. To the best of our knowledge, it is the first attempt in graph embedding field to explain the robustness from the correntropy perspective. Extensive experiments are conducted on five realworld graph datasets. The proposed method generalizes well across a variety of noisy cases, and establishes a new stateof-the-art for single-graph and cross-graph classification.
In the future, we plan to robustly adapt both marginal and conditional distributions between graphs in our architecture.
In the challenging applications where source and target graphs are both from noisy environments, we will further investigate how to reduce the impact of the contaminated target nodes on knowledge transfer.