BiGCN: A Bi-directional Low-Pass Filtering Graph Neural Network

Graph convolutional networks (GCNs) have achieved great success on graph-structured data. Many graph convolutional networks can be thought of as low-pass filters for graph signals. In this paper, we propose a more powerful graph convolutional network, named BiGCN, that extends to bidirectional filtering. Specifically, we not only consider the original graph structure information but also the latent correlation between features, thus BiGCN can filter the signals along with both the original graph and a latent feature-connection graph. Compared with most existing GCNs, BiGCN is more robust and has powerful capacities for feature denoising. We perform node classification and link prediction in citation networks and co-purchase networks with three settings: noise-rate, noise-level and structure-mistakes. Extensive experimental results demonstrate that our model outperforms the state-of-the-art graph neural networks in both clean and artificially noisy data.


INTRODUCTION
G RAPHS are important research objects in the field of machine learning as they are good carriers for structural data such as social networks, citation networks, and copurchase networks. Over the past years, there has been a surge of interest in applying deep learning to various graph-based tasks. At the same time, it is well recognized that representation learning methods [1], [2], [3], [4] thoroughly leveraging graph structure and node attributes are fundamental components for the vast majority of graph learning algorithms. In particular, graph convolutional neural networks (GCNs) [5], [6], [7], [8], [9], [10], [11] received extensive attention due to their impressive performances in various domains.
GCNs can be interpreted as varied aggregation schemes that propagate node features across graphs. From this point of view, feature noise will spread in the same way, which degrades performance. Unfortunately, in real attributed graphs, feature noise is inevitable. Take social networks as an example. To protect privacy and promote social presence, users can fake or embellish their age, gender, physical characteristics, profession, etc. Thus, the credibility of personal information provided by users is limited. In this paper, we will show how to minimize the impact of noise. Here, we first divide feature noise into two types: node-wise noise that exists in nodes and feature-wise noise that exists in attributes. The noisy case above is node-wise noise and measurement error is a typical type of feature-wise noise.
Below, we state that GCNs can denoise, but fail to cope well with feature-wise noise.
With the success of GCNs, more and more efforts are focused on the reasons why GCNs are so powerful [12]. Li et al [13] re-examined graph convolutional networks (GCNs) and connected it with Laplacian smoothing. NT and Maehara et al [14] revisited GCNs in terms of graph signal processing. Interestingly, they found that many graph convolutions can be considered as adjacency induced low-pass filters (e.g. [5], [15]). That is, they can capture low-frequency components and remove some high-frequency node-wise noise by making connective nodes more similar. In fact, these findings are not new. Since its first appearance in [16], spectral GCNs have been closely related to graph signal processing and denoising. However, GCNs can't handle all noise as graph filters are row operations for feature matrices while feature-wise noise is column distribution. Furthermore, on isolated nodes or small single components of the graph, their denoising effect is quite limited due to the lack of reliable neighbors.
To address these problems, we design a bi-directional low-pass filter and propose a more powerful graph neural network, called BiGCN (Fig. 1). The key point of BiGCN is to introduce a latent feature graph. We take feature correlations as weighted edges and represent each feature dimension as a node. If feature correlations are not available, BiGCN can learn to model latent feature correlations. Solving an optimization problem in graph signal processing, we obtain a column filter on the original graph and a row filter on the feature graph to remove node-wise noise and feature-wise noise, respectively.
Finally, we point out that unreliable structure information will limit the denoising capacity of GCNs and make them vulnerable to collapse, while BiGCN can minimize the adverse effect of structural information. As we mentioned above, graph filters are derived from Graph Fourier Transform and can be formulated as a function with respect to the graph Laplacian. The effect of graph filters depends on eigenvalues and eigenvectors of the Laplacian. However, with incorrect adjacency, most existing spectral graph convolutions are thus no longer the real low-pass filters. Worse, they will block out valid information and introduce noise. In contrast, as BiGCN learned a latent feature correlation graph that is not affected by the given graph structures, it improves the model's fault tolerance to the graph structure mistakes by setting appropriate weights of the two filters. Actually, inaccurate connection information is more common than we ever thought. For instance, on social media, it is difficult to accurately represent social relationships in the real world. Younger users probably have a lot of online friends that users don't know in real life. In addition, they may be less inclined to follow their acquaintances, such as colleagues or relatives, on social media. We evaluate our model on two tasks: node classification and link prediction. In addition to the original graph data, we develop three cases to demonstrate the performance of our model in terms of graph signal denoising and fault tolerance: 1). NOISE-RATE: randomly adding Gaussian noise with different variances to a certain percentage of nodes; 2). NOISE-LEVEL: adding different levels of Gaussian noise to the whole graph feature; 3). STRUCTURE-MISTAKES: mistaking a certain percentage of connections. The remarkable performance of our model in extensive experiments confirms our power and robustness in both clean and noisy data.
The main contributions of this work are summarized below.
• We propose a new framework for representation learning of attributed graphs. Instead of only considering the signals in the original graph, we take the feature correlations into account and make the model more robust to feature noise as well as structural mistakes.
• We formulate our graph neural network based on Laplacian smoothing and derive a bi-directional lowpass graph filter using the Alternating Direction Method of Multipliers (ADMM) algorithm. • We set three cases to demonstrate the powerful denoising capacity and high fault tolerance of our model in tasks of node classification and link prediction.

RELATED WORK
We summarize the related work in the field of graph signal processing and denoising and recent work on spectral graph convolutional networks as follows.

Graph Signal Processing and Denoising
Graph-structured data is ubiquitous in the world. Graph signal processing (GSP) [17] is intended for analyzing and processing the graph signals whose values are defined on the set of graph vertices. It can be seen as a bridge between classical signal processing and spectral graph theory. One line of the research in this area is the generalization of the Fourier transform to the graph domain and the development of powerful graph filters [18], [19]. It can be applied to various tasks, such as representation learning and denoising [20]. More recently, the tools of GSP have been successfully used for the definition of spectral graph neural networks, making a strong connection between GSP and deep learning. In this work, we restart with the concepts from graph signal processing and define a new smoothing model for deep graph learning and graph denoising. It is worth mentioning that the concept of denoising/robustness in GSP is different from the defense/robustness against adversarial attacks (e.g. [21]), so we do not make comparisons with those models.

Spectral Graph Convolutional Networks
Inspired by the success of convolutional neural networks in images and other Euclidean domains, the researcher also started to extend the power of deep learning to graphs. One of the earliest trends for defining the convolutional operation on graphs is the use of the Graph Fourier Transform and its definition in the spectral domain instead of the original spatial domain [16]. Defferrard et al [22] proposed ChebyNet which defines a filter as Chebyshev polynomials of the diagonal matrix of eigenvalues, which can be exactly localized in the k-hop neighborhood. Later on, Kipf and Welling [5] simplified the Chebyshev filters using the firstorder polynomial filter, which led to the well-known graph convolutional network. Recently, many new spectral graph filters have been developed. For example, the rational autoregressive moving average graph filters (ARMA) [19], [11] are proposed to enhance the modeling capacity of GNNs. Compared to the polynomial ones, ARMA filters are more robust and provide a more flexible graph frequency response. Feedback-looped filters [23] further improved localization and computational efficiency. There is also another type of graph convolutional networks that defines convolutional operations in the spatial domain by aggregating information from neighbors. The spatial types are not closely related to our work, so it is beyond the scope of our discussion. As we will discuss later, our model is closely related to spectral graph convolutional networks. We define our graph filter from the perspective of Laplacian smoothing, and then extend it not only to the original graph but also to a latent feature graph in order to improve the capacity and robustness of the model.

BACKGROUND: GRAPH SIGNAL PROCESSING
In this section, we will briefly introduce some concepts of graph signal processing (GSP), including graphs smoothness, graph Fourier Transform and graph filters, which will be used in later sections.

Graph Laplacian and Smoothness
A graph can be represented as G = (V, E), which consists of a set of n nodes V = {1, . . . , n} and a set of edges E ⊆ V×V. In this paper, we only consider undirected attributed graphs. We denote the adjacency matrix of G as A = (a ij ) ∈ R n×n and the degree matrix of G as D = diag(d(1), . . . , d(n)) ∈ R n×n . In the degree matrix, d(i) represents the degree of vertex i ∈ V. We consider that each vertex i ∈ V associates a scalar x(i) ∈ R which is also called a graph signal. All graph signals can be represented by x ∈ R n . Some variants of graph Laplacian can be defined on graph G. We denote the graph Laplacian of G as L = D − A ∈ R n×n . It should be noted that the sum of rows of graph Laplacian L is zero. The smoothness of a graph signal x can be measure through the quadratic form of graph Laplacian: ∆(x) = x Lx =

Capture correlation
Column filters Row filters ADMM × Fig. 1. Illustration of the l-th BiGCN layer. The key innovation of BiGCN is to apply a bi-directional low-pass filter on multi-graph (i.e. the graphs of nodes and features) to incorporate node attributes, node relationships, and feature correlations. We first construct a feature graph: node d i indicates the i-th feature dimension with the i-th column of H l as the attribute embeddings; edges are weighted representing given or learnable feature correlations. A bi-directional filter contains a column filter on the node graph and a row filter on the corresponding feature graph. As the input node graph differs from layer to layer, the learned/captured feature correlations, as well as feature graph, are different.
Due to the fact that x Lx ≥ 0, L is a semi-positive definite and symmetric matrix.

Graph Fourier Transform and Graph Filters
Decomposing the Laplacian matrix with L = UΛU , we can get the orthogonal eigenvectors U as Fourier basis and eigenvalues Λ as graph frequencies. The Graph Fourier Transform F : R n → R n is defined by Fx =x := U x. The inverse Graph Fourier Transform is defined by F −1x = x := Ux. It enables us to transfer the graph signal to the spectral domain, and then define a graph filter g in the spectral domain for filtering the graph signal x: where g(Λ) = diag(g(λ 1 ), ...g(λ N )) controls how the graph frequencies can be altered.

BIGCN
The Graph Fourier Transform has been successfully used to define various low-pass filters on graph signals (column vectors of feature matrix) and to derive spectral graph convolutional networks [22], [11], [23]. A spectral graph convolutional operation can be formulated as a function g concerning the Laplacian matrix L. Although it can smooth the graph and remove certain node-wise noise by assimilating neighbor nodes, it is sensitive to feature-wise noise and unreliable structure information. Noting that there are potential correlations between multiple attributes, such information is widely considered in machine learning such as the feature selection process, and can be used to predict one attribute from others. Analogously, we can utilize feature correlations to "denoise" noisy attributes. At the same time, it can reduce the information distortion caused by structure mistakes. In practice, we construct a feature graph based on known or learned feature correlations, and define a new row-directional low-pass filter derived from the Laplacian smoothness assumption on the feature graph. In addition, similar to spectral GCNs, we apply a column-wise filter on the original graph. That is, our model, named BiGCN ( shown in Fig. 1 ), is an innovative and general spectral GCN with bi-directional low-pass filters. To explain the power of bi-directional graph convolution better, we start with the following simple case.

From Laplacian Smoothing to Graph Convolution
Assuming that f = y 0 + η is an observation with noise η, to recover the true graph signal y 0 , a natural optimization problem is given by: where λ is a hyper-parameter, L is the (normalized) Laplacian matrix. The optimal solution to this problem is the true graph signal given by If we generalize the noisy graph signal f to a noisy feature matrix F = Y 0 + N, then the true graph feature matrix Y 0 can be estimated as follows: Y LY, the Laplacian regularization, achieves a smoothness assumption on the feature matrix. (I + λL) −1 is equivalent to a low-pass filters in graph spectral domain which can remove node-wise noise and can be used to defined a new graph convolutional operation. Specifically, by multiplying a learnable matrix W (i.e. adding a linear layer for node feature transformation beforehand, which is similar to [15], [14]), we obtain a new graph convolutional layer as follows: In order to reduce the computational complexity, we can simplify the propagation formulation by approximating (I + λL) −1 with its first-order Taylor expansion I − λL.

Bi-directional Smoothing and Filtering
The key to the success of GCNs is incorporating both node attributes and connection information. While it neglects the relationships between multiple node attributes, which is common in the real world and informative.
Here, we take the recommendation system as an instance of utilizing feature correlations. The recommendation (such as Netflix) system problem is to predict ratings of movies never seen by users, where rows and columns of rating matrix represent the users and movies, respectively. Additional information like relationships between users (such as their interpersonal relations, age, hobbies, education, etc) and movies (such as their genre, director, actors, origin country, etc) can be encoded in the form of a user-graph and a itemgraph. This information can be taken advantage of prediction. Users who have lots in common are likely to share the same tastes of movies. On the other hand, users usually have preference toward particular genre/classes of movies and are likely to rate them similarly. Inspired by recommendation system problems, we adapt GCNs to multi-graph to incorporate the additional information of feature correlations. We can introduce a "feature adjacency matrix" A to indicate this feature connections. In some cases, we can build up a simple adjacency matrix based on rough and limited human knowledge. For instance, let i-th, j-th, k-th dimension feature refer to "height","weight" and "age" respectively, considering that "weight" have very strong correlation with "height" but weak correlation with "age", it is reasonable to assign A ji = 1 while A jk = 0 (here we assume A ∈ {0, 1} d×d for simplification). However, in most cases, it is hard to obtain accurate feature correlations. We endow BiGCN with ability to capture latent relationship between attributes by introducing a learnable adjacency matrix.
Then we can construct a corresponding "feature graph" G f where nodes represent attributes and edges indicate the known/learned feature correlations. Given Y ∈ R n×d , the feature matrix of the original graph G, Y would be the feature matrix of the feature graph G f . That is, the rows and columns of Y are the embeddings of nodes and attributes.
When noise is not only node-wise but also feature-wise, or when graph structure information is not completely reliable, it is beneficial to consider feature correlation information in order to recover the clean feature matrix better. Thus we add a Laplacian smoothness regularization on feature graph to the optimization problem indicated above: Here L 1 and L 2 are the normalized Laplacian matrix of the original graph and feature graph, λ 1 and λ 2 are hyperparameters of the two Laplacian regularization. YL Y is the Laplacian regularization on the feature graph or row vectors of the original feature matrix. The solution of this optimization problem is equal to the solution of differential equation: This equation, equivalent to λ 1 L 1 Y+λ 2 YL 2 = F−Y, is a Sylvester equation. The numerical solution of Sylvester equations can be calculated using some classical algorithm such as Bartels-Stewart algorithm [24], Hessenberg-Schur method [25] and LAPACK algorithm [26]. However, all of them require Schur decomposition which including Householder transforms and QR iteration with O(n 3 ) computational cost. Consequently, we transform the original problem to a bi-criteria optimization problem with equality constraint instead of solving the Sylvester equation directly: We adopt the ADMM algorithm [27] to solve this constrain convex optimization problem. The augmented Lagrangian function of L is: The update iteration form of ADMM algorithm is: We obtain Y 1 and Y 2 iteration formulation by computing the stationary points of To decrease the complexity of computation, we can use first-order Taylor approximation to simplify the iteration formulations by choosing appropriate hyper-parameters p and λ 1 , λ 2 such that the eigenvalues of 2λ1 1+p L 1 and 2λ2 1+p L 2 all fall into [−1, 1]: In each iteration, as shown in Fig. 1, we update Y 1 by appling the column low-pass filter I − 2λ1 1+p L 1 to the previous Y 2 , then update Y 2 by appling the row low-pass filter I − 2λ2 1+p L 2 to the new Y 1 . To some extent, the new Y 1 is the low-frequency column components of the original Y 2 and the new Y 2 is the low-frequency row components of the new Y 1 . After k iteration (in our experiments, k = 2), we take the mean of Y as the approximate solution Y, denote it as Y = ADMM(F, L 1 , L 2 ). In this way, the output of ADMM contains two kinds of low-frequency components. Moreover, we can generalize L 2 to a learnable symmetric matrix based on the original feature matrix F (or some prior knowledge), since it is hard to give a quantitative description on feature correlations.
In l+1-th propagation layer, F = H (l) is the output of l-th layer, L 2 is a learnable symmetric matrix depending on H (l) , for this we denote L 2 as L (l) 2 . The entire formulation is:

Learnable L 2
We introduce a completely learnable L 2 in our experiments.
In detail, we define L 2 as: where W is an uppertriangle matrix parameter to be optimized. To make it sparse, we also add L 1 -regularization to L 2 . For each layer, L 2 is defined differently. Note that our framework is general and in practice there may be other reasonable choices for L 2 .

Discussion about Over-smoothing
Since our algorithm is derived from a bidirectional smoothing, some may worry about the over-smoothing problem. The over-smoothing issue of GCN is explored in [13], [28], where the main claim is that when the GCN model goes very deep, it will encounter over-smoothing problem and lose its expressive power. From this perspective, our model will also be faced with the same problem when we stack many layers. However, a single BiGCN layer is just a more expressive and robust filter than a normal GCN layer. The general forward function of a single-direction low-pass filtering GCN is H (l+1) = σ(g(L 1 )H (l) W (l) ). Compared with this, ADMM(H (l) , L 1 , L 2 ) combines low-frequency components of both column and row vectors of H (l) . It is more informative than g(L 1 )H (l) since the latter can be regarded as one part of the former to some extent. It also explains that BiGCN is more expressive that single-direction low-pass filtering GCNs. Furthermore, when we take L 2 as an identity matrix (in Equation 5), BiGCN degenerates to a single-directional GCN with low-pass filter: ((1 + λ 2 )I + λ 1 L 1 ) −1 . It also illustrates that BiGCN has more general model capacity.
In practice, we can also mix the BiGCN layer with original GCN layers or use jumping knowledge [29] to alleviate the over-smoothing problem: for example, we can use BiGCN at the bottom and then stack other GCN layers above. As we will show in experiments, the adding smoothing term in the BiGCN layers does not lead to over-smoothing; instead, it improves the performance on various datasets.

Model Expressiveness
As a bi-directional low-pass filter, our model can extract more informative features from the spectral domain. To simplify the analysis, let us take just one step of ADMM (k=1). Since Z 0 = 0, Y 0 1 = Y 0 2 = F, we have the final solution from Equation (10) as follows .
From this solution, we can see that Y 1 is a low-pass filter which extracts low-frequency features from the original graph via L 1 ; Y 2 is a low-pass filter which extracts lowfrequency features from the feature graph via L 2 and then do some transformation. Since we take the average of Y 1 and Y 2 as the output of ADMM(H, L 1 , L 2 ), the BiGCN layer will extract low-frequency features from both the graphs. That means, our model adds new information from the latent feature graph while not losing any features in the original graph. Compared to the original single-directional GCN, our model has more informative features and is more powerful in representation. When we take more than one step of ADMM, from Equation 11 we know that the additive component (I− 2λ1 1+p L 1 )F is always in Y 1 (with a scaling coefficient), and the component F(I − 2λ2 1+p L 2 ) is always in Y 2 . So, the output of the BiGCN layer will always contain the low-frequency features from the original graph and the feature graph with some additional features with transformation, which can give us the same conclusion as the one step case.

EXPERIMENT
We test BiGCN on two graph-based tasks: semi-supervised node classification and link prediction on several benchmarks. As these datasets are usually observed and carefully collected through a rigid screening, noise can be negligible. However, in many real-world data, noise is everywhere and cannot be ignored. To highlight the denoising capacity of the bi-directional filters, we design three cases and conduct extensive experiments on artificial noisy data. In noise-level case, we add different levels of noise to the whole graph. In noise rate case, we randomly add noise to a part of nodes. Considering the potential unreliable connection on the graph, to fully verify the fault tolerance to structure information, we set structure mistakes case in which we will change graph structure. We compare our performance with several baselines including original GCN [5], GraphSAGE [7], GAT [6], GIN [12], and GDC [30].

Benchmark Datasets
We conduct link prediction experiments on Citation networks and node classification experiments both on CITATION networks and CO-PURCHASE networks. More statistics details are provided in Table 1.

CITATION.
A citation network dataset consists of documents as nodes and citation links as directed edges. We use three undirected citation graph datasets: CORA [31], CITESEER [32], and PUBMED [33] for both node classification and link prediction tasks as they are common in all baseline approaches. In addition, we add another citation network DBLP [34] to link prediction tasks.
• CO-PURCHASE. We also use two Co-purchase networks AMAZON COMPUTERS [35] and AMAZON PHO-TOS [36], which take goods as nodes, to predict the respective product category of goods. The features are bag-of-words node features and the edges represent that two goods are frequently bought together.

Baseline Models
We compare our BiGCN with several state-of-the-art GNN models: • GCN [5]: A powerful and efficient graph convolutional network for semi-supervised node classification with a first-order approximation of spectral graph convolution. We adapt GCN into link prediction tasks is consistent with the implementation in P-GNN whose implementation of GCN is equivalent to GAE [37].
• GraphSAGE [7]: An inductive and spatial variant of graph convolutional networks proposing a general convolution schema and several aggregation methods. In our implementation, we use MEAN(·) as the aggregation function.
• GAT [6]: A graph attention network that assigns learnable weights to known edges based on node attributes. We use a multi-head attention mechanism.
• GIN [12]: The graph isomorphism network is as powerful as the Weisfeiler Lehman graph isomorphism test. We remove the graph-level READOUT(·) component and use node embeddings learned by GIN directly for node classification and link prediction.
• GDC [30]: Graph diffusion convolution based on generalized graph diffusion. We compare one of the variants of GDC which leverages personalized PageRank graph diffusion to improve the original GCN.

Noise Case
To highlight the denoising capacity of bi-directional filters, we design the following three cases with artificial noise and mistakes. Cases of noise-level and noise-rate are to add noise to node features while the case of structure-mistakes is to confuse node connections.
• NOISE-LEVEL CASE. In this case, we add different Gaussian noise with zero mean to all the node features in the graph, i.e. to the feature matrix, and use the variance of Gaussian n l (n l ∈ [0.1, 0.9]) as the quantitative indexes of the noise level. Given the attribute matrix X ∈ R n×m , we generate a noise matrix N ∈ R n×m whose entities are sampled from N (0, n 2 l ).
• NOISE-RATE CASE. In this case, we add Gaussian noise with the random variance to different proportions of nodes, i.e. some rows of the feature matrix, at a random and quantitatively study how the percentage of nodes n r ( n r ∈ {0.2, 0.4, . . . , 1} ) with noisy features impacts the model performances. Given the attribute matrix X ∈ R n×m , we generate a random mask vector m ∈ {0, 1} n with p(m i = 1) = n r and a random variance vector v. Then we generate a noise matrix N ∈ R n×m in which N i,· are sampled from N (0, (m v) 2 i ).
• STRUCTURE-MISTAKES CASE. In practice, it is common and inevitable to observe wrong or interference link information in real-world data, especially in a large-scale network, such as a social network. Therefore, we artificially make random changes with a certain error ratio r in the graph structure, such as removing edges or adding false edges by directly reversing the value of the original adjacency matrix(from 0 to 1 or from 1 to 0) symmetrically to obtain an error adjacency matrix. Given the adjacency matrix A ∈ R n×n , We generate a random matrix M ∈ {0, 1} n×n with p(M ij = 0) = r. To obtain a "noisy" adjacency matrixÃ, we letÃ We conduct all of the above cases on five benchmarks in node classification tasks and the two previous cases on four benchmarks in link prediction tasks.

Model Configuration
We train a two-layer BiGCN as the same as other baselines using Adam as the optimization method with 0.01 learning rate, 5 × 10 −4 weight decay, and 0.5 dropout rate for all benchmarks and baselines. In the node classification task, we use early stopping with patience 100 to early stop the model training process and select the best performing models based on validation set accuracy. In the link prediction task, we use the maximum 100 epochs to train each classifier and report the test ROC-AUC selected based on the best validation set ROC-AUC every 10 epochs. In addition, we follow the experimental setting from P-GNN (position-aware GNN) and the approach that we adapt GCN into link prediction tasks is consistent with the implementation in P-GNN. We set the random seed for each run and we take mean test results for 10 runs to report the performances.
All the experimental datasets are taken from PyTorch Geometric and we test BiGCN and other baselines on the whole graph while in GDC, only the largest connected component of the graph is selected. Thus, the experimental results we reported of GDC maybe not completely consistent with that reported by GDC. We found that the Citation datasets in PyTorch Geometric are a little different from those used in GCN, GraphSAGE, and GAT. It may be the reason why their accuracy results on Citeseer and Pubmed in node classification tasks are slightly lower than the original papers reported.
All implementations for both node classification and link prediction are based on PyTorch 1.2.0 and Pytorch Geometric 1 . All experiments based on PyTorch are running on one NVIDIA GeForce RTX 2080 Ti GPU using CUDA. The experimental datasets are taken from the PyTorch Geometric platform.

Hyper-parameters
We tune our hyper-parameters for each model using validation data. To accelerate the tedious process of hyperparameters tuning, we set 2λ1 1+p = 2λ2 1+p = λ and choose different hyper-parameter p for different datasets. For link prediction, we fix all hyper-parameters on all datasets across all experiments: we set p = 8.5, λ = 1.2, iteration k = 2, using two layers BiGCN with 32 hidden units and 0.5 dropout trained by 0.01 learning rate. For node classification, we also fix most hyper-parameters on all benchmarks: iteration k = 2, two layers BiGCN with 16 hidden units and 0.5 dropout trained by 0.01 learning rate. The values of p and λ depend on datasets and noise cases. In the case of NOISE-RATE and NOISE-LEVEL, we set p = 3, λ = 1.8 on CITATION networks, p = 2.5, λ = 1 on AMZ COMPUTER and p = 1.5, λ = 0.8 on AMZ PHOTOS across all the noise settings. In the cases of NOISE-LEVEL, we set p = 0.1, λ = 0.8 on CORA and PUBMED, p = 0.05, λ = 0.8 on CITESEER and p = 0.1, λ = 1 on CO-PURCHASE Networks across all structural error ratios.

Results
We evaluate BiGCN and baselines on node classification and link prediction tasks. To test their denoise capacity and robustness on noisy graphs, we artificially add feature-noise and structural-noise to clean benchmark and set three types of noise cases in terms of noise level, noise rate, and structure mistakes. As far as feature-noise is concerned, we expect our 1. https://github.com/rusty1s/pytorch geometric BiGCN to demonstrate its capabilities as graph filters. For structure-noise, we expect the latent feature graph to help correct structural errors in the original graph.

Clean Data
The performances of models on clean benchmarks in node classification and link prediction are shown in Table 2 and 3 respectively. These results correspond to the values with noise level n l = 0. On node classification, BiGCN improves accuracy by 0.5%, 1.1%, 2.5% and 1.8% over the best baselines on CORA, PUBMED, AMZ COMPUTER and AMZ PHOTOS respectively. On link prediction, our model outperforms others over all benchmarks except PUBMED.

Noise-level Case
In this case, we add different noise level to node attributes. Fig. 2 and Fig. 3 show results of node classification and link prediction.
On node classification, BiGCN provides the best performance in all datasets except CORA across all noise levels. For co-purchase networks, the curves of all baselines as well as BiGCN are smooth and flat. A possible explanation is that attributes (i.e. product reviews) rarely provide adequate information about product categories. Abnormal pattern is found: GIN behaves worst in AMZ COMPUTER and AMZ PHOTOS. It might be explained in this way: GIN is designed for graph classification task and fails to generalize well to node classification for all datasets.
In link prediction tasks, BiGCN improves 4.3%, 3.9% and 2.8% ROC AUC than the best baseline on CITESEER, PUBMED and DBLP, respectively. It demonstrates our powerful capacity of denoising.

Noise-rate Case
The results of noise rate case are shown in Fig. 4 and Fig. 3.
For node classification, the results are similar as that in noise-level case: BiGCN outperforms most baselines on all datasets across all noise rate with flatter declines. Especially on PUBMED, BiGCN improves node classification accuracy by more than 10%. For link prediction, the performance of BiGCN is at least 0.3%, 4.8%, 2.5% and 1.2% higher than the best baseline on CORA, CITESEER, PUBMED and DBLP, respectively. In this case, BiGCN shows superior robustness than baselines.

Structure-mistakes Case
Structure mistakes refer to the incorrect interaction relationship among nodes. In this case, we perform node classification whose results are shown in Fig. 5 and Fig.  3. BiGCN enhances accuracy by 3.6%, 4.1%, 1.2%, 6.7% and 10.1% at least on CORA, CITESEER, PUBMED, AMZ COMPUTER and AMZ PHOTOS, respectively. Meanwhile, as the number of faulty connections increases, our model shows a much slower decline than baselines, demonstrating our outstanding robustness. The main reason is that our bi-directional filters can effectively utilize information from the latent feature graph and drastically reduce the negative impact of the incorrect structural information.

Sensitivity Analysis
To demonstrate how hyper-parameters (iterations of ADMM, λ 2 , p and λ) influence BiGCN, we take Cora as an example and present the results on node classification under certain settings of artificial noise. First, we investigate the influence of iteration and λ 2 on clean data and three noise cases with n r = 0.2 in noiserate, n l = 0.2 noise-level and r = 0.001 structure-mistakes respectively. Fig. 6(a) shows that ADMM with 2 iterations is good enough and the choice of λ 2 has very little impact on results since it can be absorbed into the learnable L 2 .
Then we illustrate how much the performance of BiGCN depends on p and λ. Experimental results shown in Fig. 6(b) and Fig. 6(c) demonstrate that: 1). p guarantees relatively stable performance over a wide range values (p ≥ 3) for all λ ≤ 1.2; 2). When λ ≥ 1.5, things are quite different in the case of p ≥ 3: accuracy score decreases rapidly as value of p increases. As for the case of p ≤ 2, erratic fluctuations are observed. However, with a appropriate value of p (e.g. p = 3), we can still obtain a competitive or even the best performance. 3). λ has larger impact in performance on CORA with more noise (0.8 noise-rate and 0.8 noise-level).

Flexible Selection of L 2
In our paper, we assume the latent feature graph L 2 as a learnable matrix and automatically optimize it. However, in practice it can also be defined as other fixed forms. For example, a common way to deal with the latent correlation is to use a correlation graph [38]. Another special case is if we define L 2 as an identity matrix, our model will degenerate to a normal (single-directional) low-pass filtering GCN. When we take L 2 = I in Equation 5, the solution becomes which is similar to the single-directional low pass filter (Equation 3). Then the BiGCN layer will degenerate to the GCN layer as follows: (c) p and λ analysis on two cases of noise-level and structure-mistakes. Fig. 6. Sensitivity analysis of iteration in ADMM, λ 2 , λ and p on node classification. For iteration ((a) left) and λ 2 ((a) right), we conduct experiments on clean data and three noise cases with 0.2 noise-rate, 0.2 noise-level and 0.1% structure-mistakes respectively. For p and λ, except the mentioned cases, we also provide the performance of BiGCN on Cora with 0.8 noise-rate ((b) right) and 0.8 noise level ((c) middle) as well. To show the difference between different definitions of L 2 , we design a simple approach using a thresholded correlation matrix for L 2 to compare with the method used in our main paper. In particular, we define an edge weight A ij as follows.
Then we compute L 2 as the normalized Laplacian obtained from A, i.e. L 2 =D − 1 2ÃD − 1 2 . For a simple demonstration, we only compare the two models on CORA with feature-noise. From Table 4 and Table 5, we can see that our learnable L 2 is overall better. However, a fixed L 2 can still give us decent results. When the node feature dimension is large, fixing L 2 may be more efficient.

More Discussion and Results in an Additional Case
Noting that in most benchmarks, node attributes are abundant and even surplus, while in the real world, it is hard to obtain adequate information. In this section, we develop an additional case, the FEATURE-RATE case, to illustrate that even with limited attributed information, BiGCN can still capture useful and informative feature correlations and improve model performance. In this case, we only keep a portion of feature dimensions (with r f feature rate, r f ∈ [0.2, 0.4, 0.6, 0.8, 1]), as input of models and perform node classification and link prediction tasks. Extensive experimental results are shown in Fig. 7. It demonstrates that our learned feature adjacency matrix L 2 is indeed able TABLE 5 Node classification accuracy in noise-level case on Cora dataset of two types of L 2 .  to capture effective feature connections making up for the deficiency of attributed information, to some extent.

CONCLUSION
We proposed a bi-directional low-pass filtering GCN, a more powerful and robust network than general spectral GCNs. The bi-directional filter of BiGCN can capture more informative graph signal components than the single-directional one. With the help of latent feature correlation, BiGCN also enhances the network's tolerance to noisy graph signals and unreliable edge connections. Extensive experiments show that our model achieves remarkable performance improvement on noisy graphs.