Drop Edges and Adapt: a Fairness Enforcing Fine-tuning for Graph Neural Networks

The rise of graph representation learning as the primary solution for many different network science tasks led to a surge of interest in the fairness of this family of methods. Link prediction, in particular, has a substantial social impact. However, link prediction algorithms tend to increase the segregation in social networks by disfavoring the links between individuals in specific demographic groups. This paper proposes a novel way to enforce fairness on graph neural networks with a fine-tuning strategy. We Drop the unfair Edges and, simultaneously, we Adapt the model's parameters to those modifications, DEA in short. We introduce two covariance-based constraints designed explicitly for the link prediction task. We use these constraints to guide the optimization process responsible for learning the new"fair"adjacency matrix. One novelty of DEA is that we can use a discrete yet learnable adjacency matrix in our fine-tuning. We demonstrate the effectiveness of our approach on five real-world datasets and show that we can improve both the accuracy and the fairness of the link prediction tasks. In addition, we present an in-depth ablation study demonstrating that our training algorithm for the adjacency matrix can be used to improve link prediction performances during training. Finally, we compute the relevance of each component of our framework to show that the combination of both the constraints and the training of the adjacency matrix leads to optimal performances.


Introduction
The fairness of graph representation learning algorithms is quickly becoming a crucial area of research.Of particular interest is the fairness issue associated with the link prediction task.This task is heavily applied in two of the most influential AI-powered domains of our digital life, social networks and products recommendation.Social network topologies define the stream of information we will receive, often influencing our opinion McPherson et al. (2001); Halberstam and Knight (2016); Lee et al. (2019); Abbass (2018).Nevertheless, malicious users can modify topologies to spread false information Roy and Chahar (2021).Similarly, recommender systems suggest products tailored to our characteristics and history of purchases.However, pursuing the highest accuracy led to the discrimination of minorities in the past Corbett-Davies et al. (2017); Obermeyer et al. (2019), despite the law prohibiting unfair treatment based on sensitive traits such as race, religion, and gender.The unfairness arises even if the sensitive attributes are not used explicitly in the learning model.For example, most social networks are homophily-dominant.Nodes in the local neighbourhood belong to the same sensitive class with minimal connections across nodes of differing sensitive attributes.Therefore communities isolate themselves polarizing the opinions expressed within the communities.This effect is also known as the filter bubble problem.The same issue affects the bipartite graphs of users and items used in product recommendations.In Nguyen et al. (2014), the authors concluded that recommender systems reduce the exposition of the user to a subset of the items available over time.For example, streaming services may recommend movies from a particular genre to users from a specific gender.Thus, link prediction algorithms have a substantial social impact and can worsen existing biases in the data.However, enforcing the prediction of new links to be fair can mitigate the issue.
Graph neural networks (GNNs) Bronstein et al. (2017); Bacciu et al. (2020); Spinelli et al. (2021) provide state-of-the-art link prediction results with an end-to-end learning paradigm.A common approach to improve the fairness of these algorithms requires the introduction of fairness enforcing constraints during a model's training Bose and Hamilton (2019).Another strategy involves the modification of the graph's topology for post-processing the model's prediction Spinelli et al. (2022); Dai and Wang (2020); Loveland et al. (2022).Along this, the community is studying how to measure the actual fairness introduced in the system by these methods.Link predic-tion requires a dyadic fairness measure that considers the influence of both sensitive attributes associated with the connection Masrour et al. (2020).However, most works on fairness measures focus on independent and identically distributed (i.i.d.) data.A common solution consists in determining new groups defined for the edges.Then, it is possible to measure the level of equity of a new edge added to the graph by applying the known fairness metrics to these new groups.
Since training is the most expensive phase in the modern machine learning pipeline (excluding data harvesting and labelling), we designed a finetuning strategy named DEA, where we learn to modify the graph's topology and adapt the parameters of the network to those modifications.A novel covariance-based constraint designed for the link prediction task guides the fine-tuning.We introduce a novel parametrization that allows the new adjacency's optimization in its discrete form.We apply a variation of the Gumbel-max trick Jang et al. (2017) paired with a small multilayer perceptron that allows us to sample the edges from the original adjacency matrix.

Related Works
In this section, we focus on the recent contributions to the fair graph representation learning field.Although the extensive and interdisciplinary literature Chiappa (2019); Chiappa et al. (2020) on algorithmic bias, the study of fairness in graph representation learning is recent.The surge of interest is due to the state-of-the-art results of graph neural networks (GNNs) in many graph-based tasks.Some works focused on the node embeddings task to create fair embeddings to use as the input of a downstream link prediction task.Compositional fairness constraints Bose and Hamilton (2019) learn a set of adversarial filters that remove information about particular sensitive attributes.GUIDE Song et al. (2022) maximize overall individual fairness minimizing at the same time group disparity of individual fairness across different groups.FairWalk Rahman et al. (2019) is an adaptation of Node2Vec Grover and Leskovec (2016) that aims to increase the fairness of the resulting embeddings.It modifies the transition probability of the random walks at each step, by weighing the neighbourhood of each node, according to their sensitive attributes.The recent work of Li et al. (2021) learns a fair adjacency matrix during an end-to-end link prediction task.FairAdj uses a graph variational autoencoder Kipf and Welling (2016) as base architecture and introduces two different optimization processes.One for learning a fair version of the adjacency matrix and one for the link prediction.Similarly, FairDrop Spinelli et al. (2022) modifies the adjacency during training using biased edge dropout targeting the homophily with respect to the sensitive attribute.However, the biased procedure is non-trainable.FairMod Current et al. (2022), andFairEdit Loveland et al. (2022) considers debiasing the input graph during training with the addition of artificial nodes and edges and not just the deletion.Except for FairDrop and FairAdj, the other solutions target the task of computing node embeddings or node classification explicitly.To our knowledge, we are the first to propose a model agnostic fine-tuning strategy to solve the link prediction end-to-end, optimizing both model's utility and fairness protection.Our contribution contains two novelties.From one side, we introduce two covariance-based constraints explicitly to enforce the fairness of the link prediction classification.Secondly, we propose a novel way to parametrize a discrete yet trainable adjacency matrix.The latter aspect is of particular interest to the community to improve the quality of the messages sent across the graph Gasteiger et al. (2019); Kazi et al. (2022).DropEdge Rong et al. (2020) is a dropout mechanism which randomly removes a certain number of edges from an input graph at each training epoch to sparsify the connectivity.In Sparsified Graph Convolutional Network (SGCN) Li et al. (2022), the authors first pre-train a GCN to solve a node classification task.Then, a neural network sparsifies the graph by pruning some edges.Finally, they improve the classification performances by training a new GCN on the sparsified graph.Rather than sparsify the topology, another approach consists in rewiring the connections.Graph-Sage Hamilton et al. (2017) performs a neighbourhood sampling intending to be able to scale to larger graphs.The solution proposed in Gasteiger et al. (2019) alleviates the problem of noisy and often edges in real graphs by combining spectral and spatial techniques.DGM Kazi et al. (2022) and IDGL Chen et al. (2020) jointly learn the graph structure and graph embedding for a specific task.Finally, taking distance from the message passing framework and using tools from differential geometry, the authors of Topping et al. (2022) present a new curvature-based method for graph rewiring.Our solution is closely related to the first approaches sparsifying the topology.However, in future works, we plan to rewire the graphs' topology with the same underlying objective.

Graph representation learning
In this work we will consider an undirected and unweighted graph G = (V, E), where V = {1, . . ., n} is the set of node indexes, and E = {(i, j) | i, j ∈ V} is the set of arcs (edges) connecting pairs of nodes.The meaning of a single node or edge depends on the application.For some tasks, a node i is endowed with a vector x i ∈ R d of features.Each node is also associated with a categorical sensitive attribute s i ∈ S (e.g., political preference, ethnicity, gender), which may or may not be part of its features.Connectivity in the graph can be summarized by the adjacency matrix A ∈ {0, 1} n×n .This matrix is used to build different types of operators that define the communication protocols across the graph.The vanilla operator is the symmetrically normalized graph Laplacian Kipf and Welling (2017).A Graph Neural Network GNN(X, A) can combine node features with the structural information of the graph by solving an end-to-end optimization problem.We will focus on the link prediction task, where the objective is to predict whether two nodes in a network are likely to have a link Liben-Nowell and Kleinberg (2007).The output of the GNN consists of a matrix of node embeddings H. Therefore we compute a new n × n matrix containing a probability score for each possible link in the graph Ŷ = sigmoid(HH T ).The optimization objective is a binary cross-entropy loss over a subset of positive training edges and negative ones (sampled once).

Dyadic group fairness metrics
Fairness in decision-making is broadly defined as the absence of any advantage or discrimination towards an individual or a group based on their traits Saxena et al. (2019).Due to the broadness of the definition, there are several different fairness metrics, each focused on another type of discrimination Mehrabi et al. (2019).We focus on group fairness metrics measuring if the model's predictions disproportionately benefit or damage people of different groups defined by their sensitive attributes.
These measures are usually expressed in the context of a binary classification problem.In the notation of the previous section, denote by Y ∈ [0, 1] a binary target variable defined for each node of the graph, and by Ŷ = f (x) a predictor that does not exploit the graph structure.As before, we associate to each x a categorical sensitive attribute S. For simplicity's sake, we assume S to be binary, but the following definitions extend easily to the multi-class case.Two widely used criteria belonging to this group are: • Demographic Parity (DP ) Dwork et al. (2012): a classifier satisfies DP if the likelihood of a positive outcome is the same regardless of the value of the sensitive attribute S.
• Equalized Odds (EO) Hardt et al. (2016): a classifier satisfies EO if it has equal rates for true positives and false positives between the two groups defined by the protected attribute S.
These definitions trivially extend to cases where the categorical sensitive attribute can have more than two values |S| > 2. For the rest of the paper, we will consider this scenario.The link prediction task's predictive relationship between two nodes should be independent of both sensitive attributes.Therefore, In Masrour et al. (2020) and Spinelli et al. (2022), the authors introduced three dyadic criteria to map the sensitive attributes from the nodes to the edges.The original groups defined by S generate different dyadic subgroups associated with the edges D. The dyadic groups can be summarized as follows: • Mixed dyadic (|D| = 2): the original groups generate two dyadic groups independently from the cardinality of the sensitive attribute.An edge will be in the intra-group if it connects a pair of nodes with the same sensitive attribute.Otherwise, it will be part of the intergroup.
• Group dyadic (|D| = |S|): creates a one-to-one mapping between the dyadic and node-level groups.Each edge is counted twice, once for every sensitive attribute involved.This dyadic definition ensures that the nodes participate in the links' creation regardless of the value of their sensitive attribute.

Drop Edges and Adapt
In this work, we aim to improve the fairness of a trained GNN.In our finetuning strategy, we optimize at the same time the model and the adjacency matrix to solve the main task subject to a fairness constraint.To optimize the adjacency matrix, we learn a latent variable for each edge in the original graph with a neural network.The number of the introduced parameters is negligible concerning the size of the input graph, which makes our approach applicable to large-scale datasets.We focus our evaluation on the task of end-to-end link prediction.Therefore we design the constraint accordingly.We show the general framework of our method in Figure 1.We aim to fine-tune a trained model with an additional regularization term enforcing fairness by changing the adjacency matrix and adapting the network weights to these modifications.To do so, we introduce a different architecture called Sampler, containing an MLP.The Sampler takes as input the node embeddings produced by the GNN and builds representation for the edges in the graph.Then it outputs a new adjacency which will be used by the GNN to make its predictions.The fine-tuning loss comprises the cross-entropy loss and a fairness constraint that updates the Sampler and the GNN.Below, we introduce each element in a separate section.

Sampler
The Sampler is one of the two key contributions of our proposed approach.We want to sample the edges from the original adjacency matrix to help the GNN produce fairer predictions.At the same time, it has to preserve the discrete nature of the graph during the training process.The Sampler contains an MLP taking as input an edge embedding, defined as the concatenation of two-node embeddings produced by the last layer of the GNN.The output of the MLP is an unnormalized probability vector z where each element is associated with an edge of the graph.To sample the edges, we use the Gumbel-max trick Jang et al. (2017).It is a method to draw a sample from a categorical distribution, given by its unnormalized (log-)probabilities.The community proposed several extensions of this trick, including a Gumbelsigmoid Geng et al. (2020).We apply this function to the vector z: where G is an independent Gumbel noise and τ ∈ (0, ∞) is a temperature parameter.As τ diminishes to zero, a sample from the Gumbel-Sigmoid distribution becomes cold and resembles the one-hot samples.The procedure generates a new vector of soft-noisy weights m (i,j) ∀(i, j) ∈ E. Finally, we build the new adjacency matrix M where each element is defined as follows: The flowing of the gradient is guaranteed thanks to the use of a straightthrough estimator Hinton et al. (2012).

Constraints
In Zafar et al. (2019), the authors introduce a constraint to design convex boundary-based classifiers free of disparate impact.They use the covariance between the sensitive attribute s and the signed distance from the feature vectors to the decision boundary.Even if this measure is just a proxy for the disparate impact, it led to good empirical results.Neural networks, however, are not convex boundary-based classifiers.We cannot apply the constraint in its original formulation.To this end, we propose to exploit the prediction margin instead of the distance from the decision boundary.We recall that the prediction margin for a model parametrized by θ is defined as follows: where ŷ(i,j) is the predicted probability for the edge between node i and node j. δ is the threshold to assign the edge to the positive class if ŷ(i,j) δ or otherwise to the negative class.
In our definition of the constraint, we consider the dyadic nature of the link prediction task.The first and most effortless approach consists in building a constraint replicating the mixed dyadic definition.We create a new vector in which we assign to each edge a single value.We let e = 1 if the nodes at the ends of the edges have the same sensitive attribute and e = 0 otherwise.The covariance mixed dyadic constraint can be written as: where ē is the mean of the e vector.
We then propose a second version of the constraint mimicking the group dyadic definition to create a more expressive constraint.We create as many vectors as the sensitive attribute S cardinality.The first vector e 1 will be associated with the first possible value of the sensitive attribute S, denoted as s 1 and so on.We then let e k (i,j) = 1 if at least i or j has s k as sensitive attribute.We end up with |S| different e vectors and the same number of covariance constraints.We can minimize the constraint independently by assigning a different threshold c to each one of them or by averaging them together.We can express the latter approach as: In our evaluation, we opted for the second solution leaving the first approach for future work.In the end, this last approach can be viewed as a one-vs-all fairness constraint where we try to maximize the fairness of all groups at once.

Fine-tuning
Fine-tuning a model has several advantages over training from scratch when one is trying to impose some constraints.First, it is easy to assess the fairness of the prediction of the original model and fine-tuned one.Secondly, it is possible to create a fairer model and obtain a more equitable prediction of the new adjacency without retraining the model.Ideally, we want to optimize the adjacency matrix.However, as it is possible to see in the ablation section, the model suffers drastic changes in its inputs.We found that adapting the model's parameters while learning the adjacency stabilizes the predictive performances meanwhile improving their fairness.We start with a trained model parameterized by θ and a threshold value δ used to assign an edge to the positive or negative class.Next, we sample a negative set of edges for the link prediction loss.For each epoch, we compute the node embeddings.The Sampler takes them as input to output M. The network combines this discrete and trainable adjacency with the negative samples for the final feedforward step.Next, we compute the standard cross-entropy for the link prediction task and our covariance-based fairness enforcing constraint.The constraint is balanced with an additional hyperparameter λ.Finally, we update the GNN and the MLP inside the Sampler.

Experimental section
We focus our experiments on measuring the impact of our fine-tuning strategy for enhancing fairness on the link prediction task.We use six fairness metrics (i.e. two for each dyadic group) together with the AUC and accuracy on the main task.In addition, we report the average and standard deviations of ten runs with random data splits.We monitor the Demographic Parity difference (∆DP ) and the Equality of Odds difference (∆EO).The first measures the difference between the largest and the lowest group-level selection rate: The latter report the maximum discrepancy between the true positive rate (TPR) difference and the false positive rate (FPR) difference between the groups: Our evaluation comprises five datasets.We report their statistics in Table 1.DBLP is a co-authorship network built-in Buyl and De Bie (2020) from the original dataset introduced in Tang et al. (2008).Nodes represent authors and are connected if they have collaborated at least once.The sensitive attribute is the continent of the author institution without Africa and Antarctica because of their under-representation in the data.Facebook (FB) Leskovec and Mcauley (2012) is a combination of ego-networks introduced in Spinelli et al. (2022) obtained from a social network.The graph encodes users as nodes with gender as a sensitive attribute and friendships as links.These two datasets do not have feature vectors associated with the nodes.Therefore we used the eigenvectors of the Laplacian matrix as input features.We included three benchmark citation networks Citeseer, Cora-ML, and PubMed.In these graphs, nodes are articles and have associated a bag-of-words representation of the abstract.Links represent a citation regardless of the direction.We used the category of the article as a sensitive attribute.We would like to recall that the value of the sensitive attribute arises naturally from the graph topology but is never used directly in the learning pipeline.We tested our fine-tuning strategy on a GCN Kipf and Welling (2017) and a GAT Veličković et al. (2018).We used an embedding size of 128 for the GCN.The GAT uses an embedding size of 16 with eight attention heads which are concatenated.We used two layers for the citation datasets and four for the two more complex datasets.We chose the threshold for computing the accuracy and the corresponding fairness with a grid search in the interval [0.4,0.7] for each algorithm.In our covariance constraints, we set c = 0 and choose λ to balance the regularization term with grid search.The temperature τ of the Gumbel-sigmoid followed a linear decay from 5 to 1 for each dataset.The MLP in the Sampler has two layers of 128 elements across all experiments.We trained the models using Adam optimizer Kingma and Ba (2014) for 100 epochs on every dataset except FB, which required 200 epochs.Our fine-tuning required additional 100 epochs.We compare against competitors designed to enforce the fairness of link prediction tasks.We build upon the experimental evaluation proposed in Spinelli et al. (2022).Therefore we include DropEdge and Fairdrop as plain and biased sparsification techniques and FairAdj as a more complex approach.We used two configurations suggested in the original implementation for the latter method.The one with the hyperparameter T 2 = 20 provides a more robust regularization towards fairness with respect to the model trained with T 2 = 5 at the cost of lowering the model's utility.GCN+FairDrop 82.4 ± 0.9 90.1 ± 0.7 52.9 ± 2.5 31.0 ± 4.9 11.8 ± 3.2 14.9 ± 3.7 89.4 ± 3.4 100.0 ± 0.0 GAT+FairDrop 79.2 ± 1.2 87.8 ± 1.0 48.9 ± 2.8 31.9 ± 4.3 15.3 ± 3.2 18.1 ± 3.5 94.5 ± 2.0 100.0 ± 0.0 GCN+DEA+CovM 81.5 ± 0.9 89.4 ± 1.0 35.0 ± 1.4 16.5 ± 4.5 7.8 ± 1.6 11.9 ± 3.3 67.6 ± 3.9 100.0 ± 0.0 GCN+DEA+CovG 82.4 ± 0.7 89.6 ± 0.6 34.4 ± 2.4 15.8 ± 2.4 8.9 ± 2.4 12.9 ± 1.7 65.2 ± 3.8 100.0 ± 0.0 GAT+DEA+CovM 81.0 ± 1.5 89.1 ± 1.4 48.1 ± 2.4 29.5 ± 2.6 10.2 ± 1.1 13.3 ± 2.9 85.7 ± 5.2 100.0 ± 0.0 GAT+DEA+CovG 81.6 ± 1.7 89.4 ± 1.2 49.8 ± 0.9 32.2 ± 2.3 7.8 ± 2.9 12.6 ± 3.0 85.6 ± 3.6 100.0 ± 0.0

Results
We present the results in Tables 2, 3, 4, 5 and 6.DEA with CovM and CovG improves the accuracy on larger datasets and provides state-of-the-art protection against unfairness.CovM shows better fairness metrics in the mixed dyadic group.CovG has a slight advantage on the group dyadic definition.Since the constraint definition closely resembles the group definitions, this is not surprising.CovG has a slight general advantage, probably due to the additional expressiveness of the constraint.On smaller datasets, Tables Table 4:  2, 3 and 4, DEA provides slightly better protection than FairAdj.However, the latter loses in accuracy and AUC with severe losses in Table 4 where the drop in accuracy provided by FairAdj is about 15% of accuracy and 10% of AUC.FairAdj fails to solve the link prediction task on complex datasets like DBLP (Tab.5) and FB (Tab.6).In the end, DEA removes around 10% of the edges, considerably less than DropEdge and FairDrop.In Figure 2, we show the intermediate steps resulting in the final version of the fair adjacency matrix M.There is little difference between the actual edge distribution Z and its noisy version after the Gumbel-sigmoid trick M. Also, it is possible to see that CovM is more peaked at the extreme values.Finally, Figure 2(c) shows the number of edges removed from the original adjacency matrix M to obtain a fairer link prediction.

Ablation
In this section, we perform an in-depth ablation study to shed more light on the effect of these additional epochs and each component of the frame-0.00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0  GCN+DropEdge 77.9 ± 1.5 87.7 ± 0.9 0.9 ± 0.1 4.9 ± 0.4 5.5 ± 0.5 6.9 ± 0.8 11.0 ± 1.0 11.7 ± 1.5 GAT+DropEdge 71.7 ± 4.6 83.9 ± 1.5  work.In the first experiment, we train for the same number of epochs as a standard GCN and one paired with the Sampler, learning a new adjacency matrix with the sole objective of maximizing accuracy.Then, we optimize the adjacency matrix and the model's parameters to solve the main task without additional fairness constraints.Finally, in Table 7, we show that those modifications to the adjacency matrix improve the link prediction performances.

Edge Probability
In the second, we focus on the various components of our architecture on the Citeseer dataset.Results are visible in Table 8.We disable each time a different component of our framework.We train everything from scratch instead of fine-tuning a model in the second and third rows.In "Training w X", we feed to the Sampler the concatenation of the feature vectors associated with the nodes instead of the node embeddings generated by the GNN.We then proceed to fine-tune the model by disabling some components.In "w/o Sampler", we keep the covariance constraint but remove the learning of the adjacency matrix.In "w/o CovM", we do the opposite.Finally, we fine-tune the model without any modification.The latter solution has comparable performances in terms of accuracy, but it has significantly worst fairness metrics.Training from scratch has similar results.Fine-tuning with the covariance constraint or the Sampler improves the fairness, but we obtain the best results when both are active.

Conclusions
We introduced DEA, a novel approach to improve the fairness of a GNN solving a link prediction task.In our fine-tuning strategy, we learn to modify the graph's topology and adapt the parameters of the network to those modifications.A module called Sampler learns to drop edges from the original adjacency matrix.We exploit a Gumbel-sigmoid to sample a new discrete and fair adjacency.At the same time, the GNN uses this new matrix for fine-tuning.We guide both optimization processes with an additional regularization term shaped as a covariance-based constraint.We provided two different formulations, the first acting on the inter and intra connections between the groups defined by the sensitive attribute.In the second modelling, each value of the sensitive attribute is in the one-vs-the-rest paradigm.We performed an extensive experimental evaluation where we demonstrated that our fine-tuning strategy provides state-of-the-art protection against unfairness meanwhile improving the model's utility on the original task.Finally, we performed an ablation study on the contribution of each component of our pipeline.In future, we would like to learn to add new connections instead of just dropping them from the original adjacency matrix.

Figure 1 :
Figure 1: DEA schematics.The pre-trained GNN extracts the node embeddings H.The Sampler takes them as input and returns a new, fairness enforcing, discrete adjacency matrix M. The new matrix is used as input for a new feedforward step of the GNN.Finally, we update the Sampler and the GNN with a combination of the binary crossentropy loss and our covariance-based fairness constraint.

Figure 2 :
Figure 2: Edge distribution at different stages of our pipeline.In blue, we depict the results obtained using CovM constraint; in orange, the ones with CovG. Figure (a) shows the distribution z learnt by the MLP inside our Sampler.Figure (b) shows the approximation after the Gumbel sigmoid trick m.Finally, Figure (c) shows the number of edges removed and kept in the new fairness-enforcing adjacency matrix M thresholding the values in m at 0.5.

Table 2 :
Link prediction on Citeseer

Table 3 :
Link prediction on Cora

Table 5 :
Link prediction on PubMed Link prediction on DBLP

Table 6 :
Link prediction on FB

Table 7 :
Results obtained training a GCN on Citeseer with and without the Sampler optimizing the adjacency matrix.

Table 8 :
Ablation study on Citeseer with CovM