Algorithmic decision making methods for fair credit scoring

The effectiveness of machine learning in evaluating the creditworthiness of loan applicants has been demonstrated for a long time. However, there is concern that the use of automated decision-making processes may result in unequal treatment of groups or individuals, potentially leading to discriminatory outcomes. This paper seeks to address this issue by evaluating the effectiveness of 12 leading bias mitigation methods across 5 different fairness metrics, as well as assessing their accuracy and potential profitability for financial institutions. Through our analysis, we have identified the challenges associated with achieving fairness while maintaining accuracy and profitabiliy, and have highlighted both the most successful and least successful mitigation methods. Ultimately, our research serves to bridge the gap between experimental machine learning and its practical applications in the finance industry.


I. INTRODUCTION
Credit scoring applications play an important role in modern society, and the approval process of loans increasingly migrates from human decisions to complex algorithmic decisions. Agarwal et al. [1] discussed the important benefits of an automated decision process for financial institutions, such as growing the business while lowering costs, increasing approval rates without increasing credit risks, and providing a more seamless application process for the client.
This business reality raises questions regarding the supervision of such automated decisions. Blattner et al. [2] discussed how algorithmic governance faces a trade-off between complexity (read performance) and oversight (read: capacity to audit). The interpretability of the models, particularly the credit scoring methods, has long been a source of concern for industry regulators (e.g., national banks, financial authorities, and governments). Their interest was related to the capability of a human to understand how the decision was taken in order to supervise, mitigate risks, and prevent the misconduct of financial institutions. An automatic decision, even if interpretable, may lead to different treatments for groups or individuals, defined by some specific attributes, eventually causing discrimination [3].
While the laws explicitly cover some discrimination factors (such as gender, race, and nationality), other unrestricted information may be used to discriminate against vulnerable groups (e.g., based on behavior and wealth). Moreover, information from external public sources such as social networks may be used as correlated information, potentially leading to traits of race or gender [4].
There is a certain concern (among researchers, but also at the government level) about the use of AI in decision making, and several reports [5], [6] indicate future regulatory guidelines for this field. Both interpretability and fairness are addressed, together with privacy (already regulated by the GDPR in the EU 1 ), technical robustness and accountability (with specific requirements in force through Basel III 2 and IFRS9 3 regulations).
Our study addresses the problem of ensuring fairness in the automatic credit scoring process using machine learning algorithms. Since the research interest for ensuring fair processing in automatic decision-making started to grow, several processing methods and evalution metrics were defined. As we will show, there are serious difficulties in implementing 1 The General Data Protection Regulation 2 An internationally agreed regulatory framework developed by the Basel Committee on Banking Supervision in response to the financial crisis of 2007-09. 3 Accounting regulations, requiring financial institutions to implement forward-looking estimates with respect to expected losses in their financial statements. VOLUME 11, 2023 1 arXiv:2209.07912v3 [cs.LG] 22 Jun 2023 these fairness processors in a real-world context, causing a gap between lab development and industrialized implementation. We therefore benchmark state-of-the-art methods in this field by considering different fairness definitions from the literature. Our empirical findings were obtained by applying these methods to two datasets containing retail credit applications, one of which is a novel real-world dataset relating to the Romanian consumer loan market.
While the financial industry primarily focuses on financial risk assessments when evaluating loan applications, factors such as equal opportunity are either imposed by regulations or considered indirectly in the loan granting process. However, experiments in the literature have shown that discrimination is present in this type of process. Therefore, a framework for benchmarking different bias mitigation techniques could prove useful for systematically analyzing and addressing potential biases. This framework would allow individuals or groups who might have historically been underserved or disadvantaged to access credit on fair terms. Additionally, implementing a fair credit scoring framework would demonstrate a commitment to transparency, fairness, and ethical lending practices by financial institutions.
The main contributions of our work are as follows: 1) We help bridge the gap between lab development and industrialized machine learning by identifying the strengths and weaknesses of 12 bias mitigation methods with the help of a real-world data environment implementation, which is currently the most inclusive comparison of this type. 2) We show empirical evidence from a credit scoring case study indicating how bias mitigation methods perform from three different perspectives: the amelioration of fairness metrics, balanced accuracy of the models, and expected profit for financial institutions. 3) This study makes a new dataset in the field of credit scoring available to the public, allowing for the replication of the results and the development of novel experiments in this field where real data is scarce. 4) Finally, we review in depth the fairness metrics found in the machine-learning fairness literature, together with the most prominent fairness processors that have emerged in the last decade.
The remainder of this study is organized as follows. The literature review section describes the state of the art in the area of fairness in machine learning, with a special section dedicated to credit-scoring fairness developments. The methodology describes the general framework for applying fairness processors to data based on three routes found in the literature: pre-processing , in-processing, and post-processing . This section describes the fairness metrics used in the experiments. Next, the study presents the experimental setup, including data pre-processing. The Results section presents and discusses the outcomes of the case study. The last section concludes the study and provides insights for future research.

II. RELATED WORK A. FAIRNESS IN MACHINE LEARNING
Research interest in the fairness of machine learning has increased over the last decade, with several fairness measures arising in the literature. As Barocas et al. [7] note, different measures are based on various intuitions regarding fairness. The following represent examples of the numerous attempts of defining a comprehensive fairness metric: individual fairness -advocates for similar treatment of similar cases [8]; statistical parity -considers an individual belonging to a protected group to bear the same risk score as all the other members [9]; accuracy fairness (or accuracy equity) -implies different treatments for different cases, assuming a perfect classifier is also fair [10]; threshold fairness -considers the same decision threshold should be applied for each group (protected or unprotected) [11]; and calibration -conditions the estimates to have the same effectiveness for all individuals [12]. The majority of viewpoints concern equal treatment in general across different groups, as defined by protected attributes (also known as sensitive attributes), such as gender, nationality, and race (statistical parity), or, more specifically, equal treatment based on some constraints, such as predictive parity and threshold fairness.
Generally, the literature sources agree that not all fairness criteria can be used simultaneously to evaluate the fairness of an ML process [12]- [15], as some of them contradict others. As a result, while we review all of the criteria suggested by the literature in this field from a credit scoring standpoint, we do not expect to find a method that can satisfy all fairness criteria simultaneously.
Grouping the fairness metrics, three fairness criteria have emerged as standards for measuring group fairness: separation, independence, and sufficiency [7], [16].
The separation criterion, also known as equality of odds, evaluates the classification in such way that false positive rate (e.g., FP -clients wrongly associated with a high risk of not repaying debt) equal the false negative rate (e.g., FN -clients wrongly associated with a low risk of not repaying debt) for each subgroup (protected and unproctected), this being guaranteed for any cutoff used. Kozodoi et al. [17] suggested a method for measuring separation as the average absolute difference between the group-wise FP and FN rates (FPR and FNR): where x a represents protected attributes. In this way, a value of SP≈0 suggests perfect separation, while high values of SP suggest stronger discrimination, but without showing which group is privileged. From a slightly different perspective, Hardt et al. [9] initially proposed a measure where the difference between FP and FN rates is considered, also targeting a value close to 0 for a fair classification.
The independence criterion (also found in the literature as demographic parity and statistical parity) concerns the model prediction of possible dependency on the protected attribute. In other words, credit scores should be distributed in the same way within each subgroup if the predictions are not affected by the protected attribute. Barocas et al. [7] measured the connection between the protected attribute and the score (mutual information) using entropy (H ): where R is the score. The independence criterion would be satisfied if I (x a ; R)≈0. However, this can be difficult to achieve in practice in the field of credit scoring because some of the protected attributes (e.g. race, gender) are linked to significant differences in wealth and income [14]. By strictly applying this criterion, the scorecard would induce a higher risk appraisal for some groups while reducing vigilance for other groups. This can lead to a significant distortion of the reality due to an increase in FN and FP, ultimately being harmful for those not truly affording the indebtedness, denying access to loans to those under-appreciated and, on the sell side, making the lending business unpractical. Several studies have defined a relaxed measure of independence, attempting to make it usable [7], [17]. Instead of trying to reduce I (x a ; R) to 0, the ratio could be considered acceptable, with an indication of 0.2 as a reference value [18].
Sufficiency implies that the score already includes protected attributes when predicting the target variable [7]. This means that clients with the same credit score carry the same risk of not reimbursing the loan, regardless of whether they belong to a protected group. This interpretation is closely related to the notion of callibration by group, which is formally defined as: for all scores r and groups d. For this approach to be valid, the score must be considered as a probability. [7]note that, based on practical experiments, sufficiency is already satisfied to some degree owing to machine learning inner mechanisms, and trying to impose sufficiency when training a model will result in only trivial improvements. Another trend in the AI fairness literature is the observation of individual fairness in data, which attempts to offset the limitations of the group fairness assessment methods discussed above. One problem with group fairness is the propagation of an error (e.g. from a small group), leading to discrimination in subgroups, even if group fairness can still be achieved [19]. Focusing on statistical fairness leads to the impossibility of achieving the desired protection for individuals. In other words, the fairness achieved by considering individual attributes will not translate into achieving fairness for the combination of two or more of those attributes. Speicher et al. [20] employed economic inequality indices to measure the outcomes of an algorithm towards individuals or groups. Their findings also suggest that eliminating the unfair treatment between groups may lead to an increase in unfairness within a group (at the individual level).
While the fairness of algorithmic decision-making has become a hot topic for debate and research, we noticed credit scoring as one of the most popular examples in the literature when referring to possible discrimination caused by machine learning methods.

B. CREDIT SCORING APPLICATION FAIRNESS
The research community studying modern machine learning techniques for credit scoring has recently turned their attention towards the implications of the distributional impact of algorithmic decisions on sensitive/vulnerable groups. While traditional methods for credit scoring (e.g., human application analysis) were outpaced by machine learning techniques in terms of profitability for financial institutions, evidence shows how the latter produces predictions with greater variance [21].
Zarsky [22] attempted to create an analytic framework for an orderly debate on the subject of algorithmic decisionmaking, focusing on credit scoring to substantiate the proposal. Two dimensions are considered the main triggers for concerns: the problems generated by machine learning reasoning and the attributes that exacerbate these problems. The two problems identified as debatable are the efficiency (or inefficiency) of the automated decision making and the second is, obviously, the fairness of the process. One attribute involved in the debate is the automation of the process, which can show solid results at an aggregated level but it is still prone to errors because of inaccuracies in the training data or specific algorithm settings (also identified in [23] to be a possible burden for the consumer). The alternative human approach is also subject to biased decision making. Transparency is considered a solution to improve automated processing, by providing the opportunity to improve/correct the data, but may also lead to increasing costs.
Kozodoi et al. [17] assess the feasibility of a fair credit scoring system in a profit-driven environment. They deal with the strong inverse proportionality between fairness and profitability and show how the possibility of reducing discrimination can be achieved at very reasonable costs for financial institutions. However, a completely fair scoring system appears to be a pipe dream, as profitability is stifled by overly strict conditions and an increasing risk of default.
Another question addressed by the fair credit scoring literature challenges the efficiency of imposing protection for specific groups. Liu et al. [24] attempted to quantify the impact of fairness constrained classifications over time . They assume that the lender will always try to maximize the utility of the models (profit) irrespective of the fairness constraints. In other words, applying fairness criteria could have unwanted effects over time, both for a protected group (a defaulted client will worsen their credit worthiness) and for financial institutions (credit defaults considerably impact the profitability of the business). Therefore, the dynamic and temporal modeling of fairness criteria can improve the overall process. Creager et al. [25] propose a framework for causal modeling with the lending business as a foundation, demonstrating the utility VOLUME 11, 2023 of such an approach in simulating different scenarios in a profit-driven, but policy-constrained environment. The longterm vision of evaluating the implications of fairness on financial institutions and individuals seems to be a pursuit for researchers in an uncharted territory.
An auditor model is used to evaluate classification fairness, giving some constraints, and taking into account the explicability factor [26]. However, even if the framework was tested on a credit risk dataset, the authors did not discuss the feasibility of the method in real-world terms, where the profitability of the lending business must be demonstrated along with fairness implementation.
The bias embedded in the data could be unobserved using an exclusively mathematical approach. Lee and Floridi [27] argue for an approach in which the context -dependency of data is exploited. They compared different algorithms in terms of fairness capability, showing how some of the algorithms were not successful because of the association between the protected attributes and other features in the dataset (which are a proxy for the loan outcome). However, the study makes an unrealistic assumption by considering the lender's intention to maximize the value of the loans. In practice, the overall profitability of the lending business should be considered (see [28] for a profit-driven approach in credit scoring). In the same line of work, Kilbertus et al. [29] discuss on the difficulties of achieving perfect fairness and maximizing profits in the context of potentially biased previous decisions. The credit scoring models are trained based on previous lending data, leading to sub-optimal performance. They proposed a stochastic decision rules system that attempts to improve decisions in terms of utility and fairness.
A group unfairness index was introduced by Szepannek and Lubke [30] to easily quantify and compare the fairness of the models. This index is based on group fairness by acceptance rate, which is a relevant fairness definition in the context of credit scoring.

III. METHODS
A pipeline with three different approaches can be depicted based on the literature on machine-learning fairness. The general consensus is to first measure the bias in algorithmic decisions by using one or several assessment metrics, and then to mitigate unfairness by employing a specific mitigation method. Finally, the results were compared in terms of the bias evaluation and performance metrics before and after the mitigation method was applied. Figure 1 shows the fairness pipeline used in this study. Next, we present the bias assessment metrics used to evaluate the fairness of the classification throughout our experiments, and the bias mitigation algorithms tested in our benchmarking environment.

A. BIAS ASSESSMENT METRICS
As shown in the literature review section, different criteria for evaluating the fairness of a classifier have recently been developed. Initially, they were built intuitively to address the unfairness of ML algorithms in classification problems. Sub-sequently, optimizations and new visions haveemerged(like using the economic inequality indexes [20]), resulting in a set of measures generally accepted as a standard in the field.

1) Separation metrics
The average odds difference [9] is a measure of classification fairness which computes the difference between the FPR for protected and unprotected groups and adds to it the difference between the TPR for protected and unprotected groups. This is especially useful in the credit scoring context, since the TPR (clients classified as goods and indeed repaying their debt) and FPR (clients wrongly classified as goods, not repaying their debt) influence the profitability of the business model.
A value close to 0 was obtained in the case of fair classification. For practical reasons we consider fairness to be inside the interval (-0.1,0.1).
Equal opportunity difference is a relaxed version of the average odds difference, considering only the difference between the TPR for protected and unprotected groups. In the credit scoring context, this means that the classifier should have the same error rate when suggesting acceptance of loans in both protected and unprotected groups. The requirement of equalized errors puts pressure on decision makers to improve the misclassification rates by optimizing models and increasing the quality of data [7]. The fairness interval considered for this metric is (-0.1, 0.1).

2) Independence metrics
Statistical parity difference measures the difference between the probabilities of acceptance in the protected and unprotected groups. A value close to zeroimplies the same acceptance rate for both groups [8]. The fairness range for this metric is considered in the interval (-0.1,0.1).
Disparate impact starts from the idea of independence and calculates the ratio between the probability of acceptance for unprivileged and privileged groups. A value close to 1 implies an ideal degree of fairness, while values lower than 1 indicate an advantage for the privileged group and values higher than 1 indicate an advantage for the unprivileged group [18]. In a more flexible approach, the interval (0.8, 1.25) was considered acceptable for a classifier to be considered fair.

3) Individual fairness metric
The Theil index lies in the category of metrics measuring inequality in economics, which is a special case of the generalized entropy index (α=1). This allows us, in the context of fairness, to observe inequalities both at the group level (between groups) and at the individual level (within group) [20].
In formula 8 n represents the number of instances in the dataset; µ is the mean value of the benefits within a group; andŷ and y are the individual predicted and true outcomes, respectively. A value close to 0 was used as proxy for fair learning.

B. BIAS MITIGATION METHODS
Bias mitigation methods are classified into three categories based on their position in a fair AI process (see Figure 1). The pre-processing methods address fairness issues before applying the classification method, and are therefore independent of the classification algorithm. To ensure fairness, in-processing methods have been developed within the classification method. The post-processing bias mitigation methods were placed at the end of the fair processing pipeline. They made adjustments after the model was trained, considering the protected attribute restrictions. Post-processing methods, such as pre-processing methods, are independent of the clas- VOLUME 11, 2023 sification algorithm used, permitting a classifier agnostic approach to mitigate bias. The methods presented below were chosen from among those presented in highly cited studies on this topic.

1) Pre-processing mitigation methods
Reweighing is an early stage technique in fair AI areas to mitigate bias. It is centered on the idea of reweighing data without relabeling to remove discrimination [31], [32]. The algorithm attempts to achieve fairness by assigning lower weights to favored attributes.
The weights are assigned as the ratio between the expected and observed probabilities to see an instance with its protected attribute in a class: where X is the entire dataset, and S is a binary variable indicating whether an individual is a member of a protected group. Thus, the resulting dataset carries a fair representation of protected instances.
Learning fair representations is a pre-processing technique that encodes data while obfuscating information about protected groups [33]. The method acts as a clustering model, building prototypes based on the requirement that one element from a protected group is mapped to a certain prototype with the same probability as an element from an unprotected group (using the statistical parity criterion described in subsection bias assessment metrics). Formally, this is represented by where Z is a random attribute with k classes, each representing a prototype in the context of the clustering model; x + represents an individual from a protected group (X + ); and x − represents an individual from an unprotected group (X − ). Another method developed by Feldman et al. [18] is Disparate Impact Remover. The algorithm detects the disparate impact (described in subsection Bias assessment metrics) and attempts to repair the data to achieve fairness. Repair is performed to preserve the predictability of the target variable and to preserve the relative per-attribute ordering (ranking). The synthetic dataset maintained the original values for the protected attribute(s) and target variables. A more relaxed version of the algorithm considers a trade-off between fairness and utility (accuracy) and performs partial repair. Depending on the problem addressed and the classifier performance, a compromise can be achieved.

2) In-processing mitigation methods
By adding adversarial learning to the predictor, Zhang et al. [34] introduced a framework in which the target is to increase the chances of the predictor for classification, while minimizing the adversary's chances of predicting the protected feature. This method uses equality of odds as a fairness metric.
The term gerrymandering in the context of AI fairness questions the ability of a fairness constraint applied to a certain protected group to ensure fairness at the individual level. This implies that an apparent fair classification at the group level does not necessarily imply fairness for all individuals. The Gerryfair method [19] consists of two algorithms that work as a learner and an auditor, working as a cost-sensitive classification oracle using linear methods. The fairness metrics used were false-positive rates, false-negative rates, and statistical parity.
Another cost-sensitive method for mitigating bias is the Exponentiated gradient reduction [35]. The algorithm consists of a sequence of two reductions aimed at yielding a classifier with the lowest error in the context of the defined constraints. The metrics considered for fairness optimization were statistical parity and equalized odds. This method was designed only for binary classifications.
Building on the work of Agarwal et al. [35], the Grid search reduction was developed to predict continuous outcomes instead of discrete classification [36]. This direction was chosen because of the practical need for numerical prediction (e.g., quantifying the risk of default in the credit scoring setting). The fairness metrics used by the method are the statistical parity and bounded group loss, which is essentially the control of the prediction error for the protected group. A separate optimization algorithm was built for each metric.
One approach that attempts to satisfy several fairness metrics is the so-called Meta classifier [37]. The central concept of this method is to develop an algorithm for a large family of classification problems. The current implementation supports only two metrics: the false discovery rate and disparate impact. A very practical feature is the possibility of varying the constraint (τ ) to achieve a reasonable trade-off between fairness and accuracy.
The prejudice remover is one of the methods developed in the early stages of AI fairness evolution, proposing a regularization approach applied to logistic regression [38]. It defines prejudice as the statistical dependence between protected attributes and other information. Further, it regularizes learners' behavior regarding sensitive attributes by enforcing independence. The regularization parameter can be adjusted based on the accuracy-fairness trade-off.

3) Post-processing mitigation methods
One of the first methods developed for fairness postprocessing was the reject ption classification [39]. This method uses posterior probabilities from a classifier to label instances to neutralize discrimination. Rejected instances situated in a so-called critical region are assigned special labels depending on the protected group membership. These are considered to be easily influenced by bias. The method relables the instances using two cost-sensitive matrices for deprived and favored groups by optimizing loss functions (L) for the two categories, according to a discrimination-accuracy trade-off coefficient θ (Eq.11).
Calibrated equalized odds post-processing is another method that changes the output labels after classification to preserve fairness. Introduced by Pleiss et al. [40], the method builds upon the work of Hardt et al. [9], who introduced the equalized odds fairness measure. Calibration is added to the method of mitigating bias by providing the possibility of ensuring fairness for both protected and unprotected groups without leaving the option of incentivizing the algorithm when considering the sensitive feature. The method gives the practitioner the freedom to choose the level of fairness constraint, an adjustment needed when the classification accuracy suffers after calibration.

IV. EXPERIMENTS
To benchmark the different mitigation methods described above and evaluate them according to classification performance indicators and bias assessment metrics, we used two datasets. The first is the well-known German credit dataset available from the UCI Machine Learning Repository 4 [41]. The other dataset contains information on customers applying for personal loans, obtained from a Romanian bank 5 . We performed our experiments according to the pipeline described in Figure 1 using the framework of AI Fairness 360 [42].

A. DATA
We used the German credit dataset, which is one of the most popular datasets used for benchmarking in the field and has also been included in other research on fair AI, such as the works of [17], [30], [43]. Real-world datasets in the field of credit scoring are rather rare and imply business-specific challenges (such as class imbalance and profit ratios). We therefore considered it useful to bring to the attention of the community a novel dataset containing consumer loan data. A summary of the statistics for both datasets is provided in Table  1.
Both datasets consist of samples of loan applications, with the target attribute being the outcome of the loan (good or bad). In the industry standard, clients repaying their loans are called goods, while those not reimbursing the debt are called bads.
The independent attributes in the datasets can be divided into three categories: sociodemographic attributes, including information about age, education, profession, and customer history providing information regarding the relationship between the customer and the bank (e.g., other products owned, previous loans); and economic information such as client's income or loan amount.
A detailed view of the attributes of the consumer loan dataset is provided in Table 6 in Appendix .
We considered the age of the applicant as a protected attribute, as suggested in other studies [17], [43], [44]. The threshold at the age of 25 years was considered to differentiate between the vulnerable group (under 25) and invulnerable group (25 and over) [45]. Moreover, when testing the importance of variables in relation to the target variable (Default Flag) for the consumer loan dataset, age had the highest score, −Log(p) = 73.736. The weight of evidence for this score suggests a split in the data at age 23. However, the threshold was maintained at 25 for reasons of comparability with other studies. Another possible vulnerable attribute, marital status, followed age in terms of importance, with a score of 56.937.
Even if other works [3] suggest the use of gender as a protected attribute, we consider it inappropriate, since the use of attributes such as gender and race is explicitly prohibited by laws worldwide, ensuring fairness through unawareness.

EXPERIMENT DESIGN
We conducted our experiments starting with data preprocessing. As expected, the workload for cleaning the data was high in the case of the consumer loan dataset, whereas the German credit dataset was already curated.

Curating the consumer loans dataset
We started by dropping irrelevant attributes such as ID, birthplace, and profession, the latter having too many different classes and making the attributes more noisy than useful.
For categorical attributes with missing values, we added a missing class. The behavior of attributes with missing values can be interesting to observe when the missing values have a certain significance. For example, a missing value for an attribute such as workplace seniority can be explained if the loan applicant is already retired. Other missing information can be related to data collection issues. The pre-processing of categorical attributes was finalized with one-hot encoding for the transformation into numerical values, a condition for being able to run all algorithms in the benchmarking phase.
Missing data for the numerical attributes was imputed with the median value for each attribute. This method was used because the distributions were skewed.

Experiments setup
Next, for the consumer loans, the data was partitioned randomly into a training set (70%), validation set (15%), and test set (15%), considering a stratification that assigns the instances to each set based on the target variable distribution. Because the German credit data is rather small (1000 instances) we partitioned the data into training (70%) and testing (30%).
After splitting the data, we verified how the split affected the difference in the mean outcomes (statistical parity) between the protected and unprotected groups. A large difference would indicate significantly different conditions between the training and testing environments. Table 2 lists the values for the two datasets. VOLUME 11, 2023  The protected group was defined considering the values of the attribute age, with values less than 25. The favorable label for the target variable was set to represent loan applicants' good behavior. Note that the real-world consumer loans dataset encodes the target variable Default Flag, meaning that the favorable label would be, in this case, 0.
Next, we developed experiments in accordance with the pipeline shown in Figure 1. For each bias mitigation method, we chose the corresponding path according to the class it belonged to (pre-processing, in-processing, post-processing).
For the pre-processing and post-processing methods, we used logistic regression as the non-sensitive classifier to be trained. In the case of in-processing methods, we used the same classifiers to initially test the fairness as those used by the methods themselves. This is one of the disadvantages of in-processing debiasing methods. As mentioned in the Methodology section, this type of bias mitigation algorithm does not usually allow the user to select different classifiers for training because the methods are built around certain classifiers, which does not permit the same independence as the pre-processing and post-processing methods. As we will see in the results section, the outcomes may be quite different among the three categories.
To evaluate the performance of different bias mitigation metrics, we used fairness-specific metrics, a general classification metric, and a profit metric (for a synthetic view, see Table 3). The fairness-specific metrics are described in more detail in the methodology section. The general classification metric employed was balanced accuracy, which represents the average of sensitivity and specificity (see eq. 12). This method is particularly useful for imbalanced data sets. Even if a simple accuracy rate was used by most of the studies introducing fairness processing methods, we considered it insufficient in our context.
For profit calculations, we used the ROI measure [46] to estimate the outcome of the correct classification of a good client. The value used in the experiments was ROI=0.34, and the loss was accounted for at a rate of 0.9. We determined these values by considering the specifics of the Romanian market during on-site discussions with the bank representatives. The ROI estimation follows the same process as reported by [28] by observing the behavior of clients fully reimbursing their loans. Some of the loans are reimbursed earlier than initially scheduled, causing a lower than anticipated ROI. Therefore, the ROI was computed by multiplying the interest rate by the loan term and adjusting it by an early repayment coefficient (ERC) (eq. 13).
The loss incurred by loans that are not repaid (false positives in the context of our experimental setup) does not necessarily mean a total loss, as the clients made some payments before default. Note that most application scorecards are built considering the probability of default during a one-year time horizon after failing to repay for 90 consecutive days. The loss coefficient (LC) was set at 0.9. The profit of the model was computed by adjusting the TPR with the ROI and subtracting the losses caused by FP misclassification.
Because the German credit dataset does not have any information regarding the ROI, we used the same values for calculating profits for both datasets, a practice also observed in the work of Kozodoi et al. [17] . Table 3 summarizes the metrics employed in evaluating fairness, classification accuracy and profit achieved by each method. The results provided represent the 10-fold cross-validation average for each measure.

V. RESULTS AND DISCUSSION
This section presents the results obtained after running our experiments on the two datasets, the best values for each criterion are underlined (see Tables 4 and 5).
By analyzing the results, we can denote the overall loss of accuracy and profit when bias is mitigated for all fairness processors.
In several cases, our findings contradict our expectations (based on the literature review) that all fairness constraints cannot be satisfied at the same time. In the case of the consumer loan dataset, methods such as Learning Fair Representations, Disparate impact remover, andExponentiated Gradient reduction managed to achieve fairness for all five metrics. For German credit data, Grid Search Reduction achieved comparable performance. Two methods failed to achieve consistent results for each tested dataset. As will be shown later, this may have been caused by bias in the methods.
The mixed results obtained after applying the methods to the two datasets show a relevant connection between the method and data quality. For example, only one method managed to reduce the Theil index significantly in the case of the German credit dataset, which could be a consequence of the small amount of data (1000 instances). The consumer loan dataset was very imbalanced (5.7% defaulted loans), causing accuracy classification problems. In this context, the use of Balanced Accuracy is particularly important in the context of imbalanced data for differentiating classification performance. Figures 2 and 3 in Appendix provide a visual representation of the biased and de-biased values across the multiple mitigation methods. Each sub-chart displays the relationship between the balanced accuracy and a specific fairness measure, highlighting the impact of different mitigation methods on model fairness and accuracy. The effectiveness of the mitigation methods in reducing bias can be visualized, by observing the grey areas representing the desired range for fairness, highlighting the ideal performance zone.
Next, we review the performance of each fairness processor and discuss the difficulties and special processing required during the experiments.
The pre-processing method of Reweighing the examples in each group clearly performs well for most of the bias indicators. The classification performance is not significantly affected, which is an advantage over other more sophisticated methods, but a value higher than the recommended value of the Theil index might suggest unfairness at the individual level. In the case of the German dataset, although the algorithm is improving without any doubt the fairness metrics, the results are very volatile owing to the small amount of data sample.
While the Learning Fair Representations method might seem difficult to set up because of the various values and combinations of the parameters (two fairness parameters, classification threshold, and loss tolerance), the results were promising for both datasets. Because this is a pre-processing bias mitigation method, the operator has the freedom to choose more powerful classifiers.
When applying the Disparate Impact Remover, the algorithm optimizes the value of the disparate impact but with a high cost for accuracy and profits, leaving no room for compromise between fairness and accuracy or profits.
The Optimized Pre-processing algorithm was found to be extremely slow when dealing with a large number of attributes. We had to reduce the number of attributes by selecting only the top 5 attributes considering information value, along the protected attribute and the target attribute, to create a testing environment similar to that described by Calmon et al. [47]. Some of the fairness metrics remained at unwanted levels, whereas accuracy decreased significantly Adversarial debiasing is implemented using tensor flow for classification, employing a predictor and adversary model to achieve fairness. The method achieves fairness in almost all the chapters, with a loss in accuracy. The slightly better results compared to the other methods might be due to the use of tensor flow instead of logistic regression.
The GerryFair algorithm is an in-processing method for mitigating bias in datasets that considers the individual level of unfairness. The dataset needs to be specially pre-processed in order to fit its requirements in a manner similar to the optimized pre-processing method. It also has several hyperparameters to be tuned before being able to perform a suitable job for the dataset.These parameters include the fairness target (γ), maximum number of iterations, the maximum L1-Norm to be used for dual variables and the learner to be used. The learner must be a regressor. Our typical logistic regression used as a classifier in the experiments was not supported by this method; therefore we tried the learners recommended by Kearns et al. [19]. We found it difficult to tune the parameters in order to obtain a reasonable trade-off between fairness and balanced accuracy. This method tends to overfit when dealing with the German dataset because of its small size.
When using Meta Fair Classifier, the user defines the importance of the fairness metric (false discovery ratio or disparate impact) as one of the inputs for the algorithm. The classifier showed promising results, but was not sufficiently stable. When running the algorithm multiple times on both datasets, the stability issue makes the option of averaging the results unusable, as sometimes fairness constraints are not achieved. The stabilization problem was also discussed by [48] and [49] in their work.
When testing the Exponentiated Gradient Reduction algorithm, several fairness constraints were considered as parameters (Equalized odds, True Positive Rate Difference, Demographic Parity and Error Rate Ratio). Among these, VOLUME 11, 2023  the True Positive Rate Difference was associated with the best balanced accuracy and profit, while the fairness metrics achieved the targeted values for each constraint used in the case of the consumer loan dataset. . However, for the German credit dataset, the best results considering the constraints were obtained when Demographic Parity was used as a parameter. This shows how tied the results are to the data quality, and stresses the importance of experimenting when adopting a solution.
Grid Search Reduction allows the user (along with the fairness constraint) to define the constraint weight to achieve a reasonable compromise between accuracy and fairness. This proves to be a convenient feature for an in-processing algorithm, allowing the practitioner to experiment and decide on the most convenient solution. However, the transition from unfair to fair classification does not seem to be smooth enough to have many options for a compromise between accuracy and fairness. The results showed a significant penalty in accuracy when fairness was achieved.
The last tested in-processing method, Prejudice remover failed to achieve fairness on both datasets or became overfit. None of the variations in the fairness parameter of the algorithm ameliorated the results. A possible cause was reported by Kamishima et al. [50] to be in the design of the method, which is based on a hypothetical distribution.
The post-processing method Reject Option Classification showed some of the best results in terms of balanced accuracy and profit in a fairness constrained context. The fairness constraints allowed include statistical parity difference, average odds difference, and equal opportunity difference. The relatively high values of the Theil index indicate that unfairness persists at the individual level.
The Calibrated Odds-Equalizing post-processing algorithm has been struggling to mitigate bias in the consumer loan dataset. It was tested with full and reduced data (feature selection applied); however, the fairness metrics did not show significant improvements over the unrestricted setup. However, the balanced accuracy was reduced after postprocessing, in the case of consumer loans, from 0.755 to approximately 0.56, depending on the parameters applied to the algorithm (e.g., cost constraints can be chosen from the FNR, TNR, or weighted).

VI. CONCLUSION
This study adds to the literature on fair AI decision-making by benchmarking 12 bias mitigation methods against five fairness metrics and evaluating them in the context of credit scoring data, both in terms of balanced accuracy and profits. We based our experiments on two datasets, the classical German credit dataset and a novel consumer loans dataset from a Romanian bank, to show the challenges in implementing these methods in a real-world setup.
Almost every bias mitigation method benchmarked in our Abbreviations: DI = disparate impact; SP = statistical parity; AOD = average odds difference; EOD = equal opportunity difference; TI = Theil index; BAcc = balanced accuracy; P = profit. * unstable results study managed to increase overall fairness in a potentially automated decision context. Considering the 5 fairness metrics covering (virtually) the entire spectrum of fairness definitions, we identified the strengths and weaknesses of each method. None of the fairness processors can be considered a leader in this area or a universal panacea for treating unfairness, while simultaneously satisfying the accuracy and profit criteria. For this reason, for a practitioner willing to mitigate bias in their decision, a group of methods should be employed and the most convenient one should be chosen at a later stage, based on cost limitations.
However, contrary to our expectations raised by the literature review that a fairness processor cannot satisfy all the fairness criteria, we have found several methods that were able to reduce unfairness in each chapter but incurred significant costs.
When tested on a real-world consumer loan dataset, some of the bias mitigation methods underformed in accuracy compared to the results reported by the studies introducing them. We also noted that the vast majority of these studies considered simple accuracy as the performance criterion, which in our case was unusable because of the highly class imbalanced data. This shows the difficulty of implementing fairness constraints in a real-world environment with highly imbalanced data (the typical scenario for credit risk analysis), where the loss in accuracy can translate to severely increased costs or diminished profits.
Our methodology describes the differences in the ways input data needs to be prepared before feeding the different methods: different encoding of the sensitive attributes, the necessity to transform categorical features by hot-encoding them, or the importance of reducing the number of features in the dataset because of convergence time. One of the limitations of our work is that it is observational and limited to case studies, not being able to provide answers regarding the causes that may generate fairness gaps. This could be the object of a qualitative (rather than quantitative) study of the lending business, which looks outside the statistical definitions of fairness.
The accuracy results provided in this study can be optimized by employing different classifiers in a case-by-case manner. However, this optimization was not included in our scope, with the focus being on the bias processors and fairness metrics. We used logistic regression to ensure that the methods were truly comparable, as most in-processing models use this algorithm for classification, which is also widely known as the industry standard [51], [52]. Wherever possible, after choosing one or several bias mitigation methods, a practitioner should test several classifiers to achieve the best results.
In the wake of cost-sensitive classification methods [53] aiming to improve the effectiveness of different processes, among which is the lending business, the addition of fairness VOLUME 11, 2023 constraints to a cost-sensitive framework for credit scoring could be a possible solution to the still open challenge of managing the trade-off between profit, risk and fairness.
Because some bias mitigation methods are computationally intensive, one may consider benchmarking them by calculating the associated energy and carbon footprint, which is becoming of interest in the field of data science [54]. In conjunction with the other results, this may be used as a criterion for deciding the bias processor to be used in practice. Table 6 lists and describes the features of the consumer loans dataset.

APPENDIX. B
The charts in Figure 2 and 3 provide a visual representation of the debiased and not debiased values across multiple mitigation methods for both datasets.