A novel bias-alleviated hybrid ensemble model based on over-sampling and post-processing for fair classification

With the rapid development of machine learning in the field of classification, the classification fairness has become the research emphasis second to prediction accuracy. However, the data bias and algorithmic discrimination that affect the fair classification of models have not been well resolved, which may damage or benefit the specific groups related to the sensitive attributes (e.g. age, race, and gender). To alleviate the unfairness of the classification model, this study proposes a novel bias-alleviated hybrid ensemble model (BAHEM) based on over-sampling and post-processing. First, a new clustering-based over-sampling method is proposed to reduce the data bias caused by the imbalance in label and sensitive attribute. Then, a stacking-based ensemble learning method is employed to obtain the higher performance and robustness of the BAHEM. Finally, a new classification with alternating normalisation (CAN)-based post-processing method is proposed to further improve the fairness and maintain the accuracy of the BAHEM. Three datasets with different sensitive attributes and four evaluation metrics were used to evaluate the prediction accuracy and fairness of the BAHEM. The experimental results verify the superior fairness of the BAHEM with little accuracy reduction.


Introduction
With the extensive application of machine learning models for solving various classification problems, including credit scoring (Dastile et al., 2020), crime prediction (Kim et al., 2018), software performance prediction (Liu et al., 2021), and loan applications (Wang et al., 2020), social fairness has received increasing attention.Fair classification of models can be affected by data bias and algorithmic discrimination.The raw datasets that are used to train the machine learning models may contain human biases that are intended or unintended, such as the biases to gender, race, or age.The models trained using the biased datasets will learn and extract the data bias and discriminate against certain groups.Therefore, a fair machine learning model is urgently required.
The definition of the fair classification problem was firstly proposed by Kamiran and Calders (2009), and it is mainly resulted from the unbalanced training of some sensitive attributes in the machine learning models due to the imbalanced distribution of training data or the data bias.For example, if gender is taken as a sensitive attribute and the malefemale ratio in a data set is 1:100, then the samples containing female will dominate the training stage of the model, thus making the model's prediction of the male samples inaccurate.In the field of credit scoring, traditional machine learning methods only consider the distribution of labels and tend to make full use of all available information to improve the prediction accuracy as much as possible.If the imbalanced distribution of sensitive attributes does exist, the classification result of the model may be biased towards the majority class, leading to the unfairness of the model.Therefore, a model that can deal with the trade-off between accuracy and fairness is needed, and its validity should be verified by experiments on the real datasets.The aim of research on the fair classification problem is to improve the fairness of the models with little reduction in prediction accuracy by alleviating data bias and algorithmic discrimination.
On the one hand, the datasets in the real world are commonly imbalanced, both in labels and sensitive attributes.Imbalanced labels may reduce the prediction accuracy of machine learning models, and imbalanced sensitive attributes may cause a discriminative impact on unprivileged groups, affecting the classification fairness of models.Researchers have proposed several methods to balance the datasets, including synthetic minority over-sampling technique (Chawla et al., 2002), adaptive synthetic sampling (ADASYN; He et al., 2008), and balance cascade (Liu et al., 2008).However, existing sampling methods simply consider the imbalance in labels while ignoring the imbalance in sensitive attributes.Therefore, a sampling method that considers both imbalance in the labels and the sensitive attributes is required.
On the other hand, to improve the performance and robustness of the machine learning models, ensemble learning methods are widely adopted, including extreme gradient boosting (XGBoost; Chen & Guestrin, 2016), light gradient boosting machine (LightGBM; Ke et al., 2017), and gradient boosting decision tree (GBDT; Friedman, 2001).Although deep learning methods have been widely used to solve a variety of classification problems with excellent performance, such as gesture recognition (Qi et al., 2021) and sarcasm identification (Onan & Toçoğlu, 2021), it is difficult to analyse the fairness of deep learning models due to their nature of black box.Therefore, this study analyses the fairness of machine learning models, and uses ensemble learning technology to enhance the fairness of models and maintain their accuracy.In addition, classification with alternating normalisation (CAN) was adopted to readjust the predicted results so as to further improve the classification performance of the machine learning models (Jia et al., 2021).However, these methods only focus on the performance of machine learning models while ignoring the classification fairness of models.
Therefore, the motivation of this study is to provide an ensemble model to alleviate the data bias and algorithmic discrimination for fair classification.The main contributions of this study are listed as follows: (1) A novel bias-alleviated hybrid ensemble model (BAHEM) based on over-sampling and post-processing is proposed in this study to enhance the classification fairness and maintain the accuracy of ensemble models.
(2) A new clustering-based over-sampling method is proposed to balance the label and sensitive attribute automatically by generating new samples according to the data distributions.The clustering method can improve the sampling efficiency by separating the dataset into several subsets.(3) A stacking-based ensemble learning method is employed to adaptively select and integrate competent base classifiers with higher average rankings of accuracy and fairness, which are output from the first layer of stacking.Hence, the performance and robustness of the proposed BAHEM are improved.(4) A new CAN-based post-processing method is proposed to further improve the fairness and maintain the accuracy of BAHEM by modifying the prediction results with higher uncertainty and corresponding to the majority of the sensitive attribute.(5) Three datasets with different sensitive attributes and four evaluation metrics (two traditional performance metrics and two fairness metrics) are adopted to evaluate the classification performance and fairness of the BAHEM.
The remainder of this study is organised as follows.In Section 2, related work on data sampling methods, ensemble learning methods, and fair classification methods are reviewed.In Section 3, details of the proposed BAHEM are presented.The experimental settings, including the datasets, evaluation metrics, and parameter settings of the models, are presented in Section 4. In Section 5, the experimental investigation and comparison of the performances of the BAHEM and other benchmark models are described.The conclusions and suggestions for future work are presented in Section 6.

Related work
The proposed BAHEM in this study primarily involves three aspects: data sampling methods, ensemble learning methods, and fair classification methods.The literature on these three aspects is reviewed in this section.

Data sampling methods
The problem of imbalanced datasets is one of the greatest challenges in training machine learning models, because the models trained by the imbalanced datasets may be biased and favour the majority class (Thabtah et al., 2020).To reduce the negative influence of imbalanced datasets, two different data sampling methods are mainly used in the current research: one is under-sampling method and the other is over-sampling method.The under-sampling methods balance the datasets to reduce the size of the majority class by removing some of the majority samples.For example, Onan (2019) proposed a consensus clustering based under-sampling approach which combined five different clustering algorithms to balance the dataset.Devi et al. (2019) analysed the effects of data imbalance in machine learning models and proposed a Tomek-link under-sampling algorithm to solve the data imbalance.Guzmán-Ponce et al. (2021) proposed a two-stage undersampling method that combines a clustering method for filtering the majority samples and a graph-based procedure for determining the appropriate imbalance ratio (IR) for each subset.Jiang et al. (2022) proposed a boosting random forest with static under-sampling and ensemble methods to reduce the overlap between classes.However, the undersampling methods may lose some potential informative data while removing the majority samples.
In contrast to the under-sampling methods that balance the datasets by removing the majority samples, the over-sampling methods generate minority samples to balance the datasets.For instance, Tao et al. (2019) proposed a real-value negative selection over-sampling method that can generate minority samples without reusing minority samples from the original dataset and avoid generating noise samples.Puntumapon et al. (2016) proposed a clustering-based over-sampling method to reduce model overfitting and to improve the generalisation of minority samples that are generated.However, oversampling methods in these researches mainly consider how to improve the prediction accuracy of the model, while ignoring the classification fairness of the model.
Therefore, in this study, a new clustering-based over-sampling method is proposed that can automatically balance the label and sensitive attribute in the datasets according to their IR, thereby improving the adaptability to imbalanced datasets and the classification fairness of the model.

Ensemble learning methods
Ensemble learning methods, one of the most effective ways to improve the performance of base classifiers, have been widely adopted and extended by researchers to solve various classification problems, such as text classification (Onan, 2018) and sentiment classification (Onan et al., 2017).Ensemble learning methods mainly include bootstrap aggregation (bagging; Breiman, 1996), boosting (Freund & Schapire, 1996), and stacking methods (Wolpert, 1992).Onan et al. (2016) integrated statistical keyword extraction methods by bagging and boosting and verified the effectiveness of the ensemble methods in the field of text classification.Among these methods, stacking methods have been proved to be an efficient and flexible ensemble learning method, which integrates the prediction results of base classifiers to obtain the final prediction results with increasing prediction accuracy.As an example, Xu et al. (2020) combined k-means clustering and ensemble learning to forecast the price of stock market.Potha et al. (2021) proposed a sophisticated extrinsic random-based ensemble method to detect the malware and demonstrated the effectiveness of ensemble learning methods.Xia et al. (2021) proposed a weighted stacking ensemble with sparsity regularisation, which adjusts the weights of the base classifiers according to the label correlations in multi-label classification problems.Li and Li (2022) improved adaptive boosting (AdaBoost) algorithm with weight adjustment factors to handle the imbalanced data classification with minority samples.In our previous study, Zhang et al. (2021) proposed a stacking ensemble method that combined the outlier detection and sampling methods to boost the prediction accuracy and generalisation ability of model.
However, existing stacking ensemble methods select and integrate the base classifiers with higher prediction accuracy while ignoring the classification fairness of the base classifiers, which may cause the obtained ensemble model to be biased toward certain groups.Therefore, in this study, a stacking-based ensemble learning method is employed to select competent base classifiers with higher accuracy and fairness to improve the classification fairness of the model.

Fair classification methods
The methods used to improve the classification fairness of the model can be separated into three categories: the pre-processing methods, the in-processing methods, and the postprocessing methods.
Pre-processing methods aim to reduce data bias and ensure the classification fairness of the model by transforming the distribution of the datasets (d'Alessandro et al., 2017).Among the existing pre-processing methods, the most popular ones are instance sampling (Iosifidis et al., 2019), transformation (Calmon et al., 2017), and label swapping (Kamiran & Calders, 2012).For instance, Petrović et al. (2022) developed a sample re-weighting method to reduce data bias and improve classification fairness by learning sample weighting functions using adversarial training algorithms.
In-processing methods modify the state-of-the-art machine learning algorithms by changing the constraints to enhance the classification fairness of the model (Mehrabi et al., 2021).For instance, Iosifidis and Ntoutsi (2019) extended AdaBoost to a fairness-aware classifier that considers the classification fairness of each classifier while updating the sample weights.Zafar et al. (2017) proposed a notion of fairness as the constraint in the objective function of machine learning models.
Post-processing methods modify prediction results to enhance the classification fairness of the model (Iosifidis et al., 2019).As an example, Fish et al. ( 2016) designed a boosting classifier that improves classification fairness by modifying the decision boundary of the classifiers to protect unprivileged groups.Lohia et al. (2019) proposed a fairness postprocessing method that ranks samples using a bias reduction algorithm to enhance both group fairness and individual fairness.
However, the fair classification methods in these three categories improve the classification fairness of the model by sacrificing much prediction accuracy, which may cause unexpected losses when using the model for decision making.Therefore, inspired by the pre-processing methods and post-processing methods, a novel BAHEM based on over-sampling and post-processing is proposed in this study, which includes a new clustering-based over-sampling method as the pre-processing method and a new CANbased post-processing method to alleviate data bias and improve the classification fairness of the model with little reduction in prediction accuracy.

Model
In this study, a novel BAHEM based on over-sampling and post-processing is proposed to ensure the fairness and alleviate the data bias of the classification model.Figure 1 shows the framework of the proposed BAHEM.Three methods (i.e.clustering-based over-sampling method, stacking-based ensemble learning method and CAN-based postprocessing method) are proposed as the three stages of the BAHEM.First, the original dataset is separated into two parts: training data and testing data.Then, the training set and validation set are obtained by separating the training data.Second, the training set is clustered into subsets and further over-sampled to obtain a balanced training set.Further, the base classifiers trained by the balanced training set are selected according to the accuracy and fairness of each classifier and integrated into the stacked ensemble model, which is used to obtain the prediction results by predicting the testing data.The prediction results are modified based on the uncertainty score of each result, and the modified prediction results are regarded as the final prediction results.The process details are presented in the following sub-sections.

Clustering-based over-sampling method
Data imbalance is a common problem in real-world datasets, which exists not only in labels but also in sensitive attributes.Machine learning models trained using imbalanced data will be biased against some of the classes.Just like the imbalance of labels will affect the accuracy of the model, the imbalance of sensitive attributes will affect the fairness of the model.The clustering algorithm gathers the samples into different clusters, so that the samples in each cluster are as similar as possible, and the samples in different clusters are as different as possible.Therefore, the data imbalance within each class is reduced.Clustering algorithms have been widely used to solve the imbalance problem of datasets (Onan, 2019;Xu et al., 2020).In this study, a new clustering-based over-sampling method that considers the bias in both label and sensitive attribute, is proposed to balance them by comparing their respective IR.
As exhibited in Figure 2, the feature weight of the training set is firstly adjusted.For example, F 1 to F n represent all features of the training set, and F s represents the sensitive attribute in the training set.Then, F s is duplicated as F s , which is appended to the training set, resulting in the adjusted training set.Apparently, the weight of the sensitive attribute F s is increased, causing the subsequent clustering algorithm to pay more attention to the sensitive attribute in the adjusted training set.Subsequently, a clustering algorithm separates the adjusted training set into several subsets (e.g.subset 1 and subset 2) to improve the sampling efficiency.The IR of the label and the sensitive attribute in each subset are calculated and compared.The IR is the ratio of the number of the majority class to that of the minority class.For example, IR L represents the IR of the label, and IR SA represents the IR of the sensitive attribute.The proposed method can calculate both IR L and IR SA and compare them in each subset, and employ the popular over-sampling method (i.e.ADASYN) to balance them.If IR SA is greater than or equal to IR L (e.g. in subset 1), ADASYN is used to balance the sensitive attribute.If IR SA is less than IR L (e.g. in subset 2), ADASYN is used to balance the label.Finally, after all the subsets are over-sampled, they are integrated to produce a balanced training set.

Stacking-based ensemble learning method
Base classifiers commonly have problems such as low accuracy and fairness, which can be alleviated through classifier integration.Therefore, a stacking-based ensemble learning method is employed in the proposed BAHEM to integrate multiple base classifiers with higher accuracy and fairness.
As shown in Figure 3, multiple base classifiers (Clf 1, Clf 2, ... , Clf m) in the base classifier pool are trained using the balanced training set.Because accuracy and average odds difference (AOD; Bellamy et al., 2019) are the most commonly used indicators, they are evaluated in selecting the competent base classifiers.After the base classifiers are trained, the accuracy (ACC; Stehman, 1997) and AOD (Bellamy et al., 2019) of each base classifier in the validation set are calculated and sorted, respectively.The top k competent classifiers are then selected according to the average ranking of ACC and AOD.The k selected competent base classifiers (Sclf 1, Sclf 2, ... , Sclf k) are further permuted and combined into several ensemble classifiers whose prediction results are used as new features to train the meta classifier.Because Xia et al. (2018) proved the superior performance of logistic regression (LR) as a meta classifier in stacking methods, LR is adopted as the meta classifier.Finally, the stacked ensemble model is obtained.

CAN-based post-processing method
Traditional machine learning models are commonly trained using training data and they predict testing data to evaluate the classification performance of the models.Jia et al. (2021) proved that classification performance can be improved by readjusting the prediction results of challenging samples, and they proposed the CAN method.In the postprocessing method, how to select and modify the appropriate prediction results is the key to improve the accuracy and fairness of the model.Therefore, a new CAN-based postprocessing method is proposed in this study.In contrast to the original CAN method, which only improves the accuracy of the model, the proposed method ensures the classification fairness of the model while maintaining accuracy.
To evaluate the fairness of the model intuitively, the data can be divided into unprivileged group and privileged group according to a sensitive attribute, and the difference between true positive rate and false negative rate in these two groups are calculated respectively, then their average value is taken as the fairness of the model.To improve the fairness of the model, it is necessary to make the difference between true positive rate and false negative rate in both unprivileged and privileged groups as small as possible.The CANbased post-processing method proposed in this study can select and modify the prediction results with higher uncertainty to reduce the difference.Hence, the fairness of the model is improved.
As depicted in Figure 4, the stacked ensemble model is firstly used to obtain the original prediction results by predicting the testing data.Then, the CAN method, which defines entropy as the uncertainty score, is adopted to calculate the uncertainty score of each prediction result.After all the uncertainty scores of the prediction results are calculated, a threshold is given, and the prediction results with uncertainty scores higher than the threshold are selected.Because the prediction results obtained by the model with an imbalanced dataset will be biased towards the majority class, the datasets are usually balanced by reducing the majority class or increasing the minority class.Similarly, while analysing the fairness of the model, the prediction results will be biased towards the group that belongs to the majority of the sensitive attribute.To ensure that the modified prediction results improve the fairness of the proposed BAHEM, only the prediction results corresponding to the majority of the sensitive attribute are selected for modification.For example, it is assumed that the sensitive attribute and prediction results are both binary and "1" is the majority of the sensitive attribute, then, if the threshold is set as 0.5, all the prediction results with uncertainty scores higher than or equal to 0.5, are selected.Among these selected prediction results, if the corresponding sensitive attribute is "1", then the prediction result is modified to the opposite results (i.e."1" to "0", and "0" to "1").Finally, the modified prediction results are taken as the final prediction results.

Datasets description
In this study, three standard datasets, namely, Adult (Kohavi, 1996), Bank (Moro et al., 2014), and German (Asuncion & Newman, 2007), from the UC Irvine (UCI) machine learning repository are used to estimate the classification performance and fairness of the proposed BAHEM.Table 1 lists the details of the datasets used in this study, including the sample size, numbers of positive and negative samples, number of total features, and sensitive attributes.In addition, the code of this study is available at Github. 1

Evaluation metrics
Four evaluation metrics, namely, ACC, AOD, balanced accuracy (BA; Brodersen et al., 2010), and equal opportunity difference (EOD; Hardt et al., 2016) were adopted in this study in order to evaluate the classification performance and fairness of the proposed BAHEM.
ACC is the basic evaluation metric used to indicate the overall performance of a model.The formula for ACC is given in Equation ( 1).Here, TP, TN, FP, and FN represent true positive, true negative, false positive and false negative, respectively.The higher ACC represents the higher prediction accuracy of the model.
BA is an evaluation metric that is commonly adopted to evaluate the performance of a model trained by an imbalanced dataset.BA is calculated using the true positive rate (TPR) and true negative rate (TNR).The formula for BA, TPR, and TNR are given in Equation (2), Equation (3), and Equation (4), respectively.A higher BA represents the higher prediction accuracy of the model.
AOD and EOD are used to evaluate the classification fairness of the model, and AOD is calculated using Equation ( 5).Here, SA = 1 represents the sensitive attribute belonging to the unprivileged group, and SA = 0 represents the sensitive attribute belonging to the privileged group.The formula for FPR is given in Equation ( 6).TPR SA = 1 and FPR SA = 1 represent TPR and FPR in the unprivileged group, respectively, and TPR SA = 0 and FPR SA = 0 represent TPR and FPR in the privileged group respectively.The formula for EOD is presented in Equation ( 7).For a comprehensive comparison, the absolute value of EOD, i.e. |EOD|, is used in the following experimental comparison.A lower AOD or |EOD| represents the higher classification fairness of the model.

Parameter settings
The original dataset was randomly separated as follows: 20% of the original dataset was used as the testing data, and 80% of the original dataset was used as the training data.Among the training data, 80% was used as the training set and the rest was used as the validation set.In the clustering-based over-sampling method, k-means was used as the clustering method and the number of clustering centres was set as 2; ADASYN was used as the over-sampling method.The k-means and ADASYN were executed by the Python modules "sklearn" and "imblearn", respectively.In the stacking-based ensemble learning method, XGBoost, GBDT, AdaBoost, random forest (RF), support vector machine (SVM), LR, and LightGBM were used as the base classifiers.XGBoost and LightGBM were executed by the Python module "xgboost" and "lightgbm," respectively.GBDT, AdaBoost, RF, SVM, and LR were executed by the Python module "sklearn." In the CAN-based post-processing method, inspired by Agrawal et al. (2019), Kirar and Agrawal (2019) and Kirar et al. (2022), the ACC and AOD for each dataset using different thresholds of uncertainty scores are shown in Figure 5. Considering that the threshold is generally set as 0.5 by default in classification problems, this experiment compares the threshold values around the default value when selecting parameters, namely 0.4, 0.5 and 0.6.Although the ACC rises with the increase of the threshold for all datasets, the AOD for "Adult-race," "Bank-age," "German-age," and "German-sex" performed well when thresholds were set as 0.6, 0.5, 0.5 and 0.6, respectively.Therefore, the threshold of uncertainty scores for "Adult-race," "Bank-age," "German-age," and "German-sex" were set as 0.6, 0.5, 0.5 and 0.6, respectively, through trial-run experiments.For a fair comparison, all the parameters of the clustering methods, over-sampling methods, and base classifiers were set as default.

Experimental analyzation
In this study, three datasets with different sensitive attributes and four evaluation metrics were adopted to evaluate the classification performance and fairness of the proposed BAHEM.Each evaluation metric was run and calculated 10 times to ensure the reliability of the experiment.The average results of each evaluation metric were used as the performance of the BAHEM.All the experiments were realised on Microsoft Windows 10 operating system, using Python Version 3.7.

Performance evaluation of baseline classifiers
To evaluate the classification performance and fairness of BAHEM, the baseline results of three datasets with different sensitive attributes were evaluated using four evaluation metrics.The performance of baseline classifiers is presented in Table 2.The German dataset with different sensitive attributes selected can be used as two datasets.To clarify the sensitive attributes of the datasets in the following experimental analysis, the datasets with different sensitive attributes are renamed as "Adult-race," "Bank-age," "German-age," and "German-sex," respectively.

Performance comparison between base classifiers and the proposed BAHEM
To demonstrate that the proposed BAHEM can improve classification fairness with little reduction in prediction accuracy, the performance of the BAHEM is compared with seven base classifiers on four evaluation metrics.Histograms of the performance comparison between BAHEM and base classifiers are presented in Figure 6.As shown in Figure 6, although the ACC and BA of the BAHEM are slightly reduced compared with the base classifiers on each dataset, the fairness metrics (i.e., AOD and EOD) of BAHEM are improved, indicating that the proposed BAHEM can effectively improve fairness without sacrificing too much prediction accuracy.

Performance comparison between benchmark models and the proposed BAHEM
To further prove the higher classification performance and fairness of the proposed BAHEM, it is compared with the benchmark models proposed by Kearns et al. (2019) The source codes of related benchmark models are public.For a fair comparison, all experiments were conducted under the same experimental settings.The comparison results are shown in Table 6.Certain evaluation metrics were not adopted in the benchmark models;

Conclusion and future work
In this study, a novel BAHEM based on over-sampling and post-processing is proposed to alleviate data bias and improve classification fairness without sacrificing too much prediction accuracy.The proposed BAHEM mainly contains two main contributions.First, a new clustering-based over-sampling method is proposed, which generates the subsets with clustering methods and automatically balances the label and sensitive attribute to improve the adaptability to imbalanced datasets and the classification fairness of the model.Second, a new CAN-based post-processing method is proposed to select prediction results with higher uncertainty and modify them to further enhance the fairness of the BAHEM while maintaining the prediction accuracy.Three datasets (i.e.Adult, Bank, and German) with different sensitive attributes and four evaluation metrics (i.e.ACC, AOD, BA, and EOD) were used to evaluate the classification performance and fairness of the BAHEM.The experiment results show that classification performance and fairness of the BAHEM outperform those of other benchmark models.However, the proposed BAHEM in this paper has some shortcomings in both method and practical application.In planned future work, the proposed BAHEM can be further improved by generating an ensemble model with base classifiers that are processed using in-processing methods.For sampling methods, the threshold can be adjusted adaptively according to different sample distributions.For post-processing methods, more comprehensive indicators in addition to uncertainty, can be considered in evaluating and modifying the prediction results to obtain higher classification performance and fairness of the model.For practical application, it is necessary to further enhance the interpretability of the model and the robustness of the results obtained from the model.

Figure 3 .
Figure 3. Schematic diagram of stacking-based ensemble learning method.

Figure 4 .
Figure 4. Schematic diagram of CAN-based post-processing method.

Figure 5 .
Figure 5. Performance comparison between different thresholds on each dataset.

Figure 6 .
Figure 6.Histograms of performance comparison between base classifiers and BAHEM.

Table 1 .
Detailed information of the datasets.

Table 4 .
Competent base classifiers selected for different datasets.

Table 5 .
Performance comparison between the stacked ensemble model and BAHEM.BAHEM.The results of BAHEM are highlighted in bold if the evaluation metric in the BAHEM is better than that of the stacked ensemble model.As shown in Table5, although the partial ACC and BA of the BAHEM on some datasets (e.g.Bank-age and German-age) are slightly worse than those of the stacked ensemble model, BAHEM shows a significant improvement in the fairness metrics (i.e.AOD and EOD).The experimental results indicate the stackingbased ensemble learning method and CAN-based post-processing method can effectively improve the classification fairness of the model without sacrificing too much prediction accuracy.

Table 6 .
Performance comparison between benchmark models and the proposed BAHEM., the corresponding indicators are marked as "/" in this table.As the table shows, the ACC of the BAHEM are slightly reduced compared with most other benchmark models, but the BAHEM outperforms other models in AOD and EOD on most datasets, demonstrating that the proposed BAHEM can improve fairness without sacrificing too much prediction accuracy. hence