FAL-CUR: Fair Active Learning using Uncertainty and Representativeness on Fair Clustering

Active Learning (AL) techniques have proven to be highly effective in reducing data labeling costs across a range of machine learning tasks. Nevertheless, one known challenge of these methods is their potential to introduce unfairness towards sensitive attributes. Although recent approaches have focused on enhancing fairness in AL, they tend to reduce the model's accuracy. To address this issue, we propose a novel strategy, named Fair Active Learning using fair Clustering, Uncertainty, and Representativeness (FAL-CUR), to improve fairness in AL. FAL-CUR tackles the fairness problem in AL by combining fair clustering with an acquisition function that determines which samples to query based on their uncertainty and representativeness scores. We evaluate the performance of FAL-CUR on four real-world datasets, and the results demonstrate that FAL-CUR achieves a 15% - 20% improvement in fairness compared to the best state-of-the-art method in terms of equalized odds while maintaining stable accuracy scores. Furthermore, an ablation study highlights the crucial roles of fair clustering in preserving fairness and the acquisition function in stabilizing the accuracy performance.


I. INTRODUCTION
Recently, the world has experienced the advancement of deep neural networks ranging from the natural language processing [8], [13] to the natural image creation [34].While showing great results, many deep learning approaches rely on large labeled training data, which is costly and timeconsuming to generate.Many works introduced active learning [36] frameworks to reduce the cost of data annotation.Active learning (AL) methods aim to select the most informative or representative samples to be labeled by human experts.
In the initial works, the active learning research community mainly focused on maximizing the accuracy given a budget.Therefore, there is little prior work on fair active learning approaches to address the disparity in algorithm performance across under-represented groups [5], [37].For example, whether the underprivileged groups, such as African-Americans, are adequately represented in the query set.Therefore, this issue called for a fair(er) active learning approach that improves group fairness so that protected and unprotected groups are equally selected during the active learning acquisition phase.Group discrimination during active learning data acquisition could lead to misrepresented labeled data for the training phase that can induce bias in a machine learning model.However, it is observed that the performance of machine learning models, including the one implemented in active learning, is decreased when they aim to maintain the fairness constraints [32], [40].For instance, a recent work by Anahideh et al. [5] shows that their active learning model faces a downward trend in performance when the fairness is improved.Thus, it remains the main challenge to develop an active learning framework that improves the fairness score while keeping the performance stable.
This paper discusses fairness for active learning and presents a novel approach to solving the fairness and performance problems in active learning.Similar to other active learning techniques, the method's core is the proposed acquisition function, which formalizes the criteria for selecting items for the oracle.We improve the current research gap in fairness for active learning by applying fairness as a constraint for the acquisition function.We present a fair active learning method, called Fair Active Learning using fair Clustering, Uncertainty and Representativeness (FAL-CUR), consisting of two stages.In the first stage, we apply a Fair-Clustering Algorithm to group uncertain samples while maintaining the fairness constraints.The second stage is the Selection Algorithm, which picks out the most informative sample based on fairness ranking measured by the representativeness and uncertainty inside the fair clusters.
The proposed FAL-CUR method solves the problem of choosing data from an unlabeled pool (U) for AL while maintaining fairness constraints, including Equal Opportunity, Equalized Odds, and Statistical Parity metrics.To measure its accuracy and fairness performance, we conducted extensive experiments and in-depth analysis on various real-world datasets.The experimental results show that the FAL-CUR achieves high fairness performance while keeping the active learning model stable as compared to baselines.The FAL-CUR acquisition function shows a stable accuracy performance with an improved fairness score.The ablation analysis shows that all the components of the proposed model are equally important to collectively achieve high fairness as well as high performance in active learning.
The rest of the paper is organized as follows.Section II introduces the line of works related to this paper.Section III and IV present the problem definition and the proposed methodology, respectively.Section V discusses experimental set-up.In Section VI, we experimentally investigate the performance of FAL-CUR, and finally, Section VII concludes the paper contributions.

II. RELATED WORK
This section summarizes relevant works to our study including active learning, fair machine learning, and fair active learning.

A. Active Learning
The goal of active learning is to reduce the cost of annotation by labeling high-quality samples [12], [17], [23].It is very useful in a situation where a huge amount of training data is available, but the cost of annotation is high [23], [33].
Active learning approaches can be categorized based on the way they process items in the database: (i) Sequential Active Learning (SAL), in which one instance of data per algorithm iteration step is selected, and (ii) Batch-mode Active Learning (BAL), in which a batch of informative instances is processed.The shortcoming of SAL compared to BAL is that it requires re-training after each iteration, which can be quite costly [21], [27].The main goal of BAL is to select a batch of unlabeled samples on each iteration, trading off between uncertainty and representativeness [23].Uncertain samples are likely helpful to refine the decision boundary, while representativeness (according to some surrogate criterion) strives to capture the data structure of the unlabeled set.There are two main active learning approaches to achieve this trade-off: (i) objectivedriven methods and (ii) screening approaches.Objectivedriven methods formulate this trade-off in a single objective, such that the most informative/representative batch of samples is found by solving an optimization problem [22], [35].These approaches are often theoretically well-grounded and have demonstrated good performance in practice.However, they typically do not scale well to big datasets [31].While these works show promising results, most of them are conducted on a balanced dataset that does not simulate the data distribution in real-world applications.
Early work on imbalanced active learning was done by Ertekin et al. [15].The proposed framework was able to solve the class imbalance problem in active learning by providing more balanced samples to the active learning classifier.The study demonstrated an efficient technique to select informative samples from a smaller sample for active learning.The authors pointed out that implementing early stopping criteria enables active learning to achieve a fast solution with a competitive performance for imbalanced class distribution.While other works [3], [30] prefer balancing samples after selecting the batch-mode active learning algorithm for imbalanced datasets.Bhattacharjee et al. [7] proposed a current active learning framework for addressing imbalanced class distribution.The study proposed two algorithms that used a novel sampling strategy and anomaly detection method.

B. Fair Machine Learning
There are plenty of distortions that make it hard to process data fairly [11] including biased encoded data and the effect of minimizing average error to fits majority populations.Fair Machine Learning (FML) aims to present a machine learning model that is free of bias towards sensitive groups.Fairness in machine learning has attracted a tremendous research interest over several years.It has been explored in numerous data mining and machine learning problem.One of the key research on fairness was introduced by Dwork et al. [14], in which they introduced two notions of fairness, (i) individual fairness and (ii) group fairness.Individual fairness is focused on consistent treatment and strives to achieve configurations where similar objects are assigned similar outcomes.On the other hand, group fairness ensures that results are equally distributed across all subgroups defined based on sensitive attributes, such as gender, race, ethnicity, nationality, and religion.Many research works have highlighted the bias of ML models for different sensitive attributes, and therefore it must be protected.[19], [24].Following these two directions, fair clustering has also been widely studied by the FML community.Individual fairness in clustering focuses on answering the question that how to treat similar individuals similarly.Some representative studies include similar samples receive similar cluster placement [9] and similar sample receive similar cluster distance [38].Meanwhile, group fairness in clustering focuses on treating each sample in the group fairly with respect to how other groups are being treated.One of the key ideas is to balance clustering representation in clustering algorithms, e.g., spectral clustering [25] and correlation clustering [4].

C. Fair Active learning
Recently, the active learning community has been interested in investigating the fairness during active learning sample selection.First fair active learning is proposed by Anahideh et al. [5].This model is targeted to tackle fairness in the group setting.The study focused on selecting fair samples to be labeled.They designed an active learning algorithm to select data vectors to be labeled by balancing the model accuracy and fairness.The fairness was measured as the expected fairness over all unlabeled data and the newly added labeled data.Their proposed FAL method used an accuracy-fairness optimizer to select samples to be labeled and used three strategies for the optimizer: FAL α-aggregate, FAL Nested, and FAL Nested Append.The FAL method notably reduced unfairness while not significantly impacting the model accuracy, and the FAL Nested Append optimizer showed the best performance across different experiments and fairness models.In a recent work, Sharaf et al. [37] formulated the fairness-constrained active learning as a bi-level optimization problem targeted for solving group fairness.The first level of optimization focuses on training a classifier on a subset of labeled examples.In contrast, the second level is the selection policy to choose the subset of data that achieves a desired fairness and accuracy performance on the trained classifier.To achieve this goal, they used the forward-backward splitting on both optimization methods.The study showed a promising result in terms of accuracy; however, the fairness performance was not able to outmatch the previous fairlearn [2] method.

III. PROBLEM FORMULATION
We consider pool-based active learning, where we have access to small labeled data L. The set L consists of feature vectors L = {x} n i=1 , and a set of labeled data is L = {(x 1 , y 1 ), ..., (x n , y n )}.Active learning is performing for a large pool of unlabeled data U, that consists of feature vectors {u 1 , ..., u n } and a classifier C. Obviously, L U = ∅ and L U = X, where X is the whole set of feature vectors.The standard active learning pipeline can be described as the interplay among three parts: an oracle, a predictor, and an acquisition function.The oracle provides a set of labeled data for the predictor to learn.Then, data from the predictor (predictive uncertainties) goes to the so-called acquisition function, which guides the oracle on how to label the points and restart the cycle.The Acquisition Function (AF) in active learning is mapping from U to some ordered set V, thus defining the strategy of sample selection by setting priorities or rankings for every x ∈ U. Some works proposed replacing the acquisition function with another neural network.However, this replacement could lead to a slowdown of the AL cycle.The work of every iteration in this learning model comprised of using a classifier model that was built on L data to select the most informative sample (i.e., the most uncertain samples which are usually close to the classifier boundary) from the whole U, and to send the selected sample to an oracle (usually a human expert) to label it to the corresponding class.A good active learning method aims to achieve a reasonably good classifier with minimal labeling help from the oracle.Thus, every instance x L i ∈ L is associated with a label y L i , and was revealed by a domain expert, whereas the labels associated with x U i are not known yet.Acquisition Functions are designed in the way that they have maximums at locations of the input space U with higher level of uncertainty and also in regions that have not been sufficiently explored before.Typically, an AF is defined as the combination of feature extraction function f (x) : X → Y, x ∈ X that extracts the features from vectors {u 1 , ..., u n } into set Y and score function g(y) : Y → V, y ∈ Y , that calculates the score of the sample in V : where m ∈ N, m ≤ |U| is the number of vectors in set Q ⊂ U. Frequently, data for the oracle are processed in queries where they are processed one instance at a time to update the classification model.Nevertheless, this manual labeling routine can be very daunting and inadequate.Also, it does not allow paralleling the process of labeling.The possible solution to the issues mentioned earlier is a Batch-mode Active Learning (BMAL) approach [10], [18], [22].This approach belongs to a class of algorithms that allows multiple query instances at once, i.e., forming batches.The problem with these algorithms is that BMAL is still a resource-consuming task due to most real-world labeling processes, especially if a human expert conducts them.Therefore, a good BMAL algorithm should describe a procedure of this parallel processing to tackle the aforementioned issue.Specifically, in BMAL, for the given system of Labeled data L and unlabeled data U.The goal of BMAL is to select batch B from unlabeled data U consisting of m items.BMAL interactively selects a batch B of samples, satisfying B ⊂ U and |B| = b, where the batch size b is determined by human labeling ability.The AL method selects the unlabeled sample from U to be labeled by a domain expert in each iteration until a stopping criterion is achieved, such as the model accuracy threshold or budget that is considered for labeling procedures beforehand, is exhausted.Therefore, the cycle of selecting samples for the oracle can be described by the next relation: It is considered that the unlabeled data U is imbalanced, i.e., the target class could be underrepresented.In this case, parallel processing of the batch B and the whole procedure of selecting this batch could become harder as long as we consider the batch that represents all possible classes of C and groups of considering unlabeled data.Therefore, the problem of creating the algorithm for selecting as equal as possible chunks of data for every item in C into the batch and controlling fairness in their processing, i.e., the necessity of designing methods for fairness in Batch-mode, is emerging.Fairness-aware machine learning research can be developed in several directions due to the variety of possibilities to define the notion of fairness.We will deal in this paper with the most commonly considered statistical fairness, as long as we research the problem of Fairness for Active Learning.The goal of such FAL-CUR is to select samples for the batch B that are representative to keep the classifier performance good enough while keeping the model fair to certain sensitive groups.Thus we can frame our problem as a multi-objective optimization (MOP) such that where the first objective function is to maximize the accuracy A or the performance of the model, while the second objective function is to minimize the unfairness (U F ).The (X * ) is the set of selected samples derived from unlabeled data U. Formalization of the algorithm requires establishing the model of sensitive attributes such as race or gender.For each unlabeled data point x ∈ U let denote by S the set of sensitive coordinates.We use the notation S i to denote the sensitive attribute of u i ∈ U.In this work we consider three established group fairness measures: Equal Opportunity [20], Equalized Odds [20] and Statistical Parity [14].

IV. THE PROPOSED METHOD: FAL-CUR
Since we face imbalanced and fairness constraints, it is essential to select the samples that can represent the minority class and has a higher fairness score.Fig 1 illustrates the sample selection of the proposed FAL-CUR method.Our FAL-CUR model is constructed into two main parts.In the first part, we cluster the uncertain samples using fairness aware clustering [1].In the second part, the method focuses on where F k−means (x) is the K-Means term, that calculates K-Means loss over the attributes in C using: where µ i , i = 1..|C| are centers of mass for every class in C and distance is calculated only for features that do not belong to sensitive features S, i.e., only The other term F f airness is the fairness term defined as loss that tweaks towards fairness on attributes in S as dev S (C, X): By using properties of domain cardinality normalization, (x j − µ i ) 2 m could be calculated as follows [1]: ) where F r S C (t) and F r S X (t) are short-hands for the fractional representation of S = t in C and X respectively.

|X|
The parameter λ allows controlling the balance of K-Means coherency and fairness performance.
2) Sample Selection: Since we have provided fairness through fair clustering, the next challenge is to select the candidate sample from the unlabeled pool to be labeled, improving the model with a limited budget.Thus, it is essential to search for the best representative sample inside the fair cluster.In this part, we consider a distance-based candidate selection that meets several criteria, such as having a high probability of belonging to a minority class and reducing misclassification error or model uncertainty.The candidate we are searching for could reside in several locations, such as near the centroid of the fair cluster or far from the fair cluster centroid.Thus the objective function can be denoted as follows a) Uncertainty Score: We consider the uncertainty as the entropy of the classifier for specific sample and its sensitive attribute H t (y|X, c) is the entropy of specific sample (x i , s i ) ∈ U with regards to its classifier trained in t-time.The entropy can be calculated as follows b) Representative Score: Although we reduce the sample searching space by implementing fair clustering.It is still a question which samples should be selected for labelling.In this work, we consider using representative and uncertainty sampling for samples selection.After measuring the uncertainty, the next step is to calculate the representative score of each sample.For this, we calculate the function of representativeness Rep, which calculates the most similar sample to all the sample using Euclidean distance.Consider x i s , x ∈ U is the sample grouped by fair clustering and the ground-truth class label of i-th sample inside the fair cluster x i s is unknown, we define representativeness of x i s if x i s share a large similarity − measured by euclidean distance (d) − with other samples inside the cluster.Thus, we measure the representative score by calculating pairwise comparison for each sample:

c) FAL-CUR Sample Acquisition:
The final score for sample selection can be defined as follows: We rank this score and select the samples with the highest score.The β parameter is responsible for weighing the impact of the two scores on the final score.Small β makes the model favors uncertainty, while higher means the model prefers representatives.In our experiments, we fixed β to 0.6, which was the balance of the two scores.
Calculate entropy x ∈ C(x) using equation 6 ; while b < |B| do initialize k cluster randomly; set cluster prototype as cluster centroid ; while not converged do ∀x ∈ X set cluster(x) ; update cluster prototype ; reinstate Fairness using equation 4 ; return cluster (x) ; end Sample representation selection X * using equation 7; Add label (x * , y) to L and remove X * from U ; Update the model C t using L ; end

V. EXPERIMENTAL SET-UP
In this section, we discuss the experimental set-up for our analysis, including datasets, evaluation methods, and baselines.We use four real-world datasets for the analysis, and the characteristics of these datasets are presented in table I.The datasets vary in dimensionality and class-imbalance, and therefore they are interesting for comparative evaluation.Next, we explain them in detail.

A. Datasets
1) Adult census income [26]: This dataset contains 32,560 records extracted from 1994 census data.The attribute for the dataset includes age, occupation, education, and gender.The class is defined based on income such that it is 1 if the income is higher than $50k and 0 otherwise.The sensitive attribute is gender, with s = F emale being the protected group.
2) Compas dataset [6]: The Compas ProPublica dataset1 includes attributes describing the sex, age, race, juvenile felony and misdemeanor counts, number of adult priors, charge degree (felony or misdemeanor) in the state of Florida USA.We filtered the dataset to divide the race (protected attribute) into White and African-American defendants, as discussed in [5].The final dataset consists of 4483 records, and the twoyear recidivism feature is used as the output label.In the Compas dataset, 26% of the data belong to the minority class.
3) Loan Data: This dataset is extracted from loan applications data and is available at https://github.com/h2oai/app-consumer-loan.We preprocess that dataset, including removing null values and scaling, and the final cleaned dataset has a total of 5000 records collected from 2007 to 2015.The main features include the loan amount, purposes, and gender.The gender feature is used for defining a protected group.The task is to determine whether a loan application has a good or bad risk.
4) OULAD [28]: The Open University Learning Analytics (OULAD) dataset contains information about students and their activities in the virtual learning environment (VLE) for 7 courses.The dataset contains information of 32,593 students characterized by 12 attributes (7 categorical, 2 binary, and 3 numerical attributes).The target of the prediction task on the class label is the final result of the student = pass, fail.We use the cleaned dataset with 21,562 instances after removing the missing values and rows with the final result = "withdrawn".Gender is used as the protected attribute, and the ratio male:female is 11,568:9994 (56.6%:46.4%).

B. Evaluation Criteria
The proposed method is evaluated using two kinds of evaluation criteria, i.e., the model performance and fairness.
1) Performance Measures: The model performance is measured by accuracy, F1 score, and GMeans.While accuracy is a standard evaluation protocol in machine learning, it might be less preferred in unbalanced class distribution since the accuracy might be high but biased to one class, i.e., the majority.Therefore, to measure the model performance for imbalanced data sets, we compute F1-Score and GMeans in addition to accuracy.All these measures are defined below.
GM eans = where TP (True Positive), FP (False Positive), TN (True Negative), and FN (False Negative) are the same as defined in the confusion matrix.
2) Fairness Evaluation: The methods are compared based on three fairness metrics those are defined below.
• Equal Opportunity Difference: It measures the difference of a binary predictor Ŷ with respect to S and Y • Average Equalized Odds Difference: It computes the average differences of false positive rate (F P/(F P + T N )) and true positive rate (T P/(T P + F N )) between protected and non-protected group, which can be measured as • Statistical Parity Difference: Statistical parity rewards the classifier for classifying each group as positive at the same rate.The statistical parity of a binary predictor Ŷ is For all fairness metrics, a value closer to zero shows that there is a small difference between protected and non-protected groups, which indicates a fairer prediction.Negative values indicate that the model is biased against the unprivileged group, while positive values indicate otherwise.

C. Baselines Methods
We compared our method with the following active learning and fair active learning methods.
• EADA [39].The EADA (Energy-based Active Domain Adaptat) is a recent active learning approach that queries groups of target data that incorporate both domain characteristics and instance uncertainty into every selection round.
• AL [29].We use a standard active learning algorithm that selects the most uncertain sample through entropy as the candidate to be labeled by a domain expert.
• BW [3].The BW (Balance Weighting) method consists of two parts.First is the acquisition function to select samples using a pre-trained neural network in the source domain.Secondly, a balancing step is added to the acquisition function to reduce the imbalance of the labeled subset.
• FAL [5].Fair Active Learning (FAL) is the first approach that takes fairness into account for active learning.FAL focuses on developing active learning for solving group fairness, specifically improving demographic parity scores.The FAL consists of two main components: the accuracy optimizer that selects the samples that are expected to reduce the miss-classification error and a decision approach that approximates the samples expected unfairness reduction i.e the point that is expected to impart the largest reduction of current model unfairness after acquiring its label.• ALOD-RE [7].ALOD-RE (Active Learning with Outlier Detection Techniques and Resampling Methods) was proposed for active learning in imbalanced domains.This model combines active learning with outlier detection to select the most suitable samples to be labeled.ALOD-RE divides the query strategy by selecting 70% of the sample candidate by either uncertainty sampling or query by committee, while the rest of 30% is selected by outlier detection.It also adds a re-sampling method to balance the sample candidate before labeling.

D. Implementation Details
A standard data cleaning, including removing null values and normalization, is used to preprocess the datasets.For a fair comparison to the previous method [5], we use logistic regression as the classifier in the implementation.We divide the dataset into three disjoint sets, such that 10% training, 20% test, and 70% unlabeled data.We conduct experiments on all the datasets using all baselines, and the final score is calculated by taking the average over ten runs.We take 180 as the batch size for the sample selection as a real-lab experiment showed that humans were able to do 180 labels per hour [16].All experimental results are computed using this batch size.
The code for our implementation with other experimental material is publicly available on Github2 .

VI. RESULTS
In this section, we discuss experimental results to demonstrate the fairness and accuracy of the proposed method as compared to baselines.We also study the impact of each component of the proposed method on its overall performance.

A. Fairness Evaluation
We first evaluate fairness metrics, i.e., Statistical Parity (SP), equal opportunity (E Opp), and equalized odds (E Odd), for different methods, and the results are shown in Figures 2. We plot models' performance using accuracy versus unfairness to measure how fairness is correlated with a model's performance in its outcome.The unfairness value towards 0 shows that the model is more fair.The proposed FAL-CUR method provides a better trade-off between the model's performance and fairness.In most of the cases (especially for Adult, Loan, and OULAD datasets), FAL-CUR provides better accuracy and fairness than the state-of-the-art fair active learning methods.ALOD-RE is the latest fair active learning method that performs second-best or best in terms of fairness, though provides a much lower accuracy.FAL comes as the next fair model, followed by AL and EADA.On Adult and Compass datasets, the performance of state-of-the-art AL methods, such as EADA and BW, seems fairly close to each other; however in almost all the cases EADA provides the highest accuracy among baselines.The FAL-CUR method, outperforms EADA in terms of all fairness metrics on all the datasets as well as in terms of accuracy on some of the datasets.We believe the key explanation for this result is because EADA focuses on selecting samples that are predicted to increase active learning performance.However, the samples chosen may be biased towards specific protected groups thus reducing its fairness score.Furthermore, It indicates that the proposed model works well and is fair in its labeling decision, which is supported by fair sampling selection.

B. Performance Evaluation
In this experiment, we compare how the performance of FAL-CUR method changes with the number of queries as  3.The previous works [5] showed that the model performance drops when the fairness is maintained that we also observed in our experiments.In Figure 3, it is evident that unfair methods perform better than fair methods though they have lower fairness performance.Therefore, in the proposed FAL-CUR method, we aim to improve fairness without largely harming the model performance.From the figure, we can summarize that EADA, i.e., a latest state of the art model, shows a superior performance as compared to other models.However, its fairness score is lower as compared to the other models as shown in Figure 2. The proposed FAL-CUR method is able to maintain high fairness performance without losing much in the classification performance.For example, on compass dataset, The accuracy of FAL-CUR is only 0.02 point lower than the EADA and traditional AL method.However, fairness performance of FAL-CUR method is the highest as compared to state of the art models.

C. Ablation Study
To investigate the efficacy of the main components of the proposed FAL-CUR method, we conduct an extensive ablation study with several variants (i) replacing the Fair Clustering with original KMeans, (ii)FAL-CUR with representativeness only, by removing the uncertainty sampling from the sample selection process, and (iii) FAL-CUR with uncertainty, by removing the representativeness sampling for sample acquisition.
Figure 4 illustrates the ablation study results; it is clear that removing fair clustering from the method increases the performance but reduces the fairness.In contrast, selecting uncertain samples from fair clustering improved the fairness score but reduced the performance.While having the representative samples shows a stable GMeans performance with slight reduced fairness score.Therefore, these results justify the importance of all components -uncertainty and representative selection on fair clustering, which provides a good balance between performance and fairness.
1) Analyzing the Impact of β: Finally, we investigate the effect of β parameter on the performance and fairness results.The β value controls the balance between representativeness Table II illustrate the influence of β parameter to the performance result on the Compass dataset.From the table, one can infer that the performance, i.e., GMeans is increased when β is higher; however, the unfairness metrics also increases.For example when β = 1, the GMeans achieves the highest performance of 0.97 though the unfairness score is also the highest indicating the model is less fair to the sensitive class.However, when β = 0 the model shows an improved fairness score with a decreased score in the performance.Based on extensive experimental analysis, we suggest to use β = 0.6, as it provides a high fairness score while maintaining the model performance.It is also interesting to note that there is not much difference in fairness when β is set to less than 0.4, though the GMeans shows a decreasing pattern.The ablation analysis results on other datasets are similar to Compass datasets, therefore, we have shown the results only for Compass dataset.

VII. CONCLUSION
Active learning has proven to be a valuable method for reducing the annotation cost for machine learning.However, various active learning models neglect the fact that the sample selection method might be unfair to certain protected groups.In this work, we proposed a fair active learning method, called FAL-CUR, that preserves fairness while maintaining the model performance.The method is developed based on a fair clustering and a balanced scoring mechanism that takes representativeness and uncertainty of samples into account.The ablation analysis showed that a fair sample selection method is as important as fair clustering for achieving fairness in active learning.We further observed that in sample selection, only using uncertainty sampling might increase the performance but reduce the fairness accuracy, while representative sampling behaves the opposite way.Therefore, based on uncertainty and representative sampling, the proposed scoring mechanism on top of fair clustering achieves a better fairness score than other state-of-the-art models and maintains high performance.
In the future, we would like to propose a fair active learning method that also considers individual fairness of samples as group fairness does not consider the individual merits and might result in treating similar instances differently.

Fig. 1 .
Fig.1.The FAL-CUR methods illustration compared to uncertainty and representativeness sampling method.The uncertainty sampling method selects samples that the classifier is uncertain about, while representativeness sampling picks the samples which are representative of all samples inside the cluster.The FAL-CUR uses a weighted score of uncertainty and representativeness.
F r S C (t) = |x|x∈C xi=t| |C| and F r S C (t) = |x|x∈X xi=t|

Fig. 2 .
Fig. 2. Fairness comparison of the FAL-CUR method with baseline methods on all datasets (Fairness Metrics against model accuracy).

Fig. 3 .
Fig. 3. Comparison of the performance of FAL-CUR with baseline methods on multiple datasets (Metrics against number of queries).

Fig. 4 .
Fig. 4. Ablation study on different variants using only fair clustering, uncertainty, and representativeness on Compass Dataset

TABLE II β
PARAMETER EXPERIMENTS and uncertainty scores.A higher value of β parameter indicates that the model prefers representativeness over uncertainty.