Dataset Bias Prediction for Few-Shot Image Classiﬁcation

: Dataset bias is a signiﬁcant obstacle that negatively affects image classiﬁcation performance, especially in few-shot learning, where datasets have limited samples per class. However, few studies have focused on this issue. To address this, we propose a bias prediction network that recovers biases such as color from the extracted features of image data, resulting in performance improvement in few-shot image classiﬁcation. If the network can easily recover the bias, the extracted features may contain the bias. Therefore, the whole framework is trained to extract features that are difﬁcult for the bias prediction network to recover. We evaluate our method by integrating it with several existing few-shot learning models across multiple benchmark datasets. The results show that the proposed network can improve the performance in different scenarios. The proposed approach effectively reduces the negative effect of the dataset bias, resulting in the performance improvements in few-shot image classiﬁcation. The proposed bias prediction model is easily compatible with other few-shot learning models, and applicable to various real-world applications where biased samples are prevalent, such as VR/AR systems and computer vision applications.


Introduction
Few-shot learning, which trains a model with only a few training samples, is a challenging area of machine learning. Numerous models have been proposed for few-shot learning [1][2][3][4][5][6][7][8][9][10][11][12][13][14]. Generally, in few-shot learning, it is essential to quickly adapt to new classes because there are a large number of classes to classify. To fulfill this requirement, many recent works have utilized meta-learning as an effective technique for training and classifying samples in few-shot learning [4][5][6]11,15].
Few-shot learning is an important technique for machine learning systems based on VR (Virtual Reality) and AR (Augmented Reality), as these systems often rely on limited or incomplete datasets. In this respect, few-shot learning methods allow for the rapid adaptation of the system to new classes or objects, without requiring large amounts of labeled data. However, one of the major challenges in few-shot learning is the problem of dataset bias. This occurs when the distribution of the data used for training the model is not representative of the real-world distribution of the data that the model will be applied to. As a result, the model may perform poorly on unseen data. This is a particularly acute problem in VR/AR applications, where the virtual environment may be vastly different from the real-world environment, leading to a significant mismatch between the training and testing data. Addressing this bias is critical to the success of few-shot learning in VR/AR applications, and requires careful selection and curation of training data, as well as robust evaluation metrics to ensure that the system is performing well on a wide range of inputs.
Few-shot image classification is a subtopic of few-shot learning. In this type of classification, only a few classes are provided for training, and each class has only a few image samples. For a classification method to be practical, it needs to be able to classify Numerous studies have been proposed to address class imbalance or bias mitigation in image classification tasks [17][18][19]. Traditional approaches, including those proposed in studies [20,21], have primarily targeted problems related to data bias. There are also several studies that proposed few-shot image classification models [22][23][24]. However, to the best of our knowledge, there are only a handful of studies that have concurrently addressed dataset bias within the framework of few-shot image classification. Our study presents a novel approach by proposing an add-on network structure, which incorporates a bias prediction (BP) network into existing few-shot learning models. The primary objective of this integration is to significantly enhance the performance of these models by effectively mitigating biases. Moreover, the proposed architecture allows for seamless integration with various existing few-shot learning models, making it flexible for handling various issues in few-shot learning tasks.
In supervised learning, having a larger number of training samples typically results in lower dataset bias. However, in few-shot image classification, where only five or fewer training samples are available per task, the likelihood of dataset bias in the training set significantly increases. This can lead to the model learning to incorporate the bias as important information, which can negatively impact its performance on new data. To alleviate this issue, the model requires an additional mechanism that can facilitate embedding of features that exclude the bias. To address this, we propose a bias prediction network, focusing particularly on color bias, as illustrated in Figure 1. Our approach can be extended to other types of bias that can be represented as predictive targets. We train the bias prediction network to recover the bias of the raw image, such as color, from the embedded features. If the bias prediction network is able to almost fully recover the color bias, the embedded features are assumed to be highly dependent on the color components of the raw samples, indicating the presence of color bias in the features. Conversely, if the model is trained to embed features that are difficult for the bias prediction network to recover from, the bias in the embedded features can be reduced. This can result in a performance improvement of few-shot image classification. It implies that incorporating the bias prediction network into few-shot learning task can contribute to a performance improvement in various real-world applications where biased datasets are prevalent, including VR/AR systems and computer vision applications. This approach will not only help improve model performance but will also ensure that the models are fair and unbiased in their decision making, providing more reliable and trustworthy results.
The major contributions of this study are summarized as follows: • We present a novel approach for few-shot image classification that utilizes adversarial learning to train a bias prediction network. Since only a few samples are available for each class, our approach accounts for the presence of color bias in each label, and aims to minimize its impact on classification. • The proposed network is compatible with other models and can be easily integrated with them.

•
Our experiments demonstrate that incorporating the bias prediction network into few-shot learning model improves the performance, indicating the potential of our proposed approach to enhance other few-shot learning tasks across various domains.

Few-Shot Learning
Few-shot learning (FSL) is a challenging research area that focuses on training models to learn new concepts or classes from only a few examples. One of the popular topics of FSL is few-shot image classification (FSIC), which classify images using FSL methods.
Generally, FSIC is accomplished through meta-learning. The FSIC model is trained using a chain of training tasks, with each task containing only a few data samples. The FSIC problem is known as N-way K-shot problem, where N represents the number of classes (labels), and K represents the number of data samples from each class. Each training task consists of a support set and a query set. The support set is used to learn to classify accurately and is formed by randomly selecting N classes and K data samples from each class, resulting in an N × K support set. In addition, to form the query set, some data points are randomly selected from the N classes, but none of these points should be identical to any data in the support set. The query set is then used to evaluate the performance of the model on the task.
During the training process, the FSIC model extracts features from the support set and generates classifiers. The model evaluates its performance on the query set and updates its parameters accordingly. In the next task, the model randomly selects N new classes, and the evaluation and updating process is repeated. Finally, the model is tested on data from classes that were not included in the training set. Despite not having learned these new classes before, the model can accurately classify the data.
The FSIC models can be categorized into two groups: distance-based methods and graph-based methods. Distance-based methods compare two feature vectors using metrics such as the L1 distance. One example is the Siamese network [1], which extracts feature vectors from pairs of images randomly selected from the support set and then compares them using a trainable L1 distance. On the other hand, matching networks [25] learn the distance function between the support vectors and the query vector. Prototypical networks [2] embed images to extract feature vectors and calculate the prototype vector for each class. Then, when a feature vector is extracted from a query image, the image is classified as belonging to the class whose prototype vector is the closest.
Graph-based methods, such as the Graph Neural Network (GNN) model [26], represent each image as a node, with edges connecting nodes based on the similarities between their corresponding feature vectors. The edge weights are employed to compute the weighted average vector of the neighboring nodes along with their respective features. This weighted average vector is then aggregated with the node feature vector. EGNN [27] also utilizes the GNN architectures, where each edge between two nodes has the value of the similarity between the two nodes, and predicts whether the two images belong to the same class. Recently, methods focusing on classifiers have also been studied. MetaOptNet [6] utilizes linear classifiers trained using a linear support vector machine (SVM).

Bias Prediction
Dataset bias refers to a situation where the data are not a true representation of the realworld situation or phenomenon that the model is intended to learn from, which can lead to biased predictions or incorrect inferences. Dataset bias can occur due to many reasons, such as the way the data was collected, the characteristics of the study population, or the limitations of the measurement tools. For example, a facial analysis dataset that is largely composed of lighter-skinned subjects may cause errors when analyzing darker-skinned subjects [28]. It is important to identify and minimize dataset bias to ensure that machine learning models are fair, accurate, and reliable.
Bias prediction is a method used to predict bias in a dataset and mitigate the impact of bias in a dataset. One approach to bias prediction utilized generative adversarial networks [29], as demonstrated in a previous study [30]. If dataset bias exists, the labels can be predicted based on the biased samples. This means that the mutual information between the bias of the training samples and labels will be high, indicating that the corresponding labels are closely related to the dataset bias. For instance, if the dataset bias is a biased color distribution (e.g., human face skins in samples are all white or black), we can calculate the entropy of the embedded features from the feature embedding networks. If the bias prediction results reveal a clear color distribution of the training samples, the entropy will be low. Conversely, if a clear color distribution cannot be found, the entropy will be high. Therefore, the networks are trained to maximize entropy resulting in reducing the effects of color bias. Another approach, Just Train Twice (JTT) [20], is also a straightforward method that mitigates the dataset bias. The algorithm of JTT consists of two stages: identification and upweighting. In the first stage, the framework collects misclassified data from its identification model and makes an error set. Then, in the second stage, the framework upweights the error set and trains the final model using the training data with the upweighted error set. Data augmentation could be a solution for dataset bias. The method proposed in [21] attempts to weaken the correlations among two or more attributes. For example, in the human face dataset, a person wearing a hat tends to also wear glasses. Thus, the method generates images with persons only wearing either a hat or glasses. In this way, the dataset bias in the attributes can be mitigated.
Overall, the few-shot image classification model is able to classify unseen classes, making it beneficial in many practical scenarios where acquiring a large amount of labeled data is not feasible. However, the common issue of dataset bias is required to be addressed to ensure generalization to diverse data domains. Our approach aims to improve the performance of few-shot classification tasks, despite the presence of biased samples in the dataset.

The Bias Prediction Network
Consider a given training set , where x i ∈ R d represents the i-th image in the set, and y i ∈ Y is the corresponding label associated with the image. Few-shot image classification models typically consist of two deep neural networks: a feature embedding network and a classifier. The feature embedding network, denoted as φ f extracts features from the raw image. The classifier, denoted as φ c , takes the extracted features as input and outputs the classification result. The output of the feature embedding network for a raw Here, we detail the incorporation of our novel bias prediction network into the original networks φ f and φ c , as well as its role within the overall model. We note that the original networks φ f and φ c represent the few-shot image classification model. The overall architecture is illustrated in Figure 2, and the pseudo code is described in Algorithm 1. Our proposed bias prediction network, denoted as φ b , is designed to take the embedded features generated by φ f as input. The output from the bias prediction network, φ b φ f (x) , is subsequently passed through a softmax function. We define the output before the softmax function as Z, and the output after the softmax as

Algorithm 1: Networks optimization with the bias prediction network
the corresponding label y i ∈ Y Output: Optimized weights of the feature extraction network φ f , the classification network φ c , and the bias prediction networkφ b Calculate classification loss from the classifier φ c φ f (x) : L class 5 Output the bias prediction network result: Calculate the total loss using Equation (1): L total ← L class − λH(σ(Z)) 7 Update φ f and φ c by minimizing L total 8 Extract the true color labels from x i : C

9
Calculate the bias prediction loss using Equation (2) We are particularly focused on color bias and prioritize the independence between the color distribution of the input and the corresponding features. Therefore, the bias prediction network is designed to evaluate the dependency between the color distribution and the features and this evaluation is achieved by computing cross-entropy between the color distribution and the output of the bias prediction network, which is described in Section 3.3. If the cross-entropy is low, the distribution and the output are correlated, which means that the distribution and the features are dependent. On the other hand, if the cross-entropy is high, it can be assumed that the distribution and the features are barely dependent.
Although we utilize the bias prediction network architecture proposed in [30], we specifically target color bias within the context of few-shot image classification. Our network predicts color bias and is designed to be compatible with various existing few-shot image classification models. This aligns with our belief that color distribution in a dataset, including attributes such as skin or hair color, represents a prevalent form of dataset bias.
In scenarios where the attribute we aim to classify has minimal connection with colors, the features extracted by φ f show no correlation with colors. As such, our bias prediction network is tasked with not only predicting but also alleviating this color bias. Furthermore, our proposed network is compatible with various existing few-shot image classification models, allowing for broad applicability.
The algorithm introduces two main loss functions: the total loss L total and the bias prediction loss L bias . The classification loss, denoted as L class , is computed from the classifier φ c φ f (x) , which signifies the result of few-shot image classification and is used to compute the total loss L total . First, we calculate L total and update the original model (φ f and φ c ). Subsequently, we calculate L bias and update φ b . These loss functions are interactively updated during the training procedure. φ c , and a bias prediction network φ b followed by φ f . We note that φ f represents a CNN architecture and the features embedded by φ f are fed as an input to the bias prediction network, which outputs the bias prediction result. Based on the result, the bias prediction loss L bias is obtained by computing the cross-entropy between the resized raw image and the bias prediction result. Meanwhile, the classification loss L class is calculated by the original model and is used to compute the total loss L total . The image samples used in this architecture are from the Adience dataset.

The Total Loss
The total loss function is defined as: where L class is the classification loss, H(·) is the entropy function, and λ is the hyperparameter for entropy regularization. We note that the classification loss, L class , is pre-defined by the original few-shot classification model and is typically represented as a cross-entropy function. The goal of this loss function is to encourage the feature embedding network φ f to generate features that are less dependent on the color distribution of the input image. Specifically, the output of the bias prediction network φ b , denoted as σ(Z), represents the color labels of a resized image that the network recovers from the embedded features. If the entropy of the output σ(Z) is low, it means that the bias prediction network can easily predict the colors of the resized image from the embedded feature, indicating that the features highly depend on the color distribution of the input image. By maximizing the entropy of the output of the bias prediction network H(σ(Z)), we can encourage the feature embedding network to generate features that are less dependent on the color distribution of the input image, resulting in a more robust and generalizable model.

The Bias Prediction Loss
Predicting dataset bias is critical when the features of a dataset fail to accurately represent the corresponding class labels. An example is a dataset where most cats have white hair, and most dogs have black hair. In such cases, the model's performance would be poor when predicting the colors of cats and dogs with other hair colors. This is because hair color alone is not sufficient to represent cats and dogs. Consequently, when the features of a dataset are not representative of the classes, it can lead to incorrect and unreliable predictions.
In this study, we aim to predict color bias in raw images. Due to the limited number of training samples, it is challenging to represent all possible colors for each object. Therefore, we assume that a color bias exists for each label and design the bias prediction network φ b to recover the color components of the input image from the embedded features. To evaluate the performance of the bias prediction network, we compare the color labels of the resized raw image and the output σ(Z) of the network. The bias prediction loss is defined using a cross-entropy function: where H(·, ·) denotes a cross-entropy function and C is a matrix of true color labels. Figure 3 presents the detailed architecture of the bias prediction network and how the network trained with respect to the color bias. Assuming that the output of the feature embedding network is 128 × 21 × 21, the features are passed through two CNN layers to obtain the bias prediction results with a size of 8 × 21 × 21. To compare the predicted color bias with the true color labels, the sizes of both are matched. Consequently, both the height and the width of the predicted results are resized to 21 pixels. To accommodate the range of values of the red, green, and blue channels, a color binning technique is applied to split the range into eight equal intervals ([0, 31], [32,63], . . . , [224,255]). The compressed data of the raw images are represented by C. Figure 3. The detailed architecture of the bias prediction network φ b for predicting color bias involves several steps. First, a raw input image is resized and split into three color channels. Next, the pixel values of each channel are grouped into eight intervals to match the size of the output of the bias prediction network. Then, the bias prediction loss is calculated by computing the negative cross-entropy between the output generated by the network and the true color labels of the resized image. A higher bias prediction loss indicates that the model is less successful in predicting the color bias, which leads to a higher total loss. The bias prediction loss encourages the model to learn features that are less dependent on the color distribution of the input image. (2) is calculated using the following equation: bias represent the color bias prediction losses for the red, green, and blue channels, respectively, and c ijk and z ijk (i = 1, . . . , 8, j = 1, . . . , 21, k = 1, . . . , 21) are elements of C and Z, respectively. The bias prediction loss is calculated as the average of the losses across the three color channels. Each of these losses is determined by the cross-entropy between C for each respective channel and Z.

Training Procedure
The original networks φ f and φ c are trained to minimize L total , whereas the additional bias prediction network φ b is trained to minimize L bias . The classification loss L class depends on the loss function of the original networks, while H(σ(Z)) is related to φ b . Meanwhile, L bias is evaluated when φ b predicts the color labels of the raw image from the embedded features.
In Equation (1), L total becomes smaller if the entropy H(σ(Z)) is higher. Therefore, φ f attempts to embed features that φ b can hardly infer the color component of the raw image. After φ f is updated, φ b tries to recover the color components of the raw image from the embedded features to minimize L bias . Subsequently, φ b is updated. This procedure is repeated for another training sample. The corresponding pseudo code is shown in Algorithm 1.

Experimental Setup
Our study aims to enhance the performance of existing few-shot classification models by mitigating color bias. To demonstrate the effectiveness of our proposed bias prediction (BP) network, we compare the performances of multiple existing models both with and without integrating the BP network. We evaluated the performance of existing few-shot classification models, such as EGNN [27], MetaOptNet [6], DeepEMD [9], and SetFeat [31], using four benchmark datasets as follows: • miniImageNet [25] is the most general few-shot learning dataset. This is derived from the ILSVRC-12 dataset [32]. All images have a size of 84 × 84 pixels. The dataset has 100 classes, and each class has 600 image samples. • CIFAR100 Few-Shots (CIFAR-FS) [33] is randomly split from CIFAR-100 [34] dataset. It consists of 64 training classes, 16 validation classes, and 20 test classes. The classes contain 600 images per class. Each image has a resolution of 32 × 32 pixels. • Fewshot-CIFAR100 (FC-100) [35] is another split dataset from CIFAR-100 for fewshot learning. It contains 12 categories for training, 4 categories for validation, and 4 categories for tests. Furthermore, there are 60, 20, and 20 low-level classes, respectively, and each class has 600 images of size 32 × 32 pixels. • Adience [16] contains about 20,000 human face images with various genders, ages, and races. All images are aligned, and have a size of 84 × 84 pixels. To perform more difficult classification tasks, we divide the data into two or four age groups: Infant (approximately 0-2 years old), Juvenile (8-12 years old), Young (25-32 years old), and Old (60-100 years old).

Effectiveness of BP on Multiple Few-Shot Learning Models and Datasets
In this section, we present the experimental results obtained from our study where we applied the BP network to several existing few-shot classification models, which are mentioned above. The study evaluated the effectiveness of the bias prediction network across multiple benchmark datasets, including miniImageNet, CIFAR-FS, FC-100, and Adience , employing five-way one-shot and five-way five-shot learning settings.
The performance (accuracy) of each dataset and model was evaluated with and without the integration of the bias prediction network. Results presented in Tables 1-3 demonstrate that the bias prediction network effectively enhances the performance of most original few-shot learning models in our study. Overall, our study highlights the potential of the bias prediction network as a tool for improving the performance of few-shot classification models where biased samples are prevalent. The results show the importance of considering and addressing bias in the development of few-shot learning models to ensure fairness and accuracy in their predictions.
In order to confirm the statistical significance of performance improvements, we performed a paired t-test comparing the models' performance between with and without the integration of the Bias Prediction (BP) network. We found that the integration of the BP network substantially improved the performance of few-shot image classification models, yielding statistically significant results (p-value < 0.05). However, we observed an exception with the EGNN model under the miniImageNet dataset in a five-way five-shot learning scenario, where the improvement was not statistically significant (p-value = 0.1872). Despite this, in the five-way one-shot learning on the same dataset, and across all learning tasks in other datasets, the EGNN model showed significant improvements with the integration of the BP network. Therefore, we can conclude that the integration of the BP network generally enhances the performance of few-shot image classification models, including EGNN, under varying conditions and datasets.  Table 3. Performance comparison of few-shot classification models with and without the bias prediction network on FC-100 datasets.

Effectiveness of BP on Skin Color Biased Dataset
We conducted an additional experiment using the Adience dataset to evaluate the effectiveness of our bias prediction network when the training and test sets were biased by the color of human skin. As illustrated in Figure 1, the training set comprised of young white individuals (white circles) and old black individuals (black circles), while the test set consisted of young black individuals (black squares) and old white individuals (white squares). The task is a two-way five-shot learning problem for classifying young (approximately 25-32 years old) vs. old (approximately 60-100 years old) groups using EGNN as a few-shot learning model. The left part of Figure 4 presents the results, which demonstrate integrating the bias prediction network enhances the performance of EGNN model. To further investigate the effectiveness of the bias prediction network with more than two classes, we tested with two additional biased classes: Juvenile (approximately 8-12 years old) and Infant (approximately 0-2 years old). We trained the model with the black-skinned Juvenile class and the white-skinned Infant class and tested it with the whiteskinned Juvenile class and the black-skinned Infant class, resulting in a four-way five-shot learning task. The right part of Figure 4 shows the results, demonstrating the effectiveness of the bias prediction network in enhancing the original EGNN model's performance.

Effectiveness of BP on Color-Filtered Datasets
In this experiment, we demonstrated the effectiveness of the bias prediction network on different color-filtered versions of the miniImageNet dataset. We generated grayscale, red channel, green channel, and blue channel versions of the original dataset, as shown in Figure 5a, and used them during the training stage. The test set consisted of samples from the original dataset, and we evaluated the image classification results on each colorfiltered dataset. Some examples of the color-filtered miniImageNet dataset are displayed in Figure 5a. We conducted a few-shot image classification using MetaOptNet as a base model, under the five-way five-shot learning task. The results, presented in Figure 5b, demonstrate that the proposed model with the bias prediction network improved the performance of the model without it (λ = 0) on the red, green, and blue datasets. However, the bias prediction network did not improve the performance on the grayscaled dataset since the color bias of the dataset is significantly reduced when images are grayscaled and all channels have the same value. In most cases, our experiments demonstrated that integrating the bias prediction network resulted in improved performance for various few-shot learning models across multiple datasets, in different scenarios. Overall, our results showed the positive impact of bias prediction network in enhancing the performances of few-shot classification models where biased samples are present.  (1), while the vertical axis represents the improvement in performance by the bias prediction network.

Impact of Different Dataset Sizes
In this experiment, we investigated the impact of dataset size on the effectiveness of the bias prediction network in few-shot learning. Since smaller datasets tend to exhibit larger biases, we evaluated the performance of the bias prediction network on the miniImageNet dataset with varying numbers of samples per class: 600, 300, and 100. We experimented with the five-way five-shot learning task and evaluated the performance of EGNN model with and without the bias prediction network. The results presented in Figure 6 indicate that the performance improvement achieved by the bias prediction network was most significant when the number of samples per class was smallest (100 samples). Our findings suggest that datasets with fewer samples per class have a higher likelihood of exhibiting color bias, and the integration of the bias prediction network can be especially effective in mitigating such bias, even in datasets with significant bias. Figure 6. Performance of the bias prediction network with varying number of samples per class in the miniImageNet dataset. We evaluated the prediction performance using EGNN. The horizontal axis represents the number of the samples of each class. The vertical axis represents accuracy with and without the bias prediction network. The results indicate that the bias prediction network had the greatest impact on datasets with a smaller number of samples per class.

Conclusions
In this work, we addressed the challenge of reducing bias for few-shot learning. To tackle this problem, we proposed a bias prediction network model with the application of few-shot image classification, focusing on color bias. Our approach utilizes adversarial learning to train a bias prediction network that the feature embedding network generate features from input images, which are then fed into the bias prediction network to recover the color labels of the original image. If the training set is color biased, the feature embeddings are likely to be highly dependent on the color values of the training samples, making it easy for the bias prediction network to recover the original image. Accordingly, we introduced a loss function that encourages the feature embedding network to produce embeddings that are less dependent on the color values. Our experimental results demonstrate that the proposed bias prediction network is effective in improving the performance of various existing few-shot learning models across multiple benchmark datasets. The findings suggest that the proposed model has the potential to enhance other few-shot learning tasks across various domains where the number of samples is limited, and biased datasets are prevalent.