HANDLING SEVERE DATA IMBALANCE IN CHEST X-RAY IMAGE CLASSIFICATION WITH TRANSFER LEARNING USING SWAV SELF-SUPERVISED PRE-TRAINING

. Abstract. Ever since the COVID-19 outbreak, numerous researchers have attempted to train accurate Deep Learning (DL) models, especially Convolutional Neural Networks (CNN), to assist medical personnel in diagnosing COVID-19 infections from Chest X-Ray (CXR) images. However, data imbalance and small dataset sizes have been an issue in training DL models for medical image classiﬁcation tasks. On the other hand, most researchers focused on complex novel methods instead and few explored this problem. In this research, we demonstrated how Self-Supervised Learning (SSL) can assist DL models during pre-training


INTRODUCTION
Ever since its outbreak in December 2019, the coronavirus disease 2019 (COVID-19) has become a burden for humanity due to global massive incidences of the disease that are perpetuated by its contagious and rapid spreading. The symptoms of COVID-19 are dependent on its host and variant. According to the U.S. Centers for Disease Control and Prevention (CDC) 2020, people who developed noticeable symptoms mainly experienced mild or moderate symptoms (81%), while the rest developed severe (14%) and critical symptoms (5%) [1]. The moderate symptoms were often followed by mild pneumonia, while severe and critical symptoms commonly involved hypoxia, dyspnea, respiratory failure, or multiorgan dysfunction. The chest X-Ray imaging also showed more than 50% lung involvement [2]. To detect infections, the Reverse Transcription Polymerase Chain Reaction (RT-PCR) test has been widely considered the standard to confirm COVID-19 infection using samples from nasopharyngeal swabs [3].
Aside from RT-PCR, imaging detection can also be utilized for COVID-19 screening, such as chest X-Ray (CXR) imaging and computed tomography [4,5]. In most cases, the CXRbased diagnosis of pneumonia caused by COVID-19 was determined based on the presence of rounded morphology with ground-glass opacities in the chest imaging, which is characterized by bilateral peripheral distribution associated with crazy-paving patterns [6]. In some peak stages of COVID-19 infection, architectural distortion marked with subpleural bands also occurred [7]. Those are typically used to differentiate cases of pneumonia caused by  from those caused by other factors. Using those features, radiologists can manually interpret the images to annotate the presence of COVID-19 pneumonia. As manual diagnoses are not efficient, researchers across the globe have attempted to implement computer-aided COVID-19 diagnosis using Machine Learning (ML), especially the Deep Learning (DL) approach using Convolutional Neural Networks (CNN), which can increase the efficiency and accuracy of early diagnoses to assist radiologists in rapidly finding suspicious patterns on lung images [8].
Supervised learning is the most straightforward DL approach for early detection of COVID-19 from CXR images by training DL models on labeled images [9]. The models are then evaluated on an 'unseen' test set with the assumption that both the training and test sets are obtained from the same data distribution. However, DL requires enormous volumes of data to be accurate [10,11] and minimize the risk of overfitting, which means that it may be ineffective when the number of available data is insufficient. Moreover, training the models from scratch for a single specific task is exhaustive, which prompts researchers to deploy the Transfer Learning (TL) approach. For tasks related to medical images, the approach utilizes the learned knowledge of the models pre-trained on a large computer vision dataset such as the ImageNet dataset and repurposes the knowledge to another computer vision task [12]. This transferred knowledge can then be fine-tuned by using the zero, partial, or full network adaption [13]. TL has been extensively implemented in many research publications on CXR-based COVID-19 classification. For example, in our previous work, we developed a CXR-based classification model by using a DenseNet-121 backbone fine-tuned to classify lung diseases by combining two public CXR datasets and obtained AUC scores above 80% for different task configurations [14].
Another existing challenge in training DL models on medical datasets is the imbalanced condition of the datasets [15]. As high-risk patients may be rarer, it is common for severe imbalance to occur which will influence the performance of ML models, where the models' predictions may lean towards the class with more samples [16]. For most studies related to DL in analyzing CXR images, researchers typically preferred either augmenting the minority classes as a method of oversampling or performing Random Undersampling (RUS) [17,18,19].
However, both have their own weaknesses. RUS may remove valuable data randomly and cause sampling bias. In addition, the reduced number of data may be insufficient to train DL models. On the other hand, the usage of oversampling will consume more training time and may possibly cause the models to overfit if the augmented samples were too similar [20]. In other words, when the imbalance is too severe and too many augmented images were generated, the model may focus too much on specific features of the images from the minority class and fail to capture the relevant general features, resulting in accuracy degradation when evaluated on unseen samples of the minority class.
More recent studies had delved into several pre-training algorithms for TL. The Self-Supervised Learning (SSL) algorithm, especially the Swapping Assignments between Views (SwAV) algorithm, in particular, has been conspicuous, proving that it can allow CNN models to attain better accuracy on downstream tasks [21,22]. This means that performing TL using models that had been pre-trained using SwAV can yield better results, implying that they are more capable of determining detailed distinguishing features of objects. A previous research had proven this fact by pre-training a CNN on a CXR dataset using SwAV and performing TL to train it on another CXR dataset, which proved that this method outperformed regular TL models pre-trained in a supervised manner [23]. However, few have explored other potentials of this method, including whether it is more robust against data imbalance. In this research, we evaluated two types of CNN models, specifically the ResNet-50 CNN architecture, pre-trained using SwAV and standard supervised learning on classifying CXR images without oversampling. As the dataset is also imbalanced, the model pre-trained using SwAV may generalize better as it possesses more knowledge on distinguishing similar features without regard to the classes. This paper is organized as follows: section 2 presents some previous works related to CXR classification using TL, section 3 describes the details of how the experiments in this study were conducted, section 4 presents the obtained results and analyses, and section 5 presents the conclusion of this research.

RELATED WORKS
The emergence of research dedicated to the analysis of CXR images to diagnose COVID-19 by using TL becomes more prevalent during the peak of the COVID-19 pandemic. In 2020, Loey et al. used a Generative Adversarial Network (GAN) to augment CXR images from the only available public CXR COVID-19 dataset at the time for retraining three different pretrained models (Alexnet, Googlenet, and Restnet18). Their finding showed that GoogleNet achieved an accuracy of 0.806 in classifying four classes of CXR images [24]. Similarly, Rahaman et al. compared 15 pre-trained CNN architectures to classify normal, regular pneumonia, and COVID-19 CXR images, where the VGG19 model obtained an astounding accuracy of 0.893 and F1 score of 0.90 albeit augmentation was still involved [25]. In another similar research, Minaee et al. evaluated four pre-trained CNNs, namely ResNet18, ResNet50, SqueezeNet, and DenseNet-121, in detecting COVID-19 infections from collected CXR images that were analyzed by certified radiologists to determine the labels. By using TL and augmentation, the CNN models obtained around 0.9 specificity rates [26]. Overall, standard TL had been successful in classifying CXR images when trained using augmentations.
Chouhan et al. utilized the ensembling of CNN models after TL to classify pneumonia on the CXR images using five popular pre-trained CNN models. The output of each model were then assembled to produce the final output. By using augmented images from the Guangzhou Women and Children's Medical Center dataset, the model achieved 0.964 accuracy and 0.9962 recall, respectively [27]. In another research, a similar method was also performed using four public CXR datasets, where three pre-trained ResNet models were trained to perform binary classifications using three datasets (Normal-COVID, Pneumonia-COVID, Normal-Pneumonia).
The models were then ensembled and further fine-tuned using the other dataset. The proposed method outperformed the individual models with a precision of 0.94 and a recall of 1.0 [28].
However, such methods require huge computation workloads as standard DL is already computationally expensive [29], not to mention the massive amount of data required to train the models and the augmented images.
On the other hand, various SSL-based approaches had been exploited in recent studies of CXR imaging classification. Liu et al. on elaborated self-supervised mean-teacher model pretraining with semi-supervised fine-tuning called S 2 MT S 2 method that was evaluated on CXR datasets to perform multilabel classification. Using different proportions of the labeled and unlabeled data, their method produced similar results on the CXR dataset with the supervised approaches [30]. Azizi et al. proved that self-supervised pre-training on ImageNet followed by further self-supervised pre-training on unlabeled CXR images improved the model's performance on CXR classification on test sets. Combining with an alternative of contrastive learning called Multi-Instance Contrastive Learning they showed that the method can beat the performance of supervised approaches on the testing sets [31].
More recent studies have demonstrated how SSL can be exploited for CXR-based COVID-19 classification. Abbas et al. proposed a TL approach to repurpose large-scale image classification tasks to COVID-19 detection on CXR using a self-supervised sample decomposition method.
The approach called 4S-DT can deal with imbalanced class distribution in the dataset. Their method can attain high accuracies for the classification task [32]. In another study, Gazda et al. utilized self-supervised pre-training of deep CNN on CheXpert images with removed labels using contrastive learning approaches. The pre-trained models were then used to classify pneumonia types and COVID-19 recognition on different datasets. The results showed comparable results with supervised methods without using a large number of labeled datasets [33], which further proved the prowess of SSL. All in all, SSL and SwAV have yet to be fully explored for CXR image classification.

RESEARCH METHODOLOGY
3.1. Dataset. This research utilized the COVID-19 Radiography Database as a training dataset to run the CXR image classification task [18,34]. The data acquired was split into four classes, containing 3616 images of positive COVID-19 cases, 10192 normal, 6012 lung opacity (non-COVID lung infections), and 1345 cases of viral pneumonia. The creator of the dataset also provides lung segmentation masks for all images, enabling segmentation to be done before training the DL models. Figure 1 shows the original image samples along with their respective segmentation masks for each class provided by the dataset. Images with irregularities were shown in the figure, such as texts and an arrow in Figure 1(A), an arrow in Figure 1(B), black-padded different-sized images in Figure 1(C), and cropped lung images in Figure 1(D). Therefore, the segmentation masks are used in segmenting the images before feeding them into the DL models and allowing the models to focus only on the lung areas of the images. 3.3. Transfer Learning from Self Supervised Learning. In cases where the size of the dataset is insufficient to allow DL models to generalize, TL has been a broadly adopted method to assist the training of the models [35,36]. Its main goal is to aid the models in obtaining better target predictive function f T (.) for the target task T T based on the target domain D T .
The models are first trained on a pretext task using the source domain D S and source task T S , and the knowledge they obtained is transferred by retraining the models using D T and T T for a downstream task. Mathematically, TL can be formulized as follows: where x and y denotes the input data and their labels, respectively. The goal is to improve f (.) by using the transferred knowledge from D S and T S [37]. In this research, the heterogenous TL method was adopted, in which D S = D T and T S = T T . The ImageNet dataset [38] in particular, which contains more than three million images, has been widely used to pre-train proposed DL models through supervised learning [39,40,41]. In general, this method has proven effective in various DL studies [36,42,43].
Aside from the supervised learning approach, SSL in the form of Contrastive Learning (CL) has been a prominent approach in pre-training DL models on pretext tasks [23]. CL trains DL models to cluster samples, allowing them to identify the same object from different augmented views [21]. This means that models trained using CL should be capable of distinguishing different representations of the same object, making them more robust compared to standard supervised models. The SwAV algorithm in particular is one of the most successful algorithms with the best accuracy among similar CL algorithms albeit inferior to standard supervised models.
Inspired by contrastive instance learning, SwAV trains the models to differentiate various views of an image by comparing the cluster assignments produced instead of their features. This was done by utilizing the multi-crop strategy and trainable prototypes. Figure 3 illustrates how SwAV was performed. First, the multi-crop augmentation was performed to generate various views of the input image X, resulting in the randomly cropped X 1 and X 2 which were further augmented using random horizontal flip. Color distortion and Gaussian blurring were then performed on X 1 and X 2 , which were later fed into the model F θ . The output embeddings Z 1 and Z 2 were produced, and dot product operations were performed on Z 1 and Z 2 by using the prototype vector C to produce the scores Z 1 .C and Z 2 .C. The Sinkhorn-Knopp algorithm was used to assign the clusters from Z 1 .C and Z 2 .C, resulting in the codes Q 1 and Q 2 .
In computing the loss, the assignments were then swapped as described in the equations below.
where k denotes the number of prototypes used and τ is a temperature variable used for softening the scores. As shown in equation 3, the loss function used is the Cross-Entropy (CE) loss, which were later averaged and the mean CE loss was used in back-propagating the model's parameters as well as C [22]. On most computer vision downstream tasks, performing TL on models trained using SwAV had produced better results compared to standard supervised learning [21,22]. In a recent research, TL from SwAV had also been used for distinguishing COVID and normal lung X-Ray images and had proven to be superior to other TL models [23]. However, this implementation was still limited to binary classification and has yet to be tested on independent test sets. Therefore, deeper analyses on TL using SwAV for X-Ray image classification were conducted in this research by using four classes of chest X-Ray images. The results were also compared to that of TL using standard supervised method.
In this research, two ResNet-50 models were trained on the COVID-19 Radiography Database. Both models were pre-trained on the ImageNet dataset, one pre-trained using SwAV, which will be referred to as SwAV-TL in the following sections of this paper, and the other   Table 1. As the dataset used is imbalanced, the accuracy metric is not calculated in this experiment.
It should also be noted that the models were also trained using the imbalanced dataset without resampling, meaning that it is possible for some of the models to have null precision in some classes. Therefore, only the recall/sensitivity/True Positive Rate (TPR), specificity/True Negative Rate (TNR), miss rate/False Negative Rate (FNR), and fall-out/False Positive Rate (FPR) are calculated in this research. These metrics are calculated as follows.
The precision and F1 score were only calculated for the best model of the research to compare the models to that of previous studies. These two metrics were calculated as follows: 3.4.2. AUROC. In addition to the metrics explained above, the AUROC has also been a reli-

Experiment Setup.
The experiments were conducted using the Python programming language and PyTorch DL framework. In tuning the hyperparameters, the HyperOpt library [46] was utilized to perform a grid search in determining the optimal learning rate lr and L2 regularizer weight decay λ , where lr ∈ {1e − 2, 1e − 3, 1e − 4, 1e − 5} and λ ∈ {0, 1e − 1, 1e − 2, 1e − 3, 1e − 4, 1e − 5, 1e − 6}. All of the models were trained for 50 epochs using the Adam optimizer and CE loss. For each type of TL, the models were trained on the two scenarios: (a) using the original imbalanced dataset and (b) using the undersampled balanced dataset.
This means that a total of 20 models were trained, 10 for each TL type. Class weighting was utilized in training the models on the imbalanced dataset. As a multiclass classification task was performed in this research, all of the reported metrics had been macro-averaged.  Figure 2. To further validate whether TL using SwAV is more robust to overfitting, the models' performance on the test data had to be analyzed.  sults on the models trained on the imbalanced dataset were compiled in Table 2. From the AUROC alone, it is clear that the SwAV-TL models outperform the supervised models on this downstream task, a result that parallels the findings from previous studies [23,21,22]. This means that SwAV-TL is confirmed to be more robust to overfitting, even when trained on the imbalanced dataset. Further observation on the TPR and TNR also implies that the supervised models were greatly affected by the class imbalance. Most of the supervised models obtained exactly 0.25 TPR and 0.75 TNR in four-classes classification, which implies that the models classify all of the test samples into a single class. Such results are expected in cases where the train data are severely imbalanced, as the "Normal" class contains almost eight times more samples compared to the class with the least number of samples, which is the "Viral Pneumonia"

RESULTS
class. Even the class with the second largest number of samples, which is "Lung Opacity", only contains 58.98% of the number of samples in the "Normal" class.
On the other hand, the SwAV-TL models obtained considerably better results, with the best model acquiring 0.952 AUROC. This model, which only has one unfrozen block, obtained an astounding performance with 0.808 TPR and 0.938 TNR. These results are vastly better than the rest of the models, as the other SwAV-TL models failed to even achieve 0.5 TPR despite the high TNR. The low TPR means that the models are prone to false negatives, which can lead to a lot of undetected diseases. Such results parallel with a similar research on TL, which have proven that albeit models with more unfrozen blocks can achieve lower validation losses, they may not be the best on the independent test dataset [42].
In spite of the high AUROC, the supervised model with two unfrozen blocks still possessed lower TPR and TNR compared to the worst SwAV-TL model, which is the model with 0 unfrozen blocks which achieved 0.886 AUROC. Such results may be attributed to the fact that the SwAV-TL models were previously trained to cluster similar features of augmented images together [21], meaning that such models may have better general knowledge in grouping images with details and features that are generally similar on the downstream tasks while being robust to image transformations [22]. In simpler words, the models could have better knowledge in grouping images that are similar by highlighting detailed features that are more general compared to the supervised models which were trained to focus on extracting detailed distinguishing features of each class. Therefore, it can be inferred that TL using SwAV pre-training can result in better performance for models trained on imbalanced datasets.

Undersampling Resulted in Generally Better
Results. To further verify whether TL using SwAV-TL is only advantageous on imbalanced datasets, experiments were also conducted on the undersampled version of the dataset. However, the number of samples used in training was significantly reduced and sampling bias may have affected the models due to the usage of RUS. The evaluation results were listed in Table 3. Although the performance of most of the models greatly improved when trained on the imbalanced dataset, the SwAV-TL models still managed to outperform all of their supervised counterparts. The best results were achieved by the SwAV-TL model with 3 unfrozen blocks, which maintained 0.948 AUROC, which is only slightly lower than that of the best SwAV-TL model on the imbalanced dataset. The difference is that on the imbalanced dataset, the other SwAV-TL models were unable to obtain at least 0.5 TPR whereas on the balanced dataset most of the models managed to obtain more than 0.76 TPR. Such results were expected as imbalanced distribution of data can severely affect the models. 4.4. The Impact of Dataset Size and Class Imbalance. Figure 5 presents the confusion matrices of the best models. Due to the severe data imbalance, the supervised models trained on the imbalanced dataset were unable to classify the images as COVID or Viral Pneumonia, which  Table 2 and is an expected behavior for the model as the imbalance is severe. On the contrary, the SwAV-TL model still managed to correctly classify most of the images despite the imbalanced training dataset, further proving the idea that models trained using SwAV may be more robust to data imbalance on downstream tasks. When trained on the balanced dataset, the supervised model managed to make more correct predictions for COVID, lung opacity, and viral pneumonia cases. However, its TPR plummets as a lot of false negatives for the normal class emerged. Such occurrences may be attributed to the sampling bias and the insufficient number of data as the balanced dataset only contained 25.43% data from the original dataset. More details on the classification metrics of the models are listed in Table 4, which shows that the SwAV-TL model trained on the imbalanced dataset is the overall best model with 0.821 F1 score. Evaluation results of the supervised model trained on the imbalanced dataset were not included as the precisions were null for the COVID and Viral Pneumonia classes. Figure 6 visualizes the ROC curves for the best SwAV-TL (trained on the imbalanced dataset) and supervised (trained on the balanced dataset) models.  However, it should be noted that most similar studies utilizing this dataset did not use the four available classes [23,18]. The "Lung Opacity" class was rarely used, and a lot of previous studies that utilized the COVID-19 Radiography Database combined it with X-Ray images from other datasets to allow DL models to learn from more data and enhance their accuracy.
In studies that include usages of lung opacity classes, some performance degradations may be noticeable, which are signified by the reduction in F1 scores as shown in Table 5. Table 5 summarizes the comparison of the SwAV-TL model with models deployed in previous studies using the COVID-19 radiography database. It can be seen that the results obtained by the SwAV-TL model are inferior to all of the listed studies. However, it should be noted that the datasets were configured differently, where some even undersampled the test set [23,18] which may be affected by sampling bias. Additionally, the different proportions of the dataset subsets will produce different results due to the larger number of training data. In one of the cited study, 70% of the images in the dataset was used for training the models and the former was further augmented to handle the data imbalance, which yield an astounding F1 score of 0.9275 [48]. In one study that is similar to this research, no further details were provided regarding the proportions or model architectures albeit the authors stated that no augmentation was performed in one of the experiment scenarios, which results were included in Table 5. To summarize, the proposed SwAV-TL model is inferior to the ones in previous studies, but further experiments are still required as slightly different training configurations can greatly affect the model. In future studies, TL using SwAV can be further tested by training the models on oversampled train sets and more modifications on the model may be conducted as no modifications nor hidden layers were implemented in this experiment. Specifically, model compression methods can also be considered to be adopted in future studies to improve the efficiency of the models in deployment stages.

CONCLUSION
Overall, TL using models pre-trained through SwAV had brought positive impacts for the task of classifying chest X-Ray images even when the available datasets are either imbalanced or small. The ResNet-50 models used in this study had proven to be more robust to the severe data imbalance and attained a great AUROC value when pre-trained using SwAV, proving that it is superior to standard supervised pre-training. Even though TL using SwAV pre-training allowed the models to perform better, further experiments are required to discover to what extent it can improve the accuracy. The experiments conducted in this research are limited to the training of ResNet-50 without resampling and with undersampling. In the future, oversampling with augmentations can be tested on TL using SwAV, and smaller models may be deployed as large models such as the ResNet-50 require massive volumes of data.