Mixed Re-Sampled Class-Imbalanced Semi-Supervised Learning for Skin Lesion Classification

Skin cancer is one of the most common types of cancer in the world, melanoma is considered to be the deadliest type among other skin cancers. Quite recently, automated skin lesion classification in dermoscopy images has become a hot and challenging research topic due to its essential way to improve diagnostic performance, thus reducing melanoma deaths. Convolution Neural Networks (CNNs) are at the heart of this promising performance among a variety of supervised classification techniques. However, these successes rely heavily on large amounts of class-balanced clearly labeled samples, which are expensive to obtain for skin lesion classification in the real world. To address this issue, we propose a mixed re-sampled (MRS) class-imbalanced semi-supervised learning method for skin lesion classification, which consists of two phases, re-sampling, and multiple mixing methods. To counter class imbalance problems, a re-sampling method for semi-supervised learning is proposed, and focal loss is introduced to the semisupervised learning to improve the classification performance. To make full use of unlabeled data to improve classification performance, Fmix and Mixup are used to mix labeled data with the pseudo-labeled unlabeled data. Experiments are conducted to demonstrate the effectiveness of the proposed method on class-imbalanced datasets, the results show the effectiveness of our method as compared with other state-of-the-art semi-supervised methods.


Introduction
Skin cancer is one of the major types of cancers with an increasing incidence over the past decades, with over 5 million newly diagnosed cases every year [1,2]. Malignant melanoma is the most lethal type and the majority of skin cancer deaths [3]. Although the mortality is significant, an early-stage melanoma can be cured through a simple excision, and the estimated 5-year survival exceeds 95% [4]. Consequently, accurate discrimination of malignant skin lesions from benign lesions such as seborrheic keratoses or benign nevi is crucial [5].
Dermoscopy [6], a recent technique of visual inspection has been established as an imaging modality that both magnifies the skin and eliminates surface reflection, is one of the essential means to improve diagnostic performance and reduce melanoma deaths of skin cancer compared to unaided visual inspection. Automatic classification of skin lesions, particularly melanoma, in dermoscopy images is a significant computer aided diagnosis task [7]. It is a very challenging task since the accuracy of skin lesion classification suffers from inter-class similarity and intra-class variation. As shown in Fig. 1, different skin lesion categories have visual similarity in shape and color, which is difficult to distinguish.
Recently, convolutional neural networks (CNNs) which are trained end-to-end have been widely used and achieved remarkable success in a variety of visual recognition tasks [8][9][10][11]. Many researchers have advanced the skin lesion classification and have shown decent results [12,13]. Esteva et al. [14] trained the InceptionV3 architecture on 129,450 clinical images and the performance is comparable to 21 trained dermatologists. However, the success of this method is partly due to the existence of large labeled datasets. It is generally difficult to obtain a large number of well-refined annotated images in the field of skin lesion classification. Besides, the collected skin lesion original images are usually unlabeled, adding high-quality annotations to images artificially involves professional knowledge, accurately labeling unlabeled skin lesion images is difficult and time-consuming. To alleviate this annotation burden, some semi-supervised learning algorithms have been proposed to improve the performance of models by utilizing the information contained in unlabeled data [15][16][17]. Most of the semi-supervised learning algorithms assume that each class of the training data has almost the number of samples, whether labeled or unlabeled. In practice, however, due to the difficulty in data acquisition and annotation, the class distribution of data in the medical field is usually unbalanced. In semi-supervised learning algorithms, using such data can cause performance degradation in the minority classes. Class imbalanced learning is a way to solve such class imbalance where it proposes various methods including re-sampling [18], re-weighting [19], and meta metric learning [20]. However, to our best knowledge, the studies on class imbalanced learning in the field of medical imaging focuses entirely on supervised learning and have not considered semi-supervised learning.
In this paper, we propose a mixed re-sample (MRS) class-imbalanced semi-supervised learning method for skin lesion classification. Mixed sample data augmentation is originally proposed to optimize the performance of classification tasks, and obtained state-of-the-art results in multiple supervised learning classification tasks. ICT [17] and Mixmatch [15] introduce Mixup [21], one of the mixed sample data augmentation methods, into semi-supervised learning, which further improves the recognition effect of the model. Inspired by this, Our MRS uses a variety of mixed sample data augmentation methods [22] to mix labeled and unlabeled samples. However, the mixed sample data augmentation can only optimize the performance of uniformly distributed samples, and the optimization effect is not obvious for samples with unevenly distributed categories. In order to solve this problem, we introduced re-sample technology. At the beginning of the training, ensure that the input label samples of the semi-supervised learning model are evenly distributed, and increase the proportion of the majority classes labeled data as the training process progresses.
Hence, in our work, a new training procedure has been introduced to improve the semi-supervised learning's performance on a class-imbalanced dataset. First, for each batch of training phases, the labeled data is re-sampled to ensure that the model learns uniformly distributed data to learn general knowledge across the data distribution. Then, the labeled data is mixed with the pseudo-labeled unlabeled data by Mixup [21] and Fmix [22]. Finally, the model parameters are updated by using the mixed data. We evaluated the proposed MRS method on the ISIC skin 2019 dataset [23][24][25], which is the largest skin dermoscopy image publicly available, and achieved state-of-the-art performance compared with other semi-supervised learning methods.
The main contributions of this paper are thus summarized as follows: 1. We defined a class-imbalanced semi-supervised learning skin lesion classification task, reflecting a more realistic situation, and proposed a method to solve the task. 2. We introduce a re-sample to class-imbalanced semi-supervised learning method, which improves the classification performance of semi-supervised learning on class-imbalanced data. 3. Based on mixed sample data augmentation, we use Mixup and Fmix methods to mix the labeled data with pseudo-unlabeled data, further improve the generalization performance of the semi-supervised learning model. 4. The proposed class-imbalanced semi-supervised learning method adopts an end-to-end learning style and has achieved state-of-the-art results on the ISIC skin 2019 dataset.

Methodology
In this section, we introduce our proposed MRS method, which consists of a resampling strategy to balance the class-imbalanced data and a mix sample data augmentation strategy mixing labeled data with pseudo-unlabeled data to improve the model's performance in skin lesion classification. An overview of Mix-RS is presented in Fig. 2 and Algorithm 1. Input: set of labeled and unlabeled samples: Ouput: training model: f h 1: for epoch = 1 to num_epochs do 2: Get a batch of class-balanced labeled data fX ; Yg, and a batch of unlabeled data fUg. At 0, 80, 120, and 160 epochs, adjust the labeled data ratio of major and minor classes in a batch to 1: 1, 3: 1, 5: 1, and 1: 1.

Framework
In general, we have a small class-imbalanced labeled data set , which contains N L labeled instances, and a large-scale unlabeled data set Our goal is to train the model using the class-imbalanced training data set D L [ D U to classify unseen instances.
The main steps are as follows: In the first step, the labeled samples are sampled to ensure that the labeled samples sent to the model at the beginning of the training are class-balanced. Details of the sampling labeled data will be presented in 2.2. At the same time, the same number of unlabeled samples are taken randomly. Two round stochastic data augmentation has been applied to both labeled and unlabeled samples. Assuming that fX ; Yg and fUg are a batch of the original labeled and unlabeled data. fX 1 ; Yg, fX 2 ; Yg, fU 1 g, and fU 2 g are the result of two random augments. fU 1 g and fU 2 g are the probabilistic predictions calculated by the current model and pseudo-labeled by the temperature sharpening of mean for the two predictions fU 1 ; Qg andfU 2 ; Qg. Then, fX 1 ; Yg, and fU 1 ; Qg are merged together and shuffled to get the merge set fW 1 g, so is fX 2 ; Yg and fU 2 ; Qg to get the merge set fW 2 g. After that, fX 1 ; Yg, and fU 1 ; Qg are mixed with fW 1 g, fX 2 ; Yg and fU 2 ; Qg are mixed with fW 2 g, the specific mixed strategy will be introduced later. In the last step, the mixed data is used to calculate separate labeled and unlabeled loss terms and update the model parameters.

Re-sampling Training Data
The number of samples for different categories in the labeled data set is usually different. Generally, the category with a larger number of samples is defined as the major class, and the category with a smaller number of samples is defined as the minor class. This class-imbalanced phenomenon also exists in the field of skin lesion classification. In order to solve class-imbalanced problem in semi-supervised learning for skin lesion classification, we introduce a novel re-sample data training (RDT) strategy for model training. Different with other re-sampling-based method, where the majority classes are down-sampled or the minority classes are over-sampled to ensure uniform distribution. Our RDT can solve the deficiency of under sampling methods that usually ignore many examples of most types, and can also solve the problem that oversampling methods are easy to cause overfitting.
. . . ; BÞ 14: Generate a binary mask M with 00 > 0:5 15: In RDT, the model is initially trained by class-balanced label data, which is achieved by strictly requiring the number of data for each category in each batch to enter the model. In other words, the same number of samples are taken from each category to form a batch and put into the model for training. Then, as the training process progresses, gradually changes the ratio of the input data class in each batch, and thus increasing the ratio of the major class, and decreasing the ratio of the minor class. In this case, there is no need to downsample the major class. The minor classes may face the risk of overfitting.
To reduce the overfitting of minor classes, we have taken the following three strategies: First, RandAugment, which is based on AutoAugment, is used to augment the training data. AutoAugment learns an augmentation strategy based on transformation from the Python Image Library using reinforcement learning. This requires large labeled images to learn the augmentation pipeline. However, we do not have enough data to learn this augment strategy for skin lesion classification tasks. As a result, RandAugment, a variant of AutoAugment, which does not require the augmentation strategy to be learned ahead of time with labeled data, is adapted to solve the overfitting problem of minor class in our task. Before the end of each data AutoAugment, we have also used the Cutout strategy to improve the augment effect.
Second, to further prevent the over-fitting effect of minor class, we introduced the Focal loss [26] to the semi-supervised learning loss function. Formally, the combined loss function L for semi-supervised learning is computed as: where Hðp; qÞ is the cross-entropy between distributions p andq. h is the weights of the three models. X 0 is a batch of RandAugment labeled samples. U is hyperparameters described below. f 1 , f 2 are different random noise of the input u. The consistency constraint penalizes the difference between the predicted probabilities f ðh; u þ f 1 Þ and f ðh; u þ f 2 Þ. R measures the distance between two vectors and it is typically the Mean Squared Error (MSE) or Kullback-Leibler divergence (KL divergence).
In our loss term, we introduce focal loss into the standard semi-supervised loss function, the focal semisupervised learning loss is computed as: where q is the result of "sharpening" function for the average predictions across all RandAugment of u. The "sharpening" function is defined as: where p is the predicted class of unlabeled sample, and T is the temperature of sharpened distribution. As T goes to 0, the output of Sharpenðp; T Þ will approach a Dirac ("one-hot") distribution. The last but equally important strategy is the mixed sample data augmentation, which is described in detail in 2.3.

Mixed Sample Data Augmentation
To further improve the performance for class-imbalanced semi-supervised learning for skin lesion classification, a training strategy named mixed sample data augmentation (MSDA) for semi-supervised learning is integrated. Recently, a plethora of MSDA approaches have been proposed and obtained stateof-the-art results in supervised classification tasks. One of the most popular methods is Zhang et al. [21], which is proposed by Zhang as a regularization technique to encourage high-margin decision boundaries and was utilized in semi-supervised learning by ICT, MixMatch and RealMix. In our MSDA, we use Mixup to mix the labeled data fX 1 ; Yg and the pseudo-unlabeled data fU 1 ; Qg with fW 1 g. The Mixup [21] function generates a new sample ðx 0 ; y 0 Þ as follows: where ðx 1 ; p 1 Þ 2 fX 1 ; Yg [ fU 1 ; Qg, ðx 2 ; p 2 Þ 2 fW 1 g, and a is a hyperparameter. To summarize, after using Mixup [21], we first collect half of all mixed labeled samples and their labels into And half of all augmentations of all unlabeled samples with their pseudo-labels into For the other labeled data fX 2 ; Yg and the pseudo-unlabeled data fU 2 ; Qg, we use Fmix strategy, which is proposed by Ethan Harris for supervised learning classification. In Fmix, a random complex tensor Z ¼ C wÂh for which both the real and imaginary part are independent and Gaussian is sampled first. Then, each component of Z is scaled via the decay power d according to its frequency. After that, an inverse Fourier transform is performed on the complex tensor and the rear part is taken to obtain a grayscale image. Finally, a binary mask M is obtained by setting the top proportion of the image to have value "1" and the rest to have value "0". To use Fmix for semi-supervised learning, we apply both of them to labeled samples and pseudo-labeled unlabeled samples similar to how MixMatch uses Mixup. For the obtained binary mask M, we record the ratio of value "1" in all its data as the value of 00 , while ensuring that the value of 00 is greater than 0.5. If 00 < 0:5, we set M ¼ j1 À Mj and 00 ¼ 1 À 00 . So fX 2 ; Yg and fU 2 ; Qg are mixed with fW 2 g as follows: We provide some example Mixup and Fmix images for skin lesion in Figs. 3 and 4. Finally, _ X 0 and _ X 00 are used to calculate the supervised loss term L X by using Eq. (2a), _ U 0 and _ U 00 are used to calculate the unsupervised loss term U X by using Eq. (2b), and the total loss is the sum of the above two losses, calculated using Equation Eq. (1a).

Experiments and Results
To prove MRS effectiveness in the field of automatic classification of skin lesion, we perform our experiments on the International Skin Imaging Collaboration 2019 skin lesion classification (ISIC-skin 2019) dataset, which is the largest skin dermoscopy image dataset publicly available. We first introduce the training details and the ISIC-skin 2019 dataset and then conduct semi-supervised learning experiments with part of the labeled training data. Finally, the proposed method is compared and discussed with several state-of-the-art semi-supervised learning methods.

Implementations Details
Unless otherwise stated, in all our experiments, we use the "ResNeXt-101-32x8d" architecture in Xie et al. [27] pre-trained on ImageNet as the backbone of our network. Further details of the model are available in Xie et al. [27]. Formally, we replace the last 1000 dimensional fully connected (FC) layer of ResNeXt-101-32x8d with an 8-dimensional FC layer. During the training phase, we set the batch size to 8 and the training epoch to 2 14 . The model is trained for 200 epochs, and the labeled images input in the first 50 epochs are class-balanced. After 50 epochs, the ratio of labeled images with major class is gradually increased and the distribution ratio is increased every 25 epochs. At the end of the 25 epochs, the ratio between the major class and the minor class in 4 batches is 5: 1. The optimizer for our model is Adam with a start learning rate of 10 À5 , the learning rate is divided by 5 for every 50 epochs. we set the Adam parameters b 1 ¼ 0:9, b 2 ¼ 0:999, and E ¼ 10 À8 , and then use the exponential moving average of its parameters with a decay rate of 0.999 to evaluate the model. We set the sharpening temperature T ¼ 0:5, the parameter for Beta in Mixup [21] a ¼ 0:75. The unsupervised loss-weighted U1 and U2 increased from 0 to 1 in the first 16 epochs, respectively. We use weight decay as a regularization method in all models, decaying weights by 0.02 at each update for the ResNeXt-101-32x8d.

Dataset
We evaluate the proposed method on the ISIC-skin 2019 dataset, consisting of 25331 images for training across 8 different categories including melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BKL), dermatofibroma (DF), vascular lesion (VASC), and squamous cell carcinoma (SCC), the distribution of samples for training is heavily imbalanced. Since the test set of the data set has no public labels, we take 100 out of each category, a total of 800 as validation set to verify the effectiveness of the method. Then we divide the remaining data into labeled data and unlabeled data. Tab. 1 lists the details of the ISIC-skin 2019 dataset involved in our experiments, and the type and distribution of unlabeled data are unknown to the model during the training process. Finally, we choose the model that works best on the validation set, using the model on a test set with 8238 images, and using the model's performance as the experimental result.
In particular, in order to fit the model, the image in ISIC-skin 2019 is resized to 256 Â 256, and the images are processed using RandAugment strategy. Then the augmented images are center-cropped to 224 Â 224, and finally sent the model.

Metrics
To quantitatively evaluate the proposed MRS method, we used the sensitivity, specificity, accuracy, area under the receiver operating characteristic curve (AUC), and normalized multi-class accuracy (NMCA) as the performance metrics, which are defined as: where TP, FN, TN, FP, t pr , f pr , and P represent the number of true positives, false negatives, true negatives, false positives, true positives rate, false positives rate, and positives respectively, X 0 is the confidence score for a negative instance, X 1 is the confidence score for a positive instance. The NMCA value is defined as the

Comparison with Baseline Methods
Since MRS is a semi-supervised learning method, we consider the three methods including Mean Teacher, ICT, and MixMatch as baselines for comparison. We also use labeled data to perform supervised learning as a baseline. In order to make these four baseline methods produce good generalization performance on class-imbalanced distribution, we oversample the minor class labeled data, and reimplemented each of these methods in the same codebase and apply them to the same model to ensure a fair comparison. The experimental results are shown in Tabs. 2 and 3. From Tab. 2, we can infer a number of observations, for all methods, even the worst-performing method, the classification is far better than pure chance which confirms that the semi-supervised learning methods can successfully be applied to skin lesion classification. Comparing it with the baseline methods, we can conclude that our RMS method achieves the highest AUC in almost all classes classification except DF, VASV, and SCC, achieve the highest ACC in MEL, NV, AK, and DF, the second ACC in BKL, achieve the highest sensitivity in NV and BCC, the second sensitivity in MEL, VASV, and SCC, and achieve the highest Specificity in AK and DF.
Tab. 3 shows that our RMS achieved the highest overall AUC, ACC, Sensitivity, and NMCA, and the second Specificity comparing with the baseline methods, with a small gap compared to the highest Specificity. Overall, the specificity remains at a high level across all experiments with only minor variations. Fig. 5 shows the receiver operating characteristic (ROC) curve of our method and baseline methods. It can be found that our method achieved better performance, compared with the other methods. In Fig. 5, the areas under the ROC curves of our method are larger than that of other baseline methods. The experimental results confirm that our method has a better generalization capability.
It is worth noting that in most aspects, the performance of the supervised method is better than the Mixmatch and Mean Teacher methods. The reason for this phenomenon is that the Mixmatch and mean teacher method is a semi-supervised learning optimization method for uniformly distributed data. In the case where both labeled data and unlabeled data are unevenly distributed, it is difficult for the classifier to extract valid features from unlabeled data, so the performance of the classifier cannot be optimized by the distribution of unlabeled data. However, the performance of the ICT method is superior to the supervised method. This is because compared to Mixmatch, ICT uses unlabeled data only once in a batch, so it has less impact on the distribution of a batch of samples. At the same time, Mixup can completely mix unlabeled data in ICT. In general, our proposed RMS method mixes labeled and unlabeled data using Mixup and Fmix, which has less effect on the distribution of resampled labeled data. Therefore, the unlabeled data can be fully utilized to improve the performance of the classifier in the case of unbalanced categories.

Comparison with Challenge Records
In this part, we compared the performance of RMS to seven top-ranking performances without using external data in the ISIC-2019 skin lesion classification challenge leaderboard. These reported results on the ISIC-2019 challenge dataset can reflect state-of-the-art performance in the skin lesion classification task.
Since almost all the seven-top ranking methods on the ISIC-2019 skin lesion classification challenge leaderboard use the ensemble model to obtain better generalization performance, in this experiment, we selected a part of the data as supervised data in the labeled data and trained two independent ResNeXt models. Tab. 4 lists the specific numbers of labeled and unlabeled samples for the two ResNeXt model. The labeled data here is a subset of the labeled data in Tab. 1. After the training is completed, we obtained two ResNeXt models, in 3.3 we also obtained a ResNeXt model. Then we use the ensemble model based on these three models to complete the experimental comparison. As there are UNKNOWN images in the test dataset, but no such category data in the training dataset, we simply select the images whose top-1 probability < 0:25 as UNKNOWN class. The experimental results are shown in Tabs. 5 and 6.   Tab. 5 shows that our RMS method, which was trained on the ISIC-2019 training dataset only with 4200 labeled images, achieves the highest ACC on BCC, and DF. Moreover, the results obtained by our method are not much different from the list records in other categories of ACC, AUC, sensitivity, and specificity. Meanwhile, almost all methods are unsatisfactory in identifying UNK, whether it is our method or other methods that do not use additional data on the challenge record.
From Tab. 6, comparing the results of our RMS method to challenge records, we can conclude that our method has obtained a balanced accuracy of 55.3% according to the ranking rule of the challenge, an average AUC of 0.865, ACC of 0.916, the sensitivity of 0.449, and specificity of 0.969, with a small gap, compared to the challenge records. It is also worth noting that we only used 4200 labeled images, comparing with other methods that used 25331 labeled images. Fig. 6 shows the receiver operating characteristic (ROC) curve of the ensemble approach we performed.

Ablation Study
Since our RMS method combines various optimizations and augmentation techniques, we perform an extensive ablation study to better understand why it is able to obtain performant results. Specifically, what we measured is that our method only removes resample, RandAugment, Fmix, Mixup, and focal loss.  We find that each component contributes to RMS's performance. Among them, the contribution of RandAugment is the largest, the contribution of rasample is second, and the contribution of focal loss is the smallest.

Conclusion
In this paper, we presented a mixed re-sampled class imbalanced semi-supervised learning method for skin lesion classification. The proposed approach has been evaluated on the ISIC-skin 2019 dataset with considerably small labeled images dataset. Despite using only 4800 labeled images, our method has only a small gap comparing the performance to seven top-ranking performances in the ISIC-2019 skin classification challenge leaderboard using all the 25331 labeled data. The results have shown that our method can significantly improve the performance compared to other semi-supervised methods on the same task. Achieving state-of-the-art performance, this research confirms previous findings and contributes to our understanding of semi-supervised learning methods for skin lesion classification. A natural progression of this work is to improve the recognition performance of unknown classes. Further research should concentrate on incorporating additional ideas from the semi-supervised and the classimbalanced learning literature into our methods.
Funding Statement: Our research fund is funded by Fundamental Research Funds for the Central Universities (3072020CFQ0602, 3072020CF0604, 3072020CFP0601) and 2019Industrial Internet Innovation and Development Engineering (KY1060020002, KY 10600200008).

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.