Progressive Weighted Self-Training Ensemble for Multi-Type Skin Lesion Semantic Segmentation

In this study, we propose the Progressive Weighted Self-training Ensemble (PWStE) method that reinforces efficiency of labeled data for multi-type skin lesion semantic segmentation. The generation of multi-type skin lesion labeled data is extremely expensive as it should only be performed by dermatologists due to the small pixel variations and irregularly shaped lesion characteristics. For the reason, the reality is that labeled data for skin lesion segmentation model training is absolutely insufficient. The core idea of the proposed PWStE method is to minimize the transfer of uncertainty in the training phase of general SSL by progressively using the pseudo-labeled data referenced in training. The PWStE uses procedures such as Progressive Selector, Ensemble, and Pseudo Labeler designed using conventional Semi-Supervised Learning (SSL) concepts to more accurately generate detailed features of skin lesions from unlabeled data to pseudo-labeled data. We performed ensembles using a combination of models (U-Net, FPN, LinkerNet, PSPNet) and backbones (ResNet50, EfficientNet-b3, InceptionV3, DenseNet121, SE-ResNet101, SE-ResNeXt101). Validation was performed on our Multi-Type Skin Lesion Label Database (MSLD) dataset compared to conventional SSL methods. The experiments have shown that the model trained with PWStE shows similar results to the model of trained the entire label data using the Supervised Learning (SL) method, even with 30% less label data. These results show that our proposed PWStE can increase the efficiency of the given labeled data even in the multi-type skin lesion field.


I. INTRODUCTION
Semantic segmentation is one of the fundamental topics in computer vision to perform image pixel-wise classification for biomedical image analysis tasks. Recently, encoderdecoder based Deep Convolutional Neural Network (DCNN) models such as U-Net [1], FPN [2], LinkerNet [3], PSPNet [4] have been predominantly used in the biomedical field and achieved state-of-the-art using Supervised Learning (SL). As the value of data is highlighted with the development of technologies such as those mentioned above, many hospitals The associate editor coordinating the review of this manuscript and approving it for publication was R. K. Tripathy . are accumulating raw data. However, the performance of these techniques cannot be fully used due to the lack of labeled data in skin lesion segmentation domain. The biggest reason is that it is expensive to create skin lesion segmentation labeled data from raw data. In general, skin lesions have features that are difficult for annotation tasks such as small pixel variations and irregular shapes. And several types of skin lesions with these characteristics may overlap or exist nearby. Even the cost of a dermatologist to distinguish lesions is not negligible.
Semi-supervised Learning (SSL) is one alternative to solve the drought of labeled data in the sea of raw data. The main strategy of SSL is to perform training with using labelled as well as unlabelled data [5]. The SSL research trend is that consistency-based methods perform better than pseudo-labeling methods in multi-class problems. On the other hand, pseudo-labeling methods show superior performance in the multi-label problem. The multi-class problem is a task of classifying multiple classes containing one class in one image. The multi-label problem is a more realistic and difficult task to classify multiple classes in an image as a multi-type skin segmentation problem. However, pseudo-labeling method does not work well with the multi-type skin lesion semantic segmentation task. The multi-type skin lesions contain factors that make them difficult to distinguish, such as small pixel variations from lesion severity, overlapping regions, and irregular shapes. For these factors, pseudo-labeled data contain errors, as shown in Figure 1. These errors are amplified as it iteratively trains with the SSL method. The SSL does not necessarily improve the performance of SL for this reason, even if unlabeled data is added in the form of pseudo-labels; performance is sometimes compromised.
In this study, we propose the Progressive Weighted Self-training Ensemble (PWStE) method that can reduce the labeling cost through SSL on multi-type skin lesion data. The proposed idea consists of a progressive selector, self-ensemble model, and Pseudo Labeler. The progressive Selector suppresses error amplification by ratio scaling the labeled, unlabeled, and pseudo-labeled data through progressively increasing the amount of pseudo-labeled data used for training. The self-ensemble model is a strategy to create an encoder-decoder structure DCNN by combining various models and backbones, improves pixel-wise classification accuracy and, provides tolerance for uncertain pseudo-labeled data. Finally, the pseudo labeler transforms self-ensemble model inference results of unlabeled data requested by the progressive selector into a pseudo-label data.
Experimental results show that the proposed method achieved similar performance using a smaller dataset than the SL method.
The overall contributions of our study can be summarized as follows: 1) We designed the Progressive Weighted Self-training Ensemble (PWStE) method to utilize the pseudolabeling SSL method in the multi-type skin lesion semantic segmentation field. 2) We proposed a simple weighting method called progressive selector that improve fine feature segmentation performance through reducing the error amplification of pseudo-label semi-supervised learning.
3) The proposed method can be easily attached to conventional SOTA backbones and architectures.

II. RELATED WORK A. PREVIOUS WEIGHTING METHODS
SSL trains a model to cover unlabeled data based on available labeled data. At this time, the conventional method is to reduce uncertainty by weighting how much confidence to use the features extracted from this unlabeled data. Consistency based method directly weights supervised loss and consistency loss based on consistency relationship found in the data amplified by augmentation [11], [12], [13]. The pseudo-labeling method applies weights to each of the labeled data and the generated pseudo-labeled data at the stage of retraining [5], [14]. For example, the training dataset D consists of labeled data x l and y l , unlabeled data x u , and y u . The n and m the number of instances of labeled and unlabeled data, respectively. The w is a weight parameter according to uncertainty for pseudo-labeled data, is adjusted to a value between 0 to 1. The dataset D of typical SSL is expressed by the following formula: This approach is very common and works slightly well. The generated pseudo-label data has a lower impact than the label data according to the weight w. However, since this method uses all of the unlabeled data at once regardless of the ratio of the labeled data, it may cause imblance of the sample. In this paper, we argue that a factor of performance degradation is the amplification of error due to such sample imblance and low variance between lesions in multi-type skin lesion segmentation.

B. SELF-ENSEMBLE MODEL
The ensemble method generally yields better predictions for multiple neural networks than for single networks.
In SSL, the ensemble method has been approached in various ways. [5] reduces the error rate by using different augmentation and dropout regularization of multiple subnetworks. Reference [14] uses a multi-resolution ensemble of stacked dilated U-Net for improving segmentation accuracy. Reference [15] adopt a deformable partbased model to capture a stable global structure and salient objects. Reference [16] proposed a iterative robust semi-supervised based imputation. Reference [17] proposed a method to ensemble SL and Unsupervised Learning (UL) based on fuzzy information. In such SSL, the ensemble technique can improve the noise and uncertainty about the ground truth, and it also makes the generalization performance robust.

III. METHODOLOGY
The goal of Progressive Weighted Self-training Ensemble (PWStE) is to reduce the cost of labeled data for multi-type skin lesion segmentation with features such as small pixel variations and irregular shapes by using SSL theory. More specifically, it is to use unlabeled data to generate pseudo-labeled data meaningful for training through self-training. PWStE consists of Progressive Selector, Selfensemble, and Pseudo Labeler, the overview can be shown in Figure 2.

A. PROGRESSIVE SELECTOR
Our proposed Progressive Selector is one way to effectively respond to the sample imbalance problem. In SSL, the sample balance problem is a balance problem between labeled data and unlabeled data. The training weight is strongly influenced by the sample. Therefore, methods to solve the imbalance problem have a significant impact on the performance of deep learning models. In general, SSL consists of a mechanism that first, infers unlabeled data with a model that has learned labeled data, and second, re-learns the inferred results from the data. For examples, [5], [11], [12], [13], [14], and [18] are designed in such a way that the sample balance greatly affects the learning result as shown in Equation 1. Of course, the uncertainty and noise of inaccurate unlabeled data are controlled through weights in the augmentation stage and loss stage, but the uncertainty increases with the number of samples of unlabeled data. Especially in situations where all classes must be performed on a pixel basis, such as semantic segmentation, the number of samples increases extremely, and uncertainty is amplified by repetition of training. We propose the simple but effective method of progressively increasing the proportion of pseudo-labeled data by diluting the weights for each sample. The unlabeled data amount is progressively increased at every training phase. Although this simplistic method must perform heavier operations, can generate more accurate pseudo-labeled data because it leverages the features of the unlabeled data as well as the labeled data for training. The error amplification can be suppressed by dividing the pseudo-labeled data into mini-batches when classifying multi-type skin lesions for segmentation according to the consistency regularization theory. Expressed as the following formula: The x l is the image of the labeled data, y l is the annotation information of the labeled data. And n is the number of samples in the labeled data. The x u is an image of unlabeled data, y u is annotation information of unlabeled data. The m is the number of unlabeled data, w is the weight that adjusts the number of unlabeled samples. We used w as the ratio of using unlabeled data, increasing it according to the loss and average pixel accuracy, using an initial value of 0.01. This method minimizes the affection of initial non-optimized weights by providing a small unlabeled data set at the early stages of training.

B. SELF-ENSEMBLE
The Self-Ensemble is to get meaningful pseudo-label data through the bagging style ensemble of the encoder-decoder combination. The proposed self-ensemble focuses on using ground truth by generating inference results from various perspectives through different combinations of models and backbones, as shown in Figure 3. We believe that the effect of using various feature filters can be replaced by SOTA backbones such as Resnet50 [6], Efficientnetb3 [7], Inceptionv3 [8], Densenet121 [9], SE-ResNet101 [10], and SE-ResNeXt101 [10]. Also, we think that resolution-related viewpoints can be secured through various segmentation models such U-Net [1], FPN [2], LinkerNet [3], and PSPNet [4]. And each generated viewpoint is evaluated through Intersection over Union (IoU) metric per class from inference result. The proposed self-ensemble model is also considered to train while effectively controlling sample imbalance and class imbalance. The sample imbalance is solved through a structure that can learn in response to the progressive selector that gradually increases the ratio of unlabeled data from a very small number. The class imblance is a foreground-background class balance problem that has traditionally been dealt with in semantic segmentation tasks. In general, since the number of pixels in the background is larger than that in the foreground, reducing the bias from the background can improve the classification accuracy of classes. So, we use a loss that combines bce loss [19] and dice loss [20] to equation our loss function L: In our loss L of Equation 5 enabling high tolerance pixel classification performance for multi-type skin lesions with complex color and gradient distribution. The λ is a hyper-parameter with a value between 0 and 1, the 0.3 gave the best results in our MSLD dataset. The training is performed through iterations of training and pseudo-labeling procedures. Training procedure learns D PTStE passed from Progressive Selector with model combinations of classifier parameter i. The training data are labeled data with annotations and unlabeled data. After training, the training procedure transfers the weight and the IoU evaluation result of the weight to the pseudo-labeling procedure. The pseudo-labeling procedure infers unlabeled data without annotation in D PWStE using the received model information such as weight, IoU. Then, the prediction result and class IoU are computed as a dot product. This result is sent to the pseudo labeler, and anntation y u is generated based on the pixel with the highest confidence through argmax operation. Finally, when all unlabeled data is used for training, the pseudo-labeled data is rearranged using the final weight. The PWStE algorithm can be shown in Algorithm 1.

C. PSEUDO LABELER
We define a Pseudo Labeler as a module that determines class from weight prediction maps of self-ensemble model. The pseudo labeler uses a strategy that collects the opinions of all active classifiers while trusting the map with the higher IoU per class. The prediction result can be applied by the dot product operation of the class IoU to each class channel. Expressed as the following formula: The pm is the prediction map composed of h, w, and c axis, h is height, w is width, c is class. The pi is the predicted Class IoU constructed from class-specific IoU values. The n is the number of ensemble model. For example, if the n th class IoU is 0.53, only 53% of prediction result of n th class is referenced for sharpening. The same operation is performed on the results of all classifiers. Then, the results of activated classifiers on which the IoU dot product operation VOLUME 10, 2022 has been performed are merged using sum operation for each channel. Next, the value of weighted prediction map is normalized to a scalar value by dividing by the number of active classifiers. Finally, predicting the class for each pixel via augmax operation produces pseudo-label data y u .

B. EVALUATION OF PROGRESSIVE SELECTOR
We evaluated the performance of the progressive selector on the MSLD dataset. The evaluation was compared with progressive as Equation 2, non-progressive, and previous weight methods as Equation 1. The progressive method trains by progressively increasing the amount of data used for learning from 1 percent according to the progressive selector. The Non-progressive methods learn by increasing a fixed amount by 10% of the data per training. The Previous weight method uses all of the data for training and weights the data itself, experiments give weights from 1 to 10 and present the best results. Comparison of IoU with progressive, non-progressive, and previous weight methods. This comparison evaluates the performance for hair, normal skin, melasma, PIH, freckle, papule, and dark circle, and shows that IoU performance for skin lesions can be improved by solving the sample imblance problem.
The model and backbone were used as U-Net and EfficientNet-b3, ensemble is not used. The progressive method resulted in 8 and 1% IoU reduction in normal skin and hair regions, respectively. However, It show better results than previous weight method for target classes, such as Melasma 6%, PIH 14%, Freckle 3%, Papule 2%, and Dark Circle 10%, as shown in Figure 6. Additionally, we conducted an experiment comparing progressive and non-progressive methods to prove the amplification of errors triggered by the sample imbalance. This result showed that the progressive method performed better in skin lesion segmentation than the non-progressive method, the non-progressive method showed better results than the previous weight method. Finally, the sample imbalance problem is actually affecting the accuracy of the model, which means that it can be improved through controlling the ratio of labeled data to unlabeled data.

C. EVALUATION OF PROGRESSIVE WEIGHTED SELF-TRAINING ENSEMBLE
We could not find a public data set containing multi-type skin lesion to evaluate the performance of the proposed PWStE. The MSLD dataset is difficult to disclose due to the user's privacy policy. The training model of this experiment was implemented through modification of Segmentation Models [21]. The random seed of all cases were fixed 110 value for consistency. We used U-Net, FPN, LinkNet, PSPNet models and Resnet50, EfficientNet-b3, InceptionV3, DenseNet121, SE-ResNet101, SE-ResNeXt101 backbones.
We use the pixel accuracy and mean Intersection over Union (mIoU) metrics for evaluation. The pixel accuracy an extension of the commonly known the accuracy metric in pixel units as shown in Equation 7: The N represents the number of classes, the sigma GT i of the denominator term means the total of pixels of ith class. The molecular term PR ii means the sum of pixels in ith class prediction result among which groundtruth is ith The PR ji means the set of pixel that incorrectly classifies class i as groundtruth j.
We present the supervised learning results for all combinations to help understand why the dataset of performed experiments is not public, as shown in as Table 1. The ratio of the labeled data in train dataset is the percentage of the label data used for training, 100% is the same as the SL method, which uses 1,000 images for training, 50% means that only 500 images of labeled data used for training. The best TABLE 3. The comparison of mIoU performance between the PWE and the previous SSL ensemble methods. 50% means the ratio of labeled data and unlabeled data, is labeled data ratio in 1000 images.
result in 100% case is 61.35% of U-Net and EfficientNet-B3 combination, The average result of all cases is 55.43%. Table 2 shows the performance of Progressive Weighted Self-training Ensemble (PWStE) according to the number of activated classifiers, case 50% means that model used 50% of data as labeled data and remaining 50% of data as unlabeled data. The number of ensemble classifiers is adjusted according to i, where i is a hyperparameter. We randomly select combinations of different backbone and models and stack them up by i and use the average value for 10 times as the result. The best results are shown when i is 7, mIoU 51. 16 and Pixel Accuracy 58.97 at 50%, which is 3.97% better than mIoU 47.19 that is the best results of a combination of Unet and SE-ResNeXt101 in Table 1. And, 70% case shows mIoU 58.97 and Pixel FIGURE 6. Comparison of pseudo-label data quality generated by PWStE and -model [5]. Image is a cropped RGB image of a part of MSLD. Ground truth is labeled data annotated by dermatologists. -model is pseudo labeled data generated through the model. PWStE n = 5 is the result of generating pseudo labeled data using 5 classifiers.
Accuracy 88.22 that is 5.25% better than mIoU 53.72 of best combination in Table 1 70%. We argued that using multiple viewpoints and feature filters across ensembles would help generate pseudo-labeled data for multi-type skin lesion semantic segmentation. And this experiment shows that different combinations of ensemble structures lead to mIoU enhancement. Of course, simply stacking ensembles infinitely doesn't make things better. In the MSLD dataset, we confirmed that saturation occurs in the 5 to 8 section of i.
Finally, we compare the performance of supervised learning, previous Semi-Supervised Learning and PWStE method in Table 3. For comparison of the results of supervised learning, we selected the U-Net+EffectiveNet-B3 combination and U-Net+SE-ResNeXt101 combination with the best results at 50% and 70% in Table 1, and 70% of SL used only 50% and 70% of the total data. The comparison of the previous SSL uses the result obtained by implementing -model [5] and Muli-resolution model [14] and trained it with our data. The porposed PWStE is compared with the result based on the number of combinations being ensembled. Here, the previous SSL and PWStE methods used the remaining unlabeled additionally for learning, unlike SL. The SL method and the SSL method showed better results by an average of 4.86% at 50% case and 5.08% at 70% case, the SSL methods include previous SSL and PWStE. The PWStE + Classifier i = 7 shows the best result with mIoU 58.97 in 70% of cases compared to -model 55. 19 and Muli-resolution model 57.26. Nevertheless, in 50% cases, the results are slightly lower than the previous SSL methods compared. This is because self-ensembles need enough data to work well. In more than 70% of cases, the results were better than the performance of 100% SL and previous SSL.

V. CONCLUSION
We proposed a Progressive Weighted Self-training Ensemble (PWStE) method that can reduce the labeling cost by utilizing unlabeled data in multi-skin lesion semantic segmentation data, including the small pixel variations and irregularly shaped lesion. The progressive selector attenuated the amplification of errors in an environment where lesions overlap, and the ensemble model reduced the error of pseudo-labeled data. We verified through experiments that PWStE obtained similar results to that of using 100% of labeled data in the supervised learning method even when using 30% less labeled data. This means that it is more accurate to generate pseudo-labeled data by ensembles the inferred results of multiple models without sample and class imblance. In fact, most of the deep learning deployment environments have unlabeled data processing issues, and we expect that our PWStE will help solve these issues, especially in the multi-type skin lesion field with large annotation cost. However, the computation time increases by about five times compared to previous methods. In the future, we plan to design a lightweight model considering the time complexity of PWStE. In addition, there is still a problem that SSL does not exceed the performance of SL when the same size of data is used.