MobileGAN: Skin Lesion Segmentation Using a Lightweight Generative Adversarial Network

. Skin lesion segmentation in dermoscopic images is a challenge due to their blurry and irregular boundaries. Most of the segmentation approaches based on deep learning are time and memory consuming due to the hundreds of millions of parameters. Consequently, it is diﬃcult to apply them on real dermatoscope devices with limited GPU and memory resources. In this paper, we propose a lightweight and eﬃ-cient Generative Adversarial Networks (GAN) model, called MobileGAN for skin lesion segmentation. More precisely, the MobileGAN combines 1D non-bottleneck factorization networks with position and channel attention modules in a GAN model. The proposed model is evaluated on the test dataset of the ISBI 2017 challenges and the validation dataset of ISIC 2018 challenges. Although, the proposed network has only 2.35 millions of parameters, it is still comparable with the state-of-the-art. The experimental results show that our MobileGAN obtains comparable performance with accuracy of 97 . 61%.


Introduction
Skin cancer is one of the wide speared of cancers.According to WHO 8 , there are 1.04 million cases in 2018 9 .
Over the last decades, the percentage of both melanoma and non-melanoma skin cancers increased rapidly [3].Melanoma is the most dangerous types of skin cancer, and 75% of deaths are related to it [4].Image analysis techniques (Dermoscopy) based on computerized non-invasive dermatology is getting very important for physicians to inspect the pigmented skin lesions and detect malignant melanoma at an early stage [10] in order to improve the survival rate and reduce cost.Consequently, a Computer-Aided Diagnosis (CAD) system is essential to support the dermatologists to investigate the dermoscopic images and segment melanomas as precisely as possible.Several melanoma segmentation methods have been proposed in the literature [2].The main challenges faced in the segmentation of pigmented skin lesion include the huge diversity in color, shape, texture, size, but also the low contrast between skin tissues, the irregular and fuzzy boundaries and the presence of blood vessels and hairs [2].Several methods have been proposed to cope with these challenges using traditional image processing algorithms, such as histogram thresholding, unsupervised clustering, and supervised segmentation methods (see an overview in [7]).However, these approaches yield inaccurate segmentation results, when the skin lesions have fuzzy boundaries [7].In addition, the performance of these methods highly relies on pre-processing algorithms, such as hair removal and contrast enhancement.With the rapid progress in deep learning models, many skin lesion segmentation approaches have been introduced increasing the accuracy of segmentation.For instance, the SLSDeep model was proposed in [17] to segment the skin lesion by using feature pyramid pooling.In [2], a full resolution convolutional networks (FrCN) was introduced to directly learn the full resolution features of each pixel of the input image without the need for pre-or post-processing operations.Besides, GAN with a multi-scale loss function, called SegAN, has also been proposed for skin lesion segmentation in [18].All of the methods mentioned above provided high precision.However, they have tens or hundreds of millions of parameters.In this paper, we propose a lightweight GAN model, named MobileGAN, for skin lesion segmentation of dermoscopic images.In the proposed model, we extract low features with multiscale convolutional networks.In order to reduce the computational cost, the proposed model uses 1D non-bottleneck factorization network.Moreover, position and channel attention modules are used to improve the features representation regardless of spatial and channel dimensions.The contribution are: 1) to cope with shadows by supposing that only the part of true lesions appears at multiple scales (consequently, a multi-scale block is introduced for aggregating the coarse-to-fine features of dermoscopic images), 2) to reduce the computational cost by using a 1D non-bottleneck factorized network [15], 3) to enhance the discriminant ability of feature representations in spatial and channel dimensions by using both position and channel attention models [11], 4) to use a combination of the binary cross entropy, Jaccard and L 1 -norm as a loss function for training the modified GAN model.The generative adversarial network pix2pix [12] has been used in different tasks, such as synthetic image generation and medical image segmentation.It consists of two main networks: generator G and discriminator D. The generator is an encoder-decoder architecture that learns the mapping from an image from domain A (the skin image) to domain B (the segmented lesions).The discriminator compares the generated segmentation masks with real segmented images.Figure 1 presents the architecture of the proposed model, which has the G and D networks as the pix2pix model.We remind that, to alleviate false detection due to shadows, a multi-scale block for aggregating the coarse-to-fine features of dermoscopic images is used.Below, we explain the encoder and decoder networks of the generator, and the discriminator networks in details.
The encoder network: the input images for the encoder of generator network G are scaled to four resolutions (i.e., the original input size and three different resolutions) as shown in Figure 1.The four resolutions are feed into four convolution blocks to generate 4 × 16 feature maps.The four convolutional blocks are then followed by four channel attention module (CAM) to capture visual features dependencies in channel dimensions (for more details, see the supplementary A.6). Afterward, we upsample three scaled inputs to the same size of the original input image by using bilinear interpolation and then average all feature maps of the four scales to generate 1 × 16 feature maps.The encoder network can extract low features in different scales in order to cope with shadows.In addition, the resulted feature maps are created in both spatial and frequency domains.The resulted 16 feature maps are fed into two Convolutional-Downsampling-Attention (CDA) layers.Each CDA layer comprises a convolutional block followed by a max pooling of 2, and then a Position Attention Module (PAM) to capture the spatial features (for more details, see the supplementary A.5).The two layers produce 64 feature maps that are fed into the next four factorized-attention (FCA) layers.Each FCA layer consists of a nonbottleneck factorized block followed by a CAM.The resulting feature maps are fed into a CDA layer to obtain 128 feature maps that are fed into eight FCA layers.The result of the eighth FCA layer is fed to a non-bottleneck factorized block followed by two parallel attention blocks; one for CAM and the other for PAM that is summed to capture visual high features independently to position and channel dimensions.The final 128 feature maps are fed into the decoder to construct the segmented image.The decoder network: We upsample the final output of the encoder to feed both streams.Each stream consists of one Deconvolutional-Upsampling-Attention (DUA) and two FCA layers.The final feature maps are upsampled to obtain the segmented image.In all layers of the encoder and decoder networks, we used convolutional and deconvolutional filters with a kernel size of 3 × 3, a stride of 2 and a padding of 1 (for more details, see the supplementary A.4).In the testing phase, the trained generator network G is used to produce the segmentation mask for each test image.The discriminator network: It comprises four convolutional and downsampling layers.The four convolutional layers use a kernel of 4 × 4, a stride of 2, and a padding of 1.In the second layer, a PAM block is added after the convolutional block, while in the third layer, a CAM block is added.

Model training
The G and D networks are alternately trained by back-propagation in an adversarial fashion: we first fix G and train D for one step using gradients computed from the loss function, and then fix D and train G for another step using gradients computed from the same loss function passed from D to G. Assume x is a skin lesion image containing a lesion, y is the ground-truth of the segmented image of that lesion, and G(x, z) and D(x, G(x, z)) are the outputs of the generator and the discriminator, respectively.The generator loss function G comprises three terms: binary cross entropy loss, L 1 norm to boost the outliers, and Jaccard loss to increase the intersection: where λ and α are empirical weighting factors.The variable z is a random variable introduced as a dropout in the decoding layers at both training and testing phases, which helps to generalize the learning process and avoid overfitting.The L 1 loss is also necessary to boost the learning process that may be too slow because the adversarial loss term may not properly formulate the gradient towards the expected segmented lesion shape.In addition, we consider the optimization of the Jaccard loss (JL) for the lesion classes (for more details, see the supplementary A.1).
If the generator network is optimized properly, the values of D(x, G(x, z)) approach 1.0, meaning that the discriminator cannot differentiate the generated segmentation mask from the ground truth, while L 1 and Jaccard losses should approach to 0.0, indicating that every generated mask matches the corresponding ground truth mask both in overall pixel-to-pixel distances (L 1 ) and in tight convex surrogate (JL) to all Intersection-Over-Union (IoU).
The discriminator loss function D can be formulated as follows: The optimizer should fit D to maximize the loss values for ground truth images (by minimizing − log(D(x, y))) and to minimize the loss values for the predicted image (by minimizing − log(1 − D(x, G(x, z))).These two terms compute the binary cross entropy (BCE) loss using both images, assuming that the expected class for ground truth and generated images is 1 and 0, respectively.

Experiments
Datasets: The efficacy of the proposed model is assessed on two publicly available benchmark datasets of dermoscopic images for skin lesion analysis: ISIC 2018 ( Skin Lesion Analysis Towards Melanoma Detection, grand challenge datasets) [8] and ISBI 2017 (IEEE International Symposium on Biomedical Imaging, ISBI 2017, grand challenge datasets) [9].The ISIC 2018 dataset includes 2,594 images with the corresponding ground truth masks annotated by expert dermatologists.The validation and testing sets contain 100 and 1,000 images, respectively, without ground truth (evaluated by online10 only).In our experiments, we used 80% of the training set of the ISIC 2018 dataset for training and 20% for validation as proposed in [2].In turn, ISBI 2017 dataset was divided into training, validation and testing sets with 2000, 150 and 600 images, respectively.Note that we trained our model with ISIC 2018 training set and evaluated our model on ISBI 2017 test and ISIC 2018 validation sets.Evaluation Metrics: Five evaluation metrics are used for assessing the performance of our model, with the ISBI 2017 test dataset: Jaccard index (JAC), Dice coefficient (DIC) and Accuracy (ACC), Specificity (SPE), Sensitivity (SEN) [9].We used the threshold Jaccard index JAC th to evaluate our model with ISIC 2018 validation dataset (for further details, see supplementary A.

2).
Data augmentation: To achieve accurate segmentation results, we augment the two datasets by flipping the images horizontally and vertically, apply-ing gamma reconstruction and changing the contrast using adaptive histogram equalization (CLAHE) with different values on the original RGB images.
Implementation: We used Adam [13].We achieved the best results with Adam optimizer with parameters β 1 = 0.5, β 2 = 0.999.In turn, the learning rate was set to 0.0002 with a batch size of 8.The weighting factors of Jaccard loss and L 1 -norm loss (λ and α) were set to 0.1 and 0.5, respectively.Our experiments are carried on NVIDIA 1080Ti with 11GB memory taking around 8 hours to train the network.The model is implemented on PyTorch 11 deep learning library.Experimental results: The size of the images ranges from 542 × 718 to 2848 × 4288 pixels that is considered a very large to train the proposed model.Each input image was resized to q × q pixels to speed up the training process of our model.We trained and tested our model with different image sizes (64 × 64, 128 × 128 and 256 × 256).The best segmentation results are obtained with the input size of 128 × 128 (for detailed results, see the supplementary A.3).  1 and Table 2.With ISBI 2017 test dataset, we compared the MobileGAN with five skin lesion segmentation methods (FCN [14], U-Net [16], SegNet [5], FrCN [2], SLSDeep [17] and an adversarial network, SegAN [18]).We took all the test results of FCN, U-Net, SegNet, FrCN from the literature [2] that used the same dataset.As shown, the proposed MobileGAN model yields the best results in terms of ACC and SPE.MobileGAN achieves an improvement of the ACC score of 3.51% more top than the SegAN model, and the SPE score of 1.62% higher than the SLSDeep model.In turn, the SLSDeep model yields a little bit better results with an improvement of 0.17% compared to our model.In turn, the SegAN model gives a better JAC score than our model with an improvement of 0.52%.Also, the FrCN model achieves an increase of 6.9% of the SEN score higher than our model.Regarding the ISIC 2018 validation dataset, we compared MobileGAN to the FCN, U-Net, SegNet, FrCN and GAN-FCN models as shown in Table 2.We used the validation evaluation of FCN, U-Net, SegNet, FrCN from the literature [1].Our model achieves the highest JAC th score compared to the GAN-FCN models with an improvement of 0.6% and better than the U-Net model with an increase of 26%.
In addition, we compared the MobileGAN model to the FCN, U-Net, SegNet, FrCN, SegAN and GAN-FCN models in terms of the number of the parameters.The MobileGAN has only 2.35 millions of parameters.While the closest one is the GAN-FCN model with 10.61 millions of parameters.In turn, the SegAN is the most massive model with 382.17 millions of parameters.That model used the traditional GAN model.It is evident that adding non-bottleneck and position and channel attention modules significantly reduced the number of parameters of the MobileGAN model.Besides, Mobile GAN has a number of parameters 57x,5x, 4x, 6x, and 19x lower than the FCN, U-Net, SegNet, FrCN, and SLSDeep models, respectively.Fig.2shows qualitative segmentation results of the MobileGAN model with some examples from the ISBI 2017 test dataset.As shown, in Fig.2(left),Although the tested images have a high similarity between the color of the lesion and the skin regions, fuzzy boundaries and even very small lesions, the MobileGAN model accurately segments the boundary of each skin lesion with an accuracy of about 95%.Besides, in Fig.2(right), the four images shown have skin regions (the background) are very small compared to lesion regions, also the lesion regions occupy most of the image and intersect three margins of the images.In these cases, our MobileGAN yields inaccurately segmentation.It is a bit difficult to segment the boundaries of tumors accurately.That means our model needs to a complete shape of the lesion area to properly segment the boundaries of the legions regions.

Table 1 .
Evaluating the proposed model on the ISBI 2017 test dataset

Table 2 .
Evaluating the proposed model on the ISIC 2018 validation dataset

4 Conclusions
In this paper, we have proposed a lightweight yet efficient GAN model (Mo-bileGAN) for skin lesion segmentation.The MobileGAN is built by adapting the GAN model by adding 1D non-bottleneck factorization networks with position and channel attention blocks.In comparison to state-of-art skin melanoma segmentation, the number of parameters of MobileGAN model is significantly reduced with only 2.35 millions of parameters.The MobileGAN model has been evaluated on ISBI 2017 test and ISIC 2018 validation datasets.With the ISBI 2017 test dataset, it yields appropriate segmentation results with an accuracy of 97.61%, a specificity of 99.92%.The proposed model also provides Jaccard and sensitivity of 77.98% 78.50%, respectively that is comparable to the stateof-the-art.The proposed model achieves a threshold Jaccard score of 78.4% with the ISIC 2018 validation dataset.Future work attempts to implement a mobile application based on the MobileGAN model to segment skin lesions in images captured by a low-resolution camera.