Skin Lesion Segmentation Based on Edge Attention Vnet with Balanced Focal Tversky Loss

Department of Electrical and Computer Engineering, Faculty of Engineering, King Abdulaziz University, Jeddah 21589, Saudi Arabia Computer Engineering/School of Natural and Applied Sciences, Gazi University, Ankara, Turkey Department of Computer Science, College of Computer Engineering and Sciences in Al-Kharj, Prince Sattam Bin Abdulaziz University, P.O. Box 151, Al-Kharj 11942, Saudi Arabia Department of Electrical and Electronics Engineering, Faculty of Engineering, Bolu Abant Izzet Baysal University, 14280 Bolu, Turkey


Introduction
Melanoma, a deadly skin cancer, is predicted to be the fth most frequently diagnosed cancer for men (57,180 cases) and women (42,600 cases) by 2022 [1]. In the United States, treating skin cancers' annual cost is $ 8.1 billion: about $ 4.8 billion for nonmelanoma skin cancers and $ 3.3 billion for melanoma [2]. e number of diagnoses and treatments for nonmelanoma skin cancers in the USA between 1994 and 2014 reached 77% [3]. Approximately 90% of nonmelanoma skin cancers of nonmelanoma are associated with ultraviolet (UV) radiation from the sun [4]. Board-certi ed dermatologists very often use dermoscopy, a noninvasive technique that can be helpful in diagnosing skin lesions. Dermoscopy is a magnifying device that allows the application of liquid between dermoscopy and the patient's skin or the use of cross-polarized light to make the epidermis translucent, allowing the structures in the epidermis and super cial dermis to be visualized. Seeing very small skin lesions that are not visible to the naked eye is essential in deciding for a physician. Dermoscopy provides better performance than traditional methods thanks to ABCD criteria [5]. However, diagnosing the lesioned area with a dermoscopy device is time-consuming and challenging due to artifacts such as skin hairs, blood vessels, and similarities and contrast of light between ordinary and lesioned skin. erefore, deep convolutional neural networks (DCNN), which will automatically identify images, have been used to overcome this di cult task. Such systems try to determine lesion boundaries and make high-accuracy decisions based on skin lesion segmentation [6][7][8][9][10][11].
Convolutional neural networks (CNNs) performed higher for many segmentation tasks [12]. DCNNs require large amounts of data for high performance. In addition, there is no clear boundary between the lesion and the surrounding skin. In addition, since the sizes and shapes of the lesions vary, it is not easy to have information about their characteristics. In addition, ink marks and air bubbles are other difficulties [13]. An edge attention Vnet (ET-Vnet) model is proposed to overcome the artefacts in this study. ET-Vnet combines two attention mechanisms to extract information about the lesion boundary better. Specifically, ET-Vnet contains two paths in its decoder, each embedded with an attention module. In addition, BBCE and FTL functions are combined as a novel approach to increase the performance of the proposed model. Czajkowska et al. [14] proposed a model consisting of two DeepLab v3+ models with a ResNet-50 backbone and a fuzzy connectivity analysis module for fine segmentation. Hu et al. [6] proposed a new attention synergy network (AS-Net) to improve the discriminative ability for skin lesion segmentation by combining spatial and channel attention mechanisms. Arora et al. [15] proposed a modified U-Net-based segmentation model for automatic skin lesion segmentation. e proposed model used group normalization (GN) instead of batch normalization (BN) in both encoder and decoder layers [16,17]. Besides, attention gates (AG), which focuses on minute details in the skip connection, and Tversky loss (TL) function, which provides higher success in class imbalances, were used. Gu et al. [18] proposed an attention-based modified U-Net (CA-Net) that comprehensively presents based on U-Net architecture by adding multiple spatial attention between layers and awareness of the most critical spatial locations and channel scales. A new channel attention module was used to further focus the proposed model on the lesion area's feature map. A scaling module has been presented to scale the images, highlighting the most prominent feature maps. Goyal et al. [19] proposed an ensemble of R-CNN and Deeplab V3C methods, to achieve high sensitivity and specificity in lesion boundary segmentation methods.
e proposed ensemble method ensemble-S achieved better performance than FrCN, FCNs, U-Net, and SegNet. Jiang et al. [20] proposed the CSARM-CNN (canal and spatial attention residue module) model for automatic skin lesion segmentation based on deep learning. Each convolutional layer of CSARM has created a new attention module by combining channel attention and spatial attention to make segmentation training more effective. e spatial pyramid pool acquires multidimensional input images. Finally, two different cross-entropy methods were fused for higher segmentation training in the model, and the final loss function was obtained. Lei et al. [21] proposed a general adversarial network (GAN) to overcome automated lesion segmentation challenges. is network includes a fully dense U-Net-based skip connection and double discrimination (DD) layers. e proposed method (U-Net-SCDC) uses lower resolution up-sampling convolutional layers that preserve fine-grained information. In contrast, the DD module increases training performance by controlling each other in opposite directions. us, the two different methods work as if they try to find each other's fault, focusing on their wrong points. erefore, using a conditional discriminatory loss, it has been said that the model that checks each other simultaneously provided superior performance compared to other models.
Shan et al. [22] proposed a novel segmentation method named FC-DPN. e proposed method consists of the fully convolutional (FCN) and dual-path network (DPN). DPN is a model that enables the virtual feature maps of previous layers to be reused by using residual and densely connected ways. Sub-DPN projection blocks and sub-DPN processing blocks have been added instead of dense layers in the fully convolutional DenseNets (FC-DenseNets). It was stated that this method allows FC-DPN to acquire more representative and distinctive features to perform a more robust segmentation.
Dosovitskiy et al. removed the CNN-based encoder and replaced it with the vision transformer to improve the image recognition performance of the network [23]. Sarker et al. proposed a SLSNet network that reduces the computational cost by using 1-D core factor deep learning networks for sensitive skin lesion segmentation with minimal resources [24]. Wu et al. [25] proposed a new feature adaptive transformer network based on the classical encoder-decoder architecture called FAT-Net. e organization of this manuscript is as follows. Section 2 provides information about image augmentation and preprocessing. It also includes information about the proposed method and evaluation metrics. In Section 3, details about the computational analysis of the application are given. In Section 4, we present the proposed model assessment. In this section, the results of the proposed method and other methods are presented comparatively. In Section 5, we analyzed and discussed our results and made various suggestions. Finally, the conclusion is given in Section 6.

Preparing Dataset.
e proposed model was trained and tested in this study using the ISIC 2018 Lesion Boundary Segmentation dataset. e dataset used is provided by the International Skin Imaging Cooperation (ISIC) archive [12,26]. e images in the dataset consist of 8-bit RGB dermoscopic images ranging in size from 767 × 576 to 6682 × 4401. e dataset consists of 2594 training images obtained from different institutes, including various diagnostic challenges.

Data Processing and Augmentation.
Deep learning models perform very poorly on datasets with low samples. A large amount of samples is needed for training these models. e ISIC 2018 dataset used for training the model consists of 2594 images. First, 2594 images in the ISIC 2018 dataset were divided into training (2075) and test (519) sets. en, the amount of data was increased by applying boundary data augmentation with horizontal and vertical flips, random rotation, random distortion, elastic transformation, and scaling and clipping methods to the training samples obtained. Here, better detection of fine-grained features at the borders of the lesioned region was achieved with the border data augmentation method. In the random rotation method, augmented images are obtained from the original images by rotating the original image horizontally and vertically across each row and column. With the applied data augmentation methods, 72000 training images were obtained. It is aimed to give more reliable and robust results by preprocessing the training images. Contrast stretching applied in data augmentation made the lesions more prominent. In addition, the sharpening algorithm was applied to the data during the contrast stretching. With sharpening and contrast stretching, the blurred edges of the lesion were made more prominent. All images are resized to 512 × 512 as the images are of different sizes, and the suggested model is uniform. Figure 1 shows examples of preprocessed and enhanced images.

Proposed Method.
Milletari et al. [27] proposed Vnet architecture for volumetric and fully convolutional 3D image segmentation. e ET-Vnet 2D model is shown in Table 1.

e Decoder Stage.
Xavier weight initialization was used for the weight initialization of the model, and ReLU was used as the activation function for the layers [28]. In addition, ADAM was used as the optimizer [29]. In the ET-Vnet 2D model, each convolution layer consists of 3 × 3 convolutions with GN to normalize the features in the channels [30]. e 512 × 512 × 1 input image is fed to the first block, as shown in Table 1. EGM highlighting low valence features in feature maps and ET-Net architecture with WAM emphasizing high valence features were proposed by Zhang et al. [31] for high-performance organ segmentation. e EGM and WAM modules of the proposed architecture have been applied to V-Net 2D. Here, the EGM module emphasizes low-value edge features in segmentation, while the WAM module emphasizes high-value features, allowing the proposed model to perform better segmentation. Also, the convolution layers doubled from 32 to 512 per block [32].

e Encoder Stage.
e decoder stage takes only highvalue features from the encoder and upsamples them to the input size in the decoder. As shown in Table 1, deconvolution, which upsamples feature maps from the downsampling stage, is performed. A series of convolutional operations are applied at the encoder stage to enlarge the feature map from the encoder to the original low-resolution predicted input image. It is aimed to achieve higher performance segmentation of the network by synchronizing and comparing the decoder and encoder, thanks to the bypass connections from the encoder. In the decoder stages of the proposed model, the same convolution blocks used in the encoder are used in reverse. ReLU activation function is used in each layer, as in the encoder. In the last layer, 1 × 1 convolution using the sigmoid activation function is used, since black and white image segmentation includes two classes. Figure 2.

Evaluation Metrics.
e metrics used to score the performance of the models in the ISIC 2018 challenge competition were also used to test the performance of the proposed model. e first of these metrics is the Sørensen-Dice coefficient (DSC), shown in equation (1). Dice is a simple measure that measures the similarity of two samples.
e Jaccard index (Jaccard) is used to calculate the similarity  and diversity of sample sets, as defined in equation (2). We can also define Jaccard as the intersection of the union of two different sets. Accuracy (ACC), shown in equation (3), represents the percentage of correct predictions out of all predictions made by the model. Sensitivity (Sens), defined in equation (4), measures the proportion of samples predicted as true positive (TP) in the data set. Also, sensitivity can be defined as recall. As shown in equation (5), specificity (Spec) is a metric that calculates the proportion of correctly predicted nonlesional areas in an image. One drawback of the dice coefficient (Dice) is that false positive (FP) and false negative (FN) are equally weighted. While this increases the precision, it decreases the recall rate.
For this reason, Dice causes degradation in test performance in unbalanced datasets. e way to deal with this situation is to weigh FN more than FP. e Tversky similarity index is a generalization of the Dice coefficient that provides flexibility to its problems in balancing FPs and FNs as shown in equation (6).
where True Positive (TP) represents the correctly labeled lesion pixels, False Positive(FP) represents the incorrectly predicted lesion pixels, True Negative(TN) represents the correctly predicted nonlesion pixels, and False Negative(FN) represents the incorrectly predicted lesion pixels. In equation (6), α and β are adjustable parameters to increase the weight of the recall rate in unbalanced datasets.

Balanced Focal Tversky Loss Function (BFTL).
e BFTL loss function proposed in this study is developed from the loss function proposed by Zhou et al. [33]. e FTL loss function is explained in equation (7) which effectively solved the problems of Dice loss in class imbalances [34,35]. e FTL function shown in equation (8) is presented as a solution to data imbalances. In the proposed model, a loss    function is proposed that both take care of pixel type losses and solve data imbalances.
e Tversky index is adapted to a loss function (TL) in [36] by minimizing c 1 − TIc.
e BFTL function L is formulated in equation (9).
where Lbbce is the BBCE [37], unlike binary cross-entropy, β weight is added to BBCE. Where β is the number of negative samples/total number of samples. In other words, β is the proportion of the dominant sample in a data set. 1 − β denotes the fraction of the other class. In addition, the

Application Details.
A computer with Intel I5-8300H processor and 32 GB of RAM and Nvidia GTX 1080 ti graphics card was used to train the proposed model. Besides, the computer's operating system is Windows 10-64 bit. e proposed deep learning model was created using the Python 3.6 programming language. In the proposed model, the learning rate was determined as 1e − 3, and the lot size � 4 due to the experiments. e model's training has been realized as 100,000 epochs, thanks to 72000 increased training images. Vnet 2D and ResUnet 2D models were separately trained in the increased data set from the ISIC 2018 data set. e graphical analysis of the tests performed is shown in Figure 3. e images of some samples from the proposed models' test results, their true lines, and the predicted segmentation result are shown in Figure 4. As can be seen from Picture 4, when the predicted results are compared with the actual lines, it is seen that the model exhibits very high performance.

Qualitative Analysis.
e proposed model's segmentation and final results are shown in Figure 4. boundary data augmentation, and predicts complex lesion boundaries with great precision.

Hardware
Analysis. e proposed model has been compared to other cutting-edge models for many model parameters, storage requirements, and extraction rates using Nvidia Geforce GTX 1080 ti graphics. In addition, since the data set is stable, there is no need for pretraining scoring. e training of the model took approximately 8 hours. In addition, the model's estimation of the test data set took 52 seconds.

Prediction Results
e proposed segmentation model was tested on ISIC 2018 Task 1 dermoscopic lesion images. e BBFTL loss function combining BBCE and FTL function is used as a new approach. One of the main factors in applying the border data augmentation method is to enable the model to recognize better the lesion borders that separate the lesioned area from the skin. Table 2 shows the comparative results of the proposed model with other models in the ISIC 2018 dataset. e proposed model surpassed the latest literature methods in the tests in the 2018 dataset, scoring 0.91 points in the Dice score, 0.83 points in the Jaccard score, and 0.95 points in the sensitivity, and the other models achieved superior performance. Figure 4 shows the comparative results of the proposed model with other models in the ISIC 2018 dataset. Figure 5 shows the visually estimated output of some of the complex samples in the 519 test images separated from training with the proposed approach. e figure compares the basic reality and predicted image based on the original input image. Each row in the figure shows the test results of a test input image in the data set. e third column shows the segmentation result predicted by the proposed fusion model of the test input image presented in the first column.

Discussion
In the proposed model, we get the ET-Vnet 2D network by fusing the EGM, WAM, and Vnet 2D networks from the ET-Net 2D network. In addition, the ET-Vnet 2D network training was completed in 8 hours. EGM and WAM from the ET-Net 2D network significantly improved the performance of the Vnet 2D network. Also, in the proposed model, balanced binary cross-entropy and focal Tversky loss functions are hybrids combined. e proposed hybrid loss function caused a slight prolongation of the training time.
But it enabled the model to give more robust results.

Conclusion
In this study, EGM and WAM modules, which are two modules used in the ET-Net network, are combined with Vnet 2D to create the proposed model-tested on ET-Vnet 2D ISIC 2018 Lesion Boundary segmentation dataset. e proposed model consists of data set preprocessing and data augmentation, lesion identification, and prediction operations. In the dataset preprocessing stage, artifacts such as color inconsistency, exposure problem, and visibility of lesion borders are corrected. e visualization of a limited number of labeled skin lesions has been achieved sufficiently to train the proposed model, mainly thanks to the border and other data augmentation methods. In addition, a new loss function is proposed by combining BBCE, which calculates pixel type losses, and FTL, which is presented as a solution to class imbalances. e loss function proposed as a novel approach significantly improves the performance of the proposed model. e proposed hybrid loss function is thought to play a crucial role in challenging segmentation tasks. e data set's Dice and Jaccard similarity metrics were recorded as 0.95 and 0.83, surpassing the latest segmentation techniques. e proposed model will be tested in other organ segmentations to prove its robustness in future studies.

Data Availability
We can send the datasets at the request of the authors.

Ethical Approval
is article does not contain any studies with human participants. No animal studies were involved in this review.