Diagnosis of Various Skin Cancer Lesions Based on Fine-Tuned ResNet50 Deep Network

With the massive success of deep networks, there have been signi cant efforts to analyze cancer diseases, especially skin cancer. For this purpose, this work investigates the capability of deep networks in diagnosing a variety of dermoscopic lesion images. This paper aims to develop and ne-tune a deep learning architecture to diagnose different skin cancer grades based on dermatoscopic images. Fine-tuning is a powerful method to obtain enhanced classi cation results by the customized pre-trained network. Regularization, batch normalization, and hyperparameter optimization are performed for ne-tuning the proposed deep network. The proposed ne-tuned ResNet50 model successfully classi ed 7-respective classes of dermoscopic lesions using the publicly available HAM10000 dataset. The developed deep model was compared against two powerful models, i.e., InceptionV3 and VGG16, using the Dice similarity coef cient (DSC) and the area under the curve (AUC). The evaluation results show that the proposed model achieved higher results than some recent and robust models.


Introduction
The American cancer society estimates melanoma deaths as 75% of total skin cancer deaths and new melanoma cases as 100,000 in 2020 [1]. An image-based computer-aided diagnosis (CAD) system could be used to classify different skin lesions based on image features. A higher accuracy CAD system could be used in the early diagnosis of skin cancer [2]. Early and accurate detection aided by deep learning techniques can make treatment more effective [3]. Deep neural networks (DNNs) have shown great performance in many elds [2]. Moreover, the networks were pre-trained multiple times with different ne-tuning settings to achieve a more stable classi cation performance for skin lesion categorization. Compared to traditional methods where each network architecture is used once, the proposed ne-tuning framework guided the nal results' performance. To date, the reported classi cation schemes have not been reported a signi cant improvement. The study aimed to improve skin cancer diagnosis accuracy by using novel deep learning (DL) algorithms. This research proposes a new method for automated skin lesion diagnosis that overcomes generalization error, over tting, vanishing, and explosion problems via a novel deep learning approach that recognizes the skin lesions' signi cant visual features. The proposed deep architecture was trained using the HAM10000 dataset that contains 10015 dermoscopic images for seven different diagnostic categories. The remainder of the paper is organized as follows. Section 2 discusses some current related work. In Section 3, the proposed model and the used datasets are described in more detail. Section 4 elucidates the experimental results. Finally, the discussion and the conclusion are discussed in Section 5.

Related Work
Several attempts have been made to overcome the aforementioned challenges with the aid of deep learning architecture. For example, Walker et al. [19] proposed a skin cancer diagnosis system with two-stages. They rst used two deep network architectures: convolutional neural network (CNN) and inception network. They employed the Caffe library in the training of inception parameters using stochastic gradient descent. They also augmented the data to expand the available training images by applying translational and rotational invariance at random rotation angles. Second, the input images are mapped into feature representation to be used in soni cation. In the soni cation step, a raw audio classi er uses 1-dimensional CNN, convolutional, max pooling, and nally a fully connected and softmax layer with two neurons for binary classi cation of dermoscopic images, i.e., malignant from benign. They evaluated their method using the publiclyavailable ISIC 2017 dataset of 2361 labeled images as melanoma or benign lesion. They obtained an accuracy of 86.6% in the classi cation of cancerous dermoscopic images.
Mahbod et al. [9] proposed a hybrid deep network approach for skin lesion classi cation that combines two network architectures, i.e., intra and internetwork fusion architecture. They rst pretrained CNNs on ImageNet and then ne-tune them on the dermoscopic lesion images dataset. The last few fully-connected layers' deep features output is fed to a support vector machine (SVM) classi er for classifying the lesion type. They ne-tuned the pre-trained networks with different settings for better classi cation performance in classifying skin lesions. They evaluated their approach to ISIC 2017 dataset as a binary classi cation task. They achieved an average area under the curve (AUC) equals to 87.3% for malignant melanoma classi cation vs. all and 95.5% for seborrheic keratosis vs. all. They utilized ResNet-18 with random weight initialization for obtaining optimal hyperparameter of the individual components on the classi cation results.
Hekler et al. [5] used a pre-trained ResNet CNN for the classi cation of histopathological melanoma images. They employed hyperparameter controlling by modifying the weights to reduce loss, given the difference between the predicted class labels and actual class labels. They evaluated their method on 595 histopathologic slides from a dataset of 595 individual patients (300 nevi and 295 melanoma). They evaluated their deep classi cation technique's performance using a test set of 100 known class label images with a mean accuracy of 68% accuracy for binary classi cation of melanoma from nevi. They presented an invasive technique with a limited number of images with limited resolution. Another limitation is the binary nature of their technique for melanoma vs. nevi.
Kassem et al. [23] proposed a deep learning model classi cation of skin lesions. They used a deep convolutional GoogleNet architecture based inception module that utilizes a sparse CNN with conventional dense construction. They utilized transfer learning and domain adaptation to improve generalization conditions. They evaluated their model on ISIC 2019 dataset to test the ability to classify different kinds of skin lesions with 94.92% classi cation accuracy. They used the traditional multiclass SVM machine learning method, which may result in lower performance measurements.
Adegun et al. [4] proposed an in-depth learning-based approach for the automatic detection of melanoma. Their deep network comprises a connected encoder and decoder sub-networks, which brings the encoder closer to the decoder feature maps for obtaining ef cient learned features. Their system employed multi-stage and uses softmax for melanoma lesions classi cation. They also used a multi-scale system to handle various sizes of skin lesions images. They evaluated their system on two skin lesion datasets, ISIC 2017 and Hospital Pedro Hispano (PH2). They used 2000 and 600 dermoscopy images for training and testing using ISIC 2017, and 200 images and 60 dermoscopic images were used for training and testing using the PH2 dataset. Their results showed an average accuracy of 95%. They adopted only the binary case of classi cation for melanoma vs. non-melanoma.
Brinker et al. [22] used CNN to classify skin cancer images. The convolutional architecture utilized the ResNet model to classify benign images from malign skin lesions. They also employed an ensemble for the residual nets to achieved less error rate than that of GoogLeNet. They utilized stochastic gradient descent with restarts (SGDR) to settle local minima problems in which sudden increments for the learning rate may arise. They performed CNN image-classi er training using the ISBI 2016 dataset, which includes 18170 nevi and 2132 melanomas. They evaluated the results using 379 test images of ISBI 2016 with a ROC curve of 0.85.
Several studies of literature focused only on the binary case of classi cation for melanoma vs. non-melanoma [4,9,19,22]. Moreover, Kassem et al. [23] investigated the problem of multilesion diagnosis. They used the traditional multiclass machine learning method, resulting in lower performance measurements. For these reasons, the main objective is to investigate the problem of multi skin cancer lesion diagnosis for more effective treatment. The ISIC 2019 dataset with nine different diagnostic categories, will train and test the multi diagnostic technique. For this aim, this paper investigates deep learning architecture and hyperparameter optimization to improve the diagnosis results. This work uses the ResNet50 deep network to overcome the generalization error, over tting, vanishing, and explosion problems. ResNet50 can reformulate network layers in terms of residual learning functions with a mapping reference to the input layer by tting the stacked layers to the residual mapping. ResNet50 uses the identity mapping to predict the required to reach the nal prediction of the previous layer outputs, which decreases the vanishing gradient effect using an alternate shortcut path to bypass. The identity mapping allows the model to ow through the unnecessary layers. This helps the model to overcome the training set over tting problems. This approach extends the previous work by using a multi-level feature extraction technique to improve the classi cation results. Moreover, the proposed network was pre-trained multiple times with different ne-tuning settings to achieve a more stable classi cation performance for skin lesion categorization. The proposed deep model utilizes various ne-tuning techniques, i.e., regularization, hyper-parameter tuning and batch normalization, transfer learning, cross-entropy optimization, and Adam optimizer. The proposed novel deep learning classi cation scheme has been reported a signi cant improvement in skin cancer diagnosis accuracy.

Dataset
HAM10000 (Human Against Machine) (HAM) is a publicly available dataset [25]. HAM dataset is comprised of 10015 dermatoscopic images through the ISIC archive. The HAM dermatoscopic images are collected from different populations in different modalities. This dataset provides a diagnosis for seven pigmented lesion categories: Actinic keratosis (AKIEC), benign keratosis lesion (BKL), vascular lesions (VASC), basal cell carcinoma (BCC), dermato broma (DF), melanocytic nevi (NV), and melanoma (MEL). Most of these lesions are con rmed through histopathology. The lesion_id in the metadata le tracks the lesions within the dataset. The lesion images are diversely delivered. They were acquired different dermatoscopy types from various anatomic sites, i.e., nails and mucosa, from a sample of skin cancer patients, from several different institutions. Images were acquired under the Ethics Review Committee of the Medical University of Vienna and the University of Queensland. The HAM10000 dataset is also utilized as the training set for the ISIC 2018 challenge [25]. The distribution of seven pigmented lesions of HAM dermoscopic images is shown in Tab. 1.

Pre-Processing
Preprocessing steps are applied to cleanse and organize data before being fed into the model. Dataset images vary between high and low pixel range. Higher image values can result in different loss values from the lower range. Sharing the same model, weights, and learning rate require normalizing the dataset. Images pixels are scaled before the training phase in the deep learning architecture. Within experiments, images are rescaled to (224, 224, 3) using scaling techniques via the ImageDataGenerator class. Image pixel values are normalized to unify image samples. The pixel values in the range [0, 255] are normalized to the range [0, 1]. Without scaling, the high pixel range images will have many votes to update weights [22].

Training the Deep Network
Deep CNNs can learn hierarchically from low to high-level features automatically. Stacking the number of layers (depth) can enrich the levels of features. A deeper network can solve complex tasks and improve classi cation/recognition accuracy [26]. But, training the deeper network could face some dif culties, such as saturation/degrading of accuracy and vanishing/exploding gradients [27]. Utilizing deep residual pretrained architecture can solve both of these problems. Pre-trained model architecture facilitates training the deeper networks than the deeper framework used in [26]. ResNet50 is previously trained on ImageNet, composed of a large number of around 1.5 million natural scene images [27]. ResNet can reformulate network layers in terms of residual learning functions with a mapping reference to the input layer. Within ResNet, the stacked layers directly t the desired mapping (residual mapping) [28].
The key idea of ResNet50 is to use identity mapping to predict the required to reach the nal prediction of the previous layer outputs [27]. ResNet50 decreases the vanishing gradient effect using an alternate shortcut path to bypass. The identity mapping allows the model to ow through the unnecessary layers. This helps the model to overcome the over tting problem to the training set [29].
Let the desired mapping be represented as H(x). The stacked layers t the following mapping F(x) := H(x) − x. If the original mapping is transformed into F(x) + x, thus the optimization to the residual mapping can be easier than optimizing the original (unreferenced) mapping [26]. The residual can be pushed to zero more than tting the identity mapping to nonlinear layers in optimal identity mapping. Comparable to VGG-16, the ResNet50 has an additional identity mapping [27], as shown in Fig. 1.

Fine-Tuning the Network
Optimization has gained great importance, especially in deep learning, with the exponential growth of data. The large number of parameters within the deep layer network became a challenge to handle the complexities in adjusting the network's parameters [30]. These optimization algorithms aim to ne-tune the results by utilizing various optimization techniques [31]. Setting the hyper-parameter affects the performance of the deep model. The development of optimization brings many ef cient ways to adjust the hyper-parameters automatically [32]. Optimization methods by Adam optimizer have a signi cant in uence on the learning performance rate [33]. Fine-tuning the deep architecture can have a considerable effect on the performance of a model [34]. Fine-tuning the deep architecture means the choice of deep training network, the layers involved within architecture and hyper-parameters for each layer, as well as the optimizers involved to enhance performance [27]. The proposed deep model utilizes various ne-tuning techniques, i.e., regularization, hyper-parameter tuning and batch normalization, transfer learning, cross-entropy optimization, and Adam optimizer.
Regularization Regularization techniques are applied to enhance the learnability of the network. Image data augmentation is used to synthesize new data to expand the used dataset. Creating new training data from existing training data can improve the performance and the model generalization ability [35]. Image augmentation increases the amount of available data by applying domain-speci c techniques for creating transformed versions of images. These transformations can be ips, zooms, shifts, and much more. Augmentation can also transform invariant learning approaches and learn the model features that are also invariant to transforms, such as top-to-bottom to left-to-right and light levels in photographs [36].
Data preparation operations, i.e., image resizing and pixel scaling, are differentiated from image augmentation. Image augmentation is applied only to the training dataset, not to the validation or test dataset. But, data preparation must be consistently performed among all the model datasets [30]. A combination of af ne image transformations was then performed, i.e., rotation, shifting, scaling (zoom in/out), and ipping to synthesize new data. The number of image samples was increased to a total of 12519 image datasets.
Hyperparameter Tuning and Batch Normalization Batch normalization achieves better optimization performance for convolutional networks [37]. Realizing the xed distributions of inputs could remove the internal covariate shift's effects, reduce the number of epochs required, and decrease generalization error [36]. Batch normalization can handle the internal covariate shift problem by standardizing the inputs to deep layers after each mini-batch. This can affect the learning process's stability and eventually can reduce the number of training epochs required in training deep networks [38].
During training, batch normalization can be performed by computing the mean and standard deviation per mini-batch for each layer's input variable to perform standardization. After training, the mean and standard deviation can be observed as mean values over the small mini-batches training dataset [38]. The mean and standard deviation of activation are calculated to normalize features by Eqs. (1) and (2), respectively [37].
where m represents the size of a mini-batch and x if is the f th feature of the ith sample. Using mini-batch mean and standard deviation, features can be normalized using Eq. (3) [37].
where ξ represents a small positive constant to reach numerical stability. This standardization of inputs can be performed using the rst hidden layer's inputs or using the hidden layer's activations for deeper layers [38]. In practice, during the training process, batch normalization uses two learnable parameters as β f and γ f for each feature f , this allows automatic shifting and scaling for the standardized layer inputs [37].
The backpropagation algorithm updates training based on the transformed inputs and adjusts the new scale and shifting parameters to reduce the model's error [36]. Using batch normalization makes the network more stable with the adequate distribution of activation values throughout the training. Initialization of weights prior to training deep networks is a challenging problem. Achieving stability to training by batch normalization can handle the choice of weight initialization in training deep networks [38]. Batch normalization can be used as data preparation to standardize raw input data that have different scales [36]. Batch normalization has been widely used for training CNN to improve the distribution of the original inputs by scaling and shifting steps [37].
Transfer Learning The linear activation cannot learn complex mapping functions that can only be used in the output layer to predict a quantity, i.e., regression problems. Nonlinear activation functions, i.e., sigmoid and hyperbolic tangent, allow the nodes to learn more complex structures in the data [39]. A common problem with sigmoid and hyperbolic tangent functions is that they saturate. They saturate very high for a positive value, saturate to very low when for a negative value, and are sensitive to input value when z is near 0 [40].
For deep layers in large networks, using sigmoid and tanh function fail to receive suitable gradient information. The error is used to update the weights through the backpropagation, decreasing with additional layers [41]. It results in the vanishing gradient problem that prevents deep networks from learning effectively or knowing the suitable parameters to improve the cost function [39].
To train deep networks with deep layers, a speci ed activation function is needed. This activation function must act as a linear function to be sensitive to the activation input sum. It acts as a nonlinear function to allow the complex relationships within the data to be learned to avoid easy saturation [39]. To permit deep network development, major algorithms have signi cantly improved their performance by replacing hidden sigmoid units with piecewise linear hidden units, known as recti ed linear units (ReLU) [41]. Because the ReLU is linear for half of the inputs and nonlinear for the others, it is recognized as a piecewise linear function [40].

a. ReLU
A ReLu activation is applied to all hidden layers. Three fully connected layers followed Max pooling layers. To achieve high training accuracy, the dropout layer and SoftMax classi er are connected at the last layers. After the dropout, the results are smoothed and fully connected via SoftMax. The feature mapping includes convolutions, ReLU activation function, and batch normalization. To retain a bounded feature map, the model is divided into separate blocks with stacked layers to reduce the feature map's dimension. Hence, the model eventually prepared for 100 epochs for the dataset. ReLU activation function is the most common activation function used in networks with many layers. ReLU overcomes a lot of pf problems, i.e., vanishing gradient problem [41]. ReLU is presented by Eq. (5). The weights are updated by using the update rule in Eq. (6).
where w * is new weight computed based on current value w, η is the learning rate and δE 2 δw is the partial derivative of the error with respect to w [40]. The derivation of the error term describes the sensitivity of the error E 2 to the weight. The derivative term can be evaluated using the chain rule. A vanishing gradient problem is encountered if the results of partial derivative receive very small updates. In the same way, the explosion problem is the opposite of the vanishing problem, where the values of the weights are rapidly increasing [40]. ReLU activation function for input z is set to zero if z has a value less than zero and equal to z if z has a value above or equal to zero. ReLU is superior when developing CNN, allowing the model to learn faster and perform better [39].

b. SoftMax
SoftMax enables the model to map certain classes to certain logits by maximizing the target classes' logit values. It can also generate a discrete probability distribution for class outcomes [36]. This can lead to an effective training process and generating a useful machine-learning model. Besides normalization properties, SoftMax can be very useful for optimizing the network model [42]. SoftMax is a squashing function that results in vectors in the range (0, 1) and all sum up to one. These vectors are regarded as scores that represent class probability in multiclass classi cation [36]. Let the output scores denoted as s. The SoftMax function depends on all elements of the class, not for each class S i . The SoftMax function for an individual class S i can be given by Eq. (7) [4].
where S j are scores inferred by the net for all classes. SoftMax ensures that the last layer of the network probabilities' outputs have nonnegative real-valued probabilities with overall summation equals to one when no activation function is applied [36]. During iterative processes, the predictions are compared with the targets and summarized in a loss value. The improvement for the backpropagation is computed according to the loss value [4]. Performance improvement is subsequently performed using the optimizer and its idiosyncrasies. The iterative processes stop when the model achieves signi cant improvement in performance [36].
Optimization During iterative processes, the predictions are compared with the targets and summarized in a loss value. The improvement for the backpropagation is computed according to the loss value [36]. Performance improvement is subsequently performed using the optimizer and its idiosyncrasies. The iterative processes stop when the model achieves signi cant improvement in performance [4]. For this purpose, optimization techniques like Cross-Entropy Loss and Adam optimizer are used within the network architecture. For each trainable parameter, the optimizer subsequently adapts the parameter concerning the loss and intermediate layers [32]. The goal of optimization problems is to nd the optimal mapping function f (x) that minimize the loss function L of the training samples of number N (Eq. (8)) [30].
where θ is the parameter of the mapping function, x i is the feature vector of the ith samples and y i is the corresponding label. Stochastic gradient descent (SGD) outperforms batch gradient descent for large-scale data [31]. SGD uses only one random sample to update the gradient during iterations instead of calculating its value. The cost of the SGD is independent of the number of samples and can convergence faster [30]. SGD decreases the update time for large data samples and removes the computation redundancy. The loss function is represented by Eq. (9) [33]. The loss function L for a randomly selected sample i in SGD is represented by Eq. (10) [30]. The update of the gradient in SGD using a random sample i in each iteration rather than all samples [33].
where η is the learning rate andθ the gradient update based on the previous update. The straightforward enhancement to SGD is AdaGrad (Adaptive Gradient Method). AdaGrad adjusts the learning rate automatically based on the previous iterations. The update of the gradient in AdaGrad [33]: where θ t is the value of parameter θ at iteration t, η is the learning rate, g t is the gradient of parameter θ at iteration t and V t is the accumulated historical gradient of parameter θ at iteration t. AdaGrad's improvement is to resolve the radically diminishing learning rates by calculating the second-order cumulative momentum [33].
where β represents the exponential decay parameter.

a. Adaptive Moment Estimation (Adam)
Adam introduces an additional progressive to the SGD technique. Adam is an adaptive learning rate for each parameter, which integrates the adaptive learning method with the momentum methods [32]. Instead of storing the average of exponential decay of past squared gradients V t , Adam keeps the average of exponential decay of past gradients m t (Eq. (14)), similar to the momentum method [33].
where β 1 and β 2 are exponential decay rates. Thus, the nal form for the parameter θ is given by Eq. (16) [33]. Most implementations use default values of 0.9, 0.999 and 10 −8 for β 1 , β 2 and , respectively. Adam achieves better in practice against comparable adaptive learning rate algorithms [31].
b. Cross-Entropy Loss The choice of the loss function is also regarded as a signi cant part of the optimization. The model's loss function can be used to estimate the current model state repeatedly. The loss function's choice can affect weights in a suitable direction to reduce the next evaluation's loss. The cross-entropy loss function is typically used for multiclass classi cation problems, where the target values are assigned integer values. The assigned target integer values are regarded as categorical within experiments [43]. Cross-entropy calculates a summarization score of the average difference between the actual and predicted values for all classes. The cross-entropy score is minimized towards the optimal value 0. Categorical cross-entropy L is de ned by Eq. (17) [44].
where y c is the output based input x and weight w c , c is the index running over the classes number and t c is the number of occurrences of c. The cross-entropy loss function is evaluated mathematically under the inference framework of maximum likelihood. Maximizing the training set's likelihood minimizes the cross-entropy loss as Eq. (18) [44].
where y c is the corresponding target value,ŷ c the scalar value with index c in the model output and log indicates log-likelihood. The choice of cross-entropy loss function instead of the sum-ofsquares for a classi cation problem achieves better in training as well as improves generalization performance [43].

Implementation
Experiments are performed using google Colaboratory (Colab) for accelerating deep learning GPU-centric applications. The hardware con guration for Colab accelerated runtime used to execute the coded program is GPU Nvidia K80, 12 GB RAM, 2496 CUDA cores.
The dermoscopic images of the HAM10000 dataset were rst resized to 32 × 32 and sampled for scale augmentation. A combination of af ne image transformations was then performed, i.e., rotation, shifting, scaling (zoom in/out), and ipping to synthesize new data. The number of image samples was increased to a total of 12519 image datasets by data argumentation. Batch normalization was adopted after each convolution and before activation. SGD was also used with a mini-batch size of 256. The learning rate ranges between 0.1 and 0.01 when the error plateaus. The model was trained for 60 × 10 4 iterations per 100 epochs. The weight decay was set to 0.0001 and the momentum to 0.9. The images of the dataset were split into 80% for training and 20% for testing. In testing, for comparison, the same model was trained using another splitting ratio of 70% for training and 30% for testing. For best results, various improving techniques were adopted, i.e., regularization, hyperparameter tuning, and optimization using adam optimizer and categorical cross-entropy loss function.

Performance Evaluation
The evaluation results of the trained model are calculated using different performance metrics that are de ned by Eqs. (19)- (24).
Sen. = TP TP + TN (23) Spec. = TN TN + FP (24) where TP denotes the number of positive instances that are labeled correctly. FP denotes the number of positive instances that are mislabeled. TN denotes the number of negative instances that are labeled correctly. FN denotes the number of negative instances that are mislabeled [45]. Precision and recall provide a true positive rate and positive predictive value, respectively. DSC provides the harmonic mean between precision and recall in a graphical representation between sensitivity and speci city measures. ROC curves, along with the associated AUC values, are also used in evaluation [9].

Experimental Results
The rst run was done by splitting the HAM dataset, which contains 12519 dermoscopic images of seven respective classes, into 80% for training and 20% for testing. The rst run of training is done using 9514 dermoscopic images and 3005 for testing the HAM dataset concerning the seven classes: AKIEC, BKL, BCC, VASC, DF, MEL, and NV. The number of epochs for which the model was trained was 100. To evaluate the performance of the proposed framework, four performance metrics are computed for each class separately. Therefore, the average for these values is computed. The confusion matrix of this experiment is shown in Fig. 3.
The four outcomes of the confusion matrix for the rst run experiment are used to evaluate the performance metrics results. The ned-tuned model's performance in terms of the three performance metrics, i.e., precision, recall, and DSC, are outlined in Tab. 2.
The performance metrics average is 98%, 96%, 97%, and 96.5% for accuracy, precision, recall, and DSC, respectively. Hence, the average for sensitivity and speci city is 98% and 100%, respectively. The second run is performed using the same conditions and architecture with another splitting ratio of 70% for training and 30% for testing using the hold-out method. The average values of performance metrics are computed. These values are 96.00% for the average accuracy of the seven classes, 90% for average accuracy of precision, 95.25% for recall, and 92.85% for average DSC. The confusion matrix of this experiment is shown in Fig. 4.   Figure 4: The confusion matrix for the second run The confusion matrix outcomes for the second run experiment are used to evaluate the results in terms of performance metrics. The performance of the ned-tuned model in terms of the three performance metrics, i.e., precision, recall, and DSC are outlined in Tab. 3. The performance metrics averages are 98%, 98%, 97%, and 96% for accuracy, precision, recall, and DSC, respectively. Hence, the averages for sensitivity and speci city are 98% and 100%, respectively. ROC curve is constructed to visualize the performance for the ne-tuned deep network in the classi cation of seven classes of skin cancer images. ROC curve for the ne-tuned deep and the seven respective classes is created by plotting the true positive rate (TPR) against false positive rate (FPR). Fig. 5a shows the relationship between sensitivity and speci city for the multiclass model. Moreover, for balanced results, macro and micro averaged ROC AUC scores are also calculated. Other deep models of pre-trained networks on ImageNet are utilized for comparison, i.e., Inception v3 and VGG-16. Inception v3 and VGG-16 networks were re-trained using HAM dataset along with its seven respective classes. The Inception v3 layers' architecture replaces top layers with one average-pooling layer that averages out the channel values across the 2D feature map. The inception module includes two fully connected and the SoftMax layer at the last layer to categorize the results within the seven respective classes. The inception module contains lters of 3 × 3 kernel size and various combinations of convolutions. Smart factorization for convolutions using 3 × 3 convolutions, 5 × 5 convolutions, then 2 × 2 convolutions was also applied to reduce computational complexity. The input images were resized to the size (299, 299) compatible with the model. Hyper-parameters are ne-tuned and set to optimal values. The model Learning rate was set to 0.0001, and SGD was set to decay. Model optimization was performed using the Adam optimizer with momentum 0.9.
The VGG-16 network consists of 16 convolution layers with a small receptive eld of 3 × 3. It has ve max-pooling layers of size 2 × 2. Three fully connected layers follow Max pooling layers at the bottom layers. The input images were resized to the size (224, 224). Fine-tuning for hyperparameters is performed by setting the learning rate to 0.0007 and SGD to a decay. Optimization was also performed to the model using the Adam optimizer with momentum 0.9. The average results of two runs are compared against Inception v3 and VGG-16 deep models. Results comparison in terms of averaged Precision, DSC, and ROC AUC for each deep network are summarized in Tab. 4.

Discussion and Conclusion
Tabs. 3 and 4 show the proposed deep network model's performance among two different runs for testing the proposed model using two hold-out methods. The proposed model's performance is presented for seven respective classes, which shows promising results in recognizing different skin cancer lesions. The two runs' average results are listed in Tab. 5, which presents superior results against Inception v3 and VGG-16 deep models. The proposed deep model achieved a weighted average value of precision of 88.77%, DSC average of 87.55%, and ROC AUC average of 99%. In comparison, Inception v3 and VGG-16 deep models achieved 83.53% and 84.75% for precision, respectively. For DSC, they achieved 83.22% and 85.15%, respectively. For ROC AUC, they achieved 98.2% and 98.32, respectively. The ROC AUC values for each class shown in Tab. 5 are as follows: For melanoma and AKIEC classes, the proposed deep network scored the best with 99% and 100%, respectively. Also, for typical nevi and benign keratosis categories, they achieved higher AUC with 98% and 100% values, respectively. For basal cell carcinoma, VGG16 achieved higher AUC with the values of 99.1% than the proposed deep network and InceptionV3 values of 99% and 98.6%, respectively. For dermato broma cases, VGG16 achieved higher AUC with the values of 99.8% than the proposed deep network and InceptionV3 values of 99% and 99%, respectively. For vascular lesions categories, the proposed deep network, VGG16, and InceptionV3 entirely achieved 100%. ROC curve for the ne-tuned deep network in the classi cation of seven classes of skin cancer images in Fig. 5a shows superior performance compared to the ROC AUC for the comparative deep network models in other sub gures.
This study investigated the capability of deep learning in the multi-classi cation of 7 primary skin lesions. Performance evaluation using the pre-trained ResNet50 network on HAM dermoscopic images (12519 in total) outperforms other robust networks. A variety of ne-tuning techniques have been investigated for enhancing diagnosis performance, such as regularization, batch normalization, and hyperparameter optimization. Adam optimizer and cross-entropy loss function are also utilized with optimal parameters. The developed deep model has compared two powerful models, i.e., InceptionV3 and VGG16, for evaluation. The proposed ne-tuned deep learning model shows that ne-tuning networks can achieve better diagnostic accuracy than other powerful techniques. Although the utilized dataset is highly unbalanced, the model obtained promising results. These models can be easily implemented to assist dermatologists. A more diverse skin lesion categories dataset can be further investigated for future work. Also, the use of metadata for the images can be useful to enhance the diagnosis accuracy.