Biomedical Signal Processing and Control

In this paper, we propose a deep convolutional neural network framework to classify dermoscopy images into seven classes. With taking the advantage that these classes can be merged into two (healthy/diseased) ones we can train a part of the network regarding this binary task only. Then, the conﬁdences regarding the binary classiﬁcation are used to tune the multi-class conﬁdence values provided by the other part of the network, since the binary task can be solved more accurately. For both the classiﬁcation tasks we used GoogLeNet Inception-v3, however, any CNN architectures could be applied for these purposes. The whole network is trained in the usual way, and as our experimental results on the skin lesion image classiﬁcation show, the accuracy of the multi-class problem has been remarkably raised (by 7% considering the balanced multi-class accuracy) via embedding the more reliable binary classiﬁcation outcomes. ©


Introduction
Skin cancer is a common and locally destructive cancerous growth of the skin. It originates from the cells that line up along the membrane that separates the superficial layer of the skin from the deeper ones. As pigmented lesions occur on the surface of the skin, malignant behavior (e.g. melanoma) can be recognized early via visual inspection performed by a clinical expert. Dermoscopy is an imaging technique that eliminates the surface reflection of the skin. By removing surface reflection, more visual information can be obtained from the deeper levels of the skin.
In the last few years the computer aided diagnosis (CAD) is becoming more and more important in skin cancer detection [1]. As affordable mobile dermatoscopes are getting available to be attached to smart phones, the possibility for automated assessment is expected to positively influence corresponding patient care for a wide population. Given the widespread availability of highresolution cameras, algorithms that can improve our ability to asses suspicions lesions can be of great value.
There is a long term history of computer aided dermoscopy image analysis and thus its literature is very verbose. The common protocol is to apply some pre-processing for image enhancement and artifact removal and then perform classification based on certain extracted features [2]. The current trend is to consider deep ber of merged classes in a multi-class scenario and our empirical results have justified these expectations. In [7,8], a type of assisted learning was introduced for emotion recognition. The binary classification outcome there determined the further allowed labels in the multi-class problem, which is basically a classic decision treebased method. Opposite to this approach we propose a network architecture that embeds binary classification in the training process. The connected convolutional neural networks learn together simultaneously and set their parameter to optimize the common loss function at the ensemble-system level based on the developed mathematical background. The rest of the paper is organized as follows. In Section 2, we describe our novel methodology with presenting first a 7-class skin lesion dataset. We introduce our network architecture together with the proper formal description of applying binary classification support during training. Our experimental results are presented for the ISIC2018 challenge data in Section 3. We discuss on our hardware and training setups in Section 4 and also on how the proposed basic model can be improved further to increase its competitiveness in skin lesion classification. Finally, in Section 5 some conclusions are drawn.

Data
The organizers of the challenge International Skin Imaging Collaboration (ISIC) 2018: Skin Lesion Analysis Towards Melanoma Detection called for participation in developing efficient methods to classify skin lesion images into seven classes. Namely, the images are labeled according to the following classes of skin lesions: • C BKL : benign keratosis (solar lentigo/seborrheic keratosis/lichen planus-like keratosis), • C DF : dermatofibroma, • C NV : melanocytic nevus, • C AKIE : actinic keratosis or Bowen's disease, • C BCC : basal cell carcinoma, • C MEL : melanoma, • C VASC : vascular lesion.
Our data was extracted from the ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection grand challenge datasets [9,10]. The ISIC2018 challenge has released an image set collected by Tschandl et al. [9] at the Department of Dermatology of the University of Vienna and at the Cliff Rosendahl in Queensland, Australia. The authors classified the collected images into seven generic classes because of simplicity and the reason that more than 95% of the lesions appearing in clinical practice fall into one of the seven diagnostic categories [9]. The published dataset contains 10,015 images for training, 193 for validation, and 1512 for testing purposes. The training set consists of images with manual annotations regarding the seven different classes in the following compounds: 6705 images with nevus lesions, 1113 with malignant skin tumors, 514 with basal cell carcinoma, 327 with actinic keratosis, 1099 with any benign keratosis, 115 dermatofibroma and 142 ones with vascular lesions. The number of images regarding these classes tries to follow the prevalence of the classes in the population, as well. Table 1 summarizes the number of images in the HAM10000 training-set according to diagnosis and Fig. 1 depicts the distribution of the images among the classes.
The number of images in the certain classes (exclude the nevus one) is not sufficiently large for training deep CNNs. To increase the number of training images with also avoiding the over-fitting of the network and reducing the differences between the amount of images in the different classes, we have followed the commonly proposed solution [11] for the augmentation of the training dataset, such as cropping random samples from the images or horizontally flipping or rotating them at different angles. We also note that here the augmentation strategies to increase the size of the training dataset need to be applied after carefully understanding of the problem domain. It means that in order to avoid any modification of the characteristic texture of different type of lesions we could not use arbitrary scaling or aspect-ratio changes which are typically used. As the resolution of the images in the dataset are 450 × 600 pixels, but the applied CNN architectures originally require input image of spatial resolution 299 × 299, we randomly cropped subimages with the required size from the original one instead of using scaling. Moreover, on these extracted images we applied rotating with randomly selected angle from the set [90 • , 180 • , 270 • ] and horizontal/vertical flipping. In order to create a more heterogeneous dataset we applied random brighten and contrast factors which are used to set the brightness and the contrast of the images randomly. Using these procedures, we have generated these modified training images and increased the original number of the sample images for the melanoma (4452), basal cell carcinoma (4626), actinic keratosis (4251), benign keratosis (4396), dermatofibroma (2415), and vascular lesion classes (2982).
The seven classes C 1 , . . ., C 7 of skin lesions can be further grouped as negative/positive (benign/malignant) ones as With this formulation we can set up a binary classification problem besides the original multi-class (7-class) one. Our motivation with adding the binary classification problem to the original one is that we can take advantage of the output of the binary classifier to make a finer labeling according to the 7 classes, since the simpler binary task (where the number of training samples corresponding to the different classes is larger) can be solved more accurately. For an impression of the characteristics of the lesions belonging to the seven classes, see also Fig. 2.

Network architecture
As for the design of our network architecture, our main motivation was to let the more reliable binary classifier to influence the output of the multi-level one. As backbone architectures for both the binary and multi-class classifiers we have considered GoogLeNet Inception-v3 [12], as GoogLeNet is reported to show a solid performance in skin lesion classification [3]. As common requirements of designing deep convolutional architectures, we also had to take care about to incorporate such functional elements that supports training via efficient backpropagation. Consequently, we had to involve a function, whose derivative can be given in closed form to describe the influence of the binary classifier.
To realize the above aims, we have considered the GoogLeNet Inception-v3 pre-trained model on ImageNet [13] and its layers have been fine-tuned in two times; once for the binary and once for the 7-class task. Then, these two fine-tuned models have composed into one network architecture as shown in Fig. 3. Namely, the proposed network can be divided into two main CNN branches with one of them is dedicated to binary, while the other for multi-class classification. Thus, the two branches result in respective 2D and 7D softmax layers, which are merged in a Support Training Layer (STL). Then, as the last layer of the network a 7D softmax layer is considered to address the original multi-class classification problem. At the STL layer, the probability values found by the binary classifier used to refine the corresponding multi-class probabilities via keeping/dropping them (CASE I) or via a simple multiplication (CASE II). More formally, let us suppose that at the STL layer we have class confidences p NEG and p POS with p NEG + p POS = 1 by the binary, while p BKL + p DF + p NV + p AKIE + p BCC + p MEL + p VASC = 1 by the multi-class classifier regarding the corresponding classes. Moreover, let [p] = round(p) denote the round operator, which provides 0 or 1 for both the binary/multi-class probabilities. Then, the confidence values for the final 7D softmax layer are calculated as: with a consequent normalization to sum them up to 1 in both cases.
To interpret better the differences between CASE I and II, notice that, with CASE I we basically follow a classic decision rule/tree model [7,8] with excluding the classes corresponding to the binary class having lower confidence; technically, the round operator provides a 0 multiplier accordingly. As a refined approach, CASE II follows a more dynamic way by tuning only the 7-class confidences with the binary ones. Though in our experiments CASE II has led to a remarkable improvement, we also enclose the performance of the CASE I approach as our initial attempt. In either case, for efficient computation we had to be able to embed our approach in the common training protocol by backpropagation, whose description is given next.

Training with backpropagation
In order to code the part of the backpropagation corresponding to the STL we have to compute the derivatives of the loss function with respect to the input data of this layer. For a formal description, let x = (x 1 , . . ., x 9 ) ∈ R 9 be the input of the STL layer, where x 1 = p NEG , x 2 = p POS are the corresponding confidence results of the binary classifier, while the remaining 7 coordinates x 3 = p BKL , x 4 = p DF , x 5 = p NV , x 6 = p AKIE , x 7 = p BCC , x 8 = p MEL , x 9 = p VASC of x describe the final probabilities provided by the 7-class CNN classifier branch of the network. When z ∈ R 7 is the output of the STL layer/the whole network, and L is the loss function, then we have to compute the derivatives ∂L ∂x i , i = 1, . . ., 9. In CASE II during the forward pass the vector z is calculated as To calculate the required derivatives, we can use the chain rule: hence  where ∇ x L and ∇ z L denote the gradients of L with respect to x and z.
In CASE I -using the previous notations -the vector z is given by Since the derivative of the round function is 0 everywhere but at 0.5, we obtain that the partial derivatives of L with respect to x 1 and x 2 will be equal to 0. It means that in this case the binary classifier as the part of the whole network cannot be trained. In other words, the binary classifier branch of the system will remain fixed during the whole training process, which lends a less flexible characteristics for CASE I. As a minor technical issue, notice that, the derivative of the round function does not exist at 0.5, however -similarly to the same phenomenon for ReLU -, this even occurs with 0 probability.

Experimental results
In this section, we summarize our experimental findings with a special motivation to see the binary classification assistance on the multi-class predictor for both CASE I and II. To be able to observe the improvement by this support, we give the quantitative results for the initial multi-class classifier without binary support, as well. We will start our presentation with this original setup.
The models are evaluated on the test set consisting of 1512 images provided by the ISIC 2018 challenge organizers. The evaluations have been performed on the official challenge web site according to the performance measures prescribed there. The submitted solutions are primarily ranked regarding the balanced multi-class accuracy (BMA) which is a commonly used measure in multi-class classification problems concerning imbalanced datasets. BMA is defined as the average sensitivity value obtained for the 7 classes. Moreover, as common performance measures like accuracy (ACC), sensitivity (SE), specificity (SP), positive predicted value (PPV) and area under the receiver operating characteristic curve (AUC) corresponding to each individual class have been calculated, as well.
For the sake of completeness, we enclose the detailed results of the models with presenting also their confusion matrices. Here, the diagonal elements represent the number of true positive cases for each class normalized by the cardinality of the given class (a.k.a. sensitivity), while the off-diagonal entries are to those elements that are mislabeled by the given classifier. Since the official challenge web site does not support this type of performance evaluation, we have considered the 20% of the augmented training set for validation (5964 images) to compute these confusion matrices. This is the reason why the sensitivity figures of the models regarding the different classes slightly differ in the tables and the corresponding  confusion matrices. However, this is indeed a minor technical issue, since the trends naturally coincide.

Results with no binary classification support
The simplest attempt to address this specific 7-class skin lesion classification task is to consider a single backbone CNN network with a final softmax layer of 7D. As for our current specific architecture it can be realized with restricting the network shown in Fig. 3 to its lower branch -a single GoogLeNet Inception-v3 with 7-class output -only. The BMA of this model has been found to be 0.602 with the additional performance measures are also shown in Table 2 for each class. To provide a more comprehensive analysis the proper corresponding confusion matrix is also given in Fig. 4.
Moreover, since simple binary (benign/malignant) classification can be highly informative for users of skin-related CAD systems, in Table 3 we enclose the corresponding results. Notice that these binary classification results are derived from merging the 7 skin lesion classes according to (1) into a negative (healthy) and positive (diseased) class and considering the upper branch of our architecture from Fig. 3 -a single GoogLeNet Inception-v3 with binary output -for this task. For the sake of completeness, the confusion matrix for the simple binary classification setup is also given in Fig. 5. The comparison of Tables 2 and 3 reflects our initiative to exploit the better binary classification performance in the multi-class problem.

Results with binary classification support
Now we turn to the exhibition of our experimental results corresponding to the main purpose of the current work with including binary classification support in the multi-class task. To be able to observe the continuous improvement in our approach, we present the experimental outcomes for both CASE I and II described in Section 2.2, that is, when binary support is involved in a rather drastic way (CASE I) and when it is applied to tune the multi-class predictions (CASE II).
As for CASE I, via multiplying the class probabilities of the 7 skin lesions either by 0 or 1 as the rounded class probabilities of the binary classifier, we follow a simple decision tree like model. In Table 4, we present the performance figures regarding the 7-class task solved by this approach according to the final softmax layer of the whole network. For more detailed comparative purposes, we enclose the corresponding confusion matrix in Fig. 6, as well.
Comparing Table 4 with Table 2 we can see that the CASE I approach has already outperformed the original single GoogLeNet Inception-v3 one regarding both its BMA = 0.639 value and the other measures. This result suggests that letting only the one-directional influence of the binary classifier to the multi-class one alone is already capable to lead to a slight improvement. However, notice that the derivative of the round function is zero everywhere (except 0.5), and during backpropagation the binary classifier branch can-   not fine-tune its parameters. Moreover, in that case when the binary classifier misses the class label, the supporting model set each output neuron which belongs to one of the true class labels to zero. Next, we have applied the binary support with a more refined way according to CASE II with keeping the original probabilities of the binary classifier to tune the 7-class probabilities. The BMA of this model according the official ISIC 2018 challenge is 0.677. Similarly to the previous cases, we give the corresponding performance values and confusion matrix in Table 5 and Fig. 7, respectively.
Comparing Table 5 (CASE II) with Table 4 (CASE I), and Table 2 (no binary support), we can clearly observe that our initial purpose to improve multi-class classification performance by the binary one is finally successfully realized; overall, we could reach a 7% raise in balanced multi-class accuracy.
For the sake of completeness, we have made some comparative results regarding some state-of-the-art approaches which were used and evaluated during the ISIC 2018 Challenge: Skin Lesion Analysis Towards Melanoma Detection. Since there were many well known CNNs used as a proposed methodology for skin lesion classification, we have considered them as state-of-the-art solutions  and included in the quantitative comparison. Moreover, to show that our proposed method can be considered as a general framework which could increase classification accuracy when classes are mergeble in any way, we have involved other state-of-the-art CNN architectures into our final evaluation. Namely, some participants used the pre-trained ResNet-101 [4] and DenseNet-201 [14] in their solution. Since by DenseNet-201 superior results were achieved, we have also inserted this model in our framework to combine the seven-classes classifier with the binary classifier version using the proposed support training layer. After training DenseNet-201 on the ISIC 2018 Challenge dataset, we have evaluated its performance both with and without applying the binary classifier support using the ISIC Live 2018.3: Lesion Diagnosis automated evaluation system. The respective performances are also included in the quantitative comparison (see Table 6). As it can be seen from Table 6, the proposed method has remarkably improved the final classification accuracy for all the investigated CNN architectures including DenseNet-201, as well.

Discussion
As for the hardware environment, training has been performed on a computer equipped with an NVIDIA TITAN X GPU card with 7 TFlops of single precision, 336.5 GB/s of memory bandwidth, 3072 CUDA cores, and 12 GB memory. The convolutional filters of the network were found by a stochastic gradient descent algorithm iterated through 21 epochs till the validation accuracy started to drop after a while as depicted in Fig. 8. The reason of this approach lies in the fact that because of the small size of the dataset, a variant of GoogLeNet Inception-v3 pre-trained on ImageNet [13] has been considered, and its layers have been fine-tuned only separately for the binary and 7-class tasks. Then, these pre-trained branches had been fed into our assisted training architecture. This procedure is the explanation why the learning curve starts at quite  Fig. 8, then drops a bit before the network learns the supporting phenomenon. Mini-batch size has been adjusted to 100, while the learning rate to 0.0001; other learning rates has been also tested, but iteration finished earlier with lower accuracy.
As loss function we have selected cross-entropy as a common one for multi-class problems. Our framework to support finer classification with a rawer one could be further tuned for the specific task. Hyperparameter optimization is one possibility to increase the accuracy, whichbeyond the classic CNN parameters (stride, padding, number of layers/filters, dropout level, etc.) -may include a transformation of the simple approach to multiply the multi-class confidences with the corresponding binary ones. Moreover, ensemble-based systems are regularly reported to raise classification accuracy (see e.g. [15] for the same field). In our model, this approach could be realized with including more -either binary or multi-class -CNN components and select aggregation rules according to their outcomes and the appropriate way of providing the assistance by the binary classifier(s).
Since the original task was a 7-class classification one, we have not focused on the possible improvement of the binary classification outcome emerging the corresponding classes as described in Section 2.1. However, notice that it is possible to revert the direction of the support and to let the 7-class branch to assist the binary one. As for technical realization this could be reached by multiplying the p NEG and p POS probabilities with the normalized corresponding 7-class ones, so backpropagation can take place.
For the sake of an exhaustive analysis, we have also extracted the 7-class branch from the STL layer; the corresponding performance values are enclosed in Table 7. With this analysis we could check (see Tables 2 and 7) whether the 7-class branch was indeed able to improve during the assisted training as a standalone classifier. On the other hand, the whole architecture naturally has outperformed its 7-class branch regarding the classification task (see Tables 5 and 7).

Conclusion
In this paper, we have proposed a deep convolutional neural network architecture that supports multi-class classification by including the more reliable binary classification outcome in the final class probabilities. To realize this idea we have trained the same CNN architecture (GoogLeNet Inception-v3) for both a binary and multi-class task simultaneously with merging their softmax outputs on a support training layer with multiplying the multi-class confidences with the corresponding binary ones. In this way, we have achieved a remarkable improvement in a 7-class classification problem regarding skin lesions.
There is a natural limitation for our approach, namely when the classes cannot be merged directly to a smaller number of those. However, this issue can be addressed with applying some nonsupervised technique like k-means clustering. This approach can lead to further generalizations with assigning dedicated branches in our ensemble for each recommended number of clusters deter- mined e.g. by the elbow method in k-means. Then, we can optimize assisted learning for that number of clusters corresponding to the specific task. Such ensembles consisting of several branches of CNNs can be optimized further with including a penalization term in the loss function for coinciding labeling to make the members more diverse besides keeping up overall classification accuracy.

Author contributions
Balazs Harangi: Conception and design of study, drafting the manuscript, approval of the final version of the manuscript. Agnes Baran: Conception and design of study, drafting the manuscript, approval of the final version of the manuscript. Andras Hajdu: Conception and design of study, revising the manuscript critically for important intellectual content, approval of the final version of the manuscript.