Classification and Research of Skin Lesions Based on Machine Learning

: Classification of skin lesions is a complex identification challenge. Due to the wide variety of skin lesions, doctors need to spend a lot of time and effort to judge the lesion image which zoomed through the dermatoscopy. The diagnosis which the algorithm of identifying pathological images assists doctors gets more and more attention. With the development of deep learning, the field of image recognition has made long-term progress. The effect of recognizing images through convolutional neural network models is better than traditional image recognition technology. In this work, we try to classify seven kinds of lesion images by various models and methods of deep learning, common models of convolutional neural network in the field of image classification include ResNet, DenseNet and SENet, etc. We use a fine-tuning model with a multi-layer perceptron, by training the skin lesion model, in the validation set and test set we use data expansion based on multiple cropping, and use five models’ ensemble as the final results. The experimental results show that the program has good results in improving the sensitivity of skin lesion diagnosis.


Introduction
Skin lesions are a major threat to people's health, more than 5 million cases occur every year. There are many kinds of skin lesions, common skin lesions include psoriasis, eczema, vitiligo, melanoma, some of these lesions are particularly fatal, such as melanoma [Pathan, Prabhu and Siddalingaswamy (2018)]. If the doctor can detect skin lesion in advance which the patient is suffering from, whether it is benign or malignant, the process of follow-up treatment will become extremely beneficial. In the early stage of some skin diseases, the lesions are small and therefore difficult to be observed by doctors. With the development of medical technology, there is a visual inspection technique that is easy for doctors to observe: the dermatoscopy, the advantages of dermatoscopy are not only to enlarge the lesion size of the skin [Tschandl and Wiesner (2018)], but also it can eliminate the interference of some light. For doctors, the probability of misdiagnosis is reduced, and the accuracy of diagnosing the types of skin lesions has also been improved to some extent. In the actual diagnosis, the dermatoscopy images are often identified by multiple doctors to determine the type of skin lesions, and the doctors spend a lot of time and energy on diagnosis. The program algorithm to assist doctors is also beginning to appear [Rebouças Filho, Peixoto, da Nóbrega et al. (2018)].

Related works
Traditional image classification often uses image preprocessing, segmentation, feature extraction and general process of image classification for image recognition. Manerkar et al. [Manerkar, Snekhalatha, Harsh et al. (2016)] use C-means and watershed algorithm to segment skin lesion images, and then uses gray level co-occurrence matrix algorithm to extract data features, finally he selects Support Vector Machine as the classifier. Nezhadian et al. [Nezhadian and Rashidi (2017)] try to extract texture features for melanoma lesion images, combined with Support Vector Machine as classifier for benign and malignant melanoma classification. George et al. [George, Aldeen and Garnavi (2017)] have carried out five color space transformations on psoriasis lesion data, and selected the voting result as the final diagnosis. To achieve a good effect, Albay et al. [Albay and Kamaşak (2015)] extract the Fourier properties of the lesion boundary after segmentation of the skin lesion and serves as a classification feature. The traditional feature extraction method has achieved good results. However, it still cannot exceed the level of human recognition. In recent years, with the hot development of deep learning, more and more new models have emerged, breaking one record after another. The usage of deep learning models to train image data has become the first choice for many research methods. Rajesh [Rajesh (2017)] use forward feedback neural network combined with ABCD rules to classify melanoma and benign skin lesions. Islam et al. [Islam, Gallardo-Alvarado, Abu et al. (2017)] want to identify eczema, impetigo and psoriasis, after using data preprocessing, feature extraction, finally the artificial neural network ANN is used to classify the data, which has achieved faster diagnosis and recognition than human doctors. Ge et al. [Ge, Demyanov, Bozorgtabar et al. (2017)] use different depth neural networks' models for skin data, then the bilinear pooling technique is adopted for some models, finally using the Support Vector Machine as classifier achieves better results.

Organization
This paper is organized as follows: In Section 2.1, we discuss the skin lesion datasets and the various categories of data. In Section 2.2, we introduce the conditions of the device and implement our classification goals in three steps including choosing models, multiple cropping, ensemble. In Section 2.3, we show our experimental results. In Section 3, we have summarized the full text and make an outlook.

Prepared
In this article, our study is based on ISIC 2018: The Great Challenge Dataset for Skin Lesion Analysis to Melanoma Detection [Codella, Gutman, Celebi et al. (2018) ;Tschandl, Rosendahl and Kittler (2018)], which includes seven types of skin lesions. Fig. 1 shows seven different skin lesions. There is a total of 10015 images in the entire data set, image size is uniform to 450×600, 3-channel color image. Tab. 1 includes the seven types in the data set and the corresponding number of pictures.  Due to the extremely unbalanced data volume of these kinds of classes, it brings certain challenges to the classification. The class with the most data is 58 times the class with the least amount of data. Huge data differences can have a big impact on model training, and this is one of the problems which we want to solve.

Methods
We randomly divide the data set into train set, validation set, test set. The number ratio is 8:1:1. In the original data set image size is 450×600, our network model is a pre-trained classification model on ImageNet, the model has a batch size of 20 on 4 TITAN X. In the data enhancement section, we use random horizontal/vertical flipping, normalization, and randomly cut out 224×224 size from the original image size of 450×600. Our learning rate is 0.0001, the whole training contains 150 epochs. Considering the huge difference between the 7 types of pictures in the dataset, we choose the weighted loss optimizer with unbalanced class, Eq. (1) represents this method. , where i ω represents the optimization weight of each class, i x represents the number of each class, and then we use the softmax for classification. In the process of training, we save the model parameters of the best sensitivity, and we hope to get higher sensitivity by loading the validation set and test set. Fig. 2 shows the whole process of the algorithm.

Figure 2: Algorithm process diagram
In the course of the research, we use the following three steps: Step 1: Model Selection and Finetune. Deep learning uses a convolutional neural network (CNN) model to identify images. With the development of deep learning, many excellent models have emerged. We will briefly introduce several models and the reasons why we choose them. Firstly, the ResNet [He, Zhang, Ren et al. (2016)] model is proposed when image recognition in deep learning field is caught in a bottleneck, it proposes a concept such as shortcut connection and practices it into the model. The proposed method solves the problem that the accuracy rate decreases as the number of network layers deepens. Secondly, the DenseNet [Huang, Liu, Van Der Maaten et al. (2017)] model is preposed in the concept of ResNet, the difference between ResNet and DenseNet is that DenseNet connects each layer to the dimensions of all the previous layers by using shortcut connection, the optimization of the structure reduces many parameters and calculations, making the anti-over-fitting effect better. Next, the SENet [Hu, Shen and Sun (2017)] model proposes a new block structure named SE block, using squeeze to compress each feature layer in the model structure and using excitation to capture feature channel dependencies, the SE block can be combined with many existing models. And the ResNeXt [Xie, Girshick, Dollár et al. (2017)] model does not gradually deepen the network like other models, it optimizes the hyperparameters and divides the same large filter into a corresponding number of small filters to reduce the number of hyperparameters and to improve the accuracy, in this paper, we use the ResNeXt model combined with SE block. Finally, the DPN [Chen, Li, Xiao et al. (2017)] model combines the characteristics of ResNext and DenseNet models, and the improvement in accuracy is not particularly large, but it is mainly optimized in parameter optimization and computational overhead. We select the above five models that have performed well in the classification field [Cui, McIntosh and Sun (2018)] in recent years, the reason why we choose these models is that they have large depths and enough parameters, so the image features are more fully learned.
In addition, some models can reduce the influence of over-fitting and under-fitting, and some models can drop partially unimportant parameters to reduce the amount of calculation. Tab. 2 shows the performance of these models on ImageNet. In the fine-tuning, we take the resnet101 model as an example, respectively for learning rate, optimizer as Fig. 3 shows. We can get the following graphs. we finally choose a learning rate of 0.0001 and Adam's optimizer. In addition, we use the zero-initialize method for the resnet model, zero-initialize is used for the last BatchNorm layer in each residual branch, the accuracy and sensitivity of the model have increased slightly. We also made a small change to the model structure, Fig. 4 is a resnet50 structure commonly used in classification, at the same time, it shows that the fully connected layer is changed to a multilayer perceptron structure (MLP) by fine tuning. Step 2: Multiple Cropping for Validation and Test set.
In the validation set and test set, we use the data expansion method, we use multiple cropping method to obtain N images of different positions for validation, and average N results, we use the following formula to cut N copies around the original image center.
where [ ,0 We average the cropped N pictures as the result, by using the following formula: See Fig. 5, we compare N to none, 16, 36 and 64 cases, where the value of 36 is the best. using this method, the performance is improved a lot compared to the various indicators obtained by using a single random cut. By cutting the validation set and the test set around the center into multiple copies, 36 images of different positions are obtained. The randomness obtained by multiple cropping is reduced, and the lesion image recognition is more accurate for some image lesions. According to this method, we obtain various metrics for different models on the validation set and test set, and calculate the mean. The results of DenseNet121, DenseNet169, DenseNet201, ResNet50, ResNet101, ResNet152, SENet154, SE-Resnext101, DPN68b and ResNet101MLP are obtained in turn, as Tab. 3 shows. From the results of these models, the best overall model is SENet154, the model with the best sensitivity is the resnet101 model with MLP structure. However, we hope to get higher sensitivity results at the same time. we use model ensemble to combine the advantages of each model to get better results.
Step 3: Grid Search and Support Vector Machine classifier. For the results of validation set, we adopt an approach of ensemble combined with search strategy of Bayesian optimization. From the above model, we select the following five models for the ensemble, including DenseNet201, SENet154, SE-Resnext101, DPN68b and ResNet101MLP. The vectors are extracted before passing the softmax module, we connect all the vectors from the validation set through these five models, and train these vectors with Support Vector Machine or RandomForest as a classifier. In the field of hyperparameter automatic search, there are common grid search, random search and beyesian optimization search, but the first two searches are not efficient, so we adopt bayesian optimization search, we will introduce the implementation process of Bayesian optimization Algorithm 1.
The pseudo algorithm using Bayesian optimization is as follows: where f is the unknown function relationship, X is the input data, S is acquisition Function, M is model based on input data hypothesis, firstly we get the initialized data set based on the input data, then make a loop to select T times parameters. The model we chose is based on Gaussian distribution ,the mean µ and covariance ( , ) K x x * of a Gaussian function are fixed, formulated as follows: ~( , ) f GP K µ , when the Gaussian process is used as a priori for Bayesian inference, the posterior function can be used to predict new data, we suppose y is a function value known by training data, y * is the function value of the test set input x * , µ is the mean of training set, µ * is the mean of test set, * ∑ is the covariance of the training set, ** ∑ is the covariance of the test set.
Gaussian process extends multivariate Gaussian distribution to infinite dimension, a training set y can be represented as a sample taken from a multivariate Gaussian distribution: 1 2 [ , , , ] T n y y y y =  . We set the mean of the Gaussian process to 0 and the most common choice for covariance is the squared exponential, See Eq. (6): Due to the existence of noise, we express the formula as Eqs. (7), (8): ′ is the KroneckerDelta function, in addition to calculating the covariance of the training set K , see Eq. (9), we also need to calculate the covariance between the new independent variable and the training set independent variable K * Eq. (10) and the covariance of the new independent variable K * * Eq. (11). Then Eq. (12) shows the relationship.
~(0, ) The training set obeys a multidimensional normal distribution, according to K , we can know that the posterior probability of the test set y * is 2 ( |~( , )) p y y N µ σ * , the mean µ and variance 2 σ of y * are expressed as follows Eq. (13), Eq. (14): The hyperparameter to be determined is 2 [ , ] f l θ σ = , since the training set obeys a multidimensional normal distribution, the likelihood function is Eq. (15): Bayesian optimization maps x to the real space R through the Acquisition function, indicating the probability that the objective function value of the point can be larger than the current optimal value. the two main types of acquisition functions are commonly used, the first is probability of improvement, see Eq. (16).
where ( ) f X is the value of the X objective function, ( ) f X + is the optimal X objective function value so far, ( ) x µ , ( ) x σ are the mean and variance of the objective function obtained by the Gaussian process, respectively, ξ is the trade-off factor which adjust to select the points around X + . In general, we use MonteCarlo simulation method to find X so that ( ) POI X is the largest.
The second is expected improvement. The POI is a probability function, so only the probability that ( ) f x is larger than ( ) f x + is considered, and expected improvement is a desired function, so it is considered how much ( ) f x is larger than ( ) f x + . We get x by the following Eq. (17).
where t D is the first t samples, under the premise of normal distribution, we can get the following Eq. (18): expression( For the classifier model, we use SVM and RandomForest, SVM configuration is as follows, we use 10-fold cross-validation. For unbalanced data sets, we also use class balance weighted, we use sensitivity as the main evaluation indicator. Search parameters include C and kernel, the best performance parameters are the following values: C=[0.1, 1000], kernel= ['linear', 'poly', 'rbf', 'sigmoid']. The parameter control for the SVM classifier is mainly from the following formula: Due to the duality, this formula is equivalent to the following Eq. (20): where e is the vector of all ones, Q is a n by n positive semidefinite matrix.
Validation set vectors are implicitly mapped into a higher dimensional space by the function φ . The decision function is Eq. (21): 0 C > is the upper bound, we change the value to get best results. Next, we consider the classification model. The random forest model is a combination of the bagging model plus decision trees which creates multiple subtrees by splitting features. The difference is that decision trees usually generate nodes and rules by calculating the information gain and the Gini index. In contrast, random forests are random. Deeper decision trees tend to have over-fitting problems, while random forests can prevent most situations by creating random subsets of features and using them to build smaller trees, which then form subtrees, this method can prevent overfitting in most cases. In this experiment, we select the hyperparameter range to include the following values: N(the number of trees in the forest)=[10, 500], MinSamplesSplit (the minimum number of samples required to split an internal node)=[2, 100], MaxFeatures (the number of features to consider when looking for the best split)=[0.1, 0.999], MaxDepth (the maximum depth of the tree)= [5,80], to find the best sensitivity value by Bayesian optimization combined with random forest model.

Results
According to the results of Bayesian optimization, we use SVM classifier to search for hyperparameters C and kernel, the best performing set of coefficients is C=10 and kernel='rbf', the result of the test set after passing the best set of SVM classifiers is SEN=0.834, ACC=0.842, F1=0.837, SPE=0.966, AUC=0.900. For random forest classifier, the best performing set of coefficients is N=255, MaxFeatures=0.1385, MaxDepth=75 and MinSamplesSplit=100, the result of the test set after passing the best set of random forest classifier is SEN=0.846, ACC=0.840, F1=0.833, SPE=0.968, AUC=0.907. By the comparison of the two classifiers, the random forest classifier with better sensitivity is selected as the final result. Confusion matrix as shown below:

Discuss and conlusion
In this work, we have classified the unbalanced data types of skin lesions. We finally choose a method for ensemble of multiple models. We use the weighted loss optimization for unbalanced data during training. The correct classification of the model is greatly promoted. In the validation and test set, the method of multiple cropping is used to verify that the image with the image size of 450×600 is cropped into multiple copies of 224×224 for average around the center, and the evaluation results of various metrics are obviously improved. Finally, we connect the vectors which the validation set passed multiple models, and train a best-performing of sensitivity classifier by using the Bayesian optimization search hyperparameter method. The research on the classification of skin lesions is worthy of more trials, there are still many shortcomings in our work. In the future, we will do further research and make a progress on classification methods, and we are committed to helping doctors reduce the fatigue caused by diagnosis.