Machine Learning Approaches for Skin Neoplasm Diagnosis

Approaches for skin neoplasm diagnosis include physical exam, skin biopsy, lab tests of biopsy samples, and image analyses. These approaches often involve error-prone and time-consuming processes. Recent studies show that machine learning shows promise in effectively classifying skin images into different categories such as melanoma and melanocytic nevi. In this work, we investigate machine learning approaches to enhance the performance of computer-aided diagnosis (CADx) systems to diagnose skin diseases. In the proposed CADx system, generative adversarial network (GAN) discriminator is used to identify (and remove) fake images. Exploratory data analysis (EDA) is applied to normalize the original data set for preventing model overfitting. Synthetic minority oversampling technique (SMOTE) is employed to rectify class imbalances in the original data set. To accurately classify skin images, the following machine learning models are utilized: linear discriminant analysis (LDA), support vector machine (SVM), convolutional neural network (CNN), and an ensemble CNN–SVM. Experimental results using the HAM10000 data set demonstrate the ability of the machine learning models to improve CADx performance in treating skin neoplasm. Initially, the LDA, SVM, CNN, and ensemble CNN–SVM show 49%, 72%, 77%, and 79% accuracy, respectively. After applying GAN (discriminator) and SMOTE, the LDA, SVM, CNN, and ensemble CNN–SVM show 76%, 83%, 87%, and 94% accuracy, respectively. We plan to explore other machine learning models and data sets in our next endeavor.


INTRODUCTION
−3 One of the primary applications of classification techniques is in computer-aided diagnostic (CADx) systems. 4,5These systems typically involve training algorithms on large data sets of annotated medical images to learn the characteristic features associated with different types of skin lesions. 6,7Machine learning-based CADx systems offer several advantages including early detection, improved accuracy, and personalized medicine.It should be noted that the machine learning-based approaches need large and diverse training data sets, potential biases in data collection and annotation, interpretability issues with complex models, and regulatory considerations regarding clinical validation and deployment.Despite implementation challenges, machine learning holds great promise for the diagnosis of skin neoplasms.
−11 However, these traditional approaches are often time-consuming and expensive.To overcome these limitations, CADx systems have emerged as valuable tools for assisting in the detection of skin malignancies.−19 False positive can lead to unnecessary biopsies with many follow-up procedures, and false negative can lead to delays in treatment, making it difficult to treat.Recent studies suggest that CADx systems have demonstrated notable advancements in enhancing diagnostic accuracy by applying various machine learning concepts and techniques.
This work aims to investigate contemporary popular machine learning approaches to CADx systems for better identifying dermatological abnormalities in skin images and reduce the number of false negative and false positive readouts.This study involves the use of various classification algorithms, including LDA, SVM, CNN, and an ensemble CNN−SVM model.To enhance CADx performance, fake images are identified (and removed) using GAN.The original data set is normalized using EDA to avoid model overfitting.Class imbalances in the original data set are rectified using SMOTE.
This manuscript is organized as follows.Related work is reviewed in section 2. Proposed CADx system with machine learning techniques is described in section 3. Experimental details are presented in section 4. Experimental results are discussed in section 5.

RELATED WORK
2.1.Conventional CADx Systems.Conventional CADx systems utilize computer algorithms to aid healthcare professionals in diagnosing diseases. 4,20,21These systems operate by furnishing clinicians with information regarding potential disease presence and by scrutinizing data derived from medical imaging modalities such as X-rays, magnetic resonance imaging (MRI), and computed tomography (CT) scans.Recognized as valuable tools, conventional CADx systems enhance the accuracy and efficiency of medical diagnoses. 22,23They follow a series of steps to analyze medical images, encompassing preprocessing, segmentation, feature extraction, feature selection, and classification as illustrated in Figure 1.
The preprocessing step aims to improve the quality of the medical image by applying various techniques such as noise removal, geometric transformation, cropping, resizing, and adjusting the color balance of the images. 24,25The preprocessed image proceeds to the segmentation step.The segmentation step aims to identify and isolate areas of interest within the image. 26−28 Figure 2 shows an example of the implementation of noise removal from the preprocessing stage and thresholding used in segmentation.
Feature extraction and selection, two important steps, are usually grouped in CADx systems.−31 The goal of feature extraction is to minimize the dimensionality of the data so that it can be processed more easily and quickly while maintaining as much information as possible.−34 The purpose is to minimize the dimensionality of the feature space and remove redundant characteristics that might affect CADx system performance.The important features (also known as regions of interest) are used as input in the classification step.The choice of a classification method, from three major methods as shown in Table 1, for a CADx system depends on the trade-offs between accuracy, complexity, and interpretability, as well as the specific requirements of the application at hand.Rule-based algorithms rely on explicit rules and conditions defined by experts.Statistical methods extract quantitative features from images.Machine learning models learn patterns from labeled data to make predictions.
In our work, we aim to improve the performance of CADx system by removing fake images, resampling data, and using an ensemble of CNN and SVM model.

Data Set for Studying Skin Neoplasm.
In this work, we use the Human Against Machine 10000 (HAM10000) data set, a popular one for skin neoplasm study.−38 HAM10000 data set has more than 10 000 training images for detection of pigmented skin lesions with the following seven classes.Melanoma (MEL) is considered the most serious and potentially life-threatening type of skin cancer, which develops from pigment-producing cells.Basal cell carcinoma (BCC) is the most common type of skin cancer that typically appears as a waxy bump or lesion on the skin.Vascular lesion (VAS) is a variety of skin conditions caused by abnormal blood vessels, including birthmarks and hemangiomas.Actinic keratoses (AKIEC) is a precancerous lesion that appears as a scaly or crusty growth and is typically caused by sun damage.Benign   typically used for applications where accuracy is important, but the system needs to be easy to understand and comprehend machine learning typically used for applications where accuracy is critical, and the system can be complex keratosis-like lesions (BKL) represents benign skin growths resemble actinic keratoses but have different characteristics.Dermatofibroma (DF) is a benign skin lesion that appears as a firm, round bump and is typically brown or reddish-brown.Melanocytic nevi (NV) is usually benign and is commonly known as moles.It should be noted that there are no fake images in the HAM10000 data set.

Ideal Size of Images.
The optimal size of images for machine learning algorithms depends on several factors, including the complexity of the model, the computing power at hand, the size and complexity of the data set, and the application's specific requirements.Larger images generally contain more detailed information, providing the potential for improved feature extraction and better representation of the image characteristics. 39,40However, employing larger photos requires more memory and computing power to process and analyze the massive amounts of data in the image.Smaller images demand less memory and processing power, making them more suitable for systems with limited resources.However, reducing the image size may lead to the loss of some crucial details or characteristics of the skin lesions, potentially affecting the classification accuracy. 39,41To determine the optimal image size, an empirical approach is often adopted.Researchers and developers usually experiment with training and testing the machine learning classification model on images of varying sizes, monitoring performance metrics to determine which image size produces the best results.This empirical exploration allows for a data-driven decision on the ideal image size that balances accuracy, computational efficiency, and memory requirements for the CADx system and data set.
In this study, we resize the images from 600 × 450 pixels to 64 × 64 pixels.This adjustment is made to expedite training and mitigate the risk of overfitting.

Resampling.
Resampling is used to generate a more representative sample of the population, improve the performance of a model, or balance an unbalanced data set.Resampling is a statistical approach that uses random extract samples from a data set to build a new data set with fewer or more samples that have the same distribution as the original data set. 42,43Resampling is particularly useful when dealing with imbalanced data sets, where one class significantly outnumbers the others, leading to potential biases in the model's predictions.Table 2 shows the differences between the two types of resampling techniques.
Resampling can assist machine learning models increase their accuracy, especially when the data is imbalanced.However, resampling approaches must be used with caution since they might induce bias and overfitting into the model.To make sure that the resampling procedure does not provide too optimistic findings or deteriorate the model generalizability, proper validation and evaluation of the model performance is required.Resampling offers a practical way to deal with class imbalance and enhance the effectiveness of machine learning models in CADx systems.By utilizing either undersampling or oversampling approach, the model can be better equipped to detect patterns and make accurate predictions for both majority and minority classes, ultimately contributing to more robust and reliable skin disease detection and diagnosis.
We have applied SMOTE 44−46 to improve the performance of CADx systems by addressing the class imbalance issues.SMOTE is chosen over random oversampling techniques because those techniques may not introduce enough variation, especially if the minority class is significantly underrepresented.SMOTE provides a more sophisticated way to generate synthetic samples by creating new data points in the feature space, which can lead to better model performance.SMOTE does not generate realistic images; instead, it creates new samples based on extracted image features.For the feature vectors of two skin cancer images X1 and X2, SMOTE interpolates between the vectors to generate a new synthetic sample X3.The interpolated sample X3 is derived using the formula shown in eq 1.
Here, the parameter λ controls the extent of interpolation.The diff function operates in the feature space derived from convolutional neural networks.This vector captures the difference between the feature representations of the two samples.It is essential to note that the synthetic sample X3 is not an actual image from the data set but rather a new feature vector created through interpolation.
SMOTE increases the representation of minority classes by generating the synthetic samples, thereby offsetting the negative consequences associated with skewed data sets.The resampled data set with synthetic samples can then be used to train the machine learning model, enhancing its ability to generalize and make more accurate predictions for both majority and minority classes.The SMOTE techniques help CADx systems with detecting underrepresented skin lesions.
2.5.GAN: Generator and Discriminator.−50 The generator creates fake images that appear like actual images, while the discriminator distinguishes actual and fake images.During training, the generator learns to make more realistic images, while the discriminator learns to distinguish between actual and fake images, making it more challenging for the generator to produce convincing fakes.GAN (generator) often is used to generate fake data comparable to the original data.GAN (discriminator) often is used to filter out the fake images.This process helps ensure that machine learning models do not learn inaccurate patterns from synthetic data, which could lead to poor performance on actual data.Nonetheless, it is critical to recognize that training GANs can be computationally costly and necessitates rigorous hyperparameter tuning.Regardless of this limitation, the use of GAN is expected to improve the dependability and effectiveness of skin cancer classification models.
2.6.Machine Learning Techniques.Machine learning classification models have been employed with CADx systems for analyzing images including LDA, SVM, CNN, and an ensemble CNN−SVM.LDA assumes that the features follow a normal distribution, which can usually be found in medical image data.Furthermore, LDA seeks to enhance the separation between groups, making it particularly effective when there are clear differences between skin lesions.Furthermore, LDA offers a framework for classification, which can be valuable in clinical decision making.However, LDA assumes that classes have the same covariance matrices, which may not always be true in complex medical imaging situations.SVM is very effective in handling high-dimensional features, which are common in medical image data sets.SVMs can efficiently separate classes even when the data is inseparable on different kernel functions, allowing them to capture complex relationships.Additionally, SVMs inherently incorporate the concept of margin, thus potentially leading to better generalization performance.However, SVMs can be computationally intensive, especially with large data sets, which may impact their real-time applicability in clinical settings.
CNN is capable of handling large, high-resolution images well, making them suitable for skin analysis.CNN excels at extracting the right features from images, eliminating the need for manual feature engineering.This capability is crucial in medical imaging, where intricate patterns and subtle details are vital for accurate diagnosis.Additionally, CNNs are adept at capturing spatial relationships in data, allowing them to discern patterns that may be challenging for traditional machine learning models.However, CNN demands a lot of computing resources, especially when dealing with deep architectures or big data such as the HAM10000 data set.
Ensemble CNN−SVM models offer a powerful approach for skin cancer classification, leveraging the strengths of both models.The CNN excels in extracting features from images and capturing complex patterns necessary for accurate diagnosis, while the SVM excels at making explicit separations between classes, which is useful in situations where distinct boundaries are needed.In our ensemble model, there is no order in which the CNN and SVM operate; instead, they work in parallel, and the results are combined through a weighted sum of their outputs.This approach can achieve higher classification accuracy compared to individual models, providing redundancy and robustness against overfitting due to the different learning processes and feature extraction methods of the two models.However, implementing the ensemble model requires careful tuning and integration and substantial computational resources for training.Ensuring compatibility and coherence between the CNN and SVM components is crucial, and finding the right balance between their contributions can be a complex process.

MACHINE LEARNING MODELS FOR CADX SYSTEMS
In this work, we introduce GAN, EDA, SMOTE, and an ensemble CNN−SVM model in CADx systems to enhance performance of skin neoplasm diagnosis.Figure 3 illustrates the major steps of the proposed CADx system.It should be noted that the preprocessing, segmentation, feature extraction, and feature selection steps are similar to those used in typical CADx systems.

Application of GAN Discriminator.
To address concerns regarding the authenticity of the data set and enhance the reliability of the CADx system, we use GAN (discriminator) as a powerful tool for validating the data set.The GAN (discriminator) meticulously examines the data set for any fake images.Through this GAN-based analysis, the CADx system can efficiently detect and eliminate any fraudulent samples, ensuring that the model is not trained on deceptive or erroneous data.This validation procedure plays a crucial role in upholding the integrity of the data set and enhancing the CADx system's capability to accurately diagnose and classify skin lesions.Consequently, it contributes to the generation of more dependable outcomes in studying dermatological research.In this work, we aim to develop a methodology that can effectively detect and remove fake images in the original data set.However, the original HAM10000 data set does not contain any fake images.Therefore, GAN (generator) is used to add approximately 25% additional synthetic images for each skin lesion type, amounting to 360 fake images per category, in the original data set.

Application of EDA.
To evaluate and standardize the original data set to mitigate the risk of machine learning overfitting, we integrate EDA into the proposed CADx system.Beginning with a set of 10 015 images sourced from the HAM10000 data set, our approach involves augmenting approximately 25% of the data set with newly generated (fake) images, resulting in a total of 12 535 images.Given that the HAM10000 data set encompasses seven distinct types, we generate 360 images for each type using a GAN (generator).The outcomes of the EDA process are depicted in Table 3, illustrating variations among the different types.Incorporating these images into the data set facilitates the customization and training of machine learning models to effectively discern and accommodate the unique characteristics and patterns exhibited by each category.

Application of SMOTE.
Correction of classimbalances within the data set is imperative for precise skin lesion classification.To address this issue, resampling method SMOTE is utilized.SMOTE generates synthetic samples for the minority class by interpolating existing ones, thereby balancing the data set.Table 4 showcases the distribution of images within the data set after the application of the SMOTE technique.In the original HAM10000 data set, Class NV has more than 6000 images, whereas Class DF has only 115 images.Since we strive for equal representation of the classes, using fewer than 1000 images would significantly impact NV,  which would result in the loss of valuable data and diversity.
On the other hand, using exactly 1000 images would slightly affect DF, which could introduce redundancy and potential overfitting.We experiment with different numbers (more or less than 1000) of synthetic samples and found that 1000 samples provide a good balance, resulting in a reasonable accuracy.This approach enables the CADx system to learn from a broader and more representative data set, consequently mitigating overfitting and enhancing generalization capabilities.

Ensemble CNN−SVM in Classification.
We introduce an ensemble CNN−SVM method in this work to attain enhanced classification accuracy of CADx systems.While SVM excels in managing high-dimensional data, CNN adeptly captures spatial features and hierarchical patterns within images.The ensemble CNN−SVM method harnesses the complementary strengths of CNN and SVM, exploiting both techniques' prowess to achieve superior generalization and resilience.By implementing these advanced classification techniques, the CADx system enhances its ability to accurately diagnose and classify various skin lesions, providing crucial support to dermatologists and facilitating more effective diagnosis of skin neoplasm.

Data Set Used.
In this study, we use 10 015 images from the HAM10000 data set (in seven types) and assimilate approximately 25% additional synthetic images using GAN.Table 5 shows samples of the seven diagnostic categories of images within the HAM10000 data set.In Table 5, the original and fake images look similar but have differences at the pixel level.The total data set is split into 67.5% for training, 7.5% for validation, and 25% for testing.

Model Architecture.
In the classification models used in our CADx system, we strategically consider vital hyperparameters to optimize each model's performance.For LDA, the selection of solver, the application of shrinkage, and the determination of the number of components are pivotal in shaping classification capabilities.Likewise, for SVM, careful parametrization focused on kernel selection, the fine-tuning of the regularization parameter (C), and the degree of the polynomial for optimal decision boundaries are pivotal.The CNN design encompasses key hyperparameters such as the number of convolutional layers, dropout layers, and hidden layers, filter size, max pooling pool size, dropout rate, activation function, L2 (i.e., Euclidean norm) regularization rate, and training epochs.In crafting our ensemble model, we pay significant emphasis on the weighting mechanism for inconsistent predictions to achieve a harmonious fusion of these algorithms.The hyperparameters of the four classification algorithms are summarized in Table 6.For each algorithm, we have tried different hyperparameter values.The values in the table give the best performance.
The CNN model consists of five convolutional layers with filter sizes of (3, 3) in each layer, facilitating the extraction of hierarchical features from the input images.After each convolutional layer, max pooling layers with a pool size of (3, 2) are employed to downsample the extracted features, aiding in reducing spatial dimensions.Table 7 shows the shape and number of parameters of the CNN classification algorithm.For the CNN model, we have tried shapes and parameters.The values in the table give the best performance.
To mitigate overfitting, dropout layers with a dropout rate of 0.27 are strategically inserted after each max pooling layer.The depth of the feature maps increases progressively through the network, starting with 32 filters in the first layer and reaching 256 filters in the final convolutional layer.A global average pooling layer is incorporated to transform the spatial dimensions into a vector of length 256, contributing to reducing the total number of parameters.After the convolutional layers, there are two fully connected dense layers with 127 and 7 units, respectively.The first dense layer employs rectified linear unit (ReLU) activation and L2 regularization with a rate of 0.003, enhancing the model's ability to learn intricate patterns in data.The entire model comprises a total of 1 635 143 parameters, which include weights and biases.
During training, the model undergoes 60 epochs with early stopping and learning rate reduction callbacks, ensuring effective convergence and preventing overfitting.

RESULTS AND DISCUSSION
In this section, we discuss experimental results obtained from the proposed CADx system to assess the impact of removing fake data, applying resampled data, and using the ensemble CNN−SVM model.Performance metrics such as precision, recall, F1-score, and accuracy are used for evaluation.

Baseline Performance of the Simulated CADx
System.First, we present performance metrics obtained from a typical CADx system (without using GAN) by utilizing images from the original HAM10000 data set plus an additional 25% generated/fake images.The distribution of images among the seven different classes of skin lesions is already summarized in Table 3, shedding light on the representation and prevalence of each skin lesion type.The classification performances of LDA, SVM, and CNN without the implementation of GAN are presented in Tables 8, 9, and 10, respectively.
The LDA model exhibits notable disparities in its performance across different skin lesion classes (as shown in Table 8).Particularly noteworthy is the poor performance for the DF class, where both precision and recall are reported as 0.001, resulting in an F1-score of 0.001.This indicates significant challenges in correctly identifying and classifying instances of DF, highlighting an area where the model may need improvement.In contrast, the model performs relatively well for the NV class, with moderate precision (0.690), recall (0.667), and F1-score (0.674).The model performs poor for the VAS (F1-score 0.141), AKIEC (0.156), BCC (0.185), BKL (0.194), and MEL (0.197) classes, contributing to the model's average precision of 0.211, average recall of 0.244, average F1score of 0.221, and a moderate overall accuracy of 0.491.
The SVM model displays varying performance across different skin lesion classes (as shown in Table 9).Notably, it achieves high precision (0.763), recall (0.979), and F1-score (0.856) for the NV class, contributing to the model's overall accuracy of 0.722.However, the model's performance varies across other classes, with notable discrepancies in precision, recall, and F1-score, particularly for the DF class, where precision, recall, and F1-score are reported as zero.The average precision (0.461), recall (0.344), and F1-score (0.368) suggest that there are rooms for performance improvement.
The CNN model demonstrates varied performance across different skin lesion classes (as shown in Table 10).Notably, it achieves high precision (0.856), recall (0.943), and F1-score (0.891) for the NV class, contributing to the model's impressive overall accuracy of 0.774.However, the model's performance varies across other classes, with some challenges in correctly classifying instances of DF, where the recall is reported as 0.123, resulting in an F1-score of 0.226.The average precision (0.652), recall (0.477), and F1-score 0.522 indicates a reasonable balance between precision and recall across the skin lesion categories.
To gain insights into the CNN model's training process, the key visualization plot of accuracy versus epoch (shown in Figure 4) is generated.The observed plateau in accuracy around epoch 35  Table 11 shows the performance metrics due to applying the ensemble CNN−SVM model in the CADx system.The ensemble model exhibits notable strengths in accurately classifying most skin lesions, particularly achieving high precision, recall, and F1-score for the NV class with values of 0.974, 0.965, and 0.975, respectively.The results contribute to an impressive overall accuracy of 0.792.However, there are areas for improvement, especially in handling less prevalent classes like AKIEC, where the precision, recall, and F1-score are relatively lower at 0.392, 0.151, and 0.222, respectively.

Impact of Detecting and Removing Fake Images Using GAN (Discriminator).
Second, we present performance metrics acquired with a GAN in the CADx system for the original HAM10000 data set plus 25% generated/fake images.The performance of LDA, SVM, and CNN with GAN is presented in Tables 12, 13, and 14, respectively.
The LDA model demonstrates relatively balanced performance across various skin lesion classes, as shown in Table 12, achieving notable precision, recall, and F1-score for DF with values of 0.891, 0.740, and 0.816, respectively.However, the model's overall accuracy is moderate at 0.524.While excelling in certain areas, there are rooms for improvement in enhancing precision, recall, and F1-score for other classes, such as MEL, BKL, and BCC.
The SVM model showcases better performance across various skin lesion classes, as shown in Table 13, achieving particularly high precision, recall, and F1-score for NV with values of 0.851, 0.978, and 0.911, respectively.The model's overall accuracy is commendable at 0.763, indicating its effectiveness in correctly classifying skin lesions.The balanced precision, recall, and F1-score across multiple classes suggest the model's capability to generalize well to diverse skin conditions.
The CNN model exhibits promising performance across various skin lesion classes, as shown in Table 14, with notable precision, recall, and F1-score values.Particularly impressive is the perfect precision (1.000), moderate recall (0.761), and high F1-score (0.861) for the DF class, showcasing the model's ability to accurately identify this specific skin lesion type.The overall accuracy of 0.801 is indicative of the model's success in correctly classifying skin lesions across diverse categories.The balanced precision, recall, and F1-score across multiple classes emphasize the CNN model's effectiveness in providing accurate predictions for various skin conditions.The key visualization plot of accuracy versus epoch is generated (as  The experimental results show significant advancements across various classification models, particularly when employing a GAN (discriminator) with resampled data set.The notable increase in accuracy underscores the efficacy of mitigating class imbalance through resampling techniques.LDA initially demonstrates the lowest accuracy (0.491), followed by SVM (0.722), CNN (0.774), and ensemble CNN−SVM model (0.792).The incorporation of GAN (discriminator) effectively addresses challenges associated with fake images, leading to improvements in precision, recall, F1-score, and overall accuracy across all classification algorithms.The resampled data set, generated using SMOTE, consistently outperforms the original data set, emphasizing the critical role of addressing class imbalance in facilitating accurate predictions for all skin lesion classes.Finally, LDA exhibits a moderate accuracy of (0.763), followed by SVM (0.833), CNN (0.872), and ensemble CNN−SVM model (0.941, the highest accuracy).In future studies, we plan to explore the integration of advanced machine learning techniques, the exploration of additional data processing methods, and the development of more sophisticated ensemble models in our future endeavor.

Figure 1 .
Figure 1.Major steps of a conventional CADx system.

Figure 2 .
Figure 2. Noise removal and segmentation.(a) An original image from the HAM10000 data set, (b) preprocessed image of the original after noise removal, and (c) binary impression of the preprocessed image after segmentation.

Figure 3 .
Figure 3. Major steps of the proposed CADx system.

Table 1 .
Major Classification Methods

Table 2 .
Two Types of Resampling Techniques

Table 3 .
Distribution of the Dataset

Table 5 .
Diagnostic Categories of HAM10000 Dataset

Table 6 .
Classification Algorithms and Hyperparameters

Table 7 .
suggests that the model has reached a saturation point in learning from the training data, indicating that further training beyond this point may not significantly Shape and Number of Parameters of the CNN Model

Table 8 .
LDA Performance without GAN (with Fake Data)

Table 9 .
SVM Performance without GAN (with Fake Data) improve performance on the test set.The convergence of training and validation accuracy at 0.864 and 0.774, respectively, indicates that the model generalizes well to unseen data but has reached a point of diminishing returns in learning from the training set.

Table 11 .
Ensemble CNN−SVM Model Performance without GAN (with Fake Data)

Table 12 .
LDA Performance with GAN (without Fake Data)

Table 13 .
SVM Performance with GAN (without Fake Data)

Table 14 .
CNN Performance with GAN (without Fake Data)