Artificial Intelligence-Based Image Classification for Diagnosis of Skin Cancer: Challenges and Opportunities

Recently, there has been great interest in developing Artificial Intelligence (AI) enabled computer-aided diagnostics solutions for the diagnosis of skin cancer. With the increasing incidence of skin cancers, low awareness among a growing population, and a lack of adequate clinical expertise and services, there is an immediate need for AI systems to assist clinicians in this domain. A large number of skin lesion datasets are available publicly, and researchers have developed AI-based image classification solutions, particularly deep learning algorithms, to distinguish malignant skin lesions from benign lesions in different image modalities such as dermoscopic, clinical, and histopathology images. Despite the various claims of AI systems achieving higher accuracy than dermatologists in the classification of different skin lesions, these AI systems are still in the very early stages of clinical application in terms of being ready to aid clinicians in the diagnosis of skin cancers. In this review, we discuss advancements in the digital image-based AI solutions for the diagnosis of skin cancer, along with some challenges and future opportunities to improve these AI systems to support dermatologists and enhance their ability to diagnose skin cancer.


Introduction
According to the Skin Cancer Foundation, the global incidence of skin cancer continues to increase [1]. In 2019, it is estimated that 192,310 cases of melanoma will be diagnosed in the United States [2]. However, the most common forms of skin cancer are non-melanocytic, such as Basal Cell Carcinoma (BCC) and Squamous Cell Carcinoma (SCC). Non-melanoma skin cancer is the most commonly occurring cancer in men and women, with over 4.3 million cases of BCC and 1 million cases of SCC diagnosed each year in the United States, although these numbers are likely to be an underestimate [3]. Early diagnosis of skin cancer is a cornerstone to improving outcomes and is correlated with 99% overall survival (OS). However, once disease progresses beyond the skin, survival is poor [4,5].
In current medical practice, dermatologists examine patients by visual inspection with the assistance of polarized light magnification via dermoscopy.
Medical diagnosis often depends on the patients history, ethnicity, social habits and exposure to the sun. Lesions of concern are biopsied in an office setting, submitted to the laboratory, processed as permanent paraffin sections, and examined as representative glass slides by a pathologist to render a diagnosis.
AI-enabled computer-aided diagnostics (CAD) solutions are poised to revolutionize medicine and health care, especially in medical imaging. Medical imaging, including ultrasound, computed tomography (CT), and magnetic resonance imaging (MRI), is used extensively in clinical practice. In the dermatological realm, dermoscopy or, less frequently, confocal microscopy, allows for more detailed in vivo visualization of lesioned features and risk stratification [6,7,8,9,10]. In various studies, AI-based image classification algorithms match or exceed clinician performance for disease detection in medical imaging [11,12]. Recently, deep learning has provided various end-to-end solutions in the detection of abnormalities such as breast cancer, brain tumors, lung cancer, esophageal cancer, skin lesions, and foot ulcers across multiple image modalities of medical imaging [13,14,15,16,17].
Over the last decade, advances in technology have led to greater accessibility to advanced imaging techniques such as 3D whole body photoimaging/scanning, dermoscopy, high-resolution cameras, and whole-slide digital scanners that are used to collect high-quality skin cancer data from patients across the world [18,19]. The International Skin Imaging Collaboration (ISIC) is a driving force that provides digital datasets of skin lesion images with expert annotations for digital dermatology for the diagnosis of melanoma and other skin cancers. A wide research interest in AI-based image classification solutions for skin cancer diagnosis is facilitated by affordable and highspeed internet, computing power, and secure cloud storage to manage and share skin cancer datasets. These algorithms can be scalable to multiple devices, platforms, and operating systems, turning them into modern medical instruments [20].
The purpose of this review is to provide the reader with an update on the performance of artificial intelligence algorithms used for the diagnosis of skin cancer across various modalities of skin lesion datasets, especially in terms of the comparative studies on the performance of AI-based image classification algorithms and dermatologists/ dermatopathologists. We dedicated separate sub-sections to arrange these studies according to the types of imaging modality used, including clinical photographs, dermoscopy images, and whole-slide pathology scanning. Specifically, we seek to discuss the technical challenges in the digital dermatology and opportunities to improve the current AI-based image classification solutions so that they can be used as a support tool for clinicians to enhance their efficiency in diagnosing skin cancers.

Artificial Intelligence for Skin Cancer
The major advances in this field came from the work of Esteva et al. [12]

Artificial Intelligence in Dermoscopic Images
Dermoscopy is the inspection/examination of skin lesions with a dermatoscope device consisting of a high-quality magnifying lens and a (polarizable) (h) SCC [22] illumination system. Dermoscopic images are captured with high-resolution digital single-lens reflex (DSLR) or smartphone camera attachments.

Artificial Intelligence in Clinical Images
Clinical images are routinely captured of different skin lesions with mobile cameras for remote examination and incorporation into patient medical records, as shown in Fig. 2. Since clinical images are captured with different cameras Melanoma (c) BCC (d) SCC [30] with variable backgrounds, illuminance and color, these images provide different insights for dermoscopic images.

Artificial Intelligence in Histopathology Images
The The total discordance with the histopathologist was 18% for melanoma, 20% for nevi, and 19% for the full set of images.
2. Jiang et al. [44] proposed the use of a deep learning algorithm on smartphonecaptured digital histopathology images (MOI) for the detection of BCC.
They found that the performance of the algorithm on MOI and Whole

Challenges in Artificial Intelligence
With deep learning algorithms surpassing the benchmarks of popular computer vision datasets in a short period, the same trend could be expected in the skin lesion diagnosis challenge as well. However, as we further explore the skin lesion diagnosis challenge, this task appear to be not straightforward like ImageNet, MS-COCO challenges in a non-medical domain [47,48]. There are intra-class similarities and inter-class dissimilarities regarding color, texture, size, place, and appearance in the visual appearance of skin lesions. Deep learning algorithms generally require a substantial amount of diverse, balanced, and high-quality training data that represent each class of skin lesions to improve diagnostic accuracy. For skin lesion datasets of various modalities, there are many more issues related to the diagnosis of skin cancer with AI solutions as discussed below.

Performance of Deep learning and Imbalanced Datasets
The performance of deep learning algorithms mostly depends on the quality of image datasets rather than tuning the hyper-parameters of networks, as is commonly seen in the different publicly available skin lesion datasets. There are generally more cases of benign skin lesions rather than malignant lesions. Most of the deep learning architectures are designed on a balanced dataset, such as ImageNet, which consists of 1,000 images per class (1000 classes) [47]. Hence, the performance of a deep learning algorithm usually suffers from imbalanced datasets, despite using tuning tricks like a penalty for false-negatives found in a minor skin lesion class during training using custom loss functions.

Patients' Medical History and Clinical Meta-data
Patients medical history, social habits, and clinical meta-data are considered when making a skin cancer diagnosis. It is very important to know the diagnostic meta-data, such as patient and family history of skin cancer, age, ethnicity, sex, general anatomic site, size and structure of the skin lesion, while performing a visual inspection of a suspected skin lesion with dermoscopy. Hence, only imagebased deep learning algorithms used for the diagnosis of skin cancer falter on key aspects of patient and clinical information. It is proven in a previous study [37] that both beginners and skilled dermatologists performance is improved with the availability of clinical information and that they performed better than deep learning algorithms. Unfortunately, both patient history and clinical meta-data are missing in the most publicly available skin lesion datasets.

ABCDE Rule and Time-line Datasets
In

Biopsy is a Must
Even if skin cancer is confirmed by AI solutions with a high confidence rate, a biopsy and histological test must still be undertaken to confirm a diagnosis.
The diagnostic accuracy of deep learning algorithms could be misleading as well.
For example, if a testing set consists of 20 melanoma and 80 nevi cases, and the overall diagnostic accuracy is 90% (100% in nevi and 50% in melanoma cases), it is dangerous for a deep learning algorithm to be used in this case as a means to deliver a diagnosis of melanoma. As misdiagnosis of a cancer patient by a deep learning algorithm could risk a fatality, a biopsy should be taken to ensure safety and confirm the algorithms diagnosis.

Inter-class Similarities (Mimics of Skin Lesions)
A number of skin lesions can mimic skin cancer in both clinical and microscopic settings, which could result in misdiagnosis. For example, in clinical and dermoscopic images, seborrheic keratosis can mimic skin cancers including BCC, SCC, and melanoma. In histopathology images, there are many histologic mimics of BCC such as SCC, benign follicular tumors, basaloid follicular hamartoma, a tumor of follicular infundibulum, syringoma, and microcystic adnexal carcinoma [49]. Hence, deep learning algorithms, when trained on limited classes of skin lesions in a dataset, do not reliably distinguish skin cancers from their known mimics.

Intra-class Dissimilarities
Several skin lesions have intra-class dissimilarities in terms of color, attribute, texture, size, site. Hence, these skin lesions are further categorized into many sub-categories based on visual appearance. For example, the color of most melanomas is black because of the dark pigment of melanin. But certain melanomas are found to be of normal skin color, reddish, and pinkish looking.
Similarly, BCC has many subcategories, such as nodular BCC, superficial BCC, morphoeic BCC, basosquamous BCC, and their appearance is completely different from each other, ranging from white to red in color, as shown in Fig.   4.

Noisy Real-life Data with Heterogeneous Data Sources
In the current datasets of skin lesions, the dermoscopic images are captured with high-resolution DSLR cameras and in an optimal environment of lighting and distance of capture. A deep learning algorithm trained on these high-quality dermoscopic datasets, achieving a reasonable diagnostic accuracy, could potentially be scaled to smart-phone vision applications. When this model is tested on multiple smart phone captured images by different cameras in different lighting conditions and distances, the same diagnostic accuracy is hard to achieve.
Deep learning algorithms are found to be highly sensitive to which camera devices are used to capture the data, and their performance degrades if a different type of camera device is used for testing. Patient-provided self-captured skin images are frequently of low-quality and are not suitable for digital dermatology [50,51].

Race, Ethnicity and Population
Most of the cases in the current skin lesion datasets belong to fair-skinned individuals rather than brown or dark-skinned persons. Although the risk of developing skin cancer is indeed relatively high among the fair-skinned person population, people with dark skin can also develop skin cancer and are frequently diagnosed at later stages [52]. Skin cancer represents 4 to 5%, 2 to 4%, 1 to 2% of all cancers in Hispanics, Asians, and Blacks, respectively [53]. Hence, deep learning frameworks validated for the diagnosis of skin cancer in fair-skinned people has a greater risk of misdiagnosing those with darker skin [54]. In a recent study, Han et al. [28] trained a deep learning algorithm on an Asan training dataset consisting of skin lesions from Asians. They reported an accuracy of 81% on the Asian testing set, whereas they reported an accuracy of only 56% on the Dermofit dataset, which consists of skin lesions of Caucasian people.
Therefore, this drop-in accuracy signifies a lack of transferability of the learned features of deep learning algorithms across datasets that contain persons of a different race, ethnicity, or population.

Rare Skin Cancer and Other Skin Conditions
BCC, SCC and melanoma collectively comprise 98% of all skin cancers.
However, there are other skin cancers, including Merkel cell carcinoma (MCC), appendageal carcinomas, cutaneous lymphoma, sarcoma, kaposi sarcoma, and cutaneous secondaries, that are ignored by most algorithms. Beside these rare skin cancers, there are certain other skin conditions, such as ulcers, skin infections, neoplasms, and non-infectious granulomas, that could mimic skin lesions.
If deep learning algorithms are trained on datasets that do not have adequate cases of these rare skin cancers and other mentioned skin conditions, there is a high risk of misdiagnosis when it is tested on these skin conditions.

Incomplete Diagnosis Pipeline for Artificial Intelligence
In the clinical setting, the diagnosis of skin cancer is made by inspecting the skin lesion with or without dermoscopy, followed by confirmatory biopsy and pathological examination. The major issue with the current publicly available skin lesion datasets is that they lack complete labels related to the diagnosis performed by a dermatologist. Nevertheless, most of the classification labels for dermoscopic skin lesion images are determined by pathological examination.
Still, these dermoscopic and clinical skin lesion datasets do not have corresponding pathological classification labels to develop a complete diagnosis pipeline for AI.

Opportunities
AI researchers invariably claim their systems exceed the performance of dermatologists for the diagnosis of skin cancer. But this picture is far from reality, as these experiments are performed in closed systems with a defined set of rules.
With the many challenges mentioned in the above section, the nature of these reported performance evaluations is nowhere near the real-life diagnosis performed by clinicians treating skin cancer. Often, deep learning algorithms are deemed as opaque, as they only learn from pixel values of imaging datasets and do not have any domain knowledge or perform logical inferences to establish the relationship between different types of skin lesions [54]. But, in the future, deep learning could do very well for the diagnosis of skin cancer with the given opportunities listed below.

Balanced Dataset and Selection of Cases
A balanced dataset is critical for the good performance of deep learning algorithms used for classification tasks. Hence, balanced datasets are required with a selection of cases that completely represent the category of that particular skin lesion, and the input of experienced dermatologists could be very helpful for this selection.

Color Constancy for Illumination and Heterogenous Data Sources
In publicly available clinical and dermoscopic datasets, the skin lesion images are acquired with different illumination settings and acquisition devices which could reduce the performance of the AI systems. It is proven in many studies that color constancy algorithms such as Shades of Gray, max-RGB can be used to improve the performance of AI algorithms for the classification of multisource images [55,56].

Diverse Datasets
Deep learning networks are often criticised for social biases due to most of the imaging data belonging to fair-skinned persons. Skin lesion datasets need to have racial diversity, i.e., they must add equally distributed skin lesion cases from fair-skinned and dark-skinned people to reduce social or ethnic bias in deep learning models. The same concern can be extended to age, especially when the degree of skin aging or surrounding solar damage can influence the dataset and decision-making.

Data-Augmentation
Data augmentation techniques may mitigate many limitations of datasets, such as imbalanced data among the classes of skin lesions and heterogeneous sources of data, by adding augmented samples with different image transformations, such as rotation, random crop, horizontal and vertical flip, translation, shear, color jitter, and colorspace. It is proven in many studies that data augmentation improved the diagnosis of skin cancer [57,58]. In the HAM10000 dataset [22], the skin lesion images were captured at different magnifications or angles or with different cameras, a process known as natural data augmentation.
Notably, Goyal et al. [59] used deep learning architecture called Faster R-CNN to develop the algorithm to generate augmented copies similar to the natural data-augmentation for other skin lesion datasets.

Generative Adversarial Networks
Generative adversarial networks (GAN) are recently developed deep learning architectures that are attracting interest in the medical imaging community.
GAN is mainly used to generate high-quality fake imaging data to overcome a limited dataset [60,61,62]. For skin cancer, GAN can be used to generate realistic synthetic skin lesion images to overcome the lack of annotated data [63].
The distribution of skin lesions in publicly available datasets is heavily skewed by each classs prevalence among patients, and GAN can be used to generate imaging data for under-represented skin lesion classes or rare classes of skin cancer, such as MCC, sebaceous carcinoma, or kaposi sarcoma.

Identifying Sub-categories
There could be many visual intra-class dissimilarities in the appearance of skin lesions in terms of texture, color, and size. In most of the publicly available datasets, the collection of skin lesions belongs to each superclass rather than dividing them into sub-categories. Dealing with many intra-class dissimilarities and inter-class similarities (mimics of a skin lesion) in the skin lesion dataset, it is challenging for deep learning algorithms to classify or differentiate such lesions. As a possible solution to deal with this issue, sub-categories of each skin lesion should be treated as different classes in the dataset used for training deep learning algorithms. However, this will require a greater volume of training images and it will also be more challenging to translate into clinical practice. Therefore, subcategorization would require a certain degree of suspicion or a reasonable pre-test probability to adequately aid the clinician choosing the algorithm.

Semantic Explanation of Prediction
To assist clinicians, deep learning algorithms need to provide a semantic explanation rather than just a confidence score for the prediction of skin lesions.
One possible solution could be for deep learning networks using longitudinal datasets to provide a semantic explanation of networks' prediction according to ABCDE criteria (asymmetry, border, color, diameter, evolution) or 7-point skin lesion malignancy checklist (pigment network, regression structures, pigmentation, vascular structures, streaks, dots and globules, blue whitish veil) [34].

Multiple Models for Diagnosis of Skin Cancer
Rather than relying on a single AI solution for the diagnosis of skin cancer, multiple deep learning models can evaluate different features or aspects of skin lesions, submit predictions, and generate a final conclusion. In this regard, cloud computational power and storage is becoming more affordable and it will be possible to host multiple models to assist dermatologists in the diagnosis of skin cancer, around the world in parallel (or in synchrony).

Combining Clinical Information and Deep Features
Clinical meta-data and patient history are considered clinically important in

Rigorous Clinical Validation
It is a well-known fact, for both clinicians and AI researchers, that mistakes can inform future decision-making. Since we cannot afford misdiagnosis by technology, it is better to keep AI solutions in the background for rigorous validation of noisy data coming from real patients and for improving the predictions of these technological systems to date, until they are finally validated to provide useful insights into the diagnosis of skin cancer and assist clinicians either in hospital and remote settings.

Conclusion
Research involving AI is making encouraging progress in the diagnosis of skin cancer. Despite the various claims of deep learning algorithms surpass-20 ing clinicians performance in the diagnosis of skin cancer, there are far more challenges faced by these algorithms to become a complete diagnostic system.
Because such experiments are performed in controlled settings, algorithms are never tested in the real-life diagnosis of skin cancer patients. The real-world diagnosis process requires taking into account a patients ethnicity, skin, hair and eye color, occupation, illness, medicines, existing sun damage, the number of nevi, and lifestyle habits (such as sun exposure, smoking, and alcohol intake), clinical history, the respond to previous treatments, and other information from the patients medical records. However, current deep learning models predominantly rely on only patients imaging data. Moreover, such systems often risk a misdiagnosis whenever they are applied to skin lesions or conditions that are not present in the training dataset. This paper further explores opportunities to build robust algorithms to assist clinicians in the diagnosis of skin cancer.
Computer vision and dermatologist societies need to work together to improve current AI solutions and enhance the diagnostic accuracy of methods used for the diagnosis of skin cancer. AI has the potential to deliver a paradigm shift in the diagnosis of skin cancer, and thus a cost-effective, remotely accessible, and accurate healthcare solution for digital dermatology.

Search strategy and selection criteria
We used Google Scholar and PubMed to find relevant manuscripts. We restricted our search to papers published in English between Jan 1, 2012, and Nov