Using Multiple Dermoscopic Photographs of One Lesion Improves Melanoma Classification via Deep Learning: A Prognostic Diagnostic Accuracy Study

Background: Convolutional neural network (CNN)-based melanoma classifiers face several challenges that limit their usefulness in clinical practice. Objective: To investigate the impact of multiple real-world dermoscopic views of a single lesion of interest on a CNN-based melanoma classifier. Methods: This study evaluated 656 suspected melanoma lesions. Classifier performance was measured using area under the receiver operating characteristic curve (AUROC), expected calibration error (ECE) and maximum confidence change (MCC) for (I) a single-view scenario, (II) a multiview scenario using multiple artificially modified images per lesion and (III) a multiview scenario with multiple real-world images per lesion. Results: The multiview approach with real-world images significantly increased the AUROC from 0.905 (95% CI, 0.879-0.929) in the single-view approach to 0.930 (95% CI, 0.909-0.951). ECE and MCC also improved significantly from 0.131 (95% CI, 0.105-0.159) to 0.072 (95% CI: 0.052-0.093) and from 0.149 (95% CI, 0.125-0.171) to 0.115 (95% CI: 0.099-0.131), respectively. Comparing multiview real-world to artificially modified images showed comparable diagnostic accuracy and uncertainty estimation, but significantly worse robustness for the latter. Conclusion: Using multiple real-world images is an inexpensive method to positively impact the performance of a CNN-based melanoma classifier.


Introduction
Recent projections indicate a substantial increase in global melanoma incidence by 2040, with estimates suggesting increases of up to 50% in cases and 68% in associated death rates. 1 This emphasizes the urgent need for more accurate and efficient diagnostic tools to facilitate early melanoma detection and treatment.
Recent advances in artificial intelligence (AI), particularly in deep learning and convolutional neural networks (CNNs), show promise in assisting clinicians with melanocytic lesion diagnosis.
3][4][5] However, good diagnostic accuracy is just one of many aspects that are required for the successful clinical.CNNs still face limitations such as robustness 6,7 and uncertainty issues, 8,9 affecting their reliability.
Robustness refers to how well CNN-based melanoma classifiers maintain accuracy and reliability when input data undergoes alterations like changes in image orientation, lighting and position.
Previous studies have shown that when the input images are subject to artificial modifications, the performance of AI-based melanoma detection systems is affected. 10,11However, such changes are inevitable in routine clinical practice, e.g., there is no standardized orientation for photographing a skin lesion.Hence, it is crucial for CNN-based melanoma classifier to remain stable and accurate under such conditions.Additionally, the model should effectively communicate the confidence level of its predictions.
Model outputs should not be interpreted as probabilities since modern neural networks typically overestimate their own uncertainty, resulting in incorrect predictions made with high confidence. 8enever a model generates an uncertain prediction, it is important that dermatologists are made aware of this uncertainty, allowing them to interpret the results with appropriate caution.Proper confidence estimates are critical in fostering trust between clinicians and AI-based diagnostic tools, as it encourages a collaborative decision-making process that leverages both human expertise and AI capabilities. 12 address these limitations, a common technique is to incorporate multiple artificially generated views of the same lesion into the decision-making process.This involves digitally transforming the original lesion image through standard computer vision operations such as rotation, translation, or brightness variations, and averaging the predictions from these transformed images.This technique, known as test-time augmentation, has shown improvements not only in diagnostic accuracy but also in uncertainty estimation and robustness. 10,13,14However, challenges related to overconfidence and robustness persist.
In this study, we aim to investigate if incorporating multiple real-world images of a lesion, rather than artificially modified images, can enhance this technique.The rationale behind this approach is that real-world images captured from multiple perspectives provide a more comprehensive and nuanced understanding of lesion morphology, reducing the influence of imaging artifacts, such as reflections and partial occlusions.To ensure reliable findings, we utilize data from a prospective multicenter study involving eight university hospitals while focusing on multiple clinically relevant endpoints (i.e., diagnostic accuracy, uncertainty estimation and robustness).

Study design
This prospective, multicenter study was approved by the respective institutional review boards of the participating hospitals/centers and adhered to the Declaration of Helsinki guidelines.STARD 2015 reporting standards were followed and written informed consent was obtained from all participating patients. 15rmoscopic images and patient metadata (e.g., age, Fitzpatrick skin type, lesion localization and diameter) of clinically suspected melanoma were prospectively collected from eight university hospitals in Germany (Berlin, Dresden, Erlangen, Essen, Mannheim, Munich, Regensburg, Wuerzburg) between April 2021 and October 2022 during routine clinical care.For each lesion, a dermatologist captured six dermoscopic images during clinical examination while randomly varying the orientation/angle, position and mode of the dermatoscope (i.e., polarized or nonpolarized).To minimize the effect of confounding factors, dermatologists were instructed to avoid well known artifacts (e.g., skin markings).All images were obtained using one of four hardware settings that were available across the participating centers (see Supplementary Methods).This dataset will be referred to as SCP2 in the following text.
Subsequently, we trained a binary melanoma-nevus classifier on publicly available dermoscopic images and evaluated its performance on the external SCP2 dataset with respect to three clinically relevant endpoints: diagnostic accuracy, uncertainty estimation and robustness.For model prediction, we evaluated three different methods.The first method, called Single-View, represents the baseline scenario in which only one "original" image per lesion is available and the prediction is performed from that image.For the second method, referred to as multiview-artificial (MV-Artificial), the "original" image is accompanied by artificially modified duplicates generated by applying various image processing techniques such as rotation, zoom and brightness to the "original" image (see Supplementary Methods).For the final method, referred to as multiviewreal (MV-Real), the "original" image is accompanied by multiple real-world images (i.e., photographs taken in the clinic).At test-time, the model therefore provides a prediction for every single image, which are subsequently combined into an overall prediction (see Figure 1).The setup described above is feasible because the SCP2 dataset contains six real-world images per lesion.To ensure that the comparisons and statistical tests in this study were based on the same test data for all three methods, we randomly sampled one image per lesion and labeled this image as the "original" image (referred to as downsampling step).The remaining five images were set aside.Thus, all three prediction methods were evaluated on the same test set, with each image corresponding to a unique lesion.During test-time, the Single-View method received no further images, while MV-Artificial and MV-Real each received five additional images (modified duplicates and real-world images, respectively).

Participants
Participants were required to be at least 18 years old and have melanoma-suspicious skin lesions that were excised following dermoscopic examination.The suspicious lesions should not have been previously pre-biopsied nor located near the eye or under the fingernails or toenails.
Additionally, due to data privacy concerns, lesions with person-identifying features (e.g., tattoos) in their immediate vicinity were excluded from the study.All lesions were histopathologically confirmed by at least one reference pathologist at the corresponding clinic as part of routine clinical practice.In the end, only histopathologically verified melanoma or nevus lesions, recorded until October 2022, were included in this study.

Model training and evaluation
We trained a CNN with a state-of-the-art ConvNeXT architecture with publicly available melanoma and nevus images from two well-established datasets, HAM10000 16 and BCN20000 17 (see Supplementary Methods).At test-time, the Single-View, MV-Artificial and MV-Real approaches were used on the trained model, using the external SCP2 dataset for evaluation.Both training and inference were implemented using PyTorch 1.10.1, 18CUDA 11.0 and fastai 2.7.10. 19

Statistical analysis
The performance of our classifier was evaluated based on three endpoints: diagnostic accuracy, uncertainty estimation and robustness.Diagnostic accuracy was measured using the area under the receiver operating characteristic curve (AUROC), while uncertainty estimation was quantified by the expected calibration error (ECE). 20The ECE assesses the calibration of predicted probabilities against observed outcomes, with lower values indicating better calibration.
Robustness was evaluated by analyzing the consistency of the classifier's predictions across a series of images per lesion, detecting fluctuations in the model's diagnosis (see Supplementary Figure 1).We therefore computed the mean maximum confidence change (MCC), which measures the difference between the model's highest and lowest confidence scores for a series of images.Larger MMC values are worse, as the model's predictions are less consistent.As analyzing robustness requires a series of images per lesion across which to measure fluctuations, we constructed image series of either two or three images per lesion, by using the five additional images which were previously set aside during the downsampling step (see Study Design).
However, this meant we also had to reduce the number of images used for MV-Real to three and two images respectively.To keep the comparison fair, MV-Artificial was adjusted accordingly.
To reduce the impact of stochastic events, mean values for each metric were calculated using 1000 bootstrap iterations on our test sets.The corresponding 95% confidence intervals (CIs) were determined using the non-parametric percentile method.Statistical testing was conducted for all three hypothesis to identify significant differences between results with our proposed technique (i.e., MV-Real) and those with either the baseline (i.e., Single-View) or the traditional multiview technique (i.e., MV-Artificial).For each endpoint, pairwise Wilcoxon signed-rank tests were used to compare the respective metrics.Significance levels of p<0.05 were adjusted to 0.025 according to the Bonferroni correction (m=2) which equals the expected false discovery rate.In addition, we repeated the downsampling step (see Study Design) and all subsequent analysis steps five times in order to ensure that our findings were not based on an unfavorable sample.Statistical analysis was performed using SciPy 1.7.1.

Patient characteristics
A total of 617 patients with 656 skin lesions clinically suspected to be melanoma were enrolled in this study.The patient characteristics of the study samples are summarized in Table 1.Of the participants, 44.6% were female.The patients' ages at diagnosis ranged from 18 to 95 years, with a median age of 61 years.The distribution of Fitzpatrick skin types was as follows: Type I (8.8%), type II (60.0%), type III (26.4%), type IV (1.3%) and unknown type (3.5%).For 39 patients (6.3%), two different melanoma-suspicious lesions were included in the study, resulting in a total of 656 unique lesions.

MV-Real improves diagnostic accuracy and uncertainty estimation compared to Single-View and MV-Artificial
To determine the performance impact of increasing the number of images per lesion, we evaluated our model using three approaches: Single-View, MV-Artificial and MV-Real.Our findings show that MV-Real improves both the diagnostic accuracy and the uncertainty estimation when compared to the Single-View approach.The AUROC significantly increases from 0.905 (95% CI, 0.879-0.929)to 0.930 (95% CI, 0.909-0.951;p<0.001) with the ECE significantly decreasing from 0.131 (95% CI, 0.105-0.159)to 0.072 (95% CI: 0.052-0.093;p<0.001, see Figure 2).These findings were consistent across all five repeated down-samplings (see Supplementary Tables 1a-1e).Similarly, MV-Real also outperformed MV-Artificial which had a significantly lower AUROC of 0.929 (95% CI: 0.908-0.948;p<0.001) and significantly lower ECE of 0.086 (95% CI: 0.064-0.110;p<0.001).These findings were only somewhat consistent across the five repeated downsamplings, with diagnostic accuracy sometimes being on-par or slightly better for MV-Artificial, indicating that there is no practical difference in diagnostic accuracy for both approaches (see

MV-Real improves robustness compared to Single-View and MV-Artificial
The robustness of our model was analyzed across a series of either two or three images per lesion.For the series of three images, the robustness with MV-Real improved substantially over that with Single-View, as the MCC significantly decreased from 0.149 (95% CI, 0.125-0.171)to 0.115 (95% CI: 0.099-0.131;p<0.001), respectively.Similarly, robustness also improved across a series of two images, as the MMC significantly decreased from 0.094 (95% CI, 0.077-0.112)for single-view to 0.066 (95% CI: 0.056-0.076;p<0.001) for MV-Real.Surprisingly, the MV-Artificial method resulted in no robustness improvement at all, having greater MMC values than the Single-View and MV-Real approaches (see Figure 3).These findings were consistent across all five repeated down-samplings (see Supplementary Tables 1a-1e).

Principal findings
Traditionally, AI-based model make predictions on a single input image, but this approach has limitations due to the brittleness of AI. 10,21,22 Supplying additional images at test-time and subsequently combining the model's predictions into one is an easy-to-implement and already established technique in machine learning. 23This method -commonly referred to as test-time augmentation -is built on the idea that multiple different images of the same lesion may provide different perspectives on what the model perceives.However, these additional images are typically artificially generated duplicates, offering no new information.
In this study, we aim to investigate whether the performance of an AI-based melanoma classifier could be improved by using real-world images (i.e., photographs taken in the clinic) instead of artificially generated duplicates.Our evaluation focused on three clinically relevant aspects: diagnostic accuracy, uncertainty estimation and robustness.
Our results demonstrate that supplying additional real-world images at test-time (MV-Real) enhances the classifier's performance compared to the traditional case without additional images (Single-View).This improvement was observed across diagnostic accuracy, uncertainty estimation and robustness.It was expected as MV-Real is a modified version of MV-Artificial, which previously showed improvements in these aspects. 10,13,14Interestingly, when comparing MV-Artificial to Single-View, MV-Artificial performed better in diagnostic accuracy and uncertainty estimation but worse in robustness.This result is surprising since previous studies have indicated robustness improvements with this technique 10 .However, it appears that in this study these improvements were only observed when the image series used to measure robustness was artificially created and in-distribution.In contrast, when the image series consisted of natural real-world images from an external test set, the improvements were less pronounced or absent.
Considering our test set also contains natural image series from external test sets, this could indicate that MV-Artificial has limited generalization capabilities regarding robustness and/or that testing on real-world data is substantially different to simulated environments.
Comparing both multiview options, using real-world images (MV-Real) outperformed artificially generated images (MV-Artificial) significantly in uncertainty estimation and robustness, while their diagnostic accuracy performance was comparable.This performance difference can be attributed to the fact that MV-Real provides the model with genuinely new information rather than just variations of old information.Multiple real-world images allow different angles and parts of the lesion to be captured, which is particularly useful for larger lesions or those that are partially occluded (e.g., by hair) or difficult to photograph (e.g., on the ear).Furthermore, the model's classification is not solely based on the quality of a single image.Multiple real-world images increase the chances of having high-quality images or at least having images that are somewhat complementary to each other.As the six images collected for every lesion were a mixture of polarized and non-polarized dermoscopic images, some of the improvements seen with MV-Real could be attributed to this mixture.

Feasibility in clinical practice
While incorporating a single additional image into the classification already improves the diagnostic accuracy and uncertainty of the classifier, the benefits of including additional images are even more pronounced (see Supplementary Figure 2).However, asking physicians to take multiple photographs for every lesion and patient is time consuming and impractical, therefore future work should look into optimizing this process.

Limitations
The evaluation of our classifier was limited to a binary setting which does not reflect the clinical reality.However, due to the inclusion criteria of our study (i.e., melanoma suspicious skin lesions), the majority of lesions collected were either melanoma or nevus, leaving a large variety of other diagnostic classes which only had insignificant sample sizes.Therefore, our findings may not translate to a multiclass setting.

Conclusion
Using multiple real-world images of a lesion improves the overall performance of an AI-based melanoma classifier compared to more traditional approaches.As our proposed approach only requires additional photographs, it is easy-to-implement and cost-effective.We therefore recommend integrating it into future clinical workflows, which make use of AI-based computer vision.
trained on all 29,562 images (i.e., inclusion of validation set).Subsequently, the model was evaluated on the external SCP2 dataset containing out-of-distribution images.

Implementation of MV-Artificial
The MV-Artificial approach requires that the original image is duplicated n times and digitally modified before all images are classified by the model.In our case, the digital modifications consisted of rotation, zoom, changes in brightness and warp.Each of these modifications were applied to an image with a probability of 75%.

Figure 1 .
Figure 1.Illustration of the multiview approach.Top) For both the MV-Artificial and MV-Real methods, the model makes its final classification based on one original image accompanied by additional lesion images.For MV-Real, the additional images are actual dermoscopic photographs taken in the clinic by the physician (top row).For MV-Artificial, the additional images are artificially created from the original image by applying various image processing techniques such as rotation, zoom and brightness.Bottom) For both approaches, the classifier makes a prediction for the original and each of the additional images.All predictions are subsequently averaged into a single prediction.MV-Artificial: multiview-artificial, MV-Real: multiview-real.

Figure 2 .
Figure 2. MV-Real outperforms both Single-View and MV-Artificial with respect to diagnostic accuracy and uncertainty estimation.The AUROC (diagnostic accuracy) and ECE (uncertainty estimation) are plotted for the three investigated methods on the left and right, respectively.Each box extends from the lower to the upper quartile of the 1000 bootstrap iterations, with a line at the median.In addition, whiskers and fliers indicate the range and any outliers.AUROC: area under the receiver operating characteristic curve, ECE: expected calibration error, MV-Artificial: multiview-artificial, MV-Real: multiviewreal.

Figure 3 .
Figure 3. MV-Real outperforms both Single-View and MV-Artificial with respect to robustness.Robustness was measured by the maximum change in the classifier's confidence (MCC) across a series of either three (left) or two (right) images.Each box extends from the lower to the upper quartile of the 1000 bootstrap iterations, with a line at the median.In addition, whiskers and fliers indicate the range and any outliers.AUROC: area under the receiver operating characteristic curve, ECE: expected calibration error, MV: multiview, MV-Artificial: multiview-artificial, MV-Real: multiview-real.

Table 1 . Patient characteristics of the study sample.
Distributions of the age at diagnosis, lesion location and lesion diameter are reported.

Table 1a. Replication of Results for Different Test Set Samples.
The strength of each modification varied as we considered five different setups: mild, moderate, strong, severe and extreme.The mild setup was considered the default setup and is simply referred to as MV-Artificial in the main manuscript.We used fastai's built-in test-time augmentation function (TTA) with beta set to None as we wanted an unweighted average of all image predictions.The parameters for each setup are listed in the table below.