Accepted for/Published in: JMIR Dermatology
Date Submitted: Nov 23, 2021
Date Accepted: Aug 3, 2022
Date Submitted to PubMed: Aug 26, 2023
Generalisability of Deep Learning Models Trained on Standardised and Non-Standardised Images: Retrospective Comparative Study
ABSTRACT
Background:
Convolutional neural networks (CNNs) are a type of artificial intelligence (AI) which show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image datasets with varying image capture standardisation.
Objective:
The primary objective of our study was to use CNN models with the same architecture, that were trained on image sets acquired with either the same image capture device and technique (standardised) or with varied devices and capture techniques (non-standardised), and test variability in performance when classifying skin cancer images in different populations.
Methods:
Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 images taken from the International Skin Imaging Collaboration using different image capture devices (non- standardised). CNN-S was trained on 235,268 MoleMap images taken with the same capture device (standardised) and CNN-S2 was trained on a subset of 25,331 standardised MoleMap images (matched for number and classes of training images to CNN-NS). These three models were then tested on three external test sets; 569 Danish images, the publicly available ISIC 2020 dataset consisting of 33126 images and a UQ dataset of 422 images. Primary outcome measures were sensitivity, specificity and area under the curve of the receiver operating characteristic (AUROC). Tele-dermatology assessments available for the Danish dataset were used to determine model performance compared to tele-dermatologists.
Results:
When tested on the 569 Danish images, CNN-S achieved an AUROC of 0.861 (CI 0.830 – 0.889; P=.001) and CNN-S2 (standardised models) achieved an AUROC of 0.831 (CI 0.798 – 0.861; P=.009), with both outperforming CNN-NS (non-standardised model), which achieved an AUROC of 0.759 (CI 0.722 – 0.794; P=.001, P=.009) (Figure 3). When tested on two additional datasets (ISIC 2020 and UQ) CNN-S and CNN-S2 still outperformed the CNN-NS (P=0.00, P=0.00 and P=.076, P=.347). When the CNNs were matched to the mean sensitivity and specificity of the tele-dermatologists on the Danish dataset, the model’s resultant sensitivities and specificities were surpassed by the tele-dermatologists (Table 5). However, when compared to CNN-S, the differences were not statistically significant (P=.10, P=.053). Performance across all CNN models as well as tele-dermatologists was influenced by image quality.
Conclusions:
CNNs trained on standardised images had improved performance and therefore greater generalisability in skin cancer classification when applied to unseen datasets. This is an important consideration for future algorithm development, regulation and approval.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.