Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Dermatology

Date Submitted: Nov 23, 2021
Date Accepted: Aug 3, 2022
Date Submitted to PubMed: Aug 26, 2023

The final, peer-reviewed published version of this preprint can be found here:

Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study

Oloruntoba I, Vestergaard T, Nguyen TD, Yu Z, Sashindranath M, Betz-Stablein B, Soyer HP, Ge Z, Mar V

Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study

JMIR Dermatol 2022;5(3):e35150

DOI: 10.2196/35150

PMID: 27739475

PMCID: 5064390

Generalisability of Deep Learning Models Trained on Standardised and Non-Standardised Images: Retrospective Comparative Study

  • Ibukun Oloruntoba; 
  • Tine Vestergaard; 
  • Toan D Nguyen; 
  • Zhen Yu; 
  • Maithili Sashindranath; 
  • Brigid Betz-Stablein; 
  • H. Peter Soyer; 
  • Zongyuan Ge; 
  • Victoria Mar

ABSTRACT

Background:

Convolutional neural networks (CNNs) are a type of artificial intelligence (AI) which show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image datasets with varying image capture standardisation.

Objective:

The primary objective of our study was to use CNN models with the same architecture, that were trained on image sets acquired with either the same image capture device and technique (standardised) or with varied devices and capture techniques (non-standardised), and test variability in performance when classifying skin cancer images in different populations.

Methods:

Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 images taken from the International Skin Imaging Collaboration using different image capture devices (non- standardised). CNN-S was trained on 235,268 MoleMap images taken with the same capture device (standardised) and CNN-S2 was trained on a subset of 25,331 standardised MoleMap images (matched for number and classes of training images to CNN-NS). These three models were then tested on three external test sets; 569 Danish images, the publicly available ISIC 2020 dataset consisting of 33126 images and a UQ dataset of 422 images. Primary outcome measures were sensitivity, specificity and area under the curve of the receiver operating characteristic (AUROC). Tele-dermatology assessments available for the Danish dataset were used to determine model performance compared to tele-dermatologists.

Results:

When tested on the 569 Danish images, CNN-S achieved an AUROC of 0.861 (CI 0.830 – 0.889; P=.001) and CNN-S2 (standardised models) achieved an AUROC of 0.831 (CI 0.798 – 0.861; P=.009), with both outperforming CNN-NS (non-standardised model), which achieved an AUROC of 0.759 (CI 0.722 – 0.794; P=.001, P=.009) (Figure 3). When tested on two additional datasets (ISIC 2020 and UQ) CNN-S and CNN-S2 still outperformed the CNN-NS (P=0.00, P=0.00 and P=.076, P=.347). When the CNNs were matched to the mean sensitivity and specificity of the tele-dermatologists on the Danish dataset, the model’s resultant sensitivities and specificities were surpassed by the tele-dermatologists (Table 5). However, when compared to CNN-S, the differences were not statistically significant (P=.10, P=.053). Performance across all CNN models as well as tele-dermatologists was influenced by image quality.

Conclusions:

CNNs trained on standardised images had improved performance and therefore greater generalisability in skin cancer classification when applied to unseen datasets. This is an important consideration for future algorithm development, regulation and approval.


 Citation

Please cite as:

Oloruntoba I, Vestergaard T, Nguyen TD, Yu Z, Sashindranath M, Betz-Stablein B, Soyer HP, Ge Z, Mar V

Assessing the Generalizability of Deep Learning Models Trained on Standardized and Nonstandardized Images and Their Performance Against Teledermatologists: Retrospective Comparative Study

JMIR Dermatol 2022;5(3):e35150

DOI: 10.2196/35150

PMID: 27739475

PMCID: 5064390

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

Advertisement