Development and validation of a deep learning model for improving detection of nonmelanoma skin cancers treated with Mohs micrographic surgery

Background Real-time review of frozen sections underpins the quality of Mohs surgery. There is an unmet need for low-cost techniques that can improve Mohs surgery by reliably corroborating cancerous regions of interest and surgical margin proximity. Objective To test that deep learning models can identify nonmelanoma skin cancer regions in Mohs frozen section specimens. Methods Deep learning models were developed on archival images of focused microscopic views (FMVs) containing regions of annotated, invasive nonmelanoma skin cancer between 2015 and 2018, then validated on prospectively collected images in a temporal cohort (2019-2021). Results The tile-based classification models were derived using 1423 focused microscopic view images from 154 patients and tested on 374 images from 66 patients. The best models detected basal cell carcinomas with a median average precision of 0.966 and median area under the receiver operating curve of 0.889 at 100x magnification (0.943 and 0.922 at 40x magnification). For invasive squamous cell carcinomas, high median average precision of 0.904 was achieved at 100x magnification. Limitations Single institution study with limited cases of squamous cell carcinoma and rare nonmelanoma skin cancer. Conclusion Deep learning appears highly accurate for detecting skin cancers in Mohs frozen sections, supporting its potential for enhancing surgical margin control and increasing operational efficiency.


INTRODUCTION
Mohs micrographic surgery (MMS) is the recommended procedure for local high risk basal cell carcinomas (BCCs) and squamous cell carcinomas (SCCs). 1,2Although excellent concordance exists between MMS surgeons and dermatopathology, 3,4 interobserver discordance may still occur. 5Variability is seen even amongst experienced surgeons particularly in complex tumors with challenging pathology. 6As Mohs surgeons often operate in an individual or a small group setting, 7 variable interpretation of frozen sections may also arise due to operator fatigue and/or inconsistent techniques of tissue preparation.
2][13][14][15][16] It also has the potential to assist Mohs surgeons in optimizing intraoperative margin control through reduction of interobserver discordance 17 but few studies have examined the application of deep learning on Mohs frozen section images. 18ere we report on the development of deep learning models that detect nonmelanoma skin cancers (NMSCs) on frozen sections obtained during MMS.These models were trained on a dataset consisting of archival microscope images and routine diagnostic setups and procedures.The primary objective was to validate these models develop models as a first step towards the longerterm objective of enhancing Mohs surgical workflow.

Study design
This study for diagnostic model development was approved by the St. Vincent's Hospital Human Research Ethics Committee (2021/ETH00647).This research is reported in accord with CLEAR Derm Consensus guidelines for artificial intelligence (AI) algorithm reports in dermatology 19  All pathology slides were prepared using a protocol of progressive Mayer's hematoxylin and eosin stain 21 on Epredia Linistat Linear Stainer, Fisher Scientific Pittsburgh, PA 15,275.Images of focused microscope view (FMV) were acquired using Leica ICC50 W, 5.0-megapixel Camera attached to DM1000 microscope and captured via the Leica software (LAS-EZ v3.0).Full color images (at 1600 3 1200 pixels resolution) were acquired at 403 and 1003 magnification levels.The diagnosis was made by the lead author at the time of surgery.Air bubbles and freezing artifacts were unprocessed during model training to enhance procedural generalizability.
Image processing and training of deep learning models.The pipeline was designed to localize regions of malignant lesions on digitized images of FMV by producing an overlaying saliency map indicating the probability of containing tumor.Detection of tumor was performed through a tile-based classification method that utilized the sliding prediction window technique.By incorporating overlapping predictions, the sliding window technique aims to produce more accurate contours for the predicted regions and minimize any potential omissions.All models were based on a convolutional neural network architecture, utilizing the feature extraction layers from EfficientNet B0  Primary and exploratory analyses.Model set 1 -In this primary analysis, fully supervised learning models were trained on the images with tumor locations manually segmented by an experienced Mohs surgeon (E.T.).All models were trained on split tiles of 224 3 224 pixels.Each square tile was labeled as ''positive'' if the tumors occupied $10% of the area or categorized as controls.Extra controls were included from images of other diagnoses (eg, SCC to train BCC models and vice versa, actinic keratoses and normal skin).
To ensure that ground truth was reproducible, the concordance was estimated by comparing the segmentation masks prepared by 2 experienced Mohs surgeons (E.T.; D. Lim) and an anatomical pathologist (D.Lamont) on randomly selected 25 images.Interrater agreement was measured by Fleiss' kappa for identifying $5% tumor presence in each 50 3 50 pixel tile generated from the masks.
Model set 2 -Considering the labor-intensive manual segmentation process, we further evaluated in an exploratory analysis on whether models trained through weakly supervised learning (WSL) can achieve similar performance in region of interest detection in the subset of BCC 103 images; WSL has been known to achieve clinicalgrade precision without requiring human annotations. 24In this 2-stage approach, classifiers were first trained using all tiles labeled with the top-level diagnosis without expert annotations.The second stage of the training selectively included only the tiles with high inferred probabilities from the firststep.The control image tiles remained unchanged during this process.
Performance metrics and validation studies.The primary performance metric was the pixel-level area under the precision-recall curve (AUPRC), estimating the average precision across all prediction thresholds for each tumor type at the 403 and 1003 magnifications.Secondary metrics include the area under the receiver-operating characteristic curve (AUROC), the highest Dice coefficient, and folds enrichment of precision (AUPRC divided by the proportion of positive pixels).To accurately assess these metrics on the probability maps, the ground truth masks were reshaped to fill the same boundaries prescribed by the deep learning algorithm.Additional explanation is provided in Supplementary Fig 4, available via Mendeley at https://data.mendeley.com/datasets/fh7sk5ksmk/2.
The internal model validity was assessed on model set 1 using leave-one-out cross-validation (LOOCV) on the subset of images where the ground truth has been annotated.To avoid contamination of information from images obtained from the same patient, each test fold was limited to a single patient (patient-level LOOCV).The external validity was estimated on the temporal validation cohort and stratified by magnification levels and model architectures.The CIs were estimated using ordinary bootstrap resampling over 10,000 iterations.The paired Wilcoxon test was used to compare the metrics between models on the same images; the type I error rate was set at 0.01 adjusted for multiple testing.
Software.All models were built on the TensorFlow framework 2.7.0.The preprocessing and evaluation tools were implemented using customized computer scripts.Descriptive statistics and the bootstrapping procedures were performed using the R Statistical Environment 4.0.

Cohort and tumor characteristics
Two hundred fifty-eight patients over 293 visits were screened.Digital FMV images for 220 patients were available and 1836 images were retrieved.The median age of the patient population was 61 years old (interquartile range, 58-76).One hundred eightyseven patients (85%) were diagnosed with BCC, with nodular (n = 88), infiltrative (n = 49), and superficial (n = 20) being the commonest subtypes.Thirty-three patients (15%) with SCC were identified including 1 patient with perineural invasion (Table I).Three patients (1%) had metachronous BCC and SCC.

AI:
artificial intelligence AUPRC: area under the precision-recall curve AUROC: area under the receiver-operating characteristic curve BCC: basal cell carcinoma FMV: focused microscopic view LOOCV: leave-one-out cross-validation MMS: Mohs micrographic surgery NMSC: nonmelanoma skin cancer SCC: squamous cell carcinoma WSL: weakly supervised learning Most tumors were located on the nasal region (n = 93, 42%), with nasal ala and tips being the most frequent sites involved (n = 59, 27%) followed by the dorsum and sidewall (n = 27, 12%).A median of 1 surgical procedure per patient was performed (range 1-5), with the median number of MMS stages of 2 (range 1-8).The median size of lesions before MMS was 1.4 cm (range 0.2-4.5).The full cohort and tumor characteristics are shown in Table I.
Based on the date of the project initiation, 154 patients (70%) were assigned into the development cohort and 66 patients (30%) assigned to the temporal validation cohort (Fig 1).The validation cohort was similar to the training cohorts with respect to demographic, pathology, and tumor subtypes, except for the lower number of MMS stages per operation (P \.001, x 2 test).

Concordance of tumor localization in frozen section images
Estimated using 25 FMV images (23 BCC and 2 SCC), the interannotator agreement between segmentation masks was high (mean Fleiss' Kappa 0.877, 95% CI 0.842-0.911).Each segmentation task required 20 to 30 minutes to complete.Further discussion on the concordance among specialists is presented in Supplementary Fig 5, available via Mendeley at https://data.mendeley.com/datasets/fh7sk5ksmk/2.

Exploratory analysis on the temporal validation set for model set 2
The WSL models showed high precision in region of interest detection with an AUPRC [0.9, albeit with lower AUPRC and AUROC compared to models in model set 1 (Fig 3).For example, WSL models at the filtering threshold t = 0.5 showed lower

DISCUSSION
The central finding of this study is that deep learning locates NMSCs on digitized images of Mohs frozen sections with high accuracy and efficiency supporting the use of AI-assisted margin control as a ''second read'' to reduce human error. 5 recent study has shown a comparable performance from the whole slide images of BCC acquired using a digital slide scanner. 18Despite the commercial availability, the cost of scanners remains prohibitive to many clinics and not justifiable for low-volume centers.A software-assisted review of FMV images could be implemented in any Mohs laboratory at a fraction of the latter cost, reusing only inexpensive microscope systems.Such an AI approach would be expected to have maximum utility when integrated with a built-in slide scanner, in-effect eliminating the need to examine frozen section slides altogether though further evidence will be needed to corroborate the value and safety of this approach.][26] The present study provides several valuable insights into AI data quality.First, reviewing discrepancies among specialists is necessary to understand how the causes of over-or under-diagnosis occur during single-pass pathology review.For example, distinguishing between a hair follicle and tumor in BCC, or between inflammatory cells and tumor in SCC, can be challenging.In the real-world setting, the Mohs surgeon and pathologist may review multiple wafers or different pathologic levels.Transparent annotations are imperative in understanding the constraints of applicability of AI-based models, given that variations in diagnostic labels are likely to hinder the reproducibility of results. 27econd, although validations at external sites are planned, the work presented here supports the feasibility of site-specific diagnostic models based on reutilization of archival images from medical records.Third, our research highlights the importance of involving surgeons and pathologists in the creation of accurate ground truth annotations to achieve good classification results.This manual semantic segmentation is a tedious task that often acts as the rate-limiting step in model development due to its high-level of precision and the need for expert knowledge in making a diagnosis.Further investigations into alternative methods for automation, such as WSL, may overcome this limitation.
Our study has several limitations.Microscopic images were taken in a single Mohs unit; variations in surgical techniques, sample preparation, and imaging acquisition methods could thus impact model accuracy at other sites.Furthermore, only microscopic views at the standard resolution used by Mohs surgeons were analyzed; visualizing the complete section at high resolutions through whole slide analysis could improve margin control.Moreover, in contrast to the good performance of the BCC models, insufficient data hampered SCC model performance at 40x magnification in our study, highlighting the importance of adequate training data.Image sharing across Mohs units may facilitate collaborative model development for tumor types less frequently seen.
Immunohistochemical staining was not performed in this Mohs unit.Future studies should include this assessment, particularly given that there is increasing interest in managing melanoma with MMS.Lastly, our tile-based classification paired with custom evaluation metrics was designed to maximally leverage the 3-dimensional spatial context of tumors to identify hidden or unseen tumor areas.This spatial understanding is vital for Mohs surgery as unexpected inflammation or fibrosis may be a harbinger of tumor necessitating another stage. 28,29uture studies will benchmark our models against segmentation approaches like U-net, 30 to also assess its effectiveness in prospective trials in overall margin control beyond accurate localization.
In conclusion, this study supports the accuracy of deep learning models for detecting NMSCs on Mohs frozen sections.Prospective studies in accordance with CONSORT-AI and SPIRIT-AI guidelines will determine the acceptable bar for adoption as a diagnostic support tool during real-time tumor removal.
2).Examples of the saliency maps are shown in Fig 3.

Fig 2 .
Fig 2. Performance of deep learning models stratified by model sets, architecture, training data, and training method.AUPRC, Area under the precision-recall curve; AUROC, area under the receiver operating characteristic curve; BCC, basal cell carcinoma; Dice, Dice coefficient; FEP, folds enrichment of precision; SCC, squamous cell carcinoma; WSL, weakly supervised learning.
and Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis statement 20 (Supplementary Tables I and II, available via Mendeley at https://data.mendeley.com/datasets/fh7sk5ksmk/2).

Table I .
Characteristics of the development and validation cohorts (N = 220)

Table II .
Performance of the best performing deep learning models for detecting regions-of-interest in the temporal validation cohort Area under the precision-recall curve; AUROC, area under the receiver operating characteristic curve; FEP, folds enrichment of precision. AUPRC,