Model Learning Analysis of 3D Optoacoustic Mesoscopic Images for the Classification of Atopic Dermatitis

max 100 words): Atopic dermatitis (AD) is a skin inflammatory disease affecting 10% of the population worldwide. Raster-scanning optoacoustic mesoscopy (RSOM) has recently shown promise in dermatological imaging. We conducted a comprehensive analysis using three machine-learning models, Random Forest (RF), Support Vector Machine (SVM), and Convolutional Neural Network (CNN) for classifying healthy versus AD conditions, and sub-classifying different AD severities using RSOM images and clinical information. CNN model successfully differentiates healthy from AD patients with 97% accuracy. With limited data, RF achieved 65% accuracy in sub-classifying AD patients into mild versus moderatesevere cases. Identification of disease severities is vital in managing AD treatment. © 2020 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

systems are available to assess the severity of AD, such as Eczema Area Severity Index (EASI) and Scoring Atopic Dermatitis (SCORAD), modified EASI (mEASI) and others [8][9][10]. However, these scoring systems are semi-quantitative at the best as they are designed based on metrics such as itchiness, redness and scale of the affected skin region, while others include the quality of patients' life. In addition, these scoring systems are based on visual inspections and it has been reported that visual skin assessments can only differentiate severities of AD in 25% of cases when self-assessed by patients [11]. Furthermore, it requires experience and training for clinicians to make visual assessments, subjecting these scoring systems to inter-rater variability [12]. It is desirable to have a non-invasive objective scoring tool that reflects the true AD severity throughout the therapeutic intervention and AD clinical trial, especially for mild and non-mild severities.
Raster-Scanning Optoacoustic Mesoscopy (RSOM), first introduced in 2013, is an emerging hybrid optical and ultrasound imaging technique that offers non-invasive, deep penetration imaging and provides high-resolution images [13]. RSOM imaging provides deep skin structural imaging up to 1-2 mm beneath the skin surface with high resolutions up to ~7 μm axial and ~30 μm lateral resolution [14]. With these resolutions, the vascular remodeling of the skin in various clinical severities of AD can be detected by RSOM imaging [12,15,16]. Differential diagnosis is thus critical to achieve accurate AD diagnosis [17]. Considerable efforts have been made to explore the efficacy of RSOM in assessing skin inflammatory diseases. For example, skin-specific metrics derived from RSOM such as total blood volume (TBV) and epidermis thickness (ET) have shown a substantial difference between control and skin inflammatory conditions [18,19]. In another study by Li et al, the feasibility of using RSOM derived skin-specific metrics in different skin phenotypes populations was investigated [20].
Machine learning models have shown significant success in the classification of skin disease diagnosis using dermatological images of superficial skin conditions [21][22][23][24][25][26][27][28]. However, these models have not been applied on RSOM images except for the work by Yew et al [18]. In the work by Yew et al., the authors proposed an objective AD severity evaluation metrics -the Eczema Vascular Severity Index (EVSI) using Support Vector Machine (SVM) [18]. Handcrafted skin-specific features derived from RSOM images such as TBV, low-high frequency ratio (LHFR) and ET were used as features to train the model [18]. However, Convolutional Neural Networks (CNNs) were not utilized to automatically extract useful features from 3D RSOM images. In this study, we explore the utilization of CNNs for automatic extraction of useful features from 3D RSOM images and combine these features with handcrafted features proposed by Yew et al. [18].
We conducted a comprehensive analysis using three machine learning (ML) methods, SVM, Random Forest (RF) and CNNs in classifying healthy and various AD conditions. We performed two analyses (i) Healthy vs. AD and (ii) Mild vs. Moderate-Severe AD conditions. The motivation for conducting the second analysis in classifying between mild and more serious AD conditions is that patient-specific clinical care can be provided accordingly for better treatment outcomes.
To the best of the authors' knowledge, this study is the first effort to employ raw 3D RSOM images for the classification of AD conditions using a deep learning model. The objective of the study is to evaluate the performance of SVM, RF and CNNs in classifying healthy vs. AD conditions and mild vs. moderate-severe AD conditions. We designed an optimal neural network architecture that receives 3D RSOM images and other handcrafted features and successfully combines them in the network.

Overview
In this study, we performed a thorough analysis by applying three different ML models on 3D RSOM images and compared the performance of each model using different combinations of inputs to the model. We utilized raw 3D RSOM images and four handcrafted features as inputs to train ML models. Three handcrafted features were derived from RSOM images, proposed by Yew et al. [17], namely TBV, ET and LHFR. The fourth feature is transepidermal water loss (TEWL), which reflects skin barrier dysfunction and is shown to be affected in AD condition [29].
The workflow of analysis is as follows: First, we evaluated the performance of traditional ML models such as SVM and RF using different combinations of the following features: TBV, ET, LHFR and TEWL. Secondly, we adopted CNN and used raw 3D RSOM images as inputs to train the network. Thirdly, we employed both 3D RSOM images and handcrafted features information to train the CNN model. The performance of models for every combination of inputs was compared and reported.

Subjects
This study was approved by the Domain Specific Review Board (DSRB) of the National Health Group, Singapore (Ref No. 2017/00932). Patients were imaged in compliance with our institutional approvals and informed consent was obtained. Study participants were recruited from AD patients visiting the National Skin Centre, Singapore. The diagnosis of AD was made based on the Hanifin and Rajka diagnostic criteria [30]. This study also included healthy controls who were defined as not having AD, any form of inflammatory skin diseases and any atopic co-morbidities such as asthma, allergic rhinitis and allergic conjunctivitis. 76 participants were recruited for this study, 53 were AD patients and 23 were healthy controls. All 53 AD participants had their disease severity assessed by an experienced dermatologist using SCORAD. There were 19, 26 and 8 patients suffering from mild, moderate and severe AD, respectively. The criteria of AD severity using SCORAD is as follows: below 25 is defined as mild, between 25 and 50 as moderate and greater than 50 as severe [8].

Image Acquisition
The 3D RSOM images were collected by using RSOM Explorer C50 system (iThera Medical GmbH, Germany). RSOM system was implemented with one diode-pumped solid-state (DPSS, Nd:YAG, 532 nm) to provide < 1 ns pulse and a per-pulse energy up to 125 µJ with a laser's repetition rate of 270 Hz. The flexible articulated arm of RSOM allows raster scanning of 5 mm × 3 mm area on the skin in about 2.5 minutes. An in-tandem illumination-detector element is located at the focal point of the transducer, which raster-scans the two-dimensional (2D) region of interest (ROI) on the skin in a regularly-spaced acquisition grid and collects the ultra sound signals from 11 to 99 MHz. This non-invasive RSOM system can provide good quality 3D images with high resolution and deep penetration from the skin surface. The 3D RSOM images were visualized with two frequency sub-bands, high-frequency (HF) (33-99 MHz) in green and low-frequency (LF)  in red, representing the small and big vascular structure, respectively (Fig. 1).

Features
In addition to one feature m (TBV), 2) Ep and 4) Transe Technologies using a simila = ΣN × dV, maximum val as the distanc epidermis. Th the z-axis and optoacoustic maximum (FW computed as t image in the d

Machine
Two tradition vs. AD condi employed for up to a depth library scikitwere used as these features "Balanced" m This mode ad frequency in t during trainin All the an two groups (t

Model Training and Evaluation
The network was trained for 200 epochs with learning rate 1×10−5, learning rate decay of 0.05 and learning step decay of two with Adam optimizer. The batch size was set to be four.
All computations were carried out on a Linux workstation with Intel (R) Core (TM) i7-4790 CPU with 3.6 GHz clock speed, 16 GB RAM and a GeForce GTX TITAN X. It took approximately 9 min for one epoch and a total of 30 hours to train the model using the abovementioned workstation. Tensorflow 1.12 [35] implementation was used in our study. Three CNN models were trained using different combinations of inputs. The first CNN model trained used only LF and HF 3D RSOM images (23 healthy and 53 AD cases) as inputs. The second CNN model trained used 3D RSOM images, and three features (TBV, LHFR, and TEWL) added at the bottleneck layer (Fig. 2). The third CNN model used 3D RSOM images and four features (ET, TBV, LHFR, and TEWL). For the second and third analyses, 6 cases out of the 23 healthy cases did not have complete feature information, making the number of samples to be 53 AD and 17 healthy cases. Validation data was used to evaluate the models' prediction accuracy. Since one patient would have more than one sample due to the cropping pipeline, majority voting was performed to determine the final prediction for that patient. If there is a patient without a final prediction due to having an equal number of prediction outcomes, one additional sample was randomly cropped from the 3D RSOM images and evaluated to obtain the final prediction. Table 2 tabulates the average and standard deviation of the validation accuracy of RF, SVM and CNN for six-fold cross-validation results. Figure 4 shows the confusion matrices for the three models evaluated on validation datasets. The confusion matrices shown are for models that yielded the highest validation accuracy as reported in Table 2.

Results
In healthy vs. AD analysis, CNN achieved the highest performance among the three models, giving a validation accuracy of 97%, using all four features. However, when using only LF and HF 3D RSOM images, CNN yielded only 48% validation accuracy in classifying healthy vs. AD condition. Adding three handcrafted features (TBV, LHFR and ET) to the model increased the validation accuracy to 94%. The performance of CNN was further improved by 3% when TEWL was added to the model, achieving 97% accuracy. For ML models, RF performed better than SVM in all the analyses performed for different combinations of features.
In mild vs. moderate-severe analysis, RF gave the highest accuracy of 65% in severity score prediction among all three models using all four features. SVM model, on the other hand, showed a validation accuracy of 59% when TBV, LHFR, and ET were used. Lastly, CNN exhibited slightly lower accuracy at 56% in predicting severity compared to RF and SVM, using 3D RSOM images and three handcrafted features derived (TBV, LHFR, ET) from RSOM images as inputs.

Discuss
In recent year classification. on 3D optoac RSOM image Using a s compared to t images) [38]. smaller data s affected by th for preparing sets can be a b We extend study, we ad moderate-sev comprehensiv for the genera especially wh higher predic improvement Using all achieve a dia  Table 2). When raw 3D RSOM images were added to the pipeline using CNN, the diagnostic accuracy was improved to 97% and the model demonstrated more stability in prediction compared to RF and SVM, judging from the low standard deviation. This suggested that the CNN model indeed extracted useful features from 3D RSOM images, which aided in enhancing the CNN's diagnostic accuracy. Even though CNN showed very high diagnostic accuracy in classifying healthy and AD conditions, it did not achieve similar performance in classifying mild vs. moderate-severe AD conditions. The CNN's highest diagnostic accuracy for mild vs moderate-severe AD classification was 56%. A similar prediction accuracy was observed in RF and SVM, where the average diagnostic accuracy was ~60%.
From Figure 5, the RSOM cross-sectional images for healthy and AD cases were visibly distinguishable. CNN model thus was able to extract useful features in classifying healthy and AD cases even though the datasets were small. For the mild vs. moderate-severe AD conditions, using both raw 3D RSOM images and handcrafted features did not improve the CNN model's accuracy as what we had observed in the healthy vs. AD classification. We believe this was because it is challenging to differentiate between mild and moderate-severe AD RSOM images, as shown in Figures 5 and 6. There are several reasons for the erroneous classification in Figure 6. Firstly, since pathological and physiological features form the basis for determining the severity of AD in this study, any deviation in the features will affect the CNN's prediction accuracy. Notably, the mild representative case in Figure 6 exhibited a TEWL value of 31 g/m 2 h, far higher than that of severe AD cases. Similarly, the TEWL value of the moderate representative case in Figure 6 was lower than that of healthy subjects, possibly rendering the wrong classification of the case to be 'mild'. Secondly, if the structural features of the RSOM images are lost due to skin barrier dysfunction in severe AD cases, the feature quantification is challenging since the boundary between epidermis and dermis region is not delineated. As in the severe case in Figure 6, the ET calculation yields a value of 148 μm, similar to that of healthy subjects which leads it to be wrongly classified. The limited amount of training data further adds to the difficulty resulting in inaccurate classification for validation data set. As shown in Table 1, there are a total of 41 AD and 19 healthy subjects in our data set for classifying healthy vs AD cases. However, to predict mild vs moderate-severe AD subjects, fewer samples are available including 15 mild AD subjects and 26 moderatesevere AD subjects. The limited number of samples is another reason for why classification of mild AD vs moderate-severe AD subjects is harder. CNN models in general require many more samples in order to learn to extract useful features for classification. Retraining the model with a larger data set will mitigate this problem.
During ML training, data balance between classes (e.g. healthy vs. disease) is important to ensure the number of samples from both classes is similar. It is particularly important to perform data splitting at the patients-level to avoid potential data leakage from training data to validation data [38]. We have successfully developed a CNN-based pipeline, which include data preparation, data augmentation and model training to recognize various AD severity conditions using raw 3D RSOM images, and handcrafted features. This CNN-based pipeline thus will handle the data splitting at patient-level and is not limited only to skin AD disease classification. It is designed in a modularized manner and has the flexibility to be applied for classification of other skin inflammatory diseases such as rosacea, and psoriasis and other 3D optoacoustic images such as optical coherence tomography, multispectral optoacoustic tomography and multispectral optoacoustic mesoscopy. We have shared our code in https://github.com/davidc9320sg/rsom-dermatitis-cnn/.
It is crucial to diagnose AD severity accurately to monitor the treatment response and plan effective clinical care for patients. We successfully proposed an optimal network architecture suitable for 3D optoacoustic images for AD conditions classification, which can be used as an objective evaluation tool for assessing AD conditions in clinics. At the current state, our CNN model is unable to achieve desirable diagnostic accuracy in classifying mild vs. moderate-severe AD conditions. One reason could be that the AD severities were determined from SCORAD scoring which was subjected to inter-and intra-observational variability, the accuracy may therefore suffer from discrepancies from the SCORAD results. While SCORAD or EASI takes into account the presentation and frequency of AD symptoms, the subsurface inflammation physiology of the skin was out of the scoring framework. With the naked eye, it may be possible to observe that superficial symptoms are improved overtime with treatment, but inflammation may persist under the skin that can significantly impact the way we classify the disease severity. With more data collected and consensus among multiple diagnoses for each patient, the model can be re-trained using the current framework as a baseline to further enhance its accuracy.
There are several limitations in this study. Firstly, the size of the data set was small (76 patients) and the population was mainly Asian cohort. Through an on-going collaboration, we are aiming to expand this study by including patients with lower Fitzpatrick scores (I-II). Secondly, the size of the cropped sample was set at 64 x 64 pixels due to our insufficient GPU memory. A larger cropped sample might aid in improving CNN models since it provides more information to the CNN model. The AD classification model in this paper can be an adjunctive diagnostic tool to aid in clinical decisions, especially in differentiating between mild and non-mild AD severities. As even clinicians with experience in optoacoustic images may face challenges to interpret the RSOM images, our classification model aims to classify AD severity with higher sensitivity by extracting features from volumetric vascular structure in 3D RSOM images rather than one-plane imaging features. This proposed pipeline provides the foundation for an AI-aided AD diagnosis and treatment platform.

Conclusion
To conclude, we have evaluated the performance of three ML models in classifying AD conditions using 3D RSOM images, handcrafted features derived from RSOM images and transepidermal water loss. Our results showed that CNN models yield the highest accuracy (97%) in classifying healthy vs. AD conditions while RF achieve the highest accuracy (65%) in classifying mild vs. moderate-severe AD conditions. This is the first study to classify AD severity using 3D RSOM images. We developed a pipeline to prepare 3D RSOM images for training a CNN model and showed that the use of raw RSOM voxel values can be advantageous over handcrafted features. Our method can easily be extended to other inflammatory skin diseases such as rosacea and psoriasis.