Automatic classification of dual-modalilty, smartphone-based oral dysplasia and malignancy images using deep learning.

: With the goal to screen high-risk populations for oral cancer in low-and middle-income countries (LMICs), we have developed a low-cost, portable, easy to use smartphone-based intraoral dual-modality imaging platform. In this paper we present an image classification approach based on autofluorescence and white light images using deep learning methods. The information from the autofluorescence and white light image pair is extracted, calculated, and fused to feed the deep learning neural networks. We have investigated and compared the performance of different convolutional neural networks, transfer learning, and several regularization techniques for oral cancer classification. Our experimental results demonstrate the effectiveness of deep learning methods in classifying dual-modal images for oral cancer detection.

through imag cancer [9]. Autofluor malignant and AFV can help and assist in [11]

Data acq
Dual-modal im oral oncology 'normal' and saliva, defocu were not used Figure 3 s from the pala mucosa of a image pairs o yellowish, pa sensor.
We obser across the wh contrast, for O color in white the CNN.   [28,50]. This training proce some depende not work with training and Therefore, a d connected lay

Image en
Weight de [51]. It limits function: where 0 E is network, and In the study, we have compared the performance of different networks, applied data augmentation and regularization techniques, and compared our fused dual-modal method with single modal WLI or AFI. For each comparison, all settings except the one under investigation are fixed.

Architecture
While increasing the depth of network is typically beneficial for the classification accuracy [48], sometimes the accuracy becomes worse when training a small image data set [52]. To achieve the best performance with our small data set, we have evaluated a number of neural network architectures. Figure 6 In VGG-CNN-M the stride in layer conv2 is two and the max-pooling window in layers conv1 and conv5 is 2x2. In VGG-CNN-S the stride in layer conv2 is one, and the max-pooling window in layers conv1 and conv5 is 3x3. VGG-CNN-M has the best performance, and we use this architecture in our study.

Data augmentation
Data augmentation is implemented to increase the data set size. Figure 6(b) shows the increased accuracy from using data augmentation. As expected, the accuracy increases with a larger data set.

Regularization
Various weight decay parameters have been explored and λ = 0.001 has the best performance as shown in Fig. 6(c). Additional studies could help to find the optimal weight decay value. As shown in Fig. 6(d), we can further improve the performance by applying the dropout technique.

Dual-mo
To compare c WLI, and du images has th performance. images to cla has better per result uses tra the dual-moda sensitivity of . single-moda performance, w ges (Fig. 7)

Discussion
The images in this study were captured from buccal mucosa, gingiva, soft palate, vestibule, floor of mouth and tongue. One challenge associated with the current platform is that different tissue structural and biochemical compositions at different anatomical regions have different autofluorescence characteristics, so the performance of the trained neural network will be potentially impacted. This problem could be addressed in the future when we have sufficient image data at every anatomical region to train neural networks and classify each anatomical region separately.
In this study, our goal is to develop a platform for community screening in LMICs, we currently focus on distinguishing dysplasia and malignancy tissue from normal ones. While, in clinical practice, it would be of interest for clinicians to distinguish different stages of cancerous tissue, this work provides the platform for further studies towards a multi-stage classification in the future.
One remaining question on is whether the images marked as clinically normal or suspicious actually reflect the true condition of the patients because the diagnosis results from the specialists are used as the ground truth, not the histopathological results. The device will act as a screening device at rural communities, and patients will be referred to hospitals and clinics for further test and treatment if a suspicious result appears.

Conclusion
Smartphones integrate state-of-art technologies, such as fast CPUs, high-resolution digital cameras, and user-friendly interfaces. An estimated six billion phone subscriptions exist worldwide with over 70% of the subscribers living in LMICs [53]. With many smartphonebased healthcare devices being developed for various applications, perhaps the biggest challenge of mobile health concerns is the volume of information these devices will produce. Because adequately training specialists to fully understand and diagnose the images is challenging, automatic image analysis methods, such as machine learning, are urgently needed.
In this paper, we present a deep learning image classification method based on dual-modal images captured from our low-cost, dual-mode, smartphone-based intraoral screening device. To improve the accuracy, we fuse AFI and WLI information into one three-channel image. The performance with fused data is better than using either white light or autofluorescence image alone.
We have compared convolutional networks having different complexities and depths. The smaller network with five convolutional layers works better than a very deep network containing 13 convolutional layers, most likely due to the increase of the network's complexity causing overfitting when training a small data set. We have also investigated several methods to improve the learning performance and address the overfitting problem. Transfer learning can greatly reduce the number of parameters and number of images needed to train. Data augmentation used to increase the training data set size and regularization techniques such as weight decay and dropout are effective in avoiding overfitting.
In the future, we will capture more images and increase the size of our data set. With a bigger image data set, the overfitting problem will reduce and more complex deep learning models can be applied for more accurate and robust performance. Furthermore, due to the variation in tissue types in the oral cavity, performance is potentially degraded when classifying them together. With additional data, we will be able to perform more detailed multi-class (different anatomical regions) and multi-stage classification using deep learning.

Funding
National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health (UH2EB022623); National Institutes of Health (NIH) (S10OD018061); National Institutes of Health Biomedical Imaging and Spectroscopy Training (T32EB000809).