Multistage classification of oral histopathological images using improved residual network

: Oral cancer is a prevalent disease happening in the head and neck region. Due to the high occurrence rate and serious consequences of oral cancer, an accurate diagnosis of malignant oral tumors is a major priority. Thus, early diagnosis is very effective to give the patient a prompt response to treatment. The most efficient way for diagnosing oral cancer is from histopathological imaging, which provides a detailed view of inside cells. Accurate and automatic classification of oral histopathological images remains a difficult task due to the complex nature of cell images, staining methods, and imaging conditions. The use of deep learning in imaging techniques and computational diagnostics can assist doctors and physicians in automatically analysing Oral Squamous Cell Carcinoma biopsy images in a timely and efficient manner. Thus, it reduces the operational workload of the pathologist and enhance patient management. Training deeper neural networks takes considerable time and requires a lot of computing resources, due to the complexity of the network and the gradient diffusion problem. With this motivation and inspired by ResNet's significant successes to handle the gradient diffusion problem, in this study we suggest the novel improved ResNet-based model for the automated multistage classification of oral histopathology images. Three prospective candidate model blocks are presented, analyzed, and the best candidate model is chosen as the optimal one which can efficiently classify the oral lesions into well-differentiated, moderately-differentiated and poorly-differentiated in significantly reduced time, with 97.59% accuracy.


Introduction
Oral cancer is a chronic disease comes under the head and neck region, which includes the oral cavity, nasopharynx, and pharynx [1]. Due to the negligence of patients with precancerous signs, it leads to carcinoma stage in a short duration and turns into a life-threatening disease. Thus, the early detection of the oral tumor is essential for providing a better treatment plan to the patients by increasing the survival rate [2]. With the advent of computer-aided diagnostic (CAD) systems and histopathology of tissues, there is a massive accumulation of progressive digital histopathological images. Hence, there is a necessity for the automated analysis of these images. Many computer-aided methodologies have been developed for oral cancer diagnosis using machine learning and deep learning methods [3,4]. Deep learning is widely accepted by various researchers for analyzing the data for classification or prediction.
For the oral cancer analysis, biopsy sections are taken and put on a glass slide. For accurate diagnosis hematoxylin and eosin (H&E) stain for the slides is mostly used that are subsequently examined beneath a microscope. A pathologist uses a microscope to analyze the distortions (i.e., the numerous elements on the slide, such as cell organization, cell size, form and shape, and so on) to identify cancer. The patient is then diagnosed based on the pathologist's report by an oncologist. For that reason, the study should be as precise as possible. However, the whole manual observation process for feature analysis of each slide is arduous, time-consuming, and requires a significant amount of domain knowledge. Furthermore, the report may be skewed by the observer [5]. Due to these issues, an automated version of the procedure described above is the need which can reduce the bias and time by improving the feature evaluation process. Additionally, as digital histopathological images contain specific characteristics, special processing techniques are very much required for their analysis. Therefore, an automated diagnostic system is required for the screening of oral cancer, which can overcome the issues of the manual observation process. It will also be beneficial in large-scale observatories and cancer screening camps, where the vast majority of cases are benign. Therefore, the pathologist can concentrate on the cases that the system has identified as cancerous [6]. For various cancer diagnosis and grading applications, a computer-aided histopathology research has been performed [7,8]. Thereafter, many machine learning and deep learning models have been developed which can effectively predict the outcome of cancer type by extracting and identifying patterns and associations from datasets [9][10][11].
Oral malignancy can come in a variety of forms, such as "squamous cell carcinoma", "verrucous carcinoma", "minor carcinoma of the salivary gland", "lymph epithelial carcinoma", etc. From literature [12], it is seen that Squamous cell carcinoma (SCC) accounts for about 90% of oral disorders. In a healthy state, the throat and mouth are typically lined with stratified squamous cells, and only a single layer of this squamous cell's basement membrane is connected with it. By adhering to one layer, the inherent stability of the epithelial layer is protected. This inherent stability is lost when cancer develops. SCC is categorized into three types: "WDSCC (well-differentiated SCC)", "MDSCC (moderately-differentiated SCC)", and "PDSCC (poorly-differentiated SCC)" [13]. Well-differentiated cells are the low grade or grade I tumors. These cells are well-organized and have the appearance of healthy tissue. The tumor cells of high grade or grade III are poorly differentiated. Moderately differentiated cancer cells are those that don't appear to be well-differentiated or poorly-differentiated.
The progress of deep learning techniques has dramatically impacted the performances in various tasks of the medical domain [14,15]. To understand better representations, a deeper network has to be trained. The training and optimization of a deeper network are more complicated than a shallow network [16]. Although Convolutional neural networks have many successful breakthroughs for the image classification tasks [17][18][19], they are difficult to train for two reasons. First, due to the exponential decrease of the gradient the front layers train very slowly. Second, Convolutional Neural Network (CNN) models have more parameters leading to the complexity of the network, thus requires longer time to train [20]. Recent advances in deep learning methodologies can be abridged as learning methodologies, initializations, and activation functions for training more complex architectures. Though activation functions and batch normalization have made tremendous progress in decreasing the influence of exploding/vanishing gradients, optimizing a neural network with profound architecture remains a challenge [21]. To handle this challenge, we have proposed residual network architectures trained from scratch to classify the oral cancer histopathological images into three categories.
The following are the key contributions of this article: 1. We proposed three ResNet architectures for the multistage classification of the OSCC into welldifferentiated, moderately-differentiated, and poorly-differentiated. These architectures work well by providing better accuracy and preserving the information across layers. 2. The study is focused in training the different proposed variants of ResNet from scratch and comparing their efficiency for the defined problem statement. 3. The proposed ResNet architectures provide faster convergence with less computational complexity to provide superior performance with a small dataset. 4. To prevent overfitting, data augmentation techniques, activation functions, and parameter optimizations are employed.
This article is systematically arranged as follows. In section 2, the relevant, related works are discussed. The material and methods for the experimentation as well as model structures and parameters are represented in Section 3. The comparison among different variants of ResNet and the experimental results are discussed in Section 4. Finally, the conclusion and future work is presented in Section 5.

Related work
Researchers have performed oral cancer classification in the literature by using machine learning and deep learning techniques. Kim et al. [20] and Mohd et al. [9] have done the retroactive study for the early diagnosis of oral cancer and predicted the survival rate. Most of the research emphasizes on the application of machine learning methods to detect oral submucous fibrosis (OSF) [11,22,23]. The chronic nature OSF can lead to oral cancer on progression. Studies were done by T. Y. Rahman et al. [24] and T. Y. Rahman et al. [25] is based on the binary categorization of OSCC. These studies have reported that using Support Vector Machine (SVM) and Linear Discriminant Classifier (LDA,) 100% classification accuracy was achieved using texture, color and shape features. The conventional approach followed by the pathologists is subjective in nature. Apart from that, the significant challenges for the manual assessment are the variability of the microscope, the quality of the stains/slides, domain knowledge of the pathologist, time allotted for each observation etc. These factors may result in a diagnostic mistake or delay in the follow-up process [26]. In contrast, automated systems applying deep learning techniques eradicate the necessity for domain proficiency and explicit feature engineering. Deep learning algorithms can now achieve superior classification accuracy for histopathological images without requiring manual feature representation of the input data due to the availability of high-performance GPUs [27,28]. Santisudha et al. have suggested deep learning-based models for the binary categorization of OSCC [29,30] that outperformed the other baseline models in this domain. For taking appropriate clinical decisions, a biopsy image is regarded as the gold standard in pathology [31]. Many studies are found in the literature associated with deep learning utilizing biopsy images for classification tasks [11,18]. Thus, deep learning approaches are predicted to outperform conventional machine learning methods without the need for selective feature engineering.
Navarun et al. [32] have employed transfer learning with four pre-trained models and compared them with a proposed CNN model to classify four types of OSCC. They have replaced the fully connected layers for random weight initialization to relearn from oral cancer histopathological images. The literature reveals the importance of using machine learning and deep learning methods for the classification of oral cancer.
Lamia H. et al. [33] have developed an automatic segmentation of brain tumor from MRI images by using Deep Residual Network. They have achieved superior performance with 3% faster computation time compared to other DCNN models. Hao Zhu et al. [34] have proposed a ResNet model for the remote sensing classification and achieved competitive performance. There are many articles highlighting the importance of ResNet in various domains such as malaria detection [35], mask detection [36], machine health monitoring [37] etc.
The issues associated with training deeper networks is the degradation problem in which, the accuracy of the deeper neural network will increase initially until reaching saturation. After that, the accuracy will diminish as the depth is increased [38]. To overcome this challenge as well as the increase of huge parameters of the models, we have proposed simple and improved residual networks with smaller dataset. To our knowledge, no other study has used histological images of oral cancer to demonstrate the efficiency of residual networks.
The proposed study considered the multistage classification or grading of the oral histopathological images using ResNet architecture's different variations and depth. The outcomes of all variations are compared, and the best performing approach for classification is determined.

Materials and methods
For the present study, we have used oral biopsy images to evaluate and analyze the influence of using various residual blocks. The variants of residual blocks are compared and validated. First, the oral cancer histopathological image dataset is generated.
An expert pathologist performed ground-truth labeling by identifying the region of interest (ROI), from which the study dataset was created by extracting the image patches. Then, dataset split-up 70%, 10% and 20% training, validation and testing respectively is done based on the train-test strategy [16]. Thereafter we applied data pre-processing and data augmentation. Finally, classification is done with different candidate residual block generation. The diagrammatic depiction of the study approach used is shown in Figure 1.
The training dataset is used to fit the model i.e. the weights and biases. A validation dataset is used to tune the model hyperparameters i.e. the architecture of the model. Test dataset is used to evaluate the performance of the trained model.

Image dataset
Histological hematoxylin and eosin-stained sections of cancerous oral lesion are obtained from the Institute of Dental Sciences (IDS), SUM Hospital, Bhubaneswar, India. The data was gathered with consideration for the patient's concerns as well as ethics committee approval (Ref No./DMR/IMS.SH/SOA/1800040). The slides were viewed under Lawrence and Mayo research microscope and images captured with 5 MP CMOS camera, 100x resolution and stored digitally on the computer as high quality JPEG images. 400 image patches were collected from each category of oral cancer stages such as: "well-differentiated", "moderately-differentiated" and "poorlydifferentiated". The image patches of size 256 × 256 were extracted. Sample image for each class is shown in Figure 2.

Data pre-processing
The median filter of size 3 × 3 has applied for the image patches to reduce the noise such as some bright and black pixels associated with capturing the images through the microscope [39,40].

Data augmentation
Data augmentation is a simple yet effective strategy for preventing overfitting in models and improving the final results [41]. We need more image augmentations to improve the model's generalization capability. Different transformations are selected, including flipping, cropping, scaling, shifting, rotating (30⁰, 45⁰, 60⁰, 90⁰, 105⁰). Due to the small dataset available, we have applied heavy data augmentation to obtain 8400 training image patches. The aforementioned nine modifications are applied to 840 (70%) training image patches, including the original set, resulting in 8400 training image patches.

The Proposed ResNet models
A deep convolution neural network has made several notable advancements in image classification [42]. However, to analyze the convergence of deeper networks, there lies the degradation challenge in developing deep learning models.
The accuracy of the deeper neural network will increase initially until reaching saturation then diminish as the depth is increased [38]. It is recognized that the influence is not caused by overfitting as the error grows in both the training as well as test sets. In 2015, this model was the winner of the ImageNet competition. ResNet is allowed to train profound neural networks with 150+ layers successfully. The ResNet50 was proposed by He et al. [43] with 50 residual network layers. The researchers compared various deep learning models and evaluated the deeper ResNet in a significant period of time, concluding that ResNet can enhance accuracy by increasing depth and outperforms other models for classification tasks. The key visual features of OSCC such as the structural variances of epithelial layers and the formation of keratin pearls extraction are the challenging task for the oral cancer classification. Here, the ResNet need to train from a very small oral histopathological image dataset available for this study. The primary motive of this study is to extract the key features of OSCC with limited dataset and to resolve the degradation issue of the neural network. There are numerous different types of residual components, which can be further described based on the problem needs. The residual components of ResNet employed in this study are shown in Figure  3, which effectively overcomes the degradation problem. Figure 4. represents the common layered architecture of the ResNet for all the three variants of Figure 3 (a), (b) and(c). The residual component in Figure 3(a) is composed of three convolution layers in the main path and one convolution layer in skip path. The residual blocks have convolution sizes 1 × 1, 3 × 3, and 1 × 1. The Figure 3(b) residual block uses two convolutions of size 3 × 3 in the direct path and one convolution of 1x1 in the skip path. The Figure 3(c) residual block is similar to Figure 3(a) but without batch normalization. In all three models, the depth of 13 convolutions is considered with convolution kernel size is 3 × 3. We have tried out with higher number of convolution layers and observed that 13 is the minimum number of convolution layers required to give optimal performance. Thus, the models can be referred as ResNet13-A, ResNet13-B and ResNet13-C in the following sections. For ResNet13-A and ResNet13-C, the corresponding residual blocks are repeated 3 times whereas for ResNet13-B the residual block is repeated four times to make the depth of the same in all cases. Thus, the three models can be comparable. The direct path and skip path in the residual module have the same output dimensions and can be combined directly. Figure 5 depicts two stacked layers building block that is determined as: Where the input and output vectors of the residual blocks are x and H(x) respectively. The residual mapping to be learned is denoted by R(x). Rectified linear unit (ReLU) is used as the activation function for our deep learning models [44]. The softmax function is used in the prediction layer to convert the output layer's results to a probability distribution for the multi-class classification task [45].

Experimental setup
The proposed models were employed on a system having the subsequent specifications: Quadro P5200 with a six-core i7 processor, 32 GB of GDDR5 RAM, and NVIDIA-2560 CUDA processing cores, 16 GB GPU (32 GB GDDR5 graphics memory and 2560 CUDA cores). In addition, Keras (high-level neural network library run on TensorFlow or Theano) based on the python interface was used to implement the experimental framework of Reidual Network (ResNet) models for oral lesion classification.

Results and discussion
During this study, various experiments have been carried out to evaluate ResNet's performance in detecting different stages of oral cancer.
Certain deep learning terms are already present in the suggested architectures, however hyperparameter tuning is done to obtain the optimum configuration for our task. ResNet is used, with an initial learning rate of 0.01 and 50 epochs. Furthermore, the network's mini-batch size is set to 32. Stochastic Gradient Descent with Momentum (SGDM) [46] and Adam [47] have been chosen as our optimizer techniques.
It is observed from the literature that, convergence is inhibited and degradation happens as the number of layers in a CNN grows larger [48]. The residual networks are computationally more efficient and can achieve a higher degree of accuracy from significantly enhanced depth. ResNet has received much interest in research due to its acceptance and effectiveness in image classification [35,36] and. many ResNet variations have been proposed. The performance of the different variants proposed in this study are compared in Figure 6.
Model comparison and results -All the variants of proposed Residual Network models are validated, compared and the results have been discussed in terms of i. Learning curves ii.
Evaluation metrics iii.
Comparison with other cutting-edge models.

i. Learning curves
Learning curves are a visual representation of the incremental epoch-based evaluation of a classifier's learning performance. The learning curves are the accuracy and loss curves for training as well as validation set. Figure 6 shows the accuracy and loss curves of the training progression for the proposed models over 50 epochs. It indicates that after each epoch, the training loss linearly decreases, and the training accuracy increases. Model ResNet13-C is not generalizing well compared to the other two models due to the absence of batch normalization as observed from Figure 6(c). From the learning curves of ResNet13-C, Figure 6(c), it's seen that the training accuracy is much higher than the validation accuracy. Thus it overfits the training dataset leading to not generalize well for the unseen data. Batch normalization is a method used for addressing the difficulties of training deep neural networks which stabilizes the learning process and significantly reduces the number of training epochs essential for training the networks [21].  Figure 7. The accuracy and error rate differs amongst the different variants, and the model ResNet13-A perform better by achieving an accuracy of 97.59%. Model ResNet13-B becomes stable after 40 epoch by giving accuracy of 96.67% which is also comparable with model ResNet13-A with little lower performance. Including ReLU after the addition, there is a small performance difference. Conversely, the difference in validation accuracy for both Batch Normalization (BN) and non-BN networks is ≈ 15%, making it evident that the BN network's generalization capability is substantially higher than that of its non-BN counterparts. This evaluation shows that BN can help a network generalize more effectively. On average, ResNet13-A has a lower error rate than ResNet13-B and ResNet13-C.

ii. Evaluation metrics
To assess the effectiveness of our proposed model, from various classification performance measures, the benchmark metrics such as sensitivity, specificity, accuracy, precision and f-measure are used in this study. The following terminologies are used to represent sensitivity (true positive rate), specificity (true negative rate), and accuracy.  In the medical domain, both sensitivity and specificity contribute significantly in assessing the strength of the model. The metric accuracy correlates with sensitivity and specificity while evaluating the model's performance. From Table 1, it can be observed that the values of sensitivity and specificity are quite balanced, which is required for an outstanding model. Thus, ResNet13-A should be considered as the best performing model for multiclass classification of oral lesions.

iii. Comparison with other cutting-edge models
A novel multistage cancer detection approach based on ResNet is proposed in this work which can handle the degradation issue. We compare the effectiveness of our model with machine and deep learning baseline architectures. The ResNet13-A model consists of convolutional layer, batch normalization, rectified linear activation unit and an identity mapping, which is more robust compared to the training procedure of CNN. Furthermore, the ability to extract higher-level data characteristics that are more complex, distinguishes ResNet13-A architecture from others. This comparison of different methodologies provided some light on the advantages of adopting ResNet over other techniques. In identifying distinct phases of cancer cells, the development system ResNet13-A model achieved the best average accuracy of 97.59 %, which is analogous to various cutting-edge methods.
The efficiency of the suggested model is compared to the earlier research approaches relating to the accuracy metric. While the majority of the literature uses machine learning techniques, the suggested model, CNN, and CapsuleNet are deep learning architectures. In a traditional machine learning technique that involves an expert-dependent handcrafted feature extraction technique, a small dataset (between 10 and 500) was used. The coarse features are extracted independently using the deep learning approach, and the efficiency improves as the data size grows. The SVM model by Krishnan et al. [40] and backpropagation-based ANN model by Belvin et al. [41] have outperformed our outcome by 2.07% and 0.33% respectively, since the features considered for the study are well-defined handcrafted features and the dataset evaluated is substantially less than the suggested method. Even though different datasets were used in the existing literature, the aim of the study is the same, and each of them examined oral cancer histopathological images. From this perspective, we should deduce from Table 2 that our method is highly comparable with current work done in the field of OSCC to the best of our knowledge, demonstrating the practicality of the solution proposed. This demonstrates that the proposed method for detecting multistage malignant tissue would be useful for computer-assisted diagnosis of OSCC. The ResNet design incorporates the skip connections to speed up computation and reduce training error compared to existing architectures.

Conclusion and future work
In this work, we have proposed a novel model for the classification of oral cancer into multiple stages based on histopathological image data. Three prospective candidate model blocks are trained from scratch, and the best candidate model is chosen as the optimal ResNet model (ResNet13-A) which is an automated computer-aided method to obtain high-performance results with less computational complexity and small dataset. Furthermore, performance metrics specifically accuracy, sensitivity, specificity, precision, F-measure and loss rates had been studied. The suggested model achieved 97.59% accuracy for multistage classification, which is comparable with several state-of-the-art approaches. Therefore, the proposed ResNet model is an efficient model for detecting multistage oral cancer, and that can be utilized as a diagnostic tool to help physicians in daily clinical screening.
We have not employed the dropout regularization in our present implementation, which has demonstrated to improve performance in deep networks. We will train the networks using dropout regularization in the future and assess its outcome. Furthermore, we intend to test our network on a suitably large dataset, which will broaden the scope of the findings given in this paper.