Explainable AI for Interpretation of Ovarian Tumor Classification Using Enhanced ResNet50

Deep learning architectures like ResNet and Inception have produced accurate predictions for classifying benign and malignant tumors in the healthcare domain. This enables healthcare institutions to make data-driven decisions and potentially enable early detection of malignancy by employing computer-vision-based deep learning algorithms. These CNN algorithms, in addition to requiring huge amounts of data, can identify higher- and lower-level features that are significant while classifying tumors into benign or malignant. However, the existing literature is limited in terms of the explainability of the resultant classification, and identifying the exact features that are of importance, which is essential in the decision-making process for healthcare practitioners. Thus, the motivation of this work is to implement a custom classifier on the ovarian tumor dataset, which exhibits high classification performance and subsequently interpret the classification results qualitatively, using various Explainable AI methods, to identify which pixels or regions of interest are given highest importance by the model for classification. The dataset comprises CT scanned images of ovarian tumors taken from to the axial, saggital and coronal planes. State-of-the-art architectures, including a modified ResNet50 derived from the standard pre-trained ResNet50, are implemented in the paper. When compared to the existing state-of-the-art techniques, the proposed modified ResNet50 exhibited a classification accuracy of 97.5 % on the test dataset without increasing the the complexity of the architecture. The results then were carried for interpretation using several explainable AI techniques. The results show that the shape and localized nature of the tumors play important roles for qualitatively determining the ability of the tumor to metastasize and thereafter to be classified as benign or malignant.


Introduction
Ovarian cancer is one of the most common types of cancer in women all around the world [1].Early detection remains one of the best possible means of cure.Due to recent advancements in the Computer Vision domain, healthcare experts are leveraging CNN-based architectures to classify tumors into benign or malignant based on MRI or CT scanned images, thus enabling them to make early diagnosis of the disease.In addition to domain knowledge, experts can now rely on the capability of artificial intelligence to detect signs of cancer at an early stage, which would potentially transform the diagnosis of various types of cancer, with ovarian cancer being the most prominent [2].
The leverage of AI in the healthcare domain is increasing by the day, and healthcare institutions are relying on the usage of Computer Vision-based classification and detection deep learning algorithms for early diagnosis of diseases like cancer, which is not intuitive to predict by human professionals.This can potentially revolutionize the healthcare and disease prediction domain, and save or prolong countless lives through early diagnosis and treatment for fatal diseases like cancer.Deep learning convolutional neural network-based algorithms have proven to be one of the most accurate and efficient methods of detection of tumors from CT scans or MRI scan images.However, due to the black-box nature of these algorithms, it is difficult to interpret the results of the algorithm and understand why the model predicted a certain class.Retracing the results of the algorithm back to the individual layers can be a challenging task, and it is important to understand why a neural network made a certain prediction, or, in other words, which features it relied on the most for the final predicted output.The traditional Machine Learning algorithms like Support Vector Machines and Decision Trees provide explainability; however, the problems of image classification and localization can be solved with a greater accuracy and efficiency with deep convolutional neural networks.Sometimes, due to a class imbalance in the dataset, the network can be inclined towards a certain output and place importance on the wrong features.Hence, in this scenario, without some explainability methods, it would be difficult to rely on these algorithms.Class activation maps and feature importance techniques prove to be excellent for highlighting the regions of interest and, thus, ensuring that the algorithm results are fair and unbiased, as well as take into consideration the right features for prediction.This is crucial for healthcare practitioners to continue to rely on and have faith in deep learning algorithms to help in their decision-making processes.Thus, to interpret the results of these models, explainable AI methods are being implemented.Inspecting intermediate features at each layer is one of the popular methods to interpret model representation at various layers.The second approach is attention-based, which highlights the input features that are given most importance during prediction.Explainable AI methods, on the other hand, help to analyze the decision making process of neural networks with the help of class activation maps and local explanations, and have been implemented this paper.In some binary or multiclass classification problems, there also a scope of creating an ensemble of neural networks combined with decision tree or rule based approaches to enhance the interpretability of predictions.Finally, introducing dropout or regularization in networks, and thus increasing the sparsity of the networks, can help with identifying the features that are most important towards the final predicted output.
Recent advancements in deep learning convolutional neural network-based architectures have proven to be efficient in identifying higher-and lower-level features during classification of tumors into benign and malignant categories at earlier stages of detection, which is nearly impossible to predict with the human eye [3].These architectures provide enhanced performance when compared to traditional Machine Learning classification algorithms like Support Vector Machines and Logistic Regression due to the consumption of huge amounts of training data and the complexity of the network architecture.However, unlike traditional Machine Learning algorithms like Logistic Regression and Decision Trees, the deep learning CNN architectures provide less explainability and interpretability of the results, despite outperforming these algorithms manyfold in terms of classification accuracy [4].Hence, it is crucial to understand the way CNN algorithms work for making a classification and the importance of the features considered for the final classification result for the healthcare experts to make a data-driven diagnosis with more confidence and prescribe the most effective treatment [5].
In this paper, we have developed a ResNet60 model, which is a modified ResNet50 model derived from the standard ResNet50, for the classification of ovarian tumors into benign and malignant.The modified ResNet50 model is derived by adding more Convolutional layers to ResNet50, resulting in a ResNet60 model, and the detailed architecture is discussed in the methodology section of the paper.The motivation behind designing a custom ResNet50 was to attain higher classification performance while ensuring that the complexity of the network does not increase by adding more layers, as in ResNet101 or ResNet151, which could lead to vanishing and exploding gradient problems during train-ing.The derived ResNet50 showed higher classification performance than the standard ResNet50 by an increased 7.5% on the test dataset without increasing the complexity of the network to a large extent.This model was also compared with several state-of-the-art architectures, and has outperformed the other architectures with a classification accuracy of 97.50% on the ovarian tumor test dataset.Therefore, the results of the derived ResNet50 classifier are carried forward for interpretability.
To explain the results of the modified ResNet50 classifier, we have implemented the following explainable AI methods: LIME, Saliency Map, Occlusion Analysis, Grad-CAM, SHAP, SmoothGrad.The Explainable AI methods used in this paper are the ones that are most commonly used in existing literature related to Explainable AI for a qualitative interpretation of the model results.
LIME is a model-agnostic method that aims to explain the important features behind the classification results by creating several artificial data samples and thereby predicting the class of each such artificial data point.The Saliency Map highlights the pixels that most affect the class scores.Occlusion analysis evaluates the importance of each input dimension by analyzing the model performance with that input dimension missing and its impact on the model performance.Grad-CAM weighs the gradient in the final CNN layer against the output of this layer to determine which features are the determining factors behind the class predicted.SHAP is a feature importance technique, while SmoothGrad adds a small Gaussian noise and highlights the important pixels contributing to the final predicted class across all the training images.Using these explainable AI methods, we interpret and explain the results of the modified ResNet50 by highlighting the important features and pixels considered by the classifier while classifying ovarian tumors into benign and malignant.Furthermore, it helps us understand the key characteristics of benign ovarian tumors, which distinguish them from malignant tumors.
The main contributions of this paper are summarized below: • A modified custom ResNet50 (or, in terms of convolutional layers, a ResNet60) architecture is proposed for the classification of ovarian tumors as benign and malignant.The modified ResNet50 classifier performs the classification task with 97.5% accuracy on the test dataset, which is higher than other state-of-the-art architectures implemented-GoogLeNet (Inception-v1), Inception-v4, ResNeXt, Inception-ResNet, ResNet16, Xception, VGG16, VGG19, ResNet50, EfficientNetB0.

•
The results of the modified ResNet50 are then provided for interpretation to the explainable AI methods LIME, Saliency Map, Grad-CAM, Occlusion Analysis, SHAP and SmoothGrad.In order to have a meaningful and accurate interpretation of the classification output, the explainable AI methods were implemented on the classification results of the modified ResNet50 architecture, since it had the best classification accuracy on the test dataset.

•
The above explainable AI methods highlighted the features and regions of interest that were given most importance by the proposed model in obtaining the final classification output, such as the shape and neighboring area of the tumor.The results show qualitatively that the tumors that have a fixed boundary and localized nature are classified as benign by the proposed model, while those with an irregular shape and boundary are classified as malignant by the model, indicating that the localization, boundary, and shape are important characteristics to understand qualitatively from the image as to whether the tumor has a tendency to metastasize to neighboring organs and tissues, subsequently leading to the benign or malignant nature.
The organization of the paper is as follows: Section 2 presents an overview of the related work in tumor classification and various explainable AI approaches like LIME, SHAP, Grad-CAM, Occlusion Analysis, SmoothGrad for image classification in general, as well as in the medical imaging domain.Section 3 discusses the gap in the existing literature for leveraging explainable AI methods used in classifying ovarian tumor images and the motivation behind this work.Section 4 discusses the proposed methodology.Section 5 describes the results of the proposed ResNet60 (modified ResNet50) architecture and an overview of the explainable AI methods used for the interpretation of the results.It also presents a comparison of the performance of the proposed custom ResNet50 with stateof-the-art architectures, and shows that the proposed architecture yields the highest test accuracy in the classification of tumors and, hence, the results are interpreted further using the explainable AI methods as mentioned above.Finally, it discusses the scope of future work in this domain.Section 6 discusses the conclusion, along with the key takeaways from this research work.

Existing Work Explainability of Classification Models in the Medical Imaging Domain
In the healthcare domain, the use of AI to classify images of tumors, obtained from CT scans or MRI scans, as benign or malignant is increasingly becoming a useful tool for the early diagnosis of cancer.Also, with the advent of COVID-19, we see research work on the classification of chest X-ray images for presence or absence of COVID-19 or pneumonia.Some of these use cases are illustrated by the following research papers.For example, Manali Gupta et al. [6] implemented and evaluated the performance of a scratch CNN method with VGG-16 for classification of brain tumor MRI images as cancerous or non-cancerous.Samir S. Yadav et al. [7] evaluated convolutional neural network-based architectures for the classification of pneumonia presence on a chest X-ray images dataset.Although deep neural networks have proved to be effective in the medical image classification task, with an increasing number of layers, the training process can slow down and become less effective due to the problems of a vanishing and exploding gradient.This problem is solved by introducing a residual network, commonly known as ResNet, which enhances the training of deep neural networks.Jiazhi Liang et al. [8], in their research paper, illustrated the ResNet architecture for image classification using deep neural network models.

Existing Work on Explainability of CNN models in the Medical Imaging Domain
Post training, it is important to interpret the CNN classification output, which is difficult due to the black box nature of these deep learning algorithms.Hence, explainable AI methods are crucial at this point to interpret and explain the classification predictions obtained from the neural network.The following are some surveys and review papers on explainable AI methods, used for reference.Saranya, A. et al. [9] discussed some of the commonly used explainable AI methods like LIME and SHAP, and several versions of the same.Baehrens, D. et al. [10] used local explanation vectors for explaining classification results.Xu, F. et al. [11] performed a detailed survey of the history and various methods in the field of explainable AI.Yang, W. et al. [12] discussed various explainable AI approaches, along with their limitations and use cases.Samek, W. et al. [13] presented recent developments in the field of explainable AI and discussed two specific methods for the same.Singh, A. et al. [14] discussed the explainable AI methods in the medical image analysis domain.Velden, B.H.M. et al. [15] presented a survey report of existing explainable AI techniques in the deep learning-based medical image domain, and discussed future prospects for the same.Linardatos, P. et al. [16] provided a review of various interpretability methods in the Machine Learning domain.Among the explainable AI methods, the most popular ones are LIME, Saliency Map, Occlusion Analysis, Grad-CAM, SHAP, and SmoothGrad.The following are references to the research papers proposing the above methods on various datasets for interpreting CNN classification output.Ribeiro, M.T. et al. [17] proposed the LIME technique for demonstrating the explainability of the Machine Learning classifiers.Junkang An at al. [18] propose a LIME-based explainable AI technique for interpreting the results of deep learning models using feature importance and partial dependency plots (PDPs).Alqaraawi, A. et al. [19] explored the Saliency Map method in detail using public datasets.Simonyan, K. et al. [20] used two methods, one using a class score and another using a Saliency Map, for analyzing the results of a deep CNN.Xiao-Hui Li et al. [21] evaluated and compared various explainable AI methods based on defined evaluation metrics.Resta, M et al. [22] provided an occlusion-based technique for the explanation of deep recurrent neural networks for biomedical signals.Selvaraju, R.R. et al. [23] used the Grad-CAM technique for visual explanations of the classification predictions made by deep convolutional neural networks.Cao, Q.H. et al. [24] proposes a novel explainable AI method, Segmentation -Class Activation Mapping (SeCAM), that combines the best features of LIME, CAM, and the GradCAM methods for explanation of CNN prediction results.Ruigang Fu et al. [25] proposed an axiom-based Grad-CAM to satisfy the axioms of sensitivity and conservation.Ioannis Kakogeorgiou et al. [26] evaluated various explainable AI methods in the context of deep learning multi-label classification for remote sensing.Lundberg, S. et al. [27] proposed the SHAP method for interpreting the model predictions using feature importance values for a particular prediction.Bach, S. et al. [28] described an approach based on pixel-wise contributions to explain the classification predictions made by non-linear classifiers.Hooker, S. et al. [29] evaluated the performance of feature importance estimation methods used by interpretable or explainable AI methods.Ishikawa, S. et al. [30] proposed a method of explainable artificial intelligence to verify the reliability of a deep learning model for remote sensing image classification tasks.Jogani, V. et al. [31] used various explainable AI techniques to interpret the results of CNNs for classification of lung cancer from histopathological images.Montavon, G. et al. [32] used Deep Taylor Decomposition to explain the results of non-linear classifiers.Shrikumar, A et al. [33] proposed a method, Deep Learning Important Features (DeepLIFT), which assigns contributions towards the final output to the neurons of the deep neural network and also illustrates positive and negative contributions.Smilkov, D. et al. [34] proposed a method called SmoothGrad for explaining the results of a deep neural network.Soltani, S. et al. [35] provided enhanced explainable AI algorithms based on cognitive theory.Springenberg, J.T. et al. [36] proposed a deconvolutional approach for interpreting the classification results of the CNNs.Sundararajan, M. et al. [37] proposed the Integrated Gradients approach for explaining the deep neural network predictions.Vermeire, T. et al. [38] proposed the SEDC method, which is model agnostic and is able to provide counterfactual explanations for the predictions of deep convolutional neural networks.Zeiler, M.D. et al. [39] demonstrated the performance contribution of different layers of the ImageNet used for classification.Zhou, B. et al. [40] proposed a modification of the global average pooling layer and implement the Class Activation Mapping (CAM) technique to improve the CNN's performance in classification of the images without being trained specifically for the same.Wu, B. et al. [41] proposed and evaluated an attention-based model for large-scale image classification, and thereby explain the classification results of this approach.Szegedy, C. et al. [42] and LeCun, Y. et al. [43] discuss approaches for enhancing CNNs.A comparative analysis of the existing work with our research work is shown in Tables 1 and 2.

Research Gap and Motivation
Deep learning models have proved to be accurate to classify tumors as benign or malignant, and there is plenty of literature demonstrating the same.However, due to the black-box nature of deep learning models, it is difficult to interpret and accurately pinpoint the features or pixels responsible for the final model output, and to distinguish one class from another.There is limited literature on developing explainable AI methods to interpret and explain the classification output of these models for better data-driven decision-making in the healthcare domain.Moreover, there is a lack of adequate research work on leveraging explainable AI techniques to interpret the results of deep learning models used for classification of tumors into benign and malignant in the healthcare domain for the healthcare experts to take better data-driven decisions using AI and thus helping in early diagnosis of the disease.The existing literature on tumor classification does not highlight the features or regions of the tumors that are considered significant by the model during classification.The lack of qualitative analysis of the model results poses a challenge in decision making due to the black-box nature of the deep learning models and their reliability, despite their high classification performance.It is essential to understand that the classifier recognizes the right patterns during classification (for example, the shape, size and other characteristics of tumors) that determine whether they have the tendency to be benign or malignant.Hence, in this paper, we use several explainable AI techniques to explain the results of a custom enhanced ResNet50 classifier that classifies the ovarian tumors as benign and malignant with 97.50% accuracy on the test dataset, and outperforms the state-of-the-art ILSVRC winning architectures.The interpretation of the results show that the shape, boundary, and ability of the tumor to be localized in nature are important factors considered by the proposed model during classification, and this qualitative analysis also supports the basic definition of benign and malignant tumors in medical science based on their ability to metastasize and spread to neighboring tissues and organs.

Methodology
The proposed enhanced ResNet50 model has been trained on the dataset of cropped ovarian tumor images for the classification of the tumors as benign or malignant.The training dataset comprises 3644 images, with 2633 belonging to benign and 1011 belonging to the malignant classes.The model has been validated on a dataset of 1561 images, comprising 1128 benign images and 433 malignant images.The model has also been tested using a test dataset of 520 images, comprising 376 benign and 144 malignant images, on which it achieved an accuracy of 97.50% on the test dataset.There is a class imbalance in the datasets, where benign images are higher in number than malignant images.Figure 1 shows the methodology implemented.

Enhanced ResNet50 (Proposed ResNet60) Architecture
The proposed architecture considers inputs images of size 224 × 224 × 3, and the various layers are shown in Figure 2.
While the complete architecture is shown in Figure 2, the structure of each convolutional and identity block is displayed in Figure 3. Figure 4 shows the proposed system comprising the various layers of the ResNet60 architecture, obtained by adding additional layers to a pretrained ResNet50, followed by the model explanations.As the deep learning architectures continued to add more layers for higher accuracy of the predicted output, the vanishing and exploding gradient problems started becoming more common with increasingly deep layers, thereby slowing down the learning and reducing the accuracy of the predicted output.This problem was solved by ResNet, where residual blocks containing skip connections were introduced, which connect activations of a layer to deeper layers, skipping some layers in between.
The fundamental building blocks of the ResNet architecture and image classification is described in Algorithm 1.
To interpret the classification results of the modified ResNet50 model, we have implemented the above-mentioned explainable AI methods.The objective of this research is to determine the important features or pixels responsible for the proposed system to produce the final classification output.We also aimed to understand what features distinguish a benign ovarian tumor from a malignant ovarian tumor based on the CT scanned image, and how the neural network captures the same.
For the l-th Residual Unit with K layers, x l denotes the input feature, and W l = W l,k | 1≤k≤K denotes the set of weights and biases.The residual function is indicated by F.

Convolutional Block:
Convolutional 2D Layer (number of filters, kernel size, strides, padding, kernel initializer, kernel regularizer) For inputs x i , z being the output of each neuron, g() being the linear transformation of the neuron, w the weights, b the bias, and f () the activation function, the Batch Norm produces the output z N ,with mean m z and standard deviation s z , and learning parameters γ and β.After applying the ReLU activation function on the Batch Norm output, a Conv2D layer is applied with kernel ω.F(( f (z), y) denotes the input patch of image where convolution is applied and G out (z, y) denotes the convolution product, y being the output of the Conv2D layer.
Bottleneck Residual Unit: The residual units comprise stacked Batch Normalization on the inputs BN(x), followed by ReLU activation and convolutional layers c 3 and c 1 with filters of shape 1 and The Final Classification Layer: ; i=1 w jk x i + w j0 ; ; For input X, the output is defined by indices (i, j), channel index k, horizontal and vertical strides s x and s y , and filter sizes f x and f y for the pooling window with center (i, j).Ff (x, y) denotes the flattened layer, with the weights in the CNN being 4D filters W ∈ R C×X×Y×F , C, number of input channels C, spacial dimensions X and Y of the filer.f denotes the index of the output channel with stride 1, with input map as I ∈ R C×N×M×F with spacial dimensions and channel as N, M and C, respectively.Ŵf , the unit rank filter can be expressed as: where α f , β f and γ f denote the lateral, vertical and horizontal filters respectively, convolving features across channels, Y dimension and X dimension, respectively.Finally, y jk (x) denotes the output of the final dense layer or the fully connected layer denoted by f

Explainable AI Methods
To interpret the classification results of the enhanced ResNet50 model, we have implemented the above-mentioned explainable AI methods.The objective of this research is to determine the important features or pixels responsible for the proposed model to produce the final classification output.We also aimed to understand what features distinguish a benign ovarian tumor from a malignant ovarian tumor based on the CT scanned image, and how the neural network captures the same.LIME, which stands for Local Interpretable Model-agnostic Explanations, is an Explainable AI method that aims to explain the results of a model using a local linear approach.LIME creates several artificial data points in the vicinity of the data point, whose classification results it aims to explain.It uses a local linear classifier to predict the class of each of the artificially created data points.After this, the cosine similarity is calculated between each artificial data point and the original data point to determine the importance or weight of each of the artificial data points in classifying the real data point.Since the dataset consists of images, the data points are pixels, and the cosine distance is computed between the pixels to compute their weight in determining the class.Using the artificially weighted data points, a linear regression model is fitted, and the features or data points with the higher coefficients in the fitted linear regression model contribute the most in determining the prediction of the model being evaluated.
Riberio et al. [17] described a LIME-generated explanation as: where g indicates the model used for approximating the function f, Π x indicates the distance measured between z and the neighborhood around x, with z and x being instances.The complexity of explanation of model outputs using g is depicted by Ω(g).
While LIME uses feature importance to explain the importance of nearby pixels, the Saliency Map is another explainable AI method to explain the classification results of Computer Vision models.The Saliency Map highlights the region of interest, comprising the pixels that are the determining factors behind the model's output.The Saliency Map calculates the derivative of the class score with respect to the image, and helps us identify which pixels, when changed, cause the least change in the class score.
Simonyan et al. [20] has proposed the gradient of the output class calculated with respect to the input pixels of the image.The Saliency Map for a class c can be described as: f c (I ′ ) can be approximated with a linear function in the neighborhood of I by the first- order Taylor expansion: The weights of this approximated linear function are the generated Saliency Map.
Occlusion Analysis is an attribution technique that determines the importance of each feature or dimension by evaluating the model performance by having that dimension missing.The region or patch of pixels which, when dropped, causing the highest drop in the model's performance tends to have the most contribution towards the prediction results of the model.
Loannis Kakogeorgiou et al. [26] describe occlusion as: This technique calculates the change in the approximated function f c (x) by replaced contiguous rectangular patches p i ∈ P of the input image with a given baseline (e.g., all-zero patch).
Grad-CAM, standing for Gradient Class Activation Map, is an Explainable AI technique that produces a heatmap that shows the importance of each of the classes for the model being evaluated.To achieve this, the gradient with respect to the final layer of the CNN is computed and weighed against the output class of that layer.The heatmap is essentially a spatial map of the weight or importance of each channel towards the final output class.
Ruigang Fu et al. [25] describe the general form of the Class Activation Mapping (CAM) as: where c denotes the class c of the input image and the CAM can be expressed as a linear combination of the features in the target layer.The SHAPley Additive Explanation (SHAP) is an Explainable AI method that calculates the feature importance based on SHAPley values derived from cooperative game theory.The feature importance calculated using the SHAP value indicates the magnitude as well as the extent of the positive or negative effect of the feature on the final class predicted by the model, and also takes into account the presence of multicollinearity.
For calculating the importance of each feature, the model is retrained on all possible subsets of features S ⊆ F from the set of all features F. The importance of each feature is thus assigned, which indicates the effect of that particular feature on the model prediction.
For the f S∪{i} x S∪{i} − f S (x S ), two models are trained, one trained with the features being present, denoted by f S∪{i} , and another trained without including the features, denoted by f S .For a set S of input features, the values are denoted by x S .Since the effect of withdrawal of one feature depends on the subset of the other model features, the differences are calculated for all the subsets S ⊆ F\{i}.Then, the SHAPley feature importance values are computed as a weighted average of the differences.Scott M. Lundberg et al. [27] describes the computation of SHAPley values as: SmoothGrad is an explainable AI method that is gradient-based, similar to the Saliency Map.It improves the gradient visualizations by adding a small Gaussian noise to the input image.This helps in obtaining a clean gradient visualization, free from noise, and only highlights the important pixels that are activated across all the sampled images.These important pixels determine the final classification output of the model.
Given a neural network that classifies an image into a class from a set C of classes, for each input image x, a class activation function, denoted by S c , is calculated for each class c ∈ C. The final predicted class (x) is the one with the highest score defined below (Szegedy et al. [42], 2016; LeCun et al. [43], 1998): For locating the pixels in the image of higher importance, a sensitivity map, denoted by M c (x) is constructed as below: where M c (x) is the derivative of M c with respect to x, the input, and ∂S c represents the derivative (i.e., gradient) of S c .
The sensitivity maps are improved by using a Gaussian kernel to smoothen ∂S c and taking an average of the sensitivity maps of random samples in the neighborhood of an input x.Smilkov et al. [34] provided the below formulation for SmoothGrad: where the number of samples is denoted by n, and the Gaussian noise with standard deviation σ is denoted by N 0, σ 2 .

Data Source and Description
The dataset comprises a set of 5725 CT scan images of ovarian tumors, which have been annotated, that belong to benign and malignant classes for 53 unique patients from the SDM College of Medical Sciences and Hospital located in Dharwad, Karnataka, India.The dataset contains reconstructed images as per the three different planes, namely axial, sagittal, and coronal.Annotation is performed to highlight the boundaries of the tumors by encircling them with nearly-circular-shaped polygonals.Table 3 depicts the number of instances of benign and malignant images that form the training, validation, and test datasets.Figures 5 and 6 show benign and malignant tumor image samples from the dataset.

Data Preprocessing and Dataset Preparation for Training and Evaluation
The dataset is affected by class imbalance problem, where the benign sample instances are comparatively higher than the malignant ones.To mitigate the class imbalance issue, the images chosen for training the model were subsetted from the originally collected dataset.No specific augmentation techniques were applied to handle the class imbalance issue.For training and validation, images that have been annotated were cropped according to the annotation boundary that encircles the tumors, allowing the classification models to efficiently extract and learn the features of the localized tumor.Figures 7 and 8. display illustrations of the cropping and localization applied on annotated images:

Classification Results
The aforementioned architectures were trained and tested on the dataset as discussed earlier.The below hyperparameters were applied to the models: input size: 224 × 224 × 3; optimizer: Adam; number of epochs for training: 200; steps per epoch: 10; validation steps: 5; loss: sparse categorical cross entropy (for inception modules) and binary cross entropy (for other modules).Table 4 shows the different layers along with filters, strides and activation functions in each layer, for the enhanced ResNet50 architecture.Based on experiments on the train and test loss over the epochs, the learning steadily continues till 200 epochs and then slows down.On hyperparameter tuning, it was observed that the optimal number of steps per epoch that yields highest validation accuracy was 10, with a constant number of validation steps of 5. On comparing the optimizers, Adam and SGD showed the most rapid decrease in training error over the epochs, out of which Adam showed a consistent decrease beyond 150 epochs and, hence, the Adam optimizer was selected for training.
Table 5 shows the hyperparameter setting of various model architectures implemented for ovarian tumor classification.Table 6 outlines the training and test outcomes (accuracy and loss) using 200 training epochs for various architectures.After an equal number of training epochs, it is evident that the suggested ResNet60 (or enhanced ResNet50) architecture exhibits the maximum accuracy for the train as well as test datasets.In terms of accuracy, it is followed by Xception, Inception-ResNet, GoogLeNet, ResNet16, ResNeXt, and ResNet50, which exhibit test accuracy of 90% or above.During testing, an accuracy of 97.50% is obtained using the given ResNet60 model.Cross-validation of train and validation samples reveals that ResNet60 typically outperforms the other designs in terms of performance.Therefore, for the given CT scan dataset, our suggested ResNet60 design fits optimally for classifying tumors as benign or malignant.It is noted that the train and validation losses incline toward convergence, and sway about a constant value that is extremely near to zero as the epochs go by.The train and validation accuracies, at the start, show a rising trend and swing between 60 and 70%.Following 50 epochs, the accuracy rises steadily, and varies between 80 and 90%, until it stabilizes at a point where the training accuracy is between 90% and 95% and the validation accuracy is between 95% and 100%.The results demonstrate that the proposed ResNet60 has steady learning, while being free of bias and variance concerns, since it can fit the training dataset incredibly well and generalize effectively on the validation and test datasets.The confusion matrix and other detailed classification metrics for the proposed ResNet60 (modified ResNet50) model on the test dataset are shown in Figure 9 and Table 7, respectively.A precision of 96.66% indicates that, out of the total benign predictions, in 96.66% of the cases, the model was able to correctly identify it, and in only 3.34% of the predicted benign cases, it falsely predicted a malignant tumor as benign.A recall of 100% suggests that the model is able to correctly identify all benign tumors in the dataset, and there is no benign tumor falsely classified as malignant.An overall F1-Score of 98.30% suggests a good balance of the precision and recall performance.A specificity of 90.97% indicates that, out of the total number malignant tumor images in the dataset, the model was able to correctly identify the tumor in 90.97% of the cases, and in the remaining 9.03% of the actual malignant cases, the model falsely classified a malignant tumor as benign.Overall, an accuracy of 97.50% indicates that in the test dataset, the model was able to correctly predict the nature of the tumor in 97.50% of the cases.A sample of benign and malignant images of ovarian tumors from the test dataset were fed into the below-mentioned explainable AI methods for interpretation.

LIME Results
The results of LIME on sample benign and malignant images are shown below in Figure 10a,b.For interpreting the results of LIME, the cropped tumor images are fed into the method, to reduce noise from the background and only focus on the data points in the vicinity of the tumor.
From the LIME results, we see why the ResNet60 model classifies the image as benign or malignant.The yellow highlighted region indicates the superpixels where the boundary of the tumor is present.This means that the model considers the pixels on the boundary of the tumor and the neighboring pixels to determine the shape and pattern of the tumor, which helps in classifying the image as benign or malignant.These super-pixels are responsible for the final classification output result for the tumor as benign or malignant.On the right image, the area of the super-pixels colored in green indicates that these pixels cause the probability of the model predicting the predicted output class (benign or malignant) increase, while the super-pixels colored in red are the ones that decrease the probability of the model predicting the final predicted output as benign or malignant.Thus, the model can identify the special features (shape, pattern, etc.) and characteristics of the tumor that are specific for the tumor to be benign or malignant, and this is highlighted by the superpixels identified by LIME, as well as the areas of the super-pixels which increase or decrease the probability of arriving at the final predicted output classification.

Saliency Map Results
The results of the Saliency Map as shown in Figure 11a

Occlusion Analysis Results
From the results of the occlusion analysis, as shown in Figure 12a,b, we observe that the highlighted regions of the heatmaps generated for the benign and malignant samples indicate the most important input dimensions towards generating the final classification, since the difference between the model outputs with and without those dimensions are higher compared to the other input dimensions.The following are the hyperparameters used for the occlusion analysis experiment towards the benign and malignant samples: occluding size: 5; occluding pixel: 0; occluding stride: 1.
We also see that the occlusion analysis method highlights the shape of the tumor and the neighboring region surrounding the tumor; hence, these are important for the final predicted output.

Grad-CAM Results
From the results of Grad-CAM, as shown in Figure 13a,b, we see the heatmaps for the images of the tumor.The size of the heatmap is determined by the spatial dimensions of the activation maps in the last convolutional layer of the network.On projecting the heatmap to the original image, we see the areas of the image that are highlighted, and these areas have been assigned the most importance for the model to predict the final output (benign or malignant) for the image.It is observed that the network took the area surrounding the tumor as a determining factor for the classification of the tumor as benign and malignant.This is a reasonable explanation for the model, because the neighboring areas of the tumor indicate whether the tumor has spread to the adjacent regions, which indicates the tumor is malignant, else it is benign.Also, another characteristic of the tumor considered by the model for classification as benign or malignant is the shape of the tumor; if it has a smooth shape, which has not yet spread to the neighboring areas and is concentrated in a single region, then it is a benign tumor.Otherwise, if the tumor has an irregular shape, it indicates that the tumor has disintegrated and spread into the neighboring areas, which is the characteristic of a malignant tumor.Since CNNs are feature extractors, and deeper layers of the network operate in increasingly abstract spaces, the activation maps generated by the Grad-CAM method provide useful insights into which features had the most importance in determining the model's final output class.These features are not just individual pixels, but a region of interest obtained by taking activation maps of deeper convolutional layers.

SHAP and SmoothGrad Results
The results for SHAP and SmoothGrad in Figure 14a

Comparison of the Explainable AI Results
It is observed that LIME can identify the superpixels responsible for the final classification output, as well as the neighboring pixels positively and negatively impacting the classification, which proves that the shape and area of concentration of the tumor is a crucial factor for the classification.The Saliency Map highlights the region of interest of the image responsible for the classification output.We observe that this region of interest is also mostly concentrated around the edge and immediate neighboring region of the tumor, which proves that these play a significant role in the final classification output.The Grad-CAM results highlight the boundary of the tumor and its adjacent regions, indicating, once again, that these play a crucial role in the classification of the tumors as benign or malignant.The red-blue heatmap outputs for SHAP and SmoothGrad indicate that the red pixels increase the model's confidence towards the final predicted output, and these pixels are mostly considered around the boundary and neighboring area of the tumor, thus re-affirming that the shape, localization, and ability of the tumor to spread to the neighboring areas have been given the most importance by the ResNet60 model for generating the final classification output.The heatmaps generated by the Occlusion Analysis attribution technique indicate that the shape and neighboring region of the tumor are more important towards the final output than the other input dimensions.Although Occlusion Analysis is very efficient at determining the marginal effect of each input dimension if the dimensions are independent, it has a high computational complexity, since we have to evaluate the model for each of the perturbed inputs, and if the input image has a higher size, the time required to generate the heatmaps increases manyfold.
The Explainable AI methods used in this paper for interpreting the classification results are qualitative in nature.They highlight the pixels or regions of interest in the image, which are considered as most significant by the ResNet60 classifier for classification of ovarian tumors as benign or malignant.One of the limitations of these methods is not being able to provide a quantitative explanation or metric for interpretation.There is a lot of research scope in designing metrics for higher-level feature importance measures, such as size, shape, boundary, localization, etc.The results of LIME showed that the model has given importance to the boundary and shape of the tumor, which is an important factor to consider for the classification of benign from malignant images.Results of the Saliency Map show the region of interest that the model has focused on the most for the generation of the output.The Grad-CAM results show that the model has given the most importance to the neighboring areas of the tumor as well as the shape and boundary of the tumor for distinguishing a benign tumor from a malignant one.

Scope of Future Work
There is a lot of scope for utilizing gradient-based approaches and feature-importance techniques for explaining deep neural networks better.As a next step, the deconvolution and guided backpropagation techniques can be implemented on the dataset of ovarian images for better interpretation.PatternNet, a newly emerging deep neural network architecture, is being used for identifying and visualizing patterns in the image data, which can explain the classification results of convolutional neural networks on these images.Mining visual patterns in images has the potential to reveal useful information about the intrinsic properties of objects and images, which influence neural networks to arrive at a certain classification decision.

Conclusions
In this paper, we have implemented several state-of-the-art techniques for the classification of tumors in the ovary as benign or malignant, and observed that a custom ResNet60 (or an enhanced ResNet50) architecture yields the best results on the test dataset.Thus, we have evaluated and interpreted the results of the modified ResNet50 classifier on a dataset of benign and malignant ovarian tumor images using different types of explainable AI techniques.Since the model has a 97.5% accuracy on the test dataset, the results of the explainable AI techniques show that the custom ResNet50 model has given importance to the correct features and regions of interest in determining whether an image of an ovarian tumor is a benign or malignant one.Thus, with the help of these Explainable AI methods, we interpreted the results of the proposed model, and understood which features or regions of interest it has given importance for generating the final classification output.
The source code for CNN architectures and Explainable AI methods in this paper are available at: https://github.com/srirupa-guha/Explainable_AI_Ovarian_Tumor_Classification.git(accessed on 30 April 2024).

Figure 1 .
Figure 1.Methodology of model selection for ovarian tumor classification and qualitative explanation of results.

Figure 2 .
Figure 2. ResNet60 (enhanced ResNet50) architecture for classification of ovarian tumor images into benign and malignant.

Figure 4 .
Figure 4. Proposed system comprising the enhanced ResNet50 (or ResNet60) architecture in detail and Explainable AI methods.

Figure 7 .
Figure 7. Original image in dataset (left) and cropped benign tumor image (right).

Figure 8 .
Figure 8. Original image in dataset (left) and cropped malignant tumor image (right).

Figure 9 .
Figure 9. Confusion matrix for ResNet60 classification results on test dataset.

Figure 10 .
Figure 10.Results of LIME on benign and malignant samples of cropped ovarian tumors data.Original images (a) and highlighted regions with the LIME results (b).
,b show the highlighted pixels or the region of interest that are important in the final classification of the tumor into benign and malignant classes by the ResNet60 model.The calculation of saliency is similar to backpropagation.Here, the derivative of the class score is taken with respect to the image.This helps us identifying which pixels, if changed the least, affect the class scores the most.

Figure 11 .
Figure 11.Results of Saliency Map on benign and malignant samples of cropped ovarian tumors data.Original images (a) and highlighted regions with the Saliency Map results (b).

Figure 12 .
Figure 12. Results of Occlusion Analysis on benign and malignant samples of cropped ovarian tumors data.Original images (a) and highlighted pixels with the Occlusion Analysis results (b).

Figure 13 .
Figure 13.Results of Grad-CAM on benign and malignant samples of cropped ovarian tumors data.Original images (a) and highlighted regions with the Grad-CAM results (b).
,b show red-blue heatmaps on the benign and malignant images based on the SHAPley values.The red-colored pixels positively impact the model's final prediction output, by increasing the model's confidence for the final predicted class.The blue-colored pixels negatively impact the model's final prediction output, by decreasing the model's confidence for the final predicted class, benign or malignant, as shown in the figures.We see that the red pixels are mostly concentrated around the boundary and neighboring region of the tumor, indicating that the shape and localization of the tumor, as well as the ability to spread to the neighboring regions, is taken into consideration by the ResNet60 model for the classification.

Figure 14 .
Figure 14.Results of SHAP and SmoothGrad on benign and malignant samples of cropped ovarian tumors data.Original images (a) and highlighted important pixels with the SHAP and SmoothGrad results (b).

Table 1 .
Comparative analysis of existing work for tumor classification using CNNs.

Table 2 .
Comparative analysis of existing work for explainability of CNNs.

Table 3 .
Dataset distribution: number of images across train, validation and test datasets.

Table 5 .
Hyperparameters of architectures implemented for classification.

Table 6 .
Comparison of model performances on the CT Scan dataset for train and test.