Visual explanation of black-box model: Similarity Difference and Uniqueness (SIDU) method

Explainable Artificial Intelligence (XAI) has in recent years become a well-suited framework to generate human understandable explanations of"black-box"models. In this paper, a novel XAI visual explanation algorithm known as the Similarity Difference and Uniqueness (SIDU) method that can effectively localize entire object regions responsible for prediction is presented in full detail. The SIDU algorithm robustness and effectiveness is analyzed through various computational and human subject experiments. In particular, the SIDU algorithm is assessed using three different types of evaluations (Application, Human and Functionally-Grounded) to demonstrate its superior performance. The robustness of SIDU is further studied in the presence of adversarial attack on"black-box"models to better understand its performance. Our code is available at: https://github.com/satyamahesh84/SIDU_XAI_CODE.


Introduction
In recent years deep neural networks (DNN) have resulted in ground-breaking performance in solving many complex and long-running problems of artificial intelligence (AI). In particular, employing DNN architectures in tasks such as object detection [1], image classification [2] and medical imaging [3] has received great attention within the AI research field. As a result, it is no surprise to observe that DNNs have become a favoring solution for any applications involving big data analysis. As human dependency on these solutions increase on a daily basis, it is crucial from both research and business standpoints to understand the underlying processes of DNNs that output a certain decision. As reported in recent works [4,5], such decisions result from the complex inner stacked layer of the DNN that are typically referred to as 'black-box' model. The use of the term 'black-box' indicates how it is very challenging to understand which inner features of the model are the major contributors to the accuracy of the output [6].
In such cases the term 'black-box' predictors is used to aid such comprehension aspects. The interpretation ability of the 'black-box' DNN provides transparent explanation and audit model output that is crucial for sensitive domains such as medical or risk analysis [7,8]. Consequently, a new paradigm addressing explainability of these models has emerged in AI research namely Explainable AI (XAI) [9]. XAI attempts to provide further insight into the black-box models and their internal interactions that enable humans to understand a machinegenerated output. Furthermore, for end-users in sensitive domains, XAI gives the ability to interpret model features at the 'group level' or 'instance level' of the input which results in gaining greater trust for validating the outcome of deployed AI models. Although, there is no standard consensus in the literature regarding how to define a human-interpretable explanation method forthe black-box model, a widely-adopted and popular approach is to form a visual saliency map of input data showing which parts of the input have influence on the final prediction. This is motivated by the fact that the visual explanation methods can align closely with human intuition. For instance, it is more straightforward to the end-user in the medical domain to evaluate and compare the visual saliency map on a medical image produced by a DNNs model with those generated by actual clinicians. A number of visual explanation algorithms has been proposed among which methods such as LIME [10], GRAD-CAM [11] and RISE [12] are the most used examples of this class. While each of these methods can be justifiable in one way or another, apart from challenges such as gradient computation of DNN architecture (e.g., Grad-CAM) or visualizing all the perturbations modes (e.g., RISE), the generated visual explanation suffers from a lack of localizing the entire salient regions of an object, which is often required for higher classification scores. Following our prior identification of this research gap in the field, we further define it by proposing a new visual explanation approach known as SIDU [13] to address issues relating to salient region localization. SIDU stands for 'Similarity Difference and Uniqueness' method for estimating pixel saliency by extracting the last convolutional layer of the deep CNN model and creating the similarity differences and uniqueness masks that are eventually combined to form a final map for generating the visual explanation for the prediction. We briefly showed by both quantitative and qualitative analysis how SIDU can provide greater trust for the end-user in sensitive domains. The algorithm provides improved localization of the object class being questioned (see, for example, Figure. 1 d).). This results in gaining greater trust of human expert level to rely on the deep model. This paper aims at providing a more general framework of the SIDU method by presenting the proposed method in further details whilst exploring its characteristic via various experimental studies. Concretely, the studies investigate SIDU's visual explanation through three main levels of evaluation as proposed in [14]. Since these evaluation methods have different pros and cons, the superior performance of the SIDU can be investigated at depth to provide a deeper level of insight. To the best of our knowledge, our comprehensive experiment studies of these dif-ferent evaluation levels are the first in the context of XAI. Moreover, the ability of the XAI method to generalize its explanations of the black-box in different deployment scenarios can establish further trust. As evident in recent work, one example where black-box models are subject to less generalization is the presence of adversarial attack especially in sensitive domains and wider scope of trust [15]. Therefore, we investigate how XAI can handle such potential threat and respectively guard against it. Our main contributions in this work can be summarized as follows: 1. We provided step-by-step detailed explanations of the SIDU algorithm that from our investigation yielded a visual explanation map, which enabled localization of entire object classes from within an image of interest.
2. We conducted three different types of experimental evaluations to thoroughly assess SIDU: these were coined as (1) 'Human-Grounded', (2) 'Functionally-Grounded', and (3)  shown and lastly section 5 concludes the study and discuss future work.

Related Work
In this work, we follow three main research directions of XAI: a) visual explanation methods developed to explain the black-box model such as deep CNN, b) validity and evaluation of the generated explanation by XAI methods and c) vulnerability of black-box explanation method toward adversarial attacks.
The literature of each direction is presented in the following subsections.

Visual Explanation
For an end-user, visual explanation methods makes it easier to understand the prediction output of the black-box model. One common approach to generate such a visualization is done via saliency maps [16,17] and such algorithms may be divided into the following three categories:'back-propagation based' methods, 'perturbation-based' methods and 'approximation-based' methods. Back-propagation methods: back-propagation methods spread a feature signal from an output neuron rearwards through the layers of a model to the input in a single pass; making them efficient. 'Layer wise Relevance Propagation' [18] and 'DeCovNet' [19] are examples of this category. Network weights and feature activation map of CNN model at a specific layer, e.g., CNN's last layer, are considered as an effective saliency method for generating visual explanation. Class Activation Mapping (CAM) [20] that visually highlights the discriminative region of the image class prediction is an example of this family.
In addition, the gradient or its modified version in the back-propagation algorithm can be employed to visualize the derivative of the CNN's output w.r.t. to its input, e.g. such as Grad-CAM [11]. An improved method to produce input images that effectively activate a neuron was proposed in [21]. The method explored in this related work was focused upon generating class-specific saliency maps by performing a gradient ascent in pixel space to reach a maxima. This synthesized image served as a class-specific visualization that augmented comprehension of how a given CNN modeled a class. Perturbation-based methods: here, the input is perturbed while keeping track of the resultant changes to the output. In some work, the change occurs at intermediate layers of the model.
The state-of-the-art RISE [12] algorithm belongs to this category. Meaningful perturbations [22] optimized a spatial perturbation mask that maximally effects a model's output to reveal a new image saliency model that sought to identify where an algorithm searches by finding out which regions of an image most affected its output level when perturbed. Approximation-based method: Methods of this class attempt to provide explanation to a complex black-box model by utilizing an easier-to-understand and more interpretable model such as decision trees or linear regression. Apart from these simple models, a good example class that is widely applied to visual input is the LIME algorithm [10].

Evaluation of Explanation Methods
Since it is rather challenging to establish a unique and generalized evaluation metric that can be applied to any task, authors in [14] proposed three different types of evaluations to measure the effectiveness of explanations. These are presented in the following. 3. Functionally-Grounded evaluation: This method utilizes numeric metrics or proxies such as 'local fidelity' to evaluate explanations across different applications. The main advantage of this evaluation is that it is free from human bias that effectively saves time and resources. Most of the state-of-art methods fall into this category [19,22]. For example, the authors in [12] proposed casual metrics insertion and deletion, which are independent of humans to evaluate the faithfulness of the XAI methods.

Adversarial Attacks
In the context of XAI, adversarial attack generators can be divided into 'white-box' attacks and 'black-box' attacks. The Fast Gradient Sign Method (FGSM) [24] and Projected Gradient Descent (PGD) [25]  greater trust, it is essential for the XAI algorithms to only be effective but also robust against an adversarial attack at the same time [15]. Analyzing how the black-box explanation ( like SIDU ) can effectively handle such a potential problem helps the end-user to guard against a possible disastrous outcome from the classifier when adversarial attack is presented.

SIDU: Proposed Method
Recent XAI methods have shown that deeper representations in CNN models illustrate higher-level visual features [5]. A recent approach titled as Grad-CAM [11] interprets the importance of each neuron responsible for a decision of interest by computing the gradient information from the last convolutional layer of the CNN. Alternatively, the authors in [12] proposed a method titled RISE, which finds the effect of selectively inserting or deleting parts of the input (perturbation-based ) in the CNN model's output prediction. This perturbation-based method has been found to provide increased accuracy of visual explanation saliency maps compared to gradient based methods, However these methods fail to visualize all the perturbations in order to determine which To overcome the challenges of the most recent state-of-the art methods we proposed a XAI method that consequently provides better explanation method for any given CNN model. The proposed method takes the last convolution layer for generating the masks. From these masks Similarity Difference and Uniqueness scores are computed to get the explanation of the CNN model decision acronymed in therefore denoted SIDU. An overview of the proposed method is presented in Figure 2. Our method is composed of three steps, First we extract the last convolution layer of the CNN to generate the feature image mask using the last convolution layer of the given model. Second, we compute the similarity differences for each mask with respect to a predicted class and finally we compute the weights of each mask and combine them into a final map that shows the explanation of the prediction. Each step is described in the following subsections 3.1 − 3.3. Note that, the CNN model used is the same for all steps. in the Figure. Note that the CNN model F used is same for all the steps.

Step1: Generating Feature Activation Image Masks
To provide a visual explanation of the predicted output of a CNN model where τ is the threshold. In our experiments we use τ = 0.5. Note that we found experimentally that choosing different threshold values in the mask binarization step has almost no effect on generating the final explanation heatmap of the input image. The binary mask B c i is then up-sampled by applying bi-linear interpolation for a given input image I with size of W idth × Height. Next, the up-samples binary mask M c i will have values between [0, 1] and it is no longer binary. The up-sampled binary masks are also known as feature activation masks and is shown in Figure 3. Finally, point-wise multiplication is performed between the feature activation mask (Up-sampled binary mask) M c i and input image I to calculate the feature activation image mask A i c and is represented as where F is an CNN model, A c i is the feature activation image mask of feature map f c i and i = 1, ....N . The procedure of generating feature activation image masks is shown in Figure 3 where we illustrate some of the feature activation image masks from the total number of masks N . The feature activation image masks A c of object class c are used to get prediction scores which is explained in detail in the following subsection 3.  Once the predictions scores vectors are computed for all feature activation image masks and original input image, we then compute similarity differences between each input feature activation image mask prediction score P c i and prediction score P c org of the original input image I. The similarity difference between these two vectors gives the relevance of feature activation image mask with respect to the original input image. The intuition behind computing the relevance of a feature map is to measure how the prediction changes if the feature is not known, i.e., the similarity difference between prediction scores. The relevance value of the feature activation image mask will be high if it is similar to the predicted class but the relevance value will be low if dissimilar. The Similarity Difference measure between the prediction vector of the original input image I, P c org and the i th feature activation image mask prediction, P c i is given by where σ is an controlling parameter. It should be noted from Eq.3 that P c i is the prediction vectors for the feature activation image mask A c i generated from the last convolution layer of CNN model F . This is illustrated in Figure.4. Moreover, the Similarity measure in Eq.3 is inspired by Gaussian kernel function which is a suitable metrics for weighting observations as opposed to Euclidean distance. The kernel function decreases with distance and lies between zero and one. For Euclidean distance, however, the value increases with distance and provides only an absolute difference between two vectors. After computing the similarity difference measure, we also computed a uniqueness measure U c between the feature activation image masks prediction score vectors. It is one of the most popular assumptions that the image regions which stand out from the other regions grab our attention in certain aspects. Therefore the region should be labeled as a highly salient region. We therefore evaluate how different each respective feature mask is from all other feature masks constituting an image. The reason behind this is to suppress the false regions with low weights and highlight the actual regions which are responsible for predictionswith higher weights. The uniqueness measure for the i th feature image mask of object class Where N is the total number of feature activation image masks. Finally, the weight of each feature importance W c i is computed as the dot product of the Similarity Difference SD c i and Uniqueness measure U c i where where SD c i , U c i are the Similarity Difference and Uniqueness values for the feature activation image mask A c i of the object class c. The total number of feature importance weights will be as size of total number of masks N . The feature importance weight will be high for the feature which has more influence in predicting the actual class object c and low for the feature with low influence.

Step3:Visual Explanations for the prediction
To get the visual explanation (saliency map) of the predicted output class c of a CNN model F , we then performed a weighted sum between feature activation The weighted combinations of feature activation masks to calculate the final visual explanation (saliency map) of the prediction of the class is illustrated in   classes is illustrated in Figure. 6 on page 15.

Evaluation
In this section we evaluate the performance of SIDU. We conducted a comprehensive set of experiments to study the correlation of the visual explanation with the model prediction to evaluate the faithfulness. SIDU is evaluated using all three categories of evaluations as previously detailed herein [14], i.e., functionally grounded, application grounded, and human grounded. The evaluation results were compared with the most recent state-of-the art methods namely RISE [12] and GRAD-CAM [11]. A good explanation method not only provides an appropriate explanation for the prediction but also it should be robust against adversarial noise. To this end, the proposed method is evaluated on adversarial samples and compared with the most recent state-of-the-art methods RISE [12] and GRAD-CAM [11].

Functionally-Grounded evaluation
To preform the Functionally-Grounded evaluation we choose the two automatic causal metrics insertion and deletion as proposed by [12]. The deletion metric deletes the saliency region in the image which is responsible for higher classification scores and forces the CNN model to change its decision. This metric estimates the decrease in the probability classification scores, when more pixels are removed from the saliency region. With the deletion metric, the good explanation shows a sharp drop in the predicted score and area under the probability curve will be lower. Whereas, the insertion metric measures the probability increase of the predicted score. As more pixels are inserted in the image, a higher Area Under Curve (AUC) rate can be achieved (i.e., effectiveness of explanation model at a greater level). The procedure of computing AUC using insertion and deletion is illustrated in Figure. 7 on page 17. These metrics were selected since they are independent of human subjects, bias free and hence increase transparency when evaluating the XAI methods.
In order to evaluate the performance of the SIDU explanation method we choose two datasets with different characteristics, namely-The ImageNet [27] dataset of Natural Images with 1000 classes. We used 2000 images randomly collected from the ImageNet validation dataset. The other is a Retinal Fundus Image Quality Assessment (RFIQA) dataset from the medical domain consisting of 9,945 images with two levels of quality, 'Good' and 'Bad'. The retinal images were collected from a large number of patients with retinal diseases [28]. We To do a fair evaluation, we choose two existing standard CNN models, ResNet-50 [29] and VGG-16 [30] that had been pre-trained on the ImageNet dataset [27]. Table 1 summarizes the results obtained on ResNet-50 for the proposed method and compares it to the most recent works RISE [12] and GRAD-CAM [11].
It was observed that the proposed method achieved improved performance for both metrics, followed by RISE [12] and GRAD-CAM [11]. Table 1 summarizes the results obtained on the VGG-16 model for the proposed method and compares it to most recent works RISE [12] and GRAD-CAM [11] where it can be identified that proposed method, SIDU achieved best performance. From the  Figure. 6. In our proposed method, the generated masks come from the last feature activation  maps of the CNN model, due to this the final explanation map will localize the entire region of interest (object class).
We also conducted a second experiment on the Medical Image dataset which has totally different characteristics. We trained the existing ResNet-50 [29] with an additional two FC layers and softmax layer on the RFIQA dataset [28]. The CNN model achieve 94% accuaracy. The proposed explanation method uses the trained model for explaining the prediction of the RFIQA test subset with 1028 images. The evaluated results of the proposed method and RISE [12] and GRAD-CAM [11] are summarized in Table 2. We can observe that the GRAD-CAM achieves slightly higher AUC for insertion and lower AUC for deletion followed by SIDU. RISE [12] has shown least performance in both metrics, This can be explained by the fact that the RISE method generates N number of random masks and the weights predicted for these masks give higher weights to false regions which makes the final map of RISE noisy. The visual explanations of the proposed method (SIDU) and the RISE [12], GRAD-CAM [11] methods on the RFIQA test dataset are shown in Figure. 10 (b), (c), (d).

Human-Grounded evaluation
Human-Grounded evaluation is most appropriate when one aims at testing a general notions of an explanation quality. Therefore, for generic applications in the AI domain, such as object detection and object recognition, it might be sufficient to inspect a degree to which a non-expert human can understand the cause of a decision generated by a black-box model. One excellent way to measure and compare the correlation of visual explanation between a human subject and the black-box is to use an eye tracker that records the non-expert subject's fixations within interactive test settings. This approach is chosen because of its similarity to XAI methods, visual explanations. Both generate heatmaps representing salient areas of an object in an image.
An eye-tracker was used for gathering eye tracking data from human subjects to gain an understanding of visual perception [31]. The study using eye tracking data for understanding human visual attention is useful and has received great attention by UX researchers [32]. For example, the authors in [33] conducted an experimental study and gathered data 'human attention' in Vi- In our study, we investigated how non-expert subjects generated explanations via the eye-tracker, compared with those of generated by XAI visual explanation methods across natural images for recognizing object class. To this end, we follow the data collection protocol discussed in detail in the next section 4.2.1.

Database of eye tracking data
We randomly sampled 100 images from 10 different classes of the Ima-geNet [27] benchmark validation dataset. All the collected images are RGB and are resized to 224 × 224 pixels.

Data collection protocol
In order to collect eye-fixation, 5 human subjects participated in an interactive test procedure using Tobii-X120 eye-tracker in the following main steps: 1. The subject was seated in front of a computer-sized screen where the eyetracker is ready to record the visual fixations and the system is calibrated.
2. Each image from the dataset was shown in a random order for 3 seconds and corresponding fixations of the subject were recorded.
3. We divided all 100 images into 4 equally sized data blocks with a break between each experiment in order to reduce the burden on each subject.
We further add a cross-fixation image between two stimuli to reset the visionary fixation on the screen while changing from one image to the next.

Comparison Metrics
To evaluate the models with human fixations using only one metric is not enough to achieve a valid and reliable outcome [36]. We used three metrics to compare the XAI and eye-tracker generated heatmaps [37]: These are (1)  vergence is an metric, which is used to measure dissimilarity between two probability density functions [37]. For evaluating the XAI methods, eyefixation maps and the visual explanation maps produced by the model are used for the distributions. F M represents the heatmaps probability distribution from eye-tracking data, and EM indicates the visual explanation maps probability distribution. These probability distributions are normalized and they are given by : where X is the number of pixels and is a regularization constant to avoid division by zero. The KL-DIV measure is computed between these two distributions to know whether the visual explanation map which is computed from the XAI method matches human fixations. It is a non-linear measure and generally varies in ranges from zero to infinity. If the KL-DIV measure between EM and F M is lower, then the EM maps have better approximation of the human eye-fixation F M .
3. Spearmans Correlation Coefficient (SCC): Spearman's correlation is a non-parametric measure that analyses how well the relationship between two variables can be described using a monotonic function [38]. It is a statistical method used mainly for measuring the correlation or dependency between two variables. This metric varies between the values of −1 and 1, where a score of −1, represents no correlation. The SCC between two variables will be high when observations have a similar ( with a correlation close to 1) rank between the two variables, and low when observations have a dissimilar rank (with a correlation close to −1) between the two variables [38].
It is an appropriate measure for both continuous and discrete ordinal variables [38]. F M represents the heat map from eye tracking data, whereas EM is the visual explanation map. The SCC between the two random variable maps, F M and EM is given by : where cov(EM, F M ) is the covariance of EM and F M , σ(EM ) and σ(F M ) are the standard deviations of EM and F M respectively.

Comparing SIDU and State-of-art methods with human attention for recognizing the object classes
In this experiment, we use the Imagenet images eye-tracking data recordings described in section 4.2.1 to generate and evaluate the explanation by the XAI algorithms. To this end, we first generate ground truth heatmaps by applying Gaussian distributions on human expert eye-fixations. These heatmaps are then used to compare with the XAI heatmaps. AUC, SCC and KL-DIV evaluation metrics are used to evaluate the performance. We finally calculate the mean of AUC, SCC and KL-DIV of all the images in the dataset. Table 3 summarizes the results obtained by SIDU and the two different state-of-the art XAI methods RISE [12] and GRAD-CAM [11] on our proposed imageNet eye-tracking data.
We can observe that, SIDU outperforms GRAD-CAM and RISE in all the three metrics. Therefore, we can conclude that SIDU explanations are a closer match with the human explanations (heatmaps) for recognizing the object class. This is further illustrated by example image explanation in Figure. 9 on page 25.

Application-Grounded evaluation
Application-Grounded evaluation involves conducting experiments within a real application to assess the trust of the black-box models. We choose an medical case as a test application where we use the task of retinal fundus image quality assessment [28]. The application is used for screening for retinal diseases, where poor-quality retinal images do not allow an accurate medical diagnosis. Generally, in sensitive domains such as clinical settings, the domain experts ( here clinicians) are skeptical in supporting explanations generated by AI diagnostic tools in cases involving high risk.
In our experimental setup at a local hospital, two ophthalmologists participated in testing to evaluate which visual explanation resulted in more trust and further aligns with actual physical examination performed in the clinic.
This experiment assesses the effectiveness of the proposed method in terms of localizing the exact region for predicting the retinal fundus image quality with respect to state-of-the-art methods. Here, the generated visual explanation heatmaps in the RISE algorithm were used for comparison. We follow the similar setting as discussed in [11], i.e., using both the proposed SIDU method and  To perform this experiment, we choose one the most successful white box attacks, namely, gradient based attacks. Fast Gradient Sign Method (FGSM) [24] and Projected Gradient Descent (PGD) [25] are the examples of such attacks.
PGD is an iterative application of FGSM such that the process of PGD is more complex and time consuming. Therefore, the Fast Gradient Sign Method (FGSM) was selected because of its simplicity and effectiveness. The adversarial image is generated using FGSM by adding noise to an original image.
The direction of this noise is the same as the gradient of the cost with respect to the input data. The amount of noise can be controlled by a coefficient, .
By applying this coefficient properly, it will change the model predictions and it is undetectable to a human observer. Figure 11 shows the different levels of FGSM adversarial noise added to an original image. Two different experiments were conducted using adversarial noise to demonstrate the effectiveness of SIDU, compared to the state-of-the-art methods RISE and GRAD-CAM. The to pass unnoticeable by the human eye. We extracted the visual explanations heatmaps using the proposed method SIDU, RISE [12] and GRAD-CAM [11].
The heatmaps generated by SIDU, RISE and GRAD-CAM methods were finally compared with human generated visual explanations using the eye-tracker as described in section 4.2.1 using the three evaluation metrics AUC, SCC and KL-DIV. We also observe that, the performance of XAI methods decrease with all the    for these experiments. We extract the visual explanations heatmaps using the proposed method (SIDU), RISE [12] and GRAD-CAM [11] as applied to the original images without noise and with noise = 0.1. The heatmaps generated by SIDU, RISE and GRAD-CAM methods are finally compared with the original image visual explanations to see adversarial noise added images are deviated from the original ones by using the three evaluation metrics AUC, SCC and KL-DIV. Table 6 summarizes the mean AUC, SCC and KL-DIV results obtained by the XAI methods. From the table we can observe that, SIDU outperforms GRAD-CAM and RISE for all the three evaluation metrics. From Figure 12, it can be observed that the propose method(SIDU) doesn't deviate in its localizing of the object class that is responsible for the prediction. Therefore, from these two adversarial noise experiments it can be concluded that the proposed method exhibits higher robust against adversarial noise.

Conclusion and Future work
In this work, a novel method titled 'Similarity Difference and Uniqueness' method is proposed for explaining the CNN model.Specifically, the investiga- perimental studies for evaluating XAI methods were conducted, we acknowledge that the experiments involving an eye-tracker are limited only to single-object classification of ten classes. This is due to the fact that there are various methodological challenges associated with eye-tracking (e.g., subject training, hardware calibration, etc) that makes it difficult to access subjects who are willing to participate in data collection for several different scenarios. However, we believe that by demonstrating the great potential of generating valid and reliable explanation via user interaction with an eye-tracker, holds a great value for the research community. Future work involves extending SIDU to spatio-temporal CNN models to provide visual explanations for video applications tasks such as video classification and action recognition. Further more, exploring the possibility of extending our method to explain decisions made by other neural network architectures (e.g., LSTM), Vision Transformers and in other domains (e.g., Natural Language Processing). We also aim to extend our eye-tracking experimental evaluation on multi-object classification tasks in the future work. Our code is available at: https://github.com/satyamahesh84/SIDU_XAI_CODE.