Abstract

Explainable artificial intelligence is proposed to provide explanations for reasoning performed by artificial intelligence. There is no consensus on how to evaluate the quality of these explanations, since even the definition of explanation itself is not clear in the literature. In particular, for the widely known local linear explanations, there are qualitative proposals for the evaluation of explanations, although they suffer from theoretical inconsistencies. The case of image is even more problematic, where a visual explanation seems to explain a decision while detecting edges is what it really does. There are a large number of metrics in the literature specialized in quantitatively measuring different qualitative aspects, so we should be able to develop metrics capable of measuring in a robust and correct way the desirable aspects of the explanations. Some previous papers have attempted to develop new measures for this purpose. However, these measures suffer from lack of objectivity or lack of mathematical consistency, such as saturation or lack of smoothness. In this paper, we propose a procedure called REVEL to evaluate different aspects concerning the quality of explanations with a theoretically coherent development which do not have the problems of the previous measures. This procedure has several advances in the state of the art: it standardizes the concepts of explanation and develops a series of metrics not only to be able to compare between them but also to obtain absolute information regarding the explanation itself. The experiments have been carried out on four image datasets as benchmark where we show REVEL’s descriptive and analytical power.

1. Introduction

In recent years, artificial intelligence (AI) has experienced a huge development, providing solutions to many real-life problems. Unfortunately, these systems remain characteristically opaque, which is known as the black-box problem. To tackle the comprehension of the black box, several explainable AI (XAI) techniques have been proposed [1]. In general, the aim is to extract knowledge from black-box models so that they become understandable by a human but it also aims to show the risks of not using the XAI perspective [2].

In the literature, there is a clear separation between model-agnostic and model-specific explanations. Explanations designed as agnostic do not require knowledge of the model’s own structure information [3, 4]. One of the most used and simple ones is local linear explanation (LLE).

All proposed explanations are based on different notions of what constitute an explanation and, therefore, are not directly comparable. In the literature, there are several proposals to compare explanations. In [5], different desirable qualitative aspects for an explanation are proposed, without including ways to measure them. In [6], the LEAF framework is proposed, designed for the evaluation and comparison of explanations. This framework has 4 different metrics to evaluate different desirable qualitative aspects of explanations. However, these metrics have different design inconsistencies which make them incomplete and biased.

Although there are different measurement proposals, there is no consensus in the XAI literature on how to evaluate explanations since there is no definition of what constitutes a good explanation [7]. Moreover, these measures have theoretical inconsistencies, and although they are useful to compare explanations, they do not provide absolute information on the explanation itself. Therefore, a set of robust metrics theoretically correct and representing characteristic behaviors of the method in practice is necessary. We also want to emphasize the difficulty of analyzing different factors that must inherently modify the explanation, such as the specific task covered by an AI or the type of data on which the explanation is generated.

Although there is no consensus within the literature on how we should create or even measure explanations, there are different state-of-the-art tools available that, combined with robust mathematical development, can provide a more generalizable and reliable analysis of the black-box generated explanations.

This work focuses on the proposal of the REVEL framework (Robust Evaluation VEctorized Local-linear-explanation), whose main contribution is to offer a consistent and theoretically robust analysis of the black-box generated explanations, as well as being useful at a practical level for the evaluation of explanations. REVEL takes advantage of the existing state of the art and develops a series of theoretical improvements on the generation and evaluation methods. In addition, it redefines and proposes different quantitative measures to robustly assess different qualitative aspects of the explanations. These measures emerge naturally and are well defined, so that we can extract not only comparative information among explanations but also get an absolute idea about the quality of an explanation on its own.

Although the theoretical study is generalizable to any kind of data and any kind of task, we focus on image classification in order to simplify the final discussion of the article. In addition, it is easier to work with images for the purpose of the analysis in the article, since it is simpler to generate different number of features with this data type.

The experimental section has been designed to show the analytical and descriptive potential of REVEL. We have designed three different scenarios on which to use REVEL. These scenarios are as follows:(i)We analyze within LIME how much the number of black-box evaluations affects the quality of the explanations.(ii)Within LIME, we also analyze how the number of features in which we split an image is affecting.(iii)We compare the two well-known state-of-the-art black-box explanation generators, LIME and SHAP, to demonstrate the comparative capability of REVEL.

The rest of the paper is organised as follows. Section 2 provides a survey of motivations and basic concepts of LLE and describes two main methods that we will compare, LIME and SHAP. Section 3 proposes REVEL framework and highlights its strengths with respect to other methods of evaluating explanations in a theoretical way. Section 4 develops a generic experimental pipeline for the comparison of explanations which we use in Section 5 to perform a comparison of different aspects of LIME and SHAP on four image classification benchmarks. Finally, the concluding remarks and future work are reported in Section 6.

This paper is based on a preprint version published in [8].

2. Preliminaries: Considerations to Generate Local Linear Explanations

In this section, we review the type of explanations named LLE, also called feature importance models, additive feature attribution methods, or linear proxy models. These methods are called LLE because they are a local linear approximation of the black box. We focus on LLEs because of their rigor and simplicity, which helps when developing possible metrics. Other more complex explanations would make this task more difficult.

This section starts with a theoretical description of LLEs and describes the two state-of-the-art LLEs, LIME and SHAP. We then discuss four fundamental aspects for the generation of feature importance explanations: the differences between the concept of importance and how to compare them and how to generate the neighborhood of examples for the regression of LLEs and different considerations about the type of data we work on and the specific task we tackle.

2.1. Local Linear Explanations

Formally, let be the input dataset. Let be the original black-box model, where is the dimension of the output space . Previous works define as a function that relies on just , but in case of tasks such as nonbinary classification problems, the model output is a vector of probabilities where each component depends on all others. Let be the input to be explained. A white-box LLE explainer is a function defined as follows:and in other words, is a linear application from the feature space to the output space.

Intuitively, the weights of both and are linked to the importance of each feature. More precisely, each weight of matrix is linked to the importance of feature to output . Also, each bias is linked to the general importance of output .

The different LLE methods use linear regression minimizing error as follows:where the weight function selection depends on each particular method. Another factor to consider is how the neighbors are sampled. The original proposals consider a Bernoulli experiment for each feature, that is, each feature has the same probability to be present on the generated neighbor. On the other hand, there are other newer proposals that consider a smart perturbation generation [9], where examples that contribute more to the explainability white-box model are more likely to be generated. For each LLE method, we use the sample-wise approach.

2.2. Models of Local Linear Explanations: LIME and SHAP

Once explained what LLEs are, we are going to describe the two main state-of-the-art LLEs, linear model-agnostic explanation (LIME) and Shapley additive explanation (SHAP). Although both are LLEs, they have clear differences in performing the black-box regression. We now describe how each method works and the main differences between them.

2.2.1. LIME

The LIME method [10] adopts the concept of local importance, which means that a feature that produces significant changes in the neighborhood of is very important. Therefore, features that are important for the classification of but do not produce significant changes in the neighborhood of will end up being discarded as an important feature.

Formally, LIME builds a LLE model by linear regression over a neighborhood of the original datapoint . The definition of this neighborhood is not trivial due to each dataset’s different nature. In order to find a LLE , LIME fits a ridge regression to with the linear least squares function with the default kernel:where is the euclidean distance and is a regularization factor.

The generation of the neighborhood is performed by sampling from an exponential distribution with a value with , the parameter selected for the LIME kernel. Finally, let . In the hypothetical case of , where is the number of all features. The value sampled is used to select randomly features to exclude on this sample.

2.2.2. SHAP

The SHAP method [11] considers a feature to be important for the classification of an example if it produces significant changes when compared to background values.

Formally, SHAP builds a LLE model by computing the contribution of each feature to the prediction from a game theory approximation. This method tries to find a LLE as a regression with the following kernel function, which is the SHAP kernel defined as follows:where is a binary vector representing the presence of each of the features on the example and is the combinatory number of choosing elements from possibilities without replacement.

This method can obtain an exact explanation if we evaluate all the possible examples of , that is, evaluations of the black box . As the number of evaluations required increases factorially with respect to the number of features, this nonstochastic approximation is unaffordable. That is why the general use of this method uses also a stochastic approximation generating a list of different examples and solves the linear ridge regression as LIME does.

The generation of the neighborhood is performed by sampling a value from a random discrete variable whose distribution is the following:that is, is the random variable that assigns the proportional probability of the weight that SHAP assigns to all the instances that exclude exactly variables. The value sampled is used to select randomly features to exclude on this sample.

2.3. How to Define Features for LLE in Nontabular Data

For an explanation based on feature importance, it is very important to define what a feature is. In tabular data, a feature is defined naturally from the dataset itself. However, other types of data do not have this convenience, e.g., time series or images. In the case of time series, the minimum amount of information is obtained at each measurement timestep. In the case of images, we get it from each pixel. This has several associated problems:(i)Generating exact explanations becomes an unaffordable task. In the case of SHAP, for a number of features, evaluations of the black box are needed to generate the nonprobabilistic explanation. A generic ImageNet image has a size of pixels, resulting in black-box evaluations in SHAP. Even in its probabilistic versions, a regression needs a large number of these evaluations to be reliable.(ii)Explanations lose perspective. For a human being, a single pixel means nothing. In order to make a meaningful explanation, several pixels must be grouped together.

To solve these problems, some works use a division of the image into squares of the same size [12] while others use an unsupervised segmentation method to generate larger segment-size features [13].

2.4. How to Explain with LLE in Different Machine Learning Tasks

To explain an artificial intelligence model, it is necessary to take into account the task for which the model has been designed.(i)In the regression task, each element of the output can be explained separately. Thanks to this, no output is dependent on any other and a separate analysis can be performed.(ii)In the classification task, the output is usually a vector of probabilities with clear constraints that must be satisfied (each element must be greater than or equal to 0 and the sum of all of them must be 1). Furthermore, it is not just the class to which it is classified that has an influence, but also the degree of certainty with which it is classified into each class. Since the outputs are dependent on each other in this case, a joint analysis of the output must be carried out.(iii)In the clustering task, an explanation can be carried out simply by some example or by some rule for each cluster [14]. Therefore, it is necessary to unify the concept of explanation within the clustering task.

Therefore, for each specific task, a different method of explanation must be developed. From now on, we focus on the task of classification, described formally in the following.

2.4.1. Classification Task Specifications

Let be a local linear white-box-model where over the logit space, . We define the signed importance matrix as the derivative matrix , over the logit space. It should be noted that .

To obtain the probability vector, we need to apply the softmax function, that is, . We define where is the derivative operator.

The component of matrix and will refer to the importance of feature for class over the logit and probability spaces, respectively.

Both matrices give us important and complementary information about the behavior of the white box . The matrix gives us absolute information about how the logits of all classes correspond to the original features. Additionally, the matrix gives us information about the classes that are potentially most likely to be classified as, disregarding the least likely. This may provide us apparently contradictory information, as we show in the following example:(i)Let be the white-box linear model of a multiclass problem of three classes on the logit regression and let be the original example. Say and, therefore, .(ii)We now consider , a neighbor of with a perturbation on feature, that produces and, therefore, .(iii)If we consider exclusively the logit approximation, it may be interpreted as feature influences positively for classes 1 and 2 and negatively for class 3, with approximately the same intensity.(iv)If we consider exclusively the probability approximation, feature may have a positive influence for class 1, a negative influence for class 2, and, much less significantly, a negative influence for class 3.

From a global view point, each view point has its impact on the analysis. Thus, we define a new matrix as the importance matrix and it is obtained as follows from the matrices and :which attempts to combine the information of both matrices and . This matrix has the sign of the logit matrix and the geometric mean of the intensity of importance of both matrices.

From the importance matrix , we define the relative importance matrix as and the normalized matrix that maintains 0 as 0 and transforms the value with the greater absolute value to 1 or , depending on the original sign of this specific value.

We define the absolute importance matrix as the matrix of the terms in absolute value, that is, for each coefficient of matrix . Each term of is the absolute importance of feature to the class .

2.5. Proposed Frameworks to Compute LLE: Qualitative and Quantitative Approaches

All proposed explanations are based on different notions of what constitute an explanation and, therefore, are not directly comparable. In the literature, there are several proposals to compare explanations. In [15], another set of metrics is proposed to measure the quality of explanations. However, they are specialized in rules-based explanations. In [6], the LEAF framework is proposed, with also four different metrics to evaluate agnostically different explanation metrics, independent of the explanation generation method. It also offers a practical example of their use, evaluating the quality of different explanations. However, the theoretical development of this framework is not mathematically consistent, which leads to biased conclusions. On the one hand, qualitative measures are not objective, and on the other hand, poorly calibrated measures where the worst or best score is not achieved only by the worst or best option, respectively, result in saturations that equalize part of the examples, removing relevant information. In addition, it is desirable that these measures have good mathematical properties such as smoothness so that there are no abnormalities in practical cases that were not considered.

It is in this scenario where the need for a mathematically consistent and unbiased explanation evaluation framework arises. In addition, this framework must also provide a measure not only comparative but also giving an absolute idea of the good behavior of the explanation itself.

3. REVEL Framework

In this section, we propose a new explanation evaluation framework called REVEL framework, presenting five new metrics for assessing the quality of an explanation. In particular, for each metric proposed, we describe the qualitative aspect we want to measure with the metric and show how the formal definition measures this aspect. We also provide a guideline on how to interpret the metric. Finally, for each qualitative aspect, we make a theoretical comparison of each metric with other proposed metrics.

In Table 1, we summarize the metrics we propose and the qualitative aspect they measure.

3.1. Local Concordance

There are LLE methods guaranteeing the white-box explanation and the black-box model to match on the specific datapoint. However, these methods have a strong computational constraint, since they require a large number of evaluations of the black-box model. Other methods do not ensure the coincidence between white-box explanation and black-box model. Since the concordance between both is not guaranteed, it is possible that the class proposed is different from each other, which means the proposed explanations end up being inconsistent. We want to measure how much the explanation and the model are similar.

On the classification task of more than two classes, it is also necessary to consider jointly the whole probability vector. Our proposal also attempts to measure the smoothness from the min to the max concordance values, that is, only the min concordance should have a score of 0 and the max concordance should have a score of 1 on this metric.

We can easily abstract the loss function that evaluates our metric to consider vector distances among probability vectors:where is a defined norm (1-norm, 2-norm, inf-norm…) and is the maximum distance between two possible probability vectors. This term exists and is reached because the probability space is complete and the norm is continuous. Moreover, is computed as , where and , regardless of the norm.

This metric has the following qualities:(i)Using C as the normalization factor makes our score well defined in the interval [0, 1], with the max concordance achieving 1 and the min concordance achieving 0, regardless of the number of classes in the dataset.(ii)This metric considers the whole probability vector jointly and not just one coordinate of the probability vector.

3.1.1. Guideline

This metric measures how similar the explanation is to the black box in the original example. It is very important that this metric is close to 1. Otherwise, the proposed explanation does not explain what happens in the example itself.

3.1.2. Comparison

The analogous LEAF proposal local concordance is defined as , where is the hinge loss function [16]. In contrast to our proposal, the use of the hinge function makes it nonsmooth. It also does not assure that only the maximum discordance reaches the worst value of the metric. In conclusion, the LEAF proposal has inconsistencies that our proposal overcomes.

3.2. Local Fidelity

Local fidelity applies not to a classification task but a regression one. The main idea of this metric is how close is the white box approximating the probabilities obtained by the black box . We propose the mean concordance between probabilities of and obtained on the neighborhood , that is,

This metric is an extension of the local concordance on extended to its neighborhood . It is also well defined on the interval [0, 1], regardless of the number of classes in the dataset.

3.2.1. Guideline

This metric measures the similarity between the explanation and the black box in the neighborhood. This metric is essential to check that the tendency of the explanation is similar to the tendency of the black box. It must be close to 1 to obtain a good explanation.

3.2.2. Comparison

The analogous LEAF metric proposes to evaluate the resemblance between the white-box explanation and the black-box model in the proposed neighborhood N (x) using the F1 metric.(i)The LEAF proposal is a measure designed to evaluate classification problems. Since is a neighborhood of , most examples will, by continuity, be of the same class as x, resulting in an imbalance in N (x).(ii)This metric presents problems at decision borders. In a binary problem with threshold 0.5, let be an example of set where and . The F1 metric will penalize this example while actually the white box mimics almost perfectly the undecidability of the black box .

Our proposal has no problem with the imbalance dataset generated by for the metric evaluation. Also, our metric is not biased by a threshold selection.

3.3. Prescriptivity

The main idea of prescriptivity is to test whether the white-box explanation has correctly predicted the changes needed in the original example in order to change the original class.

Mathematically, let be the original example, be the black-box model, be the white-box model mimicking , and be the changes needed on to change the class predicted by the white box . We propose the following prescriptivity metric:where is a normalization factor. This normalization factor is the same as in equation (7).

In our proposal, is obtained by removing the presence of the most important positive features of the class predicted by the white box on the example . The algorithm ends when assigns a different class to and , that is, .

This metric has the following properties:(i)This prescriptivity proposal is defined as a vectorized proposal, so the metric has a global view of the whole output.(ii)This metric obtains the maximum value 1 when both vectors and are equal and obtains the minimum value 0 when both vectors are in the maximum possible disagreement on this prescriptivity scenario. It is designed not to depend on the dimensions of the probability vectors either, so the metric is independent of the number of classes in the dataset.(iii)This metric is not dependent of a boundary selection, nor it is dependent on a specified neighborhood .

3.3.1. Guideline

Prescriptivity challenges the explanation to propose an example far enough to change the prediction of the model but without losing predictive quality at this point. Indirectly, each explanation proposes an example different from the original example whose prediction must be markedly different from that of . Although the best possible score for this metric is 1, it is understandable that it does not reach the best score and serves more as a comparative metric between different explanation methods.

3.3.2. Comparison with LEAF

The prescriptivity metric is formally proposed in LEAF for a binary classification problem, where a fixed decision boundary is chosen. This decision boundary is the set , that is, the set of points in the domain whose prediction by the white box is exactly .

On the LEAF proposal, the obtention of is based on the closest projection of our example on . In reality, this is only possible if the features selected are real-valued. In case of binary data, this approximation cannot be achieved because each feature cannot process a real value. It is also dependent on a selection of a boundary .

LEAF proposes as prescriptivity metric the following function:where is the hinge loss function and is a normalization factor, so that 1 means that lies at the boundary, and 0 means x’ is at the furthest distance from the boundary. One may observe that by taking the absolute value, the measure both overshoots and undershoots the boundary as a loss of prescriptivity.

The LEAF proposal has different problems:(i)This metric is designed for a single output variable. For classification problems, it is usual to obtain a vector of probabilities whose components are linked to each other and whose analysis must be done jointly.(ii)Choosing a fixed value does not guarantee the change of class when we talk about nonbinary classification problems. In case of a classification problem of more than two classes, the majority class could have a 50% probability and other classes could share the rest of the probability equally. This results in a neighbor of whose changes do not change the original class.(iii)The proposed norm is restricted to the interval [0, 1] but not smoothly. Even if it is used a normalization parameter , it is not clear if only the maximum possible disagreement results in a 0 score on this metric or if it is even reachable. It is reasonable for this kind of metric to guarantee that the maximum disagreement obtains 0 as the worst score, and as agreement increases, the metric increases smoothly up to 1, the maximum score.

Our proposal does not show all of the different problems detected in the LEAF prescriptivity proposal, since our metric jointly measures the full probability vector, is not boundary dependent, and is well defined in the interval [0, 1], where it changes smoothly from worst case to the best one.

3.4. Conciseness

Conciseness measure aims to evaluate the brevity of the explanation. In our case, the less relevant features our explanation has, the more concise it should be.

We propose the following conciseness metric based on the absolute importance matrix , particularly in the vectors of importance of each feature. Let be the importance vector of feature , where the coefficient is the coefficient of matrix . We define the conciseness of the explanation proposed by the white box aswhich can be described as the mean irrelevance of the features. If we consider instead of , we would have the mean relevance of the features and the most concise method would have a score of . That is why we have reversed this term.

This metric has the following qualities:(i)It rewards the use of few features with a high weight.(ii)We have a general idea of how many features are important on the white box.(iii)The best possible score is obtained if we have only one feature with absolute importance 1 and the rest with 0 absolute importance, in which case we would obtain 1 as conciseness. The worst case is obtained when we have all the features with 1 as absolute importance, in which case we would obtain 0 as conciseness. In addition, the number of classes has been taken into account, so that, regardless of the number of classes in the dataset, these max and min values are reachable.(iv)We can compare explanations with different amount of features taken into account.

3.4.1. Guideline

This metric evaluates the ability of the explanation to focus on the most important features of an example and discard the less important ones. Depending on the complexity of the explanation we want, we may prefer greater or lesser conciseness. For instance, in image classification, the explanation to dismiss a large part of the image could be desired but not to have a single pixel explaining the complete decision of the model.

3.4.2. Comparison

LEAF proposes as conciseness a constraint for explanations, where it requires that explanations use exclusively features. In the case of LIME, conciseness is a variable that we supply to the algorithm so that it restricts itself to choose a given number of features with nonzero importance. On the other hand, in the case of SHAP, the algorithm uses by default all available features and gives them an importance. In order to compare both methods, the LEAF framework proposes to select a default conciseness parameter k, the number of features to be used on the white-box explanation, and restrict both LIME and SHAP to use the top-k most important features.

As mentioned in the previous paragraph, the proposed conciseness is not a metric but a constraint on white-box explanation models. Moreover, the LEAF proposal does not leave the white-box models to decide whether a particular decision has been influenced by more or fewer features.

Our proposal, instead of a constraint, provides a metric to evaluate the conciseness of each white-box explanation.

3.5. Robustness over Explanations

A key point to consider is the variability of the methods used to generate explanations. It is desirable that independent explanations generated by the same method must be as similar as possible, since very different or even contradictory explanations would lead to mistrusting the method. In case of deterministic methods, this is ensured since there is just one proposed explanation. In case of nondeterministic methods, there are several proposed explanations, and therefore, we need to ensure that the explanations do not differ or even contradict each other.

To measure how two explanations and differ, we propose two possible measures:(i)First, we propose the cosine similarity between and , which are the relative importance matrices of and , respectively:where is the scalar product.(ii)The metric proposed before based on the cosine similarity takes into account the direction of the matrices and but not the magnitude. To take the magnitude also into account, we propose the following measure of similarity:that takes into account both direction and magnitude of the explanations. In case of the same magnitude, this similarity function is exactly the cosine similarity. In case of different magnitude, this similarity function has lesser punctuation than the cosine similarity in case of a positive scalar product. In case of a negative scalar product, this score has also a lesser absolute value than the cosine similarity. In case of perpendicular explanation vectors, both metrics have a 0 score.

In both cases, as robustness, we propose the mathematical expectation of the chosen similarity of two different explanations and , that is:where is the set of all explanations that could be proposed by a certain explanation method such as LIME or SHAP. The expectation can be approximated by generating a given number of explanations and computing the mean of the similarities among explanations.

Those metrics have the following qualities:(i)Both metrics take into account the weight of all features, so two explanations and choosing a different most important feature would be punished by both metrics.(ii)The second metric takes into account the magnitude of the importance matrix.(iii)As both robustness metrics use bounded similarity functions, this metric consistently achieves the maximum and minimum possible in the best and worst case scenarios, respectively, regardless of the number of classes in the dataset.

3.5.1. Guideline

This metric does not evaluate a specific explanation but the method that generates them. All deterministic methods will score 1 in this metric since they always generate the same explanation. Therefore, this metric is designed to evaluate the robustness of nondeterministic methods. The closer this metric is to 1, the less the explanations generated by this method vary. It should be noted that this metric, due to the way it is designed, can give negative scores, which would indicate that the proposed explanations are contradictory.

3.5.2. Comparison

LEAF proposes the reiteration similarity metric, which measures how much two explanations generated by the same method vary by measuring the difference between the top-k features over several explanations.(i)This metric depends directly on the conciseness constraint of the LEAF proposal.(ii)This metric does not consider the importance of a feature, since it penalizes equally for choosing important and not so important features, not penalizing it.(iii)This metric does not penalize choosing a “positive” important feature as “negative” and vice versa. Two different explanations can consider using the same feature for their explanation but attributing positive importance to it in the first explanation and negative importance in the second, which is a clear contradiction. The similarity proposal does not see this example as a contradiction and does not penalize it.

Our proposed robustness metric does not depend on external constraints and does not have the shortcomings described above while still measuring the variation between generated explanations.

4. Experimental Setup

In this section, we describe the experimental setup we use in this work. The objective of this experimental section is to show how to implement the proposed measures, not to demonstrate the best performance of our measures compared to others. This demonstration must be done at the theoretical stage, as we do in Section 3. We first select four image datasets as benchmark where we train the models to explain. Finally, we fix some hyperparameters to compare different LLE aspects with the REVEL framework.

4.1. Benchmark Selection

The datasets selected as benchmarks are CIFAR10 [17], CIFAR100 [18], FashionMNIST [19], and EMNIST-balanced [20], which is a benchmark already used in [2] for explainability tasks. Table 2 shows a short description of each dataset.

4.2. General Training Pipeline

For this experiment, we chose the EfficientNet-B2 model [21] with the pretrained weights in the ImageNet dataset. Next, the network has been fine-tuned on the benchmark dataset for 100 epochs, 32 images per batch with the Adam optimizer [22] with learning rate 1e − 5, weight decay = 0.001, and AMSGrad = True. We randomly selected 10% of the training set as validation subset on which the loss is not computed. Over the 100-epoch models, we select the model whose performance on this validation subset is the best. As the objective of this work is the analysis of the metric behavior, we will not go deeper into the training of the network and we will set these parameters as default. In Table 3, we show the performance obtained by the model in the different test sets of the datasets used as benchmarks.

4.3. Local Linear Explanation Pipeline

In this section, with the purpose of generating a fair comparison, we fix as default some shared hyperparameters of the LLE generation models, explained below.Number of neighbors (N): For each example of the test split, we will generate a different number of neighbor examples to explain the original example. In the experiments, .Neighbor generation (N(x)): we use a smart perturbation generator, where each neighbor is generated with a probability proportional to the weight associated to it in each method of explanation generation.Number of explanations generated (E): for each LLE method and each instance to be explained, we will generate 5 different explanations.Number of features of each image : we divide each image into square patches of size , so each image will have features.Feature occlusion: to set a feature as occluded, we set the original patch from its original value to a neutral grey patch, that is, we set all pixel of the patch to 0.5 on each RGB channel.

4.4. On the Comparison between LEAF and REVEL

This paper presents REVEL as a proposal of theoretically robust measures for the evaluation of LLE explanations. The comparison with other measurement proposals, such as LEAF, should be carried out theoretically and not practically, since the measurements offered by the different proposals have nothing related to each other. That is why the comparison on this work is made exclusively on the theoretical proposal and not on the practical use cases.

5. Assessing Explanations Using REVEL: Use Cases

In this section, we propose three different scenarios in which REVEL can be used, thus demonstrating its analytical potential. These scenarios are as follows:(i)Dependence of LIME on the number of features (Section 5.1): in this scenario, we study how much the number of patches into which we have divided the original image can influence, or if there is an ideal partition in which to divide the images.(ii)Dependence of LIME on the number of black-box evaluations (Section 5.2): In this scenario, we analyze the number of black-box evaluations needed to generate a good-quality explanation. We also evaluate the trade-off between quality and time needed to generate a good explanation.(iii)LIME vs. SHAP (Section 5.3): We compare the results obtained by the two state-of-the-art explanation generator models, LIME and SHAP, with the best configuration determined by the above scenarios. This scenario provides an idea about which explanation generator can offer us better explanations depending on their scores in each of the proposed metrics.

To better support our analysis on the experiments for each previous scenarios, we use the Shapiro test and Wilcoxon test from the SciPy stats library [23]. In particular, the Shapiro test has been used to check whether each experiment follows a normal distribution, and thus using a parametric test is appropriate (Tables 46). Since with an exception of a single experiment in Table 4 we can discard that the distributions are normal, we need the nonparametric Wilcoxon test to check that the distributions of different experiments have essentially different results. We considered a p value of 0.05 to discard the null hypothesis. For each use case, the Wilcoxon test tables used to perform the comparisons are cited.

5.1. Dependence of LIME on the Number of Features

In this section, we compare how LIME performs over different number of features. This comparison allows to perform both a general study and a study focusing on the image data type. At a general level, we analyze how the number of features influences the quality of the explanation. In the case of images, we use this study to determine the best performing granularity.

5.1.1. Local Concordance

In Figure 1 we note that, as a tendency, the local concordance score increases as more features are processed. As the number of features increases, the explanation method has more parameters to fit. Therefore, the model increases its performance on mimicking the black box on the original example. The difference remains significant (Tables 710).

5.1.2. Local Fidelity

In Figure 2, we note a tendency similar to the local concordance. That is, local fidelity increases the more features we use. This is natural since the neighbors where we are evaluating local fidelity are closer to the original example the more features we use. The differences remain significant except for 121 and 144 features (Tables 1114).

5.1.3. Prescriptivity

In Figure 3, in contrast to the local concordance and local fidelity metrics, a different pattern arises, where as the number of features increases, the prescriptivity metric gets worse. Prescriptivity not only evaluates how well the explanation mimics the black box in areas near the original example but also evaluates the proposed changes to the white box. The fewer the features considered in the explanation, the fewer the changes necessary to change the predicted class. Thus, the explanation has less problems in finding the necessary features for the class to change. On the other hand, we must pay attention to where the differences in this metric are significant. In CIFAR10, FashionMNIST, and EMNIST, the results are significantly different while in CIFAR100, there are cases where they are not (Tables 1518).

5.1.4. Conciseness

In Figure 4, we note a tendency to increase conciseness as the granularity increases. However, we observe that before this increase, conciseness decreases with 64 features. This seems to indicate that the higher the number of features, the better the performance. However, it can also be interpreted as an overfitting of the explanation and that the minimum amount of information that can be obtained from the image is by separating it into 64 different features and that a higher granularity overfits the model. It should be added that in almost all cases, there are significant differences (Tables 1922). Even so, a study with images of various resolutions should be done because it could depend on the information contained on each patch.

5.1.5. Robustness

In Figure 5, we observe that the more features the models use, the more unstable the method becomes. Having more features to evaluate leads to more uncertainty in the choice of explanations. All the experiments present significant differences (Tables 2326).

5.1.6. Global Conclusion

We appreciate that the higher the number of features, the better the local performance. This is an expected result since it is biased by the neighborhood we have chosen to calculate the local fidelity. Therefore, we should focus on the rest of the metrics. In the prescriptivity calculation, we see that the more the features are, the worse the result is obtained. In contrast, the more features we see, the more concise the methods are, discarding more unimportant features. Finally, we appreciate that LIME loses robustness the more features we use. This is due to the fact that the more features we use, the more likely it is that the explanation will use a larger set of features.

5.2. Dependence of LIME on the Number of Black-Box Evaluations

In this section, we will evaluate how important the number of black-box evaluations is over the LIME methods. This study is critical since black-box evaluations are considered the biggest bottleneck of black-box explainability methods. Although it is desirable to be able to evaluate the black-box function as many times as possible, there must be a trade-off between the quality of the explanation and the time it takes to generate it.

5.2.1. Local Concordance

In Figure 6, we can appreciate that increasing the number of black-box evaluations does not change the local concordance score significantly. Also, if we look at absolute values, we realize that we obtain significantly high values. This is due to the fact that the sampling used by LIME is very stable in picking the neighbors close to the original example. The fact that most experiments show no significant differences between them corroborates this statement (Tables 2730).

5.2.2. Local Fidelity

In Figure 7, we appreciate that, in this case, the more the evaluations of the black box, the better the result. We may expect that by randomly generating more neighbors, we obtain a better score in the neighborhood of the original example. However, as in local concordance metric, the differences in the experiments are not significant, corroborating the hypothesis that LIME is very stable near the example to be explained (Tables 3134).

5.2.3. Prescriptivity

In Figure 8, we observe that the number of evaluations is not a differentiating factor. LIME proposes a series of changes that consistently change the prediction of the model by the same amount approximately. This hypothesis is corroborated by the fact that the experiments show no significant differences between them (Tables 3538).

5.2.4. Conciseness

In Figure 9, we observe that the conciseness metric is influenced by the number of evaluations of the black box, making it less variable. Thus, LIME methods propose on average the same percentage of important features although increasing the number of evaluations tends to obtain less variable results, which is the main goal of increasing the number of maximum evaluation of black-box evaluations. Differences between experiments end up being significant when there is a difference in the number of evaluations between one experiment and another (Tables 3942).

5.2.5. Robustness

In Figure 10, we observe that as the number of black-box evaluations increases, LIME methods become more consistent, although at the cost of using more computational time. Depending on the desired robustness or time limit requirements, we can estimate of how much an explanation can change. All experiments show significant differences between them (Tables 4346).

5.2.6. Global Conclusion

In this case, the metric of robustness is the one that outstands the most. Such results are expected since the more examples we use from the neighborhood, the less variable the generated explanation will be. Thanks to this analysis, we will be able to see what the cost is in time associated with particular robustness.

5.3. LIME vs. SHAP: General Analysis over the Explanation Generators

In this section, we evaluate the performance on each proposed metric of LLE methods, LIME with and SHAP, local and global versions. For this comparison, we considered the results of the above scenarios to choose the best number of features and the maximum number of black-box evaluations. In our case, we pick 64 features and 800 black-box evaluations.

5.3.1. Local Concordance

In Figure 11, we show the performance of the local concordance metric over all datasets. We observe that LIME with larger performs worse. parameter controls the width of the neighborhood generated, making the original example less relevant. On the other hand, local SHAP and global SHAP obtain stable and comparable results to those obtained by LIME with because in each SHAP regression, the relative importance of the original example remains constant with respect to the rest of the generated neighbors. Most of the LIME experiments show significant differences between 2 experiments. However, we cannot discard that local SHAP and global SHAP behave in the same way (Tables 4750).

5.3.2. Local Fidelity

In Figure 12, we note the same behavior for LIME methods as for the local concordance metric, i.e., the score of this metric decreases as is higher since the larger the neighborhood it generates, the less importance is given to the direct surroundings of the example. We also note that SHAP methods obtain a worse result than LIME with . This would mean that the behavior of SHAP gets worse as it moves away from the original example. Most of the LIME experiments show significant differences between 2 experiments. However, we cannot discard that local SHAP and global SHAP behave in the same way (Tables 5154).

5.3.3. Prescriptivity

In Figure 13, we note that different LIME methods show similar performance regardless of , with slight variations between datasets. On the other hand, there is a noticeable loss in SHAP local. This is partly due to the fact that SHAP gives significant weight to the original example when there are a large number of features and does not extrapolate to more distant examples. On the other hand, global SHAP performs slightly worse than LIME methods. It pays attention not only to the closest examples to the original example but also to the farthest possible examples. In CIFAR10 and FashionMNIST, all experiments show significant two-to-two differences. However, this is not the case in CIFAR100 or EMNIST. SHAP local and SHAP global always behave with significant differences (Tables 5558).

5.3.4. Conciseness

In Figure 14, we note that the LIME methods have a similar behavior among the different configurations, obtaining slightly different results depending on the dataset. On the other hand, the global SHAP method shows worse results, which tells us that SHAP global spreads its attention over too many features. On the other hand, local SHAP obtains a comparable score with the different LIMEs, which means that both methods spread their attention over almost the same number of features. In this case, all the datasets have experiments with significant differences except CIFAR10 in the experiments with sigma greater than 4 (Tables 5962).

5.3.5. Robustness

In Figure 15, we note that the best scoring results are obtained in this case by the SHAP models. This is due to the fact that SHAP methods choose neighbors in a stable way. LIME methods generate examples less stably as we increase the parameter. The reason of the increase of is that we also increase the size of the neighborhood and, therefore, the diversity of the generated neighbors. All experiments have significant differences with the rest of the experiments (Tables 6366).

5.4. Global Analysis and Lessons Learned

Once we have analyzed the performance of each metric separately, we can extract lessons learned about each of the methods evaluated thanks to the auditing potential of the REVEL framework.(i)SHAP: It focuses too much on the concrete example to be explained and does not generalize well in the synthetic neighborhood. Local concordance is good although the local fidelity, in comparison with LIME, is worse than expected and prescriptivity results are very poor. Although they are very stable methods, as we observe in the robustness metric, we may establish, in conjunction with the previous conclusions, that they are in fact methods whose neighborhood is too small and therefore they use almost all the same examples to generate explanations.(ii)Global SHAP vs. local SHAP: The main difference between local and global SHAP is found in prescriptivity and conciseness. Local SHAP is able to discard unimportant features, while global SHAP hardly does so. The reason for this behavior is because local SHAP is using only the neighborhood near the instance to be analyzed, while global SHAP uses also the instances of completely empty images except for some particular patches. In other types of data, this approach is correct (e.g., in tabular data, to see if any particular feature biases the overall result), but in the case of images, an almost entirely grey image does not give much information.(iii)LIME: This method focuses on the local neighborhood of the example to be explained. We observe that the parameter establishes the size of the neighborhood, and as it increases, it obtains worse results in the local environment but has greater generalization power. We deduce this because in the metrics of local concordance and local fidelity, it worsens with increasing but remains stable or even increases in prescriptivity. The increase in neighborhood size also results in slightly more attention being paid to diverse features and, in addition, causes a more diverse generation of neighbors, as we see in the conciseness and robustness metrics, respectively.

In conclusion, we may establish that SHAP focuses too much on the example to be explained while LIME is able to generalize better on these datasets.

Finally, the most important lesson learned is the exhaustive and mathematically robust study we performed for the development of REVEL. Thanks to this study, we have not only been able to establish comparative measures between explanations but also show that these measures serve as absolute measures, without the need to compare with others.

6. Concluding Remarks

In this paper, we present REVEL, a novel framework specialized in analysis and comparison of explanations. We provide a theoretical guideline for the use of REVEL. We also provide a practical illustration of usage of REVEL by comparing LIME and SHAP methods in four different benchmarks.

As lessons learned, we want to remark that having bounded metrics with well-defined limits gives us absolute information on every evaluation aspect and not only a comparative one. This is useful to dismiss explanations by themselves even if there is no baseline to compare with. For the development of future metrics, this characteristic is desirable.

Regarding the developed metrics themselves, we can extract the following lessons. Local metrics can help us to detect biases; compared with prescriptivity, conciseness provides us information about whether an explanation is useful or not by the percentage of discarded features, and robustness shows information on the stability of the explanations.

From the above analysis, we can establish that, within the black-box methods of explanation proposals over the image classification task, LIME behaves better than SHAP because SHAP focuses too much on the locality of the example to be explained, while LIME is able to generalize much better.

Once the method of explanation has been chosen for a particular model, we emphasize that the analysis should not stop there but analyze different aspects such as the number of features considered or the number of evaluations of the black box necessary for a robust explanation.

Finally, we consider that the base case on which to work rigorously in XAI is LLE. As future work, and based on the study already done, we leave the extension of the proposed metrics to other types of explanations, such as those based on decision trees or knowledge graphs.

Data Availability

The image data supporting this study are from previously reported studies and datasets, which have been cited. Datasets are available from the torchvision datasets library.

Disclosure

This paper is based on a preprint “REVEL Framework to Measure Local Linear Explanations for Black-Box Models: Deep Learning Image Classification Case of Study” in 2022 based on the following link: https://arxiv.org/abs/2211.06154 [8].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Spanish Ministry of Science and Technology under project PID2020-119478GB-I00 financed by \\ MCIN/AEI/10.13039/501100011033. This work was also partially supported by the Contract UGR-AM OTRI-6717 and the Contract UGR-AM OTRI-5987 and projects P18-FR-4961 by Proyectos I+D+i Junta de Andalucia 2018. The hardware used in this work is supported by the projects with reference EQC2018-005084-P granted by Spain’s Ministry of Science and Innovation and European Regional Development Fund (ERDF) and the project with reference SOMM17/6110/UGR granted by the Andalusian “Consejería de Conocimiento, Investigación y Universidades” and European Regional Development Fund (ERDF).