XAI-TRIS: non-linear image benchmarks to quantify false positive post-hoc attribution of feature importance

The field of ‘explainable’ artificial intelligence (XAI) has produced highly acclaimed methods that seek to make the decisions of complex machine learning (ML) methods ‘understandable’ to humans, for example by attributing ‘importance’ scores to input features. Yet, a lack of formal underpinning leaves it unclear as to what conclusions can safely be drawn from the results of a given XAI method and has also so far hindered the theoretical verification and empirical validation of XAI methods. This means that challenging non-linear problems, typically solved by deep neural networks, presently lack appropriate remedies. Here, we craft benchmark datasets for one linear and three different non-linear classification scenarios, in which the important class-conditional features are known by design, serving as ground truth explanations. Using novel quantitative metrics, we benchmark the explanation performance of a wide set of XAI methods across three deep learning model architectures. We show that popular XAI methods are often unable to significantly outperform random performance baselines and edge detection methods, attributing false-positive importance to features with no statistical relationship to the prediction target rather than truly important features. Moreover, we demonstrate that explanations derived from different model architectures can be vastly different; thus, prone to misinterpretation even under controlled conditions.


Introduction
Only recently, a trend towards the objective empirical validation of XAI methods using ground truth data has been observed (Tjoa and Guan, 2020;Li et al, 2021;Zhou et al, 2022;Arras et al, 2022;Gevaert et al, 2022;Agarwal et al, 2022).These studies are, however, limited in the extent to which they permit a quantitative assessment of explanation performance, in the breadth of XAI methods evaluated, and in the difficulty of the posed 'explanation' problems.In particular, most published benchmark datasets are constructed in a way such that realistic correlations between class-dependent (e.g., the foreground or object of an image) and class-agnostic (e.g., the image background) features are excluded.In practice, such dependencies can give rise to features acting as suppressor variables.Briefly, suppressor variables have no statistical association to the prediction target on their own, yet including them may allow an ML model to remove unwanted signals (noise), which can lead to improved predictions.In the context of image or photography data, suppressor variables could be parts of the background that capture the general lighting conditions.A model can use such information to normalize the illumination of the object and, thereby, improve object detection.More details on the principles of suppressor variables can be found in Conger (1974); Friedman and Wall (2005); Haufe et al (2014); Wilming et al (2022).Here we adopt the formal requirement that an input feature should only be considered important if it has a statistical association with the prediction target, or is associated to it by construction.In that sense, it is undesirable to attribute importance to pure suppressor features.
Yet, Wilming et al (2022) have shown that some of the most popular model-agnostic XAI methods are susceptible to the influence of suppressor variables, even in a linear setting.Using synthetic linearly separable data defining an explicit ground truth for XAI methods and linear models, Wilming et al (2022) showed that a significant amount of feature importance is incorrectly attributed to suppressor variables.They proposed quantitative performance metrics for an objective validation of XAI methods, but limited their study to linearly separable problems and linear models.They demonstrated that methods based on so-called activation patterns (that is, univariate mappings from predictions to input features), based on the work of Haufe et al (2014), provide the best explanations.Wilming et al (2023) took this one step further and presented a minimal two-dimensional linear example, analytically showing that many popular XAI methods attribute arbitrarily high importance to suppressor variables.However, it is unclear as to what extent these results would transfer to various non-linear settings.
Thus, well-designed non-linear ground truth data comprising of realistic correlations between important and unimportant features are needed to study the influence of suppressor variables on XAI explanations in non-trivial settings, which is the purpose of this paper.We go beyond existing work in the following ways: First, we design one linear and three non-linear binary image classification problems, in which different types and combinations of tetrominoes (Golomb, 1996), overlaid on a noisy background, need to be distinguished.In all cases, ground truth explanations are explicitly known through the location of the tetrominoes.Apart from the linear case, these classification problems require (different types of) non-linear predictive models to be solved effectively.
Second, based on signal detection theory and optimal transport, we define three suitable quantitative metrics of 'explanation performance' designed to handle the case of few important features.
Third, using three different types of background noise (white, correlated, imagenet), we invoke the presence of suppressor variables in a controlled manner and study their effect on explanation performance.
Fourth, we evaluate the explanation performance of no less than sixteen of the most popular model-agnostic and model-specific XAI methods, across three different machine learning architectures.
Finally, we propose four model-agnostic baselines that can serve as null models for explanation performance.

Data generation
For each scenario, we construct an individual dataset of 64 × 64-sized images as D = (x (n) , y (n) ) N n=1 , consisting of i.i.d observations (x (n) ∈ R D , y (n) ∈ {0, 1}) N n=1 , where feature space D = 64 2 = 4096 and N = 40, 000.Here, x (n) and y (n) are realizations of the random variables X and Y , with joint probability density function p X,Y (x, y).
In each scenario, we generate a sample x (n) as a combination of a signal pattern a (n) ∈ R D , carrying the set of truly important features used to form the ground truth for an ideal explanation, with some background noise η (n) ∈ R D .We follow two different generative models depending on whether the two components are combined additively or multiplicatively.

Additive generation process
For additive scenarios, we define the data generation process for the n-th sample.Signal pattern a (n) = a(y n ) carries differently shaped tetromino patterns depending on the binary class label y (n) ∼ Bernoulli( 1 /2).We apply a 2D Gaussian spatial smoothing filter H : R D → R D to the signal component to smooth the integration of the pattern's edges into the background, with smoothing parameter (spatial standard deviation of the Gaussian) σ smooth = 1.5.The Gaussian filter H can technically provide infinite support to a (n) , so in practice we threshold the support at 5% of the maximum level.White Gaussian noise η (n) ∼ N (0, I D ), representing a non-informative background, is sampled from a multivariate normal distribution with zero mean and identity covariance I D .For each classification problem, we define a second background scenario, denoted as CORR, in which we apply a separate 2D Gaussian spatial smoothing filter G : R D → R D to the noise component η (n) .Here, we set the smoothing parameter to σ smooth = 10.The third background type is that of samples from the ImageNet database (Deng et al, 2009), denoted IMAGENET.We scale and crop images to be 64 × 64-px in size, preserving the original aspect ratio.
Each 3-channel RGB image is converted to a single-channel gray-scale image using the built-in Python Imaging Library (PIL) functions and is zero-centered by subtraction of the sample's mean value.
As alluded to below, we also analyze a scenario where the signal pattern a (n) underlies a random spatial rigid body (translation and rotation) transformation R (n) : R D → R D .All other scenarios make use of the identity transformation R ) .Signal and background components are then normalized by the Frobenius norms of A and E: where the Frobenius norm of a matrix A is defined as . Finally, a weighted sum of the signal and background components is calculated, where the scalar parameter α ∈ [0, 1] determines the signal-to-noise ratio (SNR).

Multiplicative generation process
For multiplicative scenarios, we define the generation process where n) , H and G are defined as above, A and E are Frobeniusnormalized, and 1 ∈ R D .
For data generated via either process, we scale each sample where max |x| is the maximum absolute value of any feature across the dataset.

Emergence of suppressors
Note that the correlated background noise scenario induces the presence of suppressor variables, both in the additive and the multiplicative data generation processes.A suppressor here would be a pixel that is not part of the foreground R (n) • (H • a (n) ), but whose activity is correlated with a pixel of the foreground by virtue of the smoothing operator G. Based on previously reported characteristics of suppressor variables (Conger, 1974;Friedman and Wall, 2005;Haufe et al, 2014;Wilming et al, 2022), we expect that XAI methods may be prone to attributing importance to suppressor features in the considered linear and non-linear settings, leading to drops in explanation performance as compared to the white noise background setting.

Scenarios
We make use of tetrominoes (Golomb, 1996), geometric shapes consisting of four blocks (each block here being 8 × 8-pixels), to define each signal pattern a (n) ∈ R 64×64 .We choose these as the basis for signal patterns as they allow a fixed and controllable amount of features (pixels) per sample, and specifically the 'T'-shaped and 'L' shaped tetrominoes due to their four unique appearances under each 90-degree rotation.These induce statistical associations between features and target in four different binary classification problems:

Linear (LIN) and multiplicative (MULT)
For the linear case, we use the additive generation model Eq. ( 1), and for the multiplicative case, we instead use the multiplicative generation model.In both, signal patterns are defined as a 'T'-shaped tetromino pattern a T near the top left corner if y = 0 and an 'L'-shaped tetromino pattern a L near the bottom-right corner if y = 1, leading to the binary classification problem.Each pattern is encoded such that a T/L i,j = 1 for each pixel in the tetromino pattern, positioned at the i-th row and j-th column of a T/L , and zero otherwise.

Translations and rotations (RIGID)
In this scenario, a T/L defining each class are no longer in fixed positions but are randomly translated and rotated by multiples of 90 degrees according to a rigid body transform R (n) , constrained such that the entire tetromino is contained within the image.In contrast to the other scenarios, we use a 4-pixel thick tetromino here to enable a larger set of transformations, and thus increase the complexity of the problem.This is an additive manipulation in accordance with (1).

XOR
The final scenario is that of an additive XOR problem, where we use both tetromino variants a T/L in every sample.Transformation R (n) is, once again, the identity transform here.Class membership is defined such that members of the first class, where y = 0, combine both tetrominoes with the background of the image either positively or negatively, such that a XOR++ = a T + a L and a XOR--= −a T − a L .Members of the opposing class, where y = 1, imprint one shape positively, and the other negatively, such that a XOR+-= a T − a L and a XOR-+ = −a T + a L .Each of the four XOR cases are equally frequently represented across the dataset.
Figure 1 shows two examples from each class of each classification problem and for the three background types -Gaussian white noise (WHITE), smoothed Gaussian white noise (CORR), and ImageNet samples (IMAGENET).Figure B1 in the supplementary material shows examples of each of the 12 scenarios across four signal-to-noise ratios (SNRs).
With each classification scenario defined, we can form the ground truth feature set of important pixels for a given input based on the positions of tetromino pixels as For the LIN and MULT scenarios, each sample either contains a 'T' or an 'L' tetromino at a fixed position, corresponding to the fixed patterns a T and a L .Since the absence of a tetromino at one location is just as informative as the presence of the other at another location, we augment the set of important pixels for these two settings as Note that this definition is equivalent to Eq. ( 3) for the XOR scenario.Moreover, it is equivalent to an operationalization of feature importance put forward by Wilming et al (2022) for the three static scenarios LIN, MULT, and XOR.Wilming et al (2022) define any feature as important if it has a statistical dependency to the prediction target across the studied sample.In all cases, an ideal explanation method should attribute importance only to members of the set F + (x (n) ).
For training each model and the subsequent analyses, we divide each dataset threefold by a 90/5/5 split into a training set D train , a validation set D val , and a test set D test .

Classifiers
We use three architectures to model each classification problem.Firstly, a Linear Logistic Regression (LLR) model, which is a single-layer neural network with two output neurons and a softmax activation function.Secondly, a Multi-Layer Perceptron (MLP) with four fully-connected layers, where each of the hidden layers uses Rectified Linear Unit (ReLU) activations.The two-neuron output layer is once again softmaxactivated.Finally, we define a Convolutional Neural Network (CNN) with four blocks of ReLU-activated convolutional layers followed by a max-pooling operation, with a softmax-activated two-neuron output layer.The convolutional layers are specified with a progressively increasing amount of filters per layer [4,8,16,32], a kernel size of four, a stride of one, and zero-padding.The max-pooling layers are defined with a kernel size of two and a stride of one.
We train a given classifier f θ : R D → Y over parameterization θ and D train .Each network is trained over 500 epochs using the Adam optimizer without regularization, with a learning rate of 0.0005.The validation dataset D val is used at each step to get a sense of how well the model is generalizing the data.Validation loss is calculated at each epoch and used to judge when the classifier has reached optimal performance, by storing the model state with minimum validation loss.This also prevents using an overfit model.Finally, the test dataset D test is used to calculate the resulting model performance, and is used in the evaluation of XAI methods.We consider a classifier to have generalized the given classification problem when the resulting test accuracy is at or above a threshold of 80%.
Each network is implemented in PyTorch, and also in Keras with a TensorFlow backend, so to experiment over a wider variety of XAI methods implemented using either the Captum (Kokhlikyan et al, 2020) or iNNvestigate (Alber et al, 2018) frameworks.
The main text focuses on the former.

XAI methods and performance baselines
We compare sixteen popular XAI methods in our analysis.The main text focuses on the results of four: Local Interpretable Model Explanations (LIME) (Ribeiro et al, 2016), Layer-wise Relevance Propagation (LRP) (Bach et al, 2015), SHapley Additive exPlanations (SHAP) (Lundberg and Lee, 2017) and Integrated Gradients (Sundararajan et al, 2017).
The full list is detailed in Appendix B.5.This briefly summarizes each method, and provides the details of which library was used for implementation, Captum (Kokhlikyan et al, 2020) or iNNvestigate (Alber et al, 2018), as well as the specific parameterization for each method.Generally, we follow the default parameterization for each method.
Where necessary, we specify the baseline b as the zero input b = 0, a common choice in the field (Mamalakis et al, 2022).
The input to an XAI method is a model f θ : R D → R, trained according to parameterization θ over D train , the n-th test sample to be explained x (n) test , as well as the baseline reference point b = 0 for relevant methods.The method produces an 'explanation' s(f θ , x We include four model-ignorant methods to generate 'baseline' importance maps for comparison with the aforementioned XAI methods.Firstly, we consider the Sobel filter, which uses both a horizontal and a vertical filter kernel to approximate firstorder derivatives of data.Secondly, we use the Laplace filter, which uses a single symmetrical kernel to approximate second-order derivatives of data.Both are edge detection operators, and are given for each test sample as an input.Thirdly, we use a sample from a random uniform distribution U ((−1, 1) D ).Finally, we use the rectified test data sample x (n) test itself as an importance map.

Explanation performance metrics
Based on the well-defined ground truth set of class-dependent features for a given sample F + (x (n) ), we can readily form quantitative metrics to evaluate the quality of an explanation.

Precision
Omitting the sample-dependence in the notation, we define precision as the fraction of the k = |F + | features of s with the highest absolute-valued importance scores contained within the set F + itself, over the total number of important features |F + | in the sample.We constrain these results to the submitted appendices, and focus on the results and analyses for the next two defined metrics.

Earth mover's distance (EMD)
The Earth mover's distance (EMD), also known as the Wasserstein metric, measures the optimal cost required to transform one distribution to another.We can apply this to the cost required to transform a continuous-valued importance map s into F + , where both are normalized to have the same mass.The Euclidean distance between pixels is used as the ground metric for calculating the EMD, with OT(s, F + ) denoting the cost of the optimal transport from explanation s to ground truth F + .This follows the algorithm proposed by Bonneel et al (2011) and the implementation of the Python Optimal Transport library (Flamary et al, 2021).We define a normalized EMD performance score as where δ max is the maximum Euclidean distance between any two pixels. Remark.
Note that the ground truth F + (x) defines the set of important pixels based on the data generation process.It is conceivable, though, that a model uses only a subset of these for its prediction, which must be considered equally correct.The above explanation performance metrics do not fully achieve invariance in that respect.However, both are designed to de-emphasize the impact of false-negative omissions of features in the ground truth on performance, while emphasizing the impact of false-positive attributions of importance to pixels not contained in the ground truth.

Importance Mass Accuracy
Because of this, we consider a third metric, Importance Mass Accuracy (IMA).Calculated as the sum of importance attributed to the ground truth features over the total attribution in the image, this metric is akin to 'Relevance mass accuracy' as defined by Arras et al (2022).We calculate This metric achieves invariance for not penalizing false negative attribution to a subset of pixels in F + (x), whilst also utilizing the whole attribution instead of a 'top-k' metric such as Precision.Not only this, but it is a direct measure of false positive attribution, where a score of 1 signals a perfect explanation highlighting only ground truth features as important.We use this metric to complement the strengths of EMD whilst also presenting an alternative perspective to quantifying explanation performance.

Experiments
Our experiments aim to answer four main questions: 1. Which XAI methods are best at identifying truly important features as defined by the sets F + (x)?
2. Does explanation performance for each method remain consistent when moving from explaining a linear classification problem to problems with different degrees of non-linearity?
3. Does adding correlations to the background noise, through smoothing with the Gaussian convolution filter, negatively impact explanation performance?
4. How does the choice of model architecture impact explanation performance?
We generate a dataset for each scenario across a range of 20 choices of α, finding the 'sweet spot' where average test accuracy over 10 trained models is at or above 80%.Table 1 shows the resulting α values as well as the average test accuracy for each scenario, over five model trainings for datasets of size N = 40, 000 of each scenario.For training each model and the subsequent analyses, we divide each dataset three-fold by an 90/5/5 run on an internal CPU and GPU cluster, with total runtime in the order of a matter of hours.

Results
Figure 2 depicts examples of absolute-valued importance maps produced for a random correctly-predicted sample for each scenario and model.Shown are results for four XAI methods (Gradient SHAP, LIME, LRP, and PatternNet respectively) for each of the three models (LLR, MLP, CNN respectively) followed by the model-ignorant Laplace filter.Appendix B.7.1 expands on the qualitative results of the main text, and Figure B4 shows the absolute-valued global importance heatmaps for the LIN, MULT, and XOR scenarios, given as the mean of all explanations for every correctly-predicted sample of the given scenario and XAI method.As the RIGID scenario has no static ground truth pattern, calculating a global importance map is not possible.
Figure 3 shows explanation performance of individual sample-based importance maps produced by the selected XAI and baseline methods, across five models trained for each scenario-architecture parameterization, in terms of the EMD and IMA metrics.Appendix B.7.2 expands on the quantitative results of the main text, detailing results for all 16 methods studied and for our Precision metric.In a few cases, performance tends to decrease as model complexity increases (from the simple LLR to the complex CNN architecture).One notable exception is for the RIGID scenario, where the CNN outperforms other models as expected.However, in this setting nearly all XAI methods are outperformed by a simple Laplace edge detection filter for correlated backgrounds results.In this case, the discrepancy between the MLP and CNN performance is amplified for the IMA metric, with the CNN performing relatively better for a few XAI methods.The CNN also performs well in the case of the more-complicated IMAGENET backgrounds.
Within most scenario-architecture parameterizations, the performances of the studied XAI methods are relatively homogeneous, with a few exceptions.In most cases, correlated backgrounds (CORR) lead to worse explanation performance than their white noise (WHITE) counterparts, suggesting that suppressors in the smoothed background are difficult to distinguish from the class-dependent variables for most XAI methods.This effect can be most strongly observed when comparing RIGID WHITE to RIGID CORR for the IMA metric, suggesting that correlations in the background do indeed increase false positive attribution in model explanations.
Baseline methods tend to perform similarly to one another.Interestingly, their performance is on par or even superior to various XAI methods in certain scenarios.Most notably, a simple Laplace edge detection filter outperforms nearly all other methods in the RIGID as well as the XOR scenarios, when used in combination with correlated backgrounds (CORR).IMA results for baseline methods in the RIGID scenario show a lot less variance in the boxplots of Figure 3b than for the EMD equivalents in Figure 3a.
The results for the RIGID scenarios may be taken with a pinch of salt, as the high signal-to-noise ratios (SNRs) lead to highly salient tetrominoes in sample images.Notably, explanations produced for CNNs in this case tend to perform very well for both the EMD and IMA metrics compared to most results for any other model architecture and problem scenario.While this problem itself (identifying a pattern with rotation and scaling invariance) is the most realistic of the four presented here, particularly when applied to CNNs, the high saliency of tetrominoes is perhaps not wholly akin to realistic problem settings, where the relative saliency of individual objects of interest is usually far lower.The high saliency of the tetrominoes derives from our experimental choice to adjust SNRs to achieve a predefined minimal classification performance threshold, which required high SNR in this setting.An alternative approach could be to reverse this and fix the SNR for all scenarios and background types.To revisit the stated questions from the start of Section 3: 1. Which XAI methods are best at identifying truly important features as defined by the sets F + (x)?
The results show massive variability in performance for all methods across all problems and model architectures, so we cannot declare one specific 'best' method.
2. Does explanation performance for each method remain consistent when moving from explaining a linear classification problem to problems with different degrees of non-linearity?
Here we can see again that some methods vary in performance depending on the type of non-linearity (most perform better for MULT with the fixed position nonlinearity than for RIGID), with a larger spread of EMD and IMA scores (seen in the size of boxes and whiskers of Figure 3) for non-linear scenarios than for LIN.
The results for PatternNet and PatternAttribution (Kindermans et al, 2018) shown in the appendix (Figures B5, B6 B7, B13, and B14) were proposed in part for solving the suppressor problem, and we can see how this is not necessarily always the case.These methods show strong performance for LIN as proposed, and as was seen in Wilming et al (2022), but do not look to generalize as well in most non-linear scenarios.
Notably when the pattern signal is not in a fixed position (i.e., RIGID), these methods perform worse than when the signal is in a fixed position (i.e., MULT and XOR).More specifically, they also look to learn the complete pattern signal (i.e., the tetromino shapes for both classes), so in the XOR case where both shapes are present and fixed in each sample, they do outright perform the best as one might expect.
3. Does adding correlations to the background noise, through smoothing with the Gaussian convolution filter, negatively impact explanation performance?
When looking at results from WHITE to CORR, we can spot a decrease in performance and increase in spread in most cases.This can be attributed to the fact that the imposed correlations (induced through Gaussian smoothing) between background pixels correlated with those overlapping with F + cause background pixels to act as suppressor variables.One can control the strength of this effect by increasing/decreasing the strength of the Gaussian smoothing's sigma parameter.
4. How does the choice of model architecture impact explanation performance?
For LIN, explanation performance of all methods for all architectures is similar in most cases.When moving to non-linear scenarios, we can see little consistency in how architectures perform -the CNN can be seen to perform best in the RIGID case, but the MLP performs relatively better for the fixed tetromino position cases of MULT and XOR.This can perhaps be explained by the CNN architecture tending itself well to rotation/translation invariance, whereas the properties of the MLP work better for a fixed-position ground-truth class-conditional distribution.
We can also note that when multiple models present similar classification performance for a task, a user may assume or just not realize that explanation performance could be vastly different, as seen in the MLP vs CNN results of RIGID in Figure 3, and qualitatively in Figure 2 across all architectures.

Discussion
Experimental results confirm our main hypothesis that explanation performance is lower in cases where the class-specific signal is combined with a highly auto-correlated class-agnostic background (CORR) compared to a white noise background (WHITE).The difficulty of XAI methods to correctly highlight the truly important features in this setting can be attributed to the emergence of suppressor variables.Importantly, the misleading attribution of importance by an XAI method can lead to misinterpretations regarding the functioning of the predictive model, which could have severe consequences in practice.Such consequences could be unjustified mistrust in the model's decisions, unjustified conclusions regarding the features related to a certain outcome (e.g., in the context of medical diagnosis), and a reinforcement of such false beliefs in humancomputer interaction loops.
We have also seen that when multiple ML architectures can be used interchangeably to appropriately solve a classification problem -here with classification accuracy required to be above 80% -they may still produce disparate explanations.Architectures not only differed with respect to the selection of pixels within the correct set of important features, but also showed different patterns of false positive attributions of importance to unimportant background features.If one cannot produce consistent and sensical results for multiple seemingly appropriate ML architectures, the risk of model mistrust may be especially pronounced.
A recent survey showed that one in three XAI papers evaluate methods exclusively with anecdotal evidence, and one in five with user studies (Nauta et al, 2023).Other work in the field tends to focus on secondary criteria (such as stability and robustness (Rosenfeld et al, 2021-03-27;Hedström et al, 2022)) or subjective or potentially circular criteria (such as fidelity and faithfulness (Gevaert et al, 2022;Nauta et al, 2023)).It was recently shown in Wilming et al (2023) that faithfulness as a concept can be treated as an XAI method in itself, and when done so is also prone to the attribution of arbitrarily high importance to suppressor variables.We therefore doubt that such secondary validation approaches can fully replace metrics assessing objective notions of 'correctness' of explanations, considering that XAI methods are widely intended to be used as means of quality assurance for machine learning systems in critical applications.Thus, the development of specific formal problems to be addressed by XAI methods, and the theoretical and empirical validation of respective methods to address specific problems, is necessary.In practice, a stakeholder may often (explicitly or implicitly) expect that a given XAI method identifies features that are truly related to the prediction target.In contrast to other notions of faithfulness, this is an objectively quantifiable property of an XAI method, and we here propose various non-linear types of ground-truth data along with appropriate metrics to directly measure explanation performance according to this definition.While our work is not the first to provide quantitative XAI benchmarks (see, Tjoa and Guan, 2020; Li et al, 2021;Zhou et al, 2022;Arras et al, 2022;Gevaert et al, 2022;Agarwal et al, 2022), our work differs from most published papers in that it allows users to quantitatively assess potential misinterpretations caused by the presence of suppressor variables in data.
One potential limitation of the EMD metric is the strictness of limiting the ground truth feature set F + to the specific pixels of tetrominoes a T/L compared to, say, the set of features outlining a T/L .Alternative definitions of F + could be conceived to more flexibly adapt to different potential 'explanation strategies'.Figure B3 in the appendices outlines four 'explanation strategies' and how the EMD metric varies with each.Notably, an 'outline' explanation performs worse than an explanation highlighting a subset of F + .This highlights two interesting features of our novel metric.Firstly, a strongly performing 'subset' explanation shows that EMD does not penalize false negatives (not attributing high importance to some truly important features) as harshly as Precision and other 'top-k' metrics do.Secondly, the 'outline' explanation functions in a presumably similar way to some model-ignorant edge detection methods, and performs the worst of any explanation strategy shown in Figure B3.Yet, we have shown such edge detection methods to be capable of outperforming many XAI methods in some problem scenarios.Our IMA metric also complements this potential limitation of EMD, where it does not matter if the attribution of importance to features of F + is spread across all features, or just more intensely attributed to a subset.This metric directly measures false positive attribution of importance to features outside of F + , and assists the user in understanding the role that suppressors play in model explanations.
While we compare a total of 16 XAI methods, the space of possible neural network architectures is too vast to be represented; therefore we only compared one MLP and one CNN architecture here.However, our experiments hopefully serve as a showcase for our benchmarking framework, which can be easily extended to other architectures.Finally, our framework serves much needed validation purposes for methods that are conceived to themselves play a role in the quality assurance of AI.As such, we expect that the benefits of our work far outweigh potential negative implications on society, if any.A possible risk, even if far-fetched, would be that one may reject a fit-for-purpose XAI method based on empirical benchmarks such as ours, which do not necessarily reflect the real-world setting and may hence be too strict.

Conclusion
We have used a data-driven generative definition of feature importance to create synthetic data with well-defined ground truth explanations, and have used these to provide an objective assessment of XAI methods when applied to various classification problems.Furthermore, we have defined new quantitative metrics of explanation performance and demonstrated that many popular XAI methods do not behave in an ideal way when moving from linear to non-linear scenarios.Our results show that XAI methods can even be outperformed by simple model-ignorant edge detection filters in the RIGID use case, in which the object of interest is not located in a static position.Finally, we show that XAI methods may provide inconsistent explanations when using different model architectures under equivalent conditions.Future work will be to develop dedicated performance benchmarks in more complex and application-specific problem settings such as medical imaging.
7 Declarations authors commented on and edited all previous versions of the manuscript.All authors read, edited, and approved the final manuscript.4 performance baselines, and 3 metrics.We carefully construct the important classconditional features in each problem, which can serve as ground truth explanations.We assess many popular post-hoc XAI methods and quantify their 'explanation performance' using metrics from signal detection theory such as Earth mover's distance, IMA, and precision, and show that such methods attribute importance to suppressor variables and can lead to misleading interpretations.
Through our experimental results we observe behavior including that explanations produced for different equally performing ML architectures can be very inconsistent.We show that popular explanation methods are sometimes unable to outperform random performance baselines and edge detection methods for our developed performance metrics.We discuss, using related literature, that secondary metrics such as faithfulness are currently not sufficient to assess ML explanation quality compared to objective metrics focused on the 'correctness' of explanations, such as those presented here.
What papers by other authors make the most closely related contributions, and how is your paper related to them?
Several works in the XAI field have moved towards quantitative evaluation of XAI methods using ground truth data (Tjoa and Guan, 2020;Li et al, 2021;Zhou et al, 2022;Arras et al, 2022;Gevaert et al, 2022;Agarwal et al, 2022).However, these studies are limited in the extent to which they perform quantitative assessment, and many such studies do not construct their benchmark datasets in a way that realistic correlations between class-dependent and class-agnostic features (i.e., the foreground/object in an image vs. the background) are included.In practice, these correlations can give rise to features acting as suppressor variables.These works do not focus on such variables and our previous work is the only such work to do so.Wilming et al (2022), published in ECML 2022, took a similar approach to that shown here, yet focused on a linear problem for one model architecture, and did not make use of random performance baselines to compare XAI methods to.Wilming et al (2023) also looked into quantifying explanation performance in the presence of suppressors using a two-dimensional linear example, however the focus there was on analytically deriving the exact influence of suppressors on produced explanations.
Have you published parts of your paper before, for instance in a conference?If so, give details of your previous paper(s) and a precise statement detailing how your paper provides a significant contribution beyond the previous paper(s).
The content of this paper is entirely original.Some ideas discussed in this paper have already been voiced in our prior work (Haufe et al, 2014;Wilming et al, 2022Wilming et al, , 2023)).However, our current paper goes beyond these through focusing on an extensive set of empirical experiments across 4 image classification problem scenarios, 3 background types, 3 model architectures, 16 explanation methods, 4 performance baselines, and 3 metrics.
Each field can be accessed programmatically via the name, for example DataRecord.x_test returns the test data x test of the dataset.The masks fields are the tetromino pattern masks which form the ground truth for explanations.

B.3 Compute
Experiments were run on a cluster consisting of four Nvidia A40 GPUs, where each model training took roughly between three and twenty minutes to complete, depending on architecture.Time estimation for running XAI methods is more rough to calculate and depends on each method, but in total for all models and methods for a given scenario's N = 2, 000 test set, this took between 24 and 48 hours of compute time per GPU on the cluster.Quantitative analysis took roughly a further 24 hours of compute per scenario on a cluster of AMD EPYC 7702 CPUs, with six threads used for each of the 12 scenarios.Due to smaller compute requirements, we can also recommend that if one wants to explore the code and data with smaller compute requirements, the 8 × 8-px data shown in supplementary materials Section B.8 is also representative of a strong benchmark for XAI methods.Code and instructions to run it have also been provided in the GitHub repository linked in the above supplementary materials Section B.2.

B.4 Data
Here, we expand on Figure 1 with Figure B1, which shows an example of each scenario across four choices of signal-to-noise ratio (SNR), parameterized by α.

B.5 Explanation Methods and Model Training
Here, we detail the full suite of 16 XAI methods used in our analysis, with a brief description along with the reference and any parameterization details.In the main text, we focus on XAI methods available with the Captum (Kokhlikyan et al, 2020) framework for explaining PyTorch models.We also make use of methods available in the iNNvestigate (Alber et al, 2018) library, through training equivalent models for the Keras framework.
Table B1: XAI Methods used with a brief description of each method and the implementation details, including the software framework used and any specific parameterization including the baseline input used, if applicable.

Guided GradCAM
Computes the element-wise product of guided backpropagation attributions with respect to a classdiscriminative localization map in the final convolution layer of a CNN.This produces a coarse importance map for the target class as an explanation, the same size as the convolutional feature map, rather than pixel-wise over the whole image This section further elaborates results of our experiments on validating the performance of XAI methods.In Figures B5 and B7 we also show methods available in the iNNvestigate (Alber et al, 2018) library, through training equivalent models for the Keras framework.We note that there were some issues in convergence for CNN models for the XOR scenarios with the required Keras framework, even under seemingly equivalent conditions such as fixed random seeds and He-normal weight initialization.Our model architectures have been chosen as a showcase of the datasets and benchmarks of this work, and other architectures may have better or worse performance on the same XAI methods, but this was not a focus of this work.As such, we do not show the corresponding results for these methods (PatternNet, PatternAttribution, Deep Taylor Decomposition) in the XOR-CNN problem setting, so to promote a fair comparison of methods.Fig. B3: EMD scores for the 8 × 8 ground truth as well as four 'explanation strategies'.Here, we can see that the EMD metric does not penalize an explanation highlighting a subset of truly important features compared to an explanation highlighting the outline of the ground truth.This shows that the EMD penalizes false negatives (not attributing high importance to truly important features) less than a 'top-k' metric like Precision would.The 'outline' strategy in the third column produces an explanation presumably similar to a model-ignorant edge detector, which has the lowest EMD score of the strategies shown, yet we have shown such edge detectors can outperform many XAI methods in some problem scenarios.

B.7.1 Qualitative Results
In Figure B4, we can see absolute-valued global importance maps for selected XAI methods and baselines, calculated as the mean importance value over all correctly predicted samples.RIGID scenarios involving translations and rotations of the tetromino signal pattern are not included as they have no fixed ground truth position.

B.7.2 Quantitative Results
In Figures B5, B6, and B7 we can see the full quantitative results for the EMD, IMA, and Precision metrics respectively, across all XAI methods and baselines.We can also see results for the PatternNet, PatternAttribution, and Deep Taylor Decomposition (DTD) methods, which are part of the Keras-based iNNvestigate framework (Alber et al, 2018).

B.8 8x8 Benchmarks
The benchmark was originally designed around 8 × 8-px tetromino images, scaled up to 64 × 64-px with the inclusion of the ImageNet data as a third background type.This was done to improve the robustness and real-world applicability of the datasets and benchmarks present in this work.The original results for the 8 × 8-px data with 1-px thick tetrominoes can be seen in this section.Figure B8 shows example data for both classes and also across a range of four α values.For CORR backgrounds, we set σ smooth = 3.0 for the smoothing filter, and no pattern smoothing was incorporated.Here, each scenario was constructed with sample size N = 10, 000 and with an 80/10/10 train/val/test split, with 25 datasets per scenario being used for analyses.
The Linear Logistic Regression (LLR) model in these experiments was the same single-layer neural network with two output neurons and a softmax activation function.The Multi-Layer Perceptron (MLP) similarly has four fully-connected layers and Rectified Linear Unit (ReLU) activations, and each of the fully-connected hidden layers halves the input size, i.e. [64,32,16,8].The two-neuron output layer was once again softmax-activated.Finally, the Convolutional Neural Network (CNN) was defined as four blocks of ReLU-activated convolutional layers followed by a max-pooling operation, with a softmax-activated two-neuron output layer.The convolutional layers are specified with four filters, a kernel size of two, a stride of one, and padding such that the input and output shapes match.This padding technique was used to improve pixel utilization across each convolution, as well as to mitigate shrinking outputs of the already relatively small images, by adding extra filler pixels (set to values of zero) around the edge of each image.The max-pooling layers are defined with a kernel size of two and a stride of two.As with the CNN architecture of the main text, some popular CNN architecture features (such as batch normalization) are unavailable here due to lack of implementation support by some XAI methods.
Figure B9 shows the training results across ten α values along with Table B2 which shows the chosen α values used for analysis.Each network was trained over 500 epochs using the Adam optimizer without regularization, with a learning rate of 0.004 for the LIN, MULT, and XOR scenarios, and 0.0004 for the RIGID scenario.
Figures B10 and B11 show qualitative results for local and global explanations respectively, and Figures B13 and B14 show quantitative results for the EMD and Precision metrics respectively.Fig. B7: Precision score for every XAI method tested, separated by model architecture and depicted as boxplots of median and quartile performance scores.Most methods outperform the baseline methods for most model-scenario parameterization pairs.The 'x' method, using input data as reference point of explanation, performs better for scenarios with higher signal-to-noise ratio (SNR), as the tetromino patterns will, on average, be more salient in the data there, thus present higher precision on average.Namely, the RIGID WHITE and IMAGENET scenarios generally require a higher SNR to be appropriately modeled.PatternNet and PatternAttribution, designed to nullify the influence of suppressor variables, generally perform well in the LIN and XOR WHITE cases, similar to the results shown by Wilming et al (2022), however these methods struggle in various other non-linear problem scenarios.LIME struggles across all scenarios, but performs better in the results shown in supplementary materials Section B.8, with the smaller 8 × 8-px image benchmark.Similarly to the results of B5, no XAI method performs outright the best across all scenarios.would be expected to outperform the Multi-Layer Perceptron (MLP) for the RIGID (translations and rotations of tetrominoes) scenarios due to the invariance under these properties for this architecture.However, performance is comparable, with the MLP obtaining an average test accuracy above the 80% threshold at a lower SNR than the CNN.This may be partially due to the compromise in the architecture of the CNN, where we were not able to use Batch Normalization due to incompatibility with some XAI frameworks and methods.Most methods outperform the baseline methods for most model-scenario parameterization pairs.The 'x' method, using input data as reference point of explanation, performs better for scenarios with higher signal-to-noise ratio (SNR), as the tetromino patterns will, on average, be more salient in the data there, thus present higher precision on average.Namely, the RIGID and WHITE scenarios generally require a higher SNR to be appropriately modeled.Outside of this, performance for XAI methods for the Convolutional Neural Network (CNN) is comparable to baseline methods.

Fig. 1 :
Fig. 1: Examples of data for each scenario, showing differences between samples of each class.

Fig. 2 :
Fig. 2: Absolute-valued importance maps obtained for a random correctly-predicted data sample, for selected XAI methods and baselines.Recovery of the ground truth pattern across all scenarios is best shown by XAI methods applied to a Linear Logistic Regression (LLR) model.The Multi-Layer Perceptron (MLP) tends to focus on noise in the case of ImageNet backgrounds, and LIME often fails to produce sensical explanations across all model architectures.
Fig. 3: Quantitative explanation performance of individual sample-based feature importance maps produced by various XAI approaches and baseline methods on correctly-predicted test samples, as per the EMD (top) and IMA (bottom) metrics.Depicted are boxplots of median explanation performance, with upper and lower quartiles as well as outliers shown.The white areas (left) show results for white background noise (WHITE), whereas the light gray shaded areas (middle) shows results for the correlated background noise (CORR) scenarios and the darker gray areas (right) for ImageNet (IMAGENET) backgrounds.

Fig. B1 :
Fig. B1: Examples of generated data samples for each scenario, showing how a generated sample of Class #0 (where y=0) for each scenario varies across four signalto-noise ratios (SNRs) α.

Fig. B2 :
Fig. B2: Average test accuracy over 10 model trainings for each problem scenario and model architecture, for a fixed range of signal-to-noise ratios (SNRs).As expected, the Linear Logistic Regression (LLR) model cannot perform above chance level for non-linear scenarios.The Convolutional Neural Network (CNN) outperforms the Multi-Layer Perceptron (MLP) for the RIGID (translations and rotations of tetrominoes) scenarios as expected, perhaps due to the invariance under these properties for this architecture.

Fig. B4 :
Fig.B4: Absolute-valued global importance maps calculated as the mean importance value over all correctly predicted samples, for selected XAI methods and baselines.RIGID scenarios involving translations and rotations of the tetromino signal pattern are not included as they have no fixed ground truth position.CORR scenarios with correlated background can be seen to produce noisier global importance maps, suggesting that this setting induces suppressor variables in the background, which are difficult for XAI methods to distinguish from the true signal pattern.Results for the ImageNet background also tend to show noisier global explanations, suggesting that the complicated and variable features of this background type present a challenge to the models and corresponding XAI methods.LIME fails to produce any meaningful explanations yet again, suggesting an issue with this scale of image.The results of supplementary materials Section B.8 show better performance for LIME with the smaller 8 × 8-px image benchmark.

Fig. B5 :
Fig. B5: EMD metric based on the Earth Mover's Distance (EMD) for every XAI method tested, separated by model architecture and depicted as boxplots of median and quartile performance scores.Guided GradCAM is only implemented for CNN architectures, and Keras models required for PatternNet, PatternAttribution, and Deep Taylor Decompostion (DTD) struggled to converge for the XOR scenarios as stated above, so these are excluded from the corresponding sub-plots.Some methods see a drop in explanation performance as model complexity increases, from the Linear Logistic Regression (LLR) model to a Convolutional Neural Network (CNN).In the RIGID CORR case, the model-ignorant Laplace filter outright performs the best for explanations of MLP decisions and nearly so for the CNN.SHAP variants DeepSHAP, GradSHAP, and Shapley Value Sampling perform very similarly to one another in most cases across all model types, despite being formulated to target particular problems.No XAI method performs outright the best across all scenarios.

Fig. B6 :
Fig.B6: IMA metric results for every XAI method tested, separated by model architecture and depicted as boxplots of median and quartile performance scores.Guided GradCAM is only implemented for CNN architectures, and Keras models required for PatternNet, PatternAttribution, and Deep Taylor Decompostion (DTD) struggled to converge for the XOR scenarios as stated above, so these are excluded from the corresponding sub-plots.For the most part, results are relatively consistent with the above EMD results of FigureB5.Some methods see a drop in explanation performance as model complexity increases, from the Linear Logistic Regression (LLR) model to a Convolutional Neural Network (CNN).In the RIGID CORR case, the model-ignorant Laplace filter outright performs the best for explanations of MLP decisions and nearly so for the CNN.SHAP variants DeepSHAP, GradSHAP, and Shapley Value Sampling perform very similarly to one another in most cases across all model types, despite being formulated to target particular problems.One noticeable difference between the EMD results of FigureB5and the results shown here is that PatternAttribution performs outright best for LIN WHITE under the LLR and MLP, and XOR WHITE under the MLP.In contrast, PFI performs strongly for many scenarios under the CNN, but poorly under the MLP.No XAI method performs outright the best across all scenarios.
(a) One generated sample of Class #0 (where y=0) for four different SNRs α.(b) Two generated samples of each class per scenario.

Fig. B8 :
Fig. B8: Examples of generated 8 × 8-px data samples for each scenario, showing how an example for each scenario varies across four signal-to-noise ratios (SNRs) α (top).

Fig. B9 :
Fig.B9: Average test accuracy over 10 model trainings for each problem scenario and model architecture of the 8 × 8-px setting, for a fixed range of signal-to-noise ratios (SNRs).As expected, the Linear Logistic Regression (LLR) model cannot perform above chance level for non-linear scenarios.The Convolutional Neural Network (CNN) would be expected to outperform the Multi-Layer Perceptron (MLP) for the RIGID (translations and rotations of tetrominoes) scenarios due to the invariance under these properties for this architecture.However, performance is comparable, with the MLP obtaining an average test accuracy above the 80% threshold at a lower SNR than the CNN.This may be partially due to the compromise in the architecture of the CNN, where we were not able to use Batch Normalization due to incompatibility with some XAI frameworks and methods.

Fig. B10 :
Fig. B10: Absolute-valued importance maps obtained for a random correctly-predicted 8 × 8-px data sample, for selected XAI methods and baselines.Recovery of the ground truth pattern across all scenarios is best shown by XAI methods applied to a Linear Logistic Regression (LLR) model.

Fig. B11 :
Fig.B11: Absolute-valued global importance maps calculated as the mean importance value over all correctly predicted 8 × 8-px scenario samples, for selected XAI methods and baselines.RIGID scenarios involving translations and rotations of the tetromino signal pattern are not included as they have no fixed ground truth position.CORR scenarios with correlated background can be seen to produce noisier global importance maps, suggesting that this setting induces suppressor variables in the background, which are difficult for XAI methods to distinguish from the true signal pattern.

Fig. B12 :
Fig. B12: Quantitative explanation performance of individual sample-based feature importance maps produced by various XAI approaches and baseline methods on correctly-predicted 8 × 8-px scenario test samples, as per the EMD metric.Depicted are boxplots of median explanation performance, with upper and lower quartiles as well as outliers shown.The white area (left) shows results for white background noise (WHITE), whereas the gray shaded area (right) shows results for the correlated background noise (CORR) scenarios.Explanation performance decreases as model complexity (from LLR to MLP to CNN) increases, with the exception of the RIGID scenarios, where the CNN is better suited to the non-static ground truth patterns present.Unlike results seen for linear data (Wilming et al, 2022), PatternNet and PatternAttribution do not outright outperform other XAI methods for most configurations.

Fig. B13 :
Fig. B13: EMD metric based on the Earth Mover's Distance (EMD) for every XAI method tested in the 8 × 8-px setting, separated by model architecture and depicted as boxplots of median and quartile performance scores.Consistent with the results of Figure B12, explanation performance tends to decrease as model complexity increases, from the Linear Logistic Regression (LLR) model to a Convolutional Neural Network (CNN).An exception is seen for RIGID scenarios where most XAI methods outperform the Multi-Layer Perceptron (MLP) equivalent.In this case, the model-ignorant Laplace filter performs the best across both architectures.

Fig. B14 :
Fig.B14: Precision score for every XAI method tested in the 8 × 8-px setting, separated by model architecture and depicted as mean and standard deviation performance scores.Most methods outperform the baseline methods for most model-scenario parameterization pairs.The 'x' method, using input data as reference point of explanation, performs better for scenarios with higher signal-to-noise ratio (SNR), as the tetromino patterns will, on average, be more salient in the data there, thus present higher precision on average.Namely, the RIGID and WHITE scenarios generally require a higher SNR to be appropriately modeled.Outside of this, performance for XAI methods for the Convolutional Neural Network (CNN) is comparable to baseline methods.

Table 1 :
split into a training set D train , a validation set D val , and a test set D test .From this, we compute absolute-valued importance maps |s| for the intersection of test data D test correctly predicted by every appropriate classifier.The full table of training results for finding appropriate SNRs can be seen in Appendix B.5.Experiments were Results of the model training process for each classification setting, model architecture, and background type.These results are depicted as chosen Signal-to-noise ratios (SNRs), parameterized by α, as well as the average test accuracy (ACC, %).

Table B2 :
Results of the model training process for each classification setting, model architecture, and background type in the 8 × 8-px setting.These results are depicted as chosen Signal-to-noise ratios (SNRs), parameterized by α, as well as the average test accuracy (ACC, %).