Preemptively Pruning Clever-Hans Strategies in Deep Neural Networks

Explainable AI has become a popular tool for validating machine learning models. Mismatches between the explained model’s decision strategy and the user’s domain knowledge (e.g. Clever Hans effects) have also been recognized as a starting point for improving faulty models. However, it is less clear what to do when the user and the explanation agree . In this paper, we demonstrate that acceptance of explanations by the user is not a guarantee for a machine learning model to function well, in particular, some Clever Hans effects may remain undetected. Such hidden flaws of the model can nevertheless be mitigated, and we demonstrate this by contributing a new method, Explanation-Guided Exposure Minimization (EGEM), that preemptively prunes variations in the ML model that have not been the subject of positive explanation feedback. Experiments demonstrate that our approach leads to models that strongly reduce their reliance on hidden Clever Hans strategies, and consequently achieve higher accuracy on new data.


Introduction
Machine learning (ML) models such as deep neural networks have been shown to be capable of converting large datasets into highly nonlinear predictive models [44,8,54,77,86,18].As ML systems are increasingly being considered for high-stakes decision making, such as autonomous driving [4] or medical diagnosis [17,90,64,38,81], building them in a way that they reliably maintain their prediction accuracy on new data is crucial.
Proper data splitting and evaluation of trained models on hold-out test sets have long been recognized as an essential part of the validation process (e.g. in [11]), but unfortunately, such techniques cannot detect all flaws of a model [45,66,27].Misspecified loss functions, spurious correlations, or biased datasets can potentially compromise attempts to build well-generalizing models, without altering the measured accuracy.A failure to address these more elusive flaws might lead to catastrophic failures, as has been demonstrated numerous times (e.g.[62,26,88,45,31,90,10,64]), which has spurred efforts to find potential causes of such failures (e.g.[25,45,58,89,14,84,60]).Furthermore, in modern real-world scenarios, data is often non-i.i.d.due to intrinsic heterogeneities in the data generating process (different users, locations, sensors, etc.) [35] and plagued with spurious correlations [16].The task of * Corresponding author robustification against the use of spurious features, also known as the Clever Hans (CH) effect [45] or shortcut learning [27], is an especially challenging endeavor because, to the model, such CH features are indistinguishable from truly generalizing ones.
Explainable AI (XAI) [30,71,9,70,34] is a natural starting point for robustification because it places a human in the loop: Explanation techniques seek to describe a model to the human in an intelligible way, e.g. based on features that can be visualized or that have a specific meaning.Using such methods, human experts can identify hidden flaws in the model [14] and provide useful feedback for model improvement [7,56,83,74,3].For example, if a local explanation reveals the use of spuriously correlated features [45] (Clever Hans effect), the expert may take action to correct this flawed prediction strategy.More training examples that do not exhibit the spurious correlation may be provided [60] or the model may be trained to match the users' ground-truth explanations [65,63,79,83,74,3].
However, the perhaps more common case, where the explanation returned to the user is correct, i.e. it agrees with the knowledge of the human expert, is so far little explored.In this paper, we demonstrate that a model validated with classical XAI pipelines is still not guaranteed to perform well on new data.Specifically, we find that CH strategies may remain undetected, especially when the data available for validation is limited or incomplete, supporting recent findings by Adebayo et al. [1].Given the rising popularity of foundation models [13] (multi- The user receives the model from the third party.Because the user has limited data (in particular, no data with copyright tag), the flaw of the model cannot be easily detected with XAI methods and the model appears to be both accurate and right for the right reasons.Right: Because of the undetected flaw, a naive deployment of the model is likely to result in prediction errors on new data (e.g.cats with copyright tags predicted to be horses).Our proposed approach preemptively reduces exposure to unseen features (the copyright tag), thereby avoiding these incorrect predictions.
purpose models made available in pretrained form by a third party), and given that such models do not always come with the full dataset used for training, this scenario is not an academic exercise but a timely concern.
To address the problem of undetected CH strategies, we propose to refine the original third-party model in a way that its overall feature exposure is reduced, subject to the constraint that explanations presented to the user remain the same.Specifically, we contribute a method, called Explanation-Guided Exposure Minimization (EGEM), that formulates an optimization problem weighting the feature exposure and explanation constraints.With mild approximations, our formulation simplifies to easy-to-implement soft-pruning rules that enable the removal or mitigation of undetected CH strategies.-Crucially,our refinement method only requires the data points whose predictions and explanations have been approved by the user.Neither prior knowledge about the spurious feature, nor data containing it is needed.Our proposal, as well as the context in which it operates are illustrated in Fig. 1.
To evaluate our approach, we simulate a number of scenarios of a user receiving a third-party model and possessing a subset of the data on which no CH strategies can be detected by classical XAI pipelines (e.g.LRP/SpRAy [6,45]).
Results on image data demonstrate that our proposed EGEM approach (and its extension PCA-EGEM) delivers models with a much lower reliance on CH strategies, thereby achieving more stable prediction accuracy, especially when considering data with spurious features.Our approach also outperforms a number of existing and contributed baselines.

Related Work
In this section, we present related work on validating ML models that goes beyond classical validation techniques such as holdout or cross-validation [11] in order to address statistical artifacts such as domain shifts and spurious correlations.We make a distinction between methods relying on Explainable AI and users' explanatory feedback (Section 2.1), and a broader set of methods addressing domain shift and spurious correlations by statistical means (Section 2.2).

Explainable AI and Clever Hans
Explainable AI (XAI) [29,53,71,93,70] is a major development in machine learning which has enabled insights into a broad range of black-box ML models.It has shown to be successful at explaining complex state-of-theart neural networks classifiers [6,62,75,92,73], as well as a broader set of ML techniques such as unsupervised learning (e.g.[51,39]).While most XAI methods generate an explanation for individual instances, solutions have been proposed to aggregate them into dataset-wide explanations that can be concisely delivered to the user [45].
Notably, XAI techniques have been successful at revealing CH features in ML models [45,60].Knowledge about the CH features can be used to desensitize the model to these features (e.g. via retraining [60] or layer-specific adaptations [3]).If ground-truth explanations are available (e.g.provided by a human expert), the model may be regularized to match these explanations [65,63,79], e.g. by minimizing the error on the explanation via gradient descent.Such adaptations to the users' expectations have also been shown to be effective in interactive settings [83,74].Our approach differs from these works as we address the case where the available data does not contain CH features, hence making them indiscoverable by local explanations, and where the model is pretrained and thus cannot be regularized during training.
A different approach is DORA [15], which attempts to find potential CH features in a data-agnostic way and subsequently use the discovered candidate features to detect faulty decision strategies at deployment time.In contrast, we attempt to robustify the network with no need for further post-processing at deployment.Furthermore, we examine the scenario where a limited amount of clean data is available, allowing us to employ conceptually different criteria besides outlierness.

Robustness to Spurious Correlations
Our work is part of a larger body of literature concerned with domain shift and how to design models robust to it.Yet, it concerns itself with a decidedly specialized and rather recent part of this area: unlearning or avoiding the use of spurious features in deep neural networks.Previous work attempting to create models that are robust against spurious correlations approached the problem from the angle of optimizing worst-group loss [22,36,68,69,80,37,43,57].This approach has shown to be effective in reducing reliance on CH features.Yet, these methods require access to samples containing the CH features and a labeling of groups in the data induced by these features.In particular, as previously pointed out by Kirichenko et al. [43], Group-DRO (distributionally robust optimization) [36], subsampling approaches [69,37] and DFR (deep feature reweighting) [43] assume group labels on the training or validation data, and even methods that do away with these assumptions need to rely on group labels for hyper-parameter tuning [49,22,37].Our setting is different from the ones above in that we assume that a pretrained model is to be robustified post hoc with limited data and that data from the groups containing the CH feature are not available at all.We believe this is a highly relevant scenario, considering the increasing prevalence of pretrained thirdparty models that have been trained on datasets that are unavailable or too large to fully characterize.

Explanation-Guided Exposure Minimization (EGEM)
Let us restate the scenario of interest in this paper, highlighted in Fig. 1: (1) a model provided in pretrained form by a third party, (2) a user who has limited data available to validate the third-party model and who concludes that the predictions and the associated decision strategies (as revealed by XAI) on this limited data are correct.-Asargued before, in spite of the positive validation outcome, there is no guarantee that the model's decision strategy remains correct in regions of the input space not covered by the available data.
As a solution to the scenario above, we propose a preemptive model refinement approach, which we call Explanation-Guided Exposure Minimization (EGEM).Technically, our approach is a particular form of knowledge distillation where the refined (or distilled) model should reproduce observed prediction strategies (i.e.predictions and explanations) of the original model on the available data.At the same time, the refined model should minimize its overall exposure to variations in the input domain so that undetected (potentially flawed) decision strategies are not incorporated into the overall decision strategy.
Let the original and refined model have the same architecture but distinct parameters θ old and θ.We denote by f (x, θ old ) and f (x, θ) the predictions produced by the two models, and the explanations associated to their predictions as R(x, θ old ) and R(x, θ) respectively.We then define the learning objective as where the expectation is computed over the available data, and where Ω(x, θ) is a function that quantifies the exposure of the model to the input variation when evaluated at the data point x.
Although the aforementioned formulation is general, it is not practical, because it would require optimizing a highly nonlinear and non-convex objective.Moreover, the objective depends on explanation functions which themselves may depend on multiple model evaluations, thereby making the optimization procedure intractable.

A Practical Formulation for EGEM
To make the concept of explanation-guided exposure minimization effective, we will restrict our analysis to XAI methods that can attribute onto any layer of the model and whose produced scores have a particular structure.Specifically, we require the score assigned to a neuron i at a given layer to be decomposable in terms of the neurons j in the layer above, i.e.R i = j R ij and terms of the decomposition should have the structure where a i denotes the activation of neuron i, w ij is the weight connecting neuron i to neuron j in the next layer, ρ is an increasing function satisfying ρ(0) = 0 (e.g. the identity function), and d j is a term that only indirectly depends on the parameters in the given layer and that is reasonable to approximate as constant locally.Explanation techniques that produce explanation scores with such structure include backpropagation methods such as Layerwise Relevance Propagation (LRP) [6,55] and gradientbased techniques such as Gradient × Input (GI) and Integrated Gradients (IG).(See Supplementary Note A for derivations.)The refined model only retains the dependence on a 2 (a neuron detecting the actual horse) and removes its reliance on a 3 (a neuron responsive to spurious copyright tags).Bottom: Qualitative behavior of PCA-EGEM on ML models trained on real datasets.The models produced by our approach become robust to spurious features not seen at inspection time but occurring at deployment time.(Pixel-wise explanations are computed using the zennit package [2].) We now present our practical formulation of explanation-guided exposure minimization.First, it imposes explanation similarity between the original and the refined model on the messages R ij at a specified layer of the network, which is not necessarily the input layer.Furthermore, it restricts the search for refined parameters to the weights of the same layer.Hence, our scheme can be interpreted as changing the parameters of a given layer so that the overall model has minimized its exposure, subject to the explanation at that layer remaining the same.Specifically, we solve: where the expectation is taken over the available data.The first squared term constraints the explanations of the refined model to be close to that of the original model.The second squared term corresponds to the penalty used for exposure minimization.The quantity s ij (w ij ) which we use for this purpose can be interpreted as the way by which the refined model responds to the activation of neuron i through neuron j, in particular, if (s ij ) j becomes zero, the model becomes unresponsive to the activation of neuron i.An advantage of this formulation is that it has the closedform solution: See Supplementary Note B for a derivation.In other words, the refined model can be seen as a soft-pruned version of the original model where the pruning strength depends on how frequently and to what magnitude the input neuron is activated and how the model responds to the output neuron.If we further assume that a i and d j are independent (or more weakly that their squared magnitudes are decorrelated), then it can be shown that d j vanishes from Equation (4).Furthermore, the refined model can be obtained by keeping the weights intact and inserting a layer directly after the activations that performs the scaling: with The pruning of the neural network architecture and the resulting loss of dependence on the CH feature are depicted in Fig. 2 (top).
The same approach can also be applied to convolutional layers.To calculate the scaling parameters c i , the activations of each channel are summed up along the spatial dimensions.For refinement, the pruning coefficients are then applied to all activations of the corresponding feature map (cf.Eq. 5).Such pruning strategy for convolutional layers can be derived exactly from Eq. (3) if assuming activation maps of infinite size (or circular convolutions) and stride 1.For the majority of convolution layers used in practice, Eq. ( 5) only derives from the objective formulation approximately.

Pruning in PCA Space
Within the EGEM soft-pruning strategy, each dimension of a layer is pruned individually.In practice, this only allows to eliminate undetected flawed strategies that use a set of neurons that is disjoint from the validated strategies.Because a given neuron may contribute both to detected and to undetected strategies, the standard version of EGEM may not be able to carry out the exposure minimization task optimally.To address this limitation, we propose PCA-EGEM, which inserts a virtual layer, mapping activations to the PCA space (computed from the available data) and back (cf.Fig. 2).PCA-EGEM then applies soft-pruning as in Eq. ( 5), but in PCA space, that is: Here, {U k } K k=1 is the basis of PCA eigenvectors and ā is the mean of the activations over the available data.The motivation for such mapping to the PCA space is that activation patterns that support observed strategies will be represented in the top PCA components.PCA-EGEM can therefore separate them better from the unobserved strategies that are likely not spanned by the top principal components.
While using PCA to find principal directions of interpretable features for GANs [28] has been proposed previously by Härkönen et al. [32] and has found application beyond that [76,19], to our knowledge its use for the purpose of identifying a basis for exposure minimization is novel.

Experimental Evaluation
In this section, we evaluate the efficacy of the approaches introduced in Section 3 on various datasets that either naturally contain spurious correlations, giving rise to Clever Hans decision strategies, or have been modified to introduce such spurious correlations.After introducing the datasets, we demonstrate that the proposed approaches can mitigate the effect of CH behavior learned by various pretrained models.We will do this by evaluating our approaches on test datasets where the correlation of the CH feature and the true class is manipulated -i.e. a distribution shift is introduced.Additionally, we empirically explore the effect of the number of samples used for refinement and discuss the challenges of hyper-parameter selection in our setting in Sections 4.4 and 4.5.A more qualitative evaluation on the CelebA dataset [50] follows in Section 5.

Datasets
We introduce here the datasets used to evaluate the proposed methods in Sections 4. 3 of the MNIST dataset [46], the ImageNet dataset [23,67], and the ISIC dataset [20,85,21].Details on the preprocessing, and the neural networks used for each dataset can be found in the Supplemental Notes.
Modified MNIST.The original MNIST dataset [46] contains 70,000 images of hand-written digits, 10,000 of which are test data.We create a variant in which digits of the class '8' are superimposed with a small artifact in the topleft corner, with a probability of 0.7 (see Fig. 2).In order to generate a natural yet biased split of the training data that separates an artifact-free set of refinement data, we train a variational autoencoder [42] on this modified dataset and manually chose threshold along a latent dimension such that the samples affected with the artifact only fall on one side.This defines a subset of 39,942 samples from which clean refinement datasets are sampled and leaves a systematically biased subset (containing all modified '8' samples) only accessible during training.We train a small (2 convolutional and 2 fully connected layers) neural network on this dataset using binary cross-entropy loss over all ten classes on the whole training data.
ImageNet.We use the ILSVRC 2012 subset of the Im-ageNet dataset [23,67] containing 1.2M natural images, each associated with one of 1000 classes for training.For evaluation, we use the 50 labeled samples per class that are contained in the validation set.Previous work has identified multiple spurious correlations potentially affecting a model's output [3].In our experiments, we use a watermark and web-address on images of the 'carton' class and a gray frame around images of the 'mountain bike' class as Clever Hans features (see Fig. 2).We vary their frequency in the test set by pasting these features on images (details in Supplementary Note C).The selected classes are evaluated in a binary classification setting against the most similar classes in terms of the output probabilities.Training set images used for refinement that do not contain the CH feature are manually selected for the 'carton' experiments and automatically for 'mountain bike' experiment.For experiments on this dataset, we make use of the pretrained ResNet50 [33] (for the 'carton' class) and VGG-16 [78] (for the 'mountain bike' class) networks available in pytorch 1 .
ISIC.The ISIC 2019 dataset [20,85,21] consists of images containing skin lesions that are associated with one of eight medical diagnoses.The data is split into 22,797 samples for training and 2,534 for evaluation.We fine-tune a neural network based on a VGG-16 pretrained on Ima-geNet for this classification task using a cross-entropy loss.Some images of the class 'Melanocytic nevus' are contaminated with colored patches (see Fig. 2), which have been recognized as potential CH feature [52,63,12,3].We manually remove all contaminated images after training and use this clean dataset for refinement.Images in the test set are contaminated at the desired ratio by pasting one extracted colored patch onto other images.

Methods
We compare several methods for the mitigation of the Clever Hans effect.The most basic baseline is the original pretrained model (Original).We evaluate both EGEM and PCA-EGEM, as well as what we will call responseguided exposure minimization (RGEM), which attempts to maintain the last-layer responses, rather than the explanations, while minimizing exposure by penalizing large weights.Specifically, RGEM solves where the expectation is computed over the available data.See Supplementary Note D for more details.Furthermore, we evaluate a version of the original model finetuned on the refinement set (Retrain) and a version of the original model where the last layer has been replaced by weights learned via ridge regression on the refinement data (Ridge).The latter is equivalent to linear probing or DFR, which has been shown to be effective in mitigating accuracy loss due to subpopulation shifts [72] and the Clever Hans effect, when hyper-parameter selection based on worst-group accuracy optimization is possible [43].The formulation for Ridge can also be retrieved by replacing the output of the original model, f (x, θ old ), in the formulation of RGEM in Supplementary Note D with the ground-truth labels.

Results
To evaluate robustness against spurious features we generate a fully poisoned test set by adding the CH artifact uniformly over all test images of all classes.This serves as a scenario where the correlation of the CH feature and the target class breaks.Such a distribution shift could, for example, happen in medical applications where a classifier might be trained on data in which the mode of data collection or the population characteristics of subjects are correlated with the outcome, but this correlation does not hold in the general case [64].Note that while this poisoning scenario is an extreme case, it is not the worst case, as the class which was contaminated during training will also be modified with artifacts during testing.For refinement, 700 correctly predicted samples per class are used, oversampling images if fewer than 700 correctly predicted samples are in the available refinement data.For the modified MNIST and the ISIC dataset, we use 1000 randomly chosen test samples for each run of the evaluation, for ImageNet we use all available validation samples.
We evaluate the various models for all tasks under 0% and 100% uniform poisoning.Classification accuracy for intermediate levels of poisoning can be obtained by linear interpolation of these extremes.Figure 3 shows the obtained accuracy on those two levels of poisoning.An ideal model would obtain high accuracy with only a very small difference between clean and poisoned data.It should be invariant to the spurious feature and at most react to possible interference with other features, e.g. the spurious feature being pasted on top of a relevant part of the image, while not losing accuracy on the clean data.As expected, across all datasets increased poisoning reduces accuracy of the original model.Importantly, this drop in accuracy cannot be detected without access to samples containing the CH feature.
On the modified MNIST dataset, the original model loses about 30% of its clean-data accuracy when evaluated at the 100% poisoning level.All other models achieve both clean-data and 100%-poisoned accuracy levels within 4% of the original model's clean-data accuracy.While explanation-based methods lose slightly more clean-data accuracy than the other baselines, they display virtually no gap between clean-data accuracy and 100%-poisoned accuracy, making them the most predictable when no poisoned data is available.
On the more complex ISIC dataset, it can be observed that exposure to the CH feature cannot be completely removed by any of the methods.EGEM and PCA-EGEM still provide fairly robust models, with the highest poisoned-data accuracy and the smallest gap between 0%-poisoned and 100%-poisoned accuracy.PCA-EGEM retains clean-data accuracy while being the only method improving poisoned-data accuracy by more than 10 percentage points.The dataset provides a challenge for all other methods.Even though Retrain is the only method that improves clean-data accuracy in the refinement process, its poisoned-data accuracy is virtually the same as the original model's, indicating that the absence of a feature in the refinement data is not enough to remove it from the model, given only a limited amount of samples.
On the ImageNet tasks containing the 'carton' class, PCA-EGEM is the most robust refinement method.It is only outperformed in the 100% poisoned setting of the 'carton/envelope' task, where Retrain achieves the highest clean-data and poisoned accuracy.As we will see in Section 4.4, the inferior 100% poisoning accuracy of PCA-EGEM is a result of the hyper-parameter selection procedure and not fundamentally due to the pruning-based nature of the method.On the 100% poisoned setting 'mountain bike' task, no refinement method is able to achieve accuracy gains over the original model.This might be due to the small magnitude of the CH effect resulting in the  clean-data loss due to refinement outweighing the robustness gain.This case also demonstrates that refinement is not beneficial in all scenarios and might not even lead to an improved 100% poisoned accuracy.Whether or not to refine should be decided based on whether the loss of clean-data accuracy can be tolerated.
Overall, this section's experiments demonstrate that the proposed refinement methods can preemptively robustify a pretrained model against Clever Hans effects, even if the latter cannot be observed from the limited available data.We could clearly establish that the attempt to robustify against CH behavior in absence of the associated artifact or knowledge thereof is not a hopeless endeavor and can be addressed with relatively simple methods.Yet, the trade-off between clean-data accuracy and poisoned-data accuracy cannot be directly observed and thus needs to be resolved heuristically.We explore this aspect in the next section.

Hyper-parameter Selection
The hyper-parameters optimized in the experiments in this section are the number or epochs for 'Retrain' and the regularization factor λ for all other refinement methods.For the deep exposure-based approaches, EGEM and PCA-EGEM, we do not optimize λ for each layer directly, but we rather employ an approach inspired by the triangular method of Ashouri et al. [5] and earlier work [61] where pruning strength increases with the layer index, which allows us to reduce the number of parameters to optimize to one.In particular, we define thresholds τ l that denote the desired average pruning ratio per layer, where l is the layer index within the set of L layers to be refined: and optimize α. λ l is then set such that the average pruning factor Ex∈X Ej c j from Eq. ( 4) for layer l is at least τ l .The search for λ l given τ l can be easily implemented as exponential search.
Ideally, the hyper-parameters should be set such that classification loss is minimized while exposure to the spurious artifact is negligible.While classification loss on clean data can be readily approximated by evaluating the loss function on the refinement data, exposure to the spurious artifact is a more elusive quantity and cannot be measured without a priori knowledge of the spurious artifact.In previous work (e.g.[36,68,69,22,80,37,43,57]) it is assumed that for each class a set of samples with and without the spurious artifact is given and in most cases that the worst-group-accuracy can be directly optimized or at least used for hyper-parameter selection, circumventing this problem.Since in our problem setting access to samples with the artifact is not given, this metric for parameter selection is not available and we need to establish a heuristic approach.Assuming that the classification loss on clean data can be approximated accurately, one option is to pick the strongest refinement hyper-parameter (i.e.highest number of epochs, largest λ or smallest α) from a pre-defined set (see Supplementary Note F) for which the validation accuracy after refinement is at least as high as the one achieved by the original model.
As it is possible that strong refinement also impairs the use of generalizing features, there may be a trade-off between clean-data accuracy and robustness to spurious features.That optimizing overall clean-data accuracy is generally not the best approach to optimizing overall accuracy is highlighted by the fact that other works optimize worst-group-accuracy, as mentioned above.We explore the accuracy trade-off in Fig. 4 by introducing a 'slack' parameter s to the hyper-parameter selection for PCA-EGEM.We refer to Supplementary Note G for the results of all methods.The refinement hyper-parameter is then chosen as the strongest regularization, given that the validation accuracy is at most s% smaller than the one achieved by the original model.The idea is that minimizing loss of classification accuracy on the refinement data prevents removing too much exposure to useful features, yet, allowing for some slack counteracts the tendency to choose trivial least-refinement solutions.We suspect that in the simple case of the modified MNIST dataset, the model only learned few important high-level features and that the CH feature is close to being disentangled in some layer of the network.This scenario is a natural fit for pruning methods which could simply remove the outgoing connections of the node corresponding to the CH feature.Stronger refinement risks pruning useful features as well, which is an effect that can be observed in Fig. 4. For most datasets, we can observe that the accuracy curves first converge to or maintain a minimal 0%-100% poisoning gap.In this regime PCA-EGEM prunes unused or CH features.After crossing a certain level of slack, both accuracy values deteriorate as features necessary for correct classifications are being pruned as well.
The results previously presented in Fig. 3 are the outcomes for s = 5%.This is a heuristic and it can be seen from Fig. 4 that different values of slack may be beneficial to increase robustness, depending on the dataset.We also show in Supplementary Notes G and H that PCA-EGEM provides the most robust refinement over a large range of slack values.As slack cannot be optimized w.r.t. the true deployment-time accuracy, we propose to set s between 1% and 5% as a rule of thumb.
In principle, another hyper-parameter is the choice of layers to refine.Knowledge of the type of Clever Hans could potentially guide this choice [3,47] as the layer in which a concept is best represented may differ across concepts [41].Since we do not assume such knowledge in our experiments, we simply refine the activations after every ResNet50 or VGG-16 block for the parts of the models that are derived from those architectures and additionally after every ReLU following a fully connected or convolutional layer that is not contained in a ResNet or VGG block.For 'Retrain' we fine-tune the whole network and RGEM and Ridge are restricted to the last layer.

The Effect of the Sample Size
As the number of instances available for refinement is limited, a natural question is what impact the number of samples has on the efficacy of refinement and if refining with too few instances can be detrimental.In this section, we repeat the experiment from Section 4.3 for refinement datasets containing 25, 50, 200, 500, and 700 instances per class for 0% and 100% uniform poisoning.Slack is again set to 5%.If for some classes fewer correctly classified instances are available, these are over-sampled to achieve the desired number.
The effect of varying sample size is shown in Figure 5.It can be seen that especially in the low-sample regime, the positive effect of refinement is modulated by the number of instances.See Supplementary Note H for all other methods.While refinement with a small sample size appears in most cases to be remarkably effective for increasing 100%-poisoned accuracy, clean-data accuracy tends to suffer as the sample does not cover all of the features necessary to generalize, some of which are thus pruned away.For this reason, a larger refinement sample is in most cases beneficial, in particular for preserving cleandata accuracy.Yet, there are two cases that stand out as breaking this rule: The modified MNIST dataset and the 'carton/envelope' task.In both cases, the gap between 0% and 100%-poisoned accuracy is close to constant, suggesting that the drop in accuracy stems from a loss of generalizing features rather than a loss of robustness, as could be induced e.g. by samples contaminated by CH features.Considering the effect of slack, displayed in Fig. 4, we can also see that those two scenarios are also the cases for which 5% slack is not optimal.We hypothesize that here, the negative effect of increasing sample size stems from the interrelation between sample size and refinement strength.In particular, for EGEM and PCA-EGEM, using fewer instances generally means less coverage of the feature space, which leads to more zero or near-zero coefficients in the pruning procedure (cf.Eq. 5).Hence, for EGEM and PCA-EGEM, larger sample sizes potentially lead to weaker refinement which can be similar in effect to a decrease in slack.
Since clean-data accuracy can be evaluated on held-out data, we can observe that applying PCA-EGEM results in fairly predictable 100%-poisoning performance across a wide range of sample sizes, i.e. the spread between cleandata and poisoned-data accuracy is small, as is demonstrated by the relatively small shaded area in Fig. 5.

Use Case on CelebA: Reducing Bias
In this section, we will take a closer look at the effect of applying PCA-EGEM to a model trained on the CelebA dataset [50].In contrast to the previous experiments, we do not evaluate based on a specific known CH feature, but rather conduct the analysis in an exploratory manner, uncovering subpopulations for which a learned CH behavior is leading to biased classifications.In practice, such an analysis could be done in hindsight, e.g. when PCA-EGEM has been applied before deployment, and its effect is later evaluated on new samples collected during deployment.
The CelebA dataset contains 202,599 portrait images of celebrities, each associated with 40 binary attributes.The existence of spurious correlations in the CelebA dataset has been documented previously [69,91,40,68] and it can be seen in Supplementary Note C.2 that the attributes in the training set are correlated to various degrees.We train a convolutional neural network (details in Supplementary Note E) on the 'train' split of the CelebA dataset using cross-entropy loss on a 'blonde hair'-vs-not classification task.The training data is stratified and we achieve a binary test accuracy of 93%, which is comparable to accuracy reported in other works, e.g.Sagawa et al. [68].We regard this classifier as a model given to the user by a third party.
In the following, we will assume a scenario where the user seeks to use the third-party classifier to retrieve blond people from a set of images available during deployment.They wish this retrieval process to be accurate and not biased against subgroups in the population.In order to analyze the impact of applying PCA-EGEM on such retrieval task, we simulate a validation set where the user has a limited subset of 'clean' examples, specifically, 200 examples of both classes, that are correctly predicted by the model and whose explanation highlight the actual blond hair as determined by LRP scores falling dominantly within the area of the image where the hair is located (see Supplementary Note F).These explanations (considered by the user to be all valid) are then fed to PCA-EGEM in order to produce a model that is more robust to potential unobserved Clever Hans effects.As for the previous experiments, we use 5% slack, which translates here to α = 0.01.

PCA-EGEM Reduces Exposure to Shirt Collars
After the model is deployed, the analysis of the decision strategy (of the original and refined model) can be reexamined in light of the new data now available.Fig. 6 shows explanations for some retrieved images, specifically, evidence for them being predicted to be blond.We can observe that pixels displaying hair are considered to be relevant and remain so after refinement.
In contrast, one can identify a significant change of strategy before and after refinement in the lower part of the image: The original model appears to make heavy use of shirt and suit collars as a feature inhibiting the detection of blond hair, whereas such inhibiting effect is much milder in the refined model.This observation suggests that PCA-EGEM has effectively mitigated a previously unobserved Clever Hans strategy present in the original model, and as a result, effectively aided the retrieval of images with collars on them.

PCA-EGEM Balances Recall Across Subgroups
We will now analyze the implication of the Clever Hans effect reduction by PCA-EGEM on specific subgroups, specifically, whether certain subgroups benefit from the model refinement in terms of recalling members with the attribute 'blond'.
To this end, we randomly sample for every attribute in the dataset, a subset of 5000 images from the test data that only contains samples exhibiting this attribute.If fewer images are available for some attribute, we use all of the available samples.We evaluate the classifier with and without the application of PCA-EGEM on each of these subgroups.
Figure 7 shows recall scores for each subgroup before and after application of PCA-EGEM.We observe a substantial increase of recall on low-recall subgroups, such as 'Wearing Necktie', 'Goatee', and 'Male'.Most high-recall group see only minuscule negative effects.Overall, while having almost no effect on the dataset-wide recall, we can observe that the application of PCA-EGEM rebalances recall in favor of under-recalled subsets.Our investigation thus demonstrates that a model bias responsible for underdetecting blond hair in these subgroups has been mitigated by applying PCA-EGEM, and consequently leads to a set of retrieved images that is more representative of the different subgroups and more diverse.
It is of theoretical interest to ask whether such rebalancing effect would generalize to other scenarios.An argument is that the underrepresentation of certain subgroups in the retrieved set is mainly caused by subgroups with low prevalence of the class of interest being actively suppressed by the model in order to optimize its accuracy.In practice, such suppression can be achieved by identifying features specific to the subgroup and, although causally unrelated to the task, making these features contribute negatively to the output score.Our PCA-EGEM technique, by removing such task-irrelevant Clever Hans features, redresses the decision function in favor of these lowprevalence subgroups, thereby leading to a more balanced set of retrieved instances.
Two outliers to the overall rebalancing effect can however be noted in Fig. 7: 'Wearing Hat', and 'Blurry'.Interestingly, these are two subgroups in which the feature of interest (the hair) is occluded or made less visible.In other words, in these two subgroups, only weakly correlated features are available for detection, and their removal by PCA-EGEM consequently reduces the recall.An underlying assumption behind the rebalancing effect is therefore that the true features are detectable in the input image without resorting to weakly or spuriously correlated features.
Overall, we have demonstrated in our CelebA use case, that PCA-EGEM can be useful beyond raising accuracy on disadvantageous test-set distributions.Specifically, we have shown that our PCA-EGEM approach enables the retrieval of a more diverse set of positive instances from a large heterogeneous dataset.

Open Questions
We could demonstrate the efficacy of the proposed methods for mitigation of Clever Hans effects in Sections 4.3 and 5, however, it can also be observed that 1) a complete removal of the model's response to the spurious (CH) feature is usually not achieved, and 2) classification accuracy on clean data may suffer.We suggest that there are multiple reasons for these undesired effects.
Firstly, in deep neural networks, CH features are generally not neatly disentangled from generalizing features.This means that either entangled well-generalizing features might suffer from pruning, reducing clean-data accuracy, or CH features might not be pruned due to being entangled with a feature present in the clean dataset.The latter would inhibit robustification against the CH feature.While the PCA-EGEM extension we have proposed achieves some basic form of disentanglement, more refined disentanglement methods based, for example, based on finding independent components, could be considered in future work.
A second open question is posed by the fact that the number of examples for which one collects explanatory feedback is limited.Thus, not all generalizing features may be present in the refinement data, thereby leading to these features being pruned away.Methods to more extensively draw from the user's explanatory feedback (e.g.rendering explanations to the user in a more intuitive way so that more examples can be inspected in a more precise manner, or presenting to the user in priority examples considered to be more informative) should be the focus of further investigation.
As pointed out in previous work [40], removing spurious (CH) features can hurt performance on data where the spurious correlation holds.Thus, un-biasing methods such as the refinement approaches introduced in this paper should be used under the assumption that they may hurt classification accuracy on biased data.
We also point out that our technique relies on the refinement set not containing CH features.This may naturally be satisfied in many cases where samples for refinement originate from a different data source than the training data.Generally, our methods rely on post-hoc attribution methods to validate that the given samples are clean.While there have been concerns on the effectiveness of attribution methods in some scenarios [1], these concerns could be addressed in future work by moving beyond pixel-wise attribution for explanation validation e.g. using concept-based or counterfactual explanations [82,87,24].

Conclusion
Sensitivity to distribution shifts, such as the ones induced by spurious correlations (so-called Clever Hans effects) has long been an Achilles heel of machine learning approaches, such as deep learning.The problem becomes even more pronounced with the increasing adoption of foundation models for which the training data may not be public and is thus closed to scrutiny.Explanation techniques have the potential to uncover such deficiencies by putting a human in the loop [48,45,56,71].Previous work in XAI has mainly focused on improving explanations or fixing flaws in the model that have been identified by the user from such explanations.In contrast, we have considered the under-explored case where the human and the explanation agree but where there are possibly unobserved spurious features that the model is sensitive to.While recent work has shown that XAI-based validation techniques may fail to detect some of these Clever Hans strategies employed by a model [1], we have argued that one can nevertheless still reduce the exposure of a model to some of these hidden strategies and demonstrated this via our contributed Explanation-Guided Exposure Minimization approach.
Our approach, while formulated as an optimization problem, reduces to simple pruning rules applied in intermediate layers, thereby making our method easily applicable, without retraining, to complex deep neural network models such as those used in computer vision.Our method was capable of systematically improving prediction performance on a variety of complex classification problems, outperforming existing and contributed baselines.
Concluding this paper, we would like to emphasize the novelty of our approach, which constitutes an early attempt to leverage correct explanations for producing refined ML models and attempts to tackle the realistic scenario where Clever Hans features are not accessible.We believe that in future work, the utility derived from explanations via refinement can still be expanded, e.g. by letting the user specify what is correct and what is incorrect in an explanation so that the two components can be treated separately, or by identifying sets of examples to present to the user that are the most useful to achieve model refinement, for example, by ensuring that they cover the feature space adequately or by active learning schemes.
In this section we derive a closed form solution for the EGEM method, which we stated in Section 3 of the main paper as w old ij (11) with the purpose of solving the objective min where and Substituting these last two terms into the objective, we get: We observe that each term of the sum depends on its own parameter w ij .Hence, each term can be minimized separately.Consider one such term and compute its gradient: We now find where the gradient is zero.Our derivation uses the fact that ρ(w ij ) and its derivative do not depend on the data and can therefore be taken out of the expectation: Furthermore, the equation above also implies   where the expectation is computed over the available data.The first term ensures the reproduction of the model output on the refinement data.The second term with regularization parameter λ penalizes overall model exposure, e.g.forcing the model to not be too complex.For the linear case, where the teacher model and student model are given by f (x, w old ) = w old x and f (x, w) = w x respectively, we get the closed form solution: where Σ = E[xx ].This equation resembles the ridge regression solution; the difference being that the cross-covariance between data and targets E[xy] in the original model is replaced by the term Σw old .This term performs a realignment of the pretrained model's weights along the validation data.This realignment with the refinement data, desensitizes the model to directions in feature space that are not expressed in the available data and that the user could not verify, giving some level of immunity against a possible CH effect in the classifier.For neural networks with a final linear projection layer, response-guided exposure minimization (RGEM) can be applied to this last layer.We can rewrite Eq. ( 25) as ridge regression on the predictions f (x, w old ): where f (X, w old ) is the vector of outputs of the original model on the refinement data.

Figure 1 :
Figure 1: Cartoon comparison of a naive XAI-based validation/deployment pipeline and our proposed approach incorporating an additional exposure minimization step.Left: A third party trains a flawed (Clever Hans) model which exploits a spurious correlation in the data (images of the horse class have a copyright tag in the bottom-left corner).Middle: The user receives the model from the third party.Because the user has limited data (in particular, no data with copyright tag), the flaw of the model cannot be easily detected with XAI methods and the model appears to be both accurate and right for the right reasons.Right: Because of the undetected flaw, a naive deployment of the model is likely to result in prediction errors on new data (e.g.cats with copyright tags predicted to be horses).Our proposed approach preemptively reduces exposure to unseen features (the copyright tag), thereby avoiding these incorrect predictions.

Figure 2 :
Figure2: Top: Cartoon depicting the removal of unseen Clever Hans strategies via our proposed exposure minimization approaches (EGEM and PCA-EGEM).The refined model only retains the dependence on a 2 (a neuron detecting the actual horse) and removes its reliance on a 3 (a neuron responsive to spurious copyright tags).Bottom: Qualitative behavior of PCA-EGEM on ML models trained on real datasets.The models produced by our approach become robust to spurious features not seen at inspection time but occurring at deployment time.(Pixel-wise explanations are computed using the zennit package[2].) mt. bike/bicycle-b.f.t.

Figure 3 :
Figure3: The accuracy for 0% (lighter shade) and uniform 100% (darker shade) poisoning with the spurious feature.The last four bars for each method refer to the binary tasks constructed from the ImageNet dataset.Solid lines show average 100%-poisoned accuracy and dashed lines show average clean-data accuracy over all datasets.The results shown are the mean accuracy and standard deviation obtained using 700 refinement samples per class on the respective test sets over five runs.

Figure 4 :
Figure 4: Accuracy under variations of the slack parameter.Higher slack means higher refinement-data loss is accepted when selecting the refinement hyper-parameter.The dotted line indicates clean-data accuracy whereas the solid line indicates 100% poisoned data accuracy.Mean and standard deviation are computed over 5 runs.

Figure 5 :
Figure 5: Effect of the number of instances used for refinement.The dotted line indicates clean-data accuracy whereas the solid line indicates 100%-poisoned data accuracy.Mean and standard deviation are computed over 5 runs.

Figure 6 :
Figure 6: Test set images that exhibit strong changes in the detection of blond hair and corresponding LRP explanations, before and after refinement.Red indicates positive and blue negative contribution to the detection of blond hair.Shirt collars and similar features appear to inhibit the prediction of blond hair in the original model but less so in the refined one.

Figure 7 :
Figure 7: Comparison of recall of the pretrained model before (Original) and after refinement (PCA-EGEM).Recall is calculated on subsets of CelebA containing only samples exhibiting the attribute on the x-axis.'All' is sampled from the whole test set.
h a d o w A r c h e d _ E y e b r o w s A t t r a c t iv e B a g s _ U n d e r _ E y e s B a ld B a n g s B ig _ L ip s B ig _ N o s e B la c k _ H a ir B lo n d _ H a ir B lu r r y B r o w n _ H a ir B u s h y _ E y e b r o w s C h u b b y D o u b le _ C h in E y e g la s s e s G o a t e e G r a y _ H a ir H e a v y _ M a k e u p H ig h _ C h e e k b o n e s M a le M o u t h _ S li g h t ly _ O p e n M u s t a c h e N a r r o w _ E y e s N o _ B e a r d O v a l_ F a c e P a le _ S k in P o in t y _ N o s e R e c e d in g _ H a ir li n e R o s y _ C h e e k s S id e b u r n s S m il in g S t r a ig h t _ H a ir W a v y _ H a ir W e a r in g _ E a r r in g s W e a r in g _ H a t W e a r in g _ L ip s t ic k W e a r in g _ N e c k la c e W e a r in g _ N e c k t ie Y o u n g

Figure 1 :
Figure 1: Correlation matrix of attributes in the CelebA training data. /bicycle-b

Figure 4 :
Figure 4: The accuracy for uniform 0% (lighter shade) and 100% (darker shade) poisoning with the spurious artifact under variations of the slack parameter.Solid lines show average 100%-poisoned accuracy and dashed lines show average clean-data accuracy over all datasets.The results shown are the mean accuracy and standard deviation obtained using 700 refinement samples per class on the respective test sets over five runs.

Figure 5 :
Figure 5: Accuracy under variations of the number of available samples for each refinement method.

Figure 6 :
Figure 6: Comparison of precision and recall of the pretrained model before (Original) and after refinement (PCA-EGEM).The metrics are calculated on subsets of CelebA containing only samples exhibiting the attribute on the x-axis.'All' is sampled from the whole test set.

Figure 7 :
Figure 7: Examples of images drawn from the CelebA dataset before and after occlusion.

Table 1 :
-4.5: a modified version Overview of datasets, classification problems (poisoned class in bold, number of classes in brackets), and spurious features (CH). .f.t.