Finding and Removing Clever Hans: Using Explanation Methods to Debug and Improve Deep Models

Contemporary learning models for computer vision are typically trained on very large (benchmark) datasets with millions of samples. These may, however, contain biases, artifacts, or errors that have gone unnoticed and are exploitable by the model. In the worst case, the trained model does not learn a valid and generalizable strategy to solve the problem it was trained for, and becomes a ‘Clever-Hans’ predictor that bases its decisions on spurious correlations in the training data, potentially yielding an unrepresentative or unfair, and possibly even hazardous predictor. In this paper, we contribute by providing a comprehensive analysis framework based on a scalable statistical analysis of attributions from explanation methods for large data corpora. Based on a recent technique – Spectral Relevance Analysis – we propose the following technical contributions and resulting ﬁndings: (a) a scalable quantiﬁcation of artifactual and poisoned classes where the machine learning models under study exhibit Clever-Hans behavior, (b) several approaches we collectively denote as Class Artifact Compensation, which are able to eﬀectively and signiﬁcantly reduce a model’s Clever Hans behavior. I.e., we are able to un-Hans models trained on (poisoned) datasets, such as the popular ImageNet data corpus. We demonstrate that Class Artifact Compensation, deﬁned in a simple theoretical framework, may be implemented as part of a Neural Network’s training or ﬁne-tuning process, or in a post-hoc manner by injecting additional layers, preventing any further propagation of undesired Clever Hans features, into the network architecture. Using our proposed methods, we provide qualitative and quantitative analyses of the biases and artifacts in, e.g., the ImageNet dataset, the Adience benchmark dataset of unﬁltered faces and the ISIC 2019 skin lesion analysis dataset. We demonstrate that these insights can give rise to improved, more representative and fairer models operating on implicitly cleaned data corpora.


Introduction
Throughout the last decade, Deep Neural Networks (DNNs) have enabled impressive performance leaps in a wide range of domains, from solving classification problems [1,2], over playing and winning games competitively [3,4] (some in real time [5,6]), to enabling the understanding of quantum-chemical many-body systems [7] and finding improved solutions to the notoriously difficult task of protein structure prediction [8]. These models are typically (pre-)trained on very large datasets, e.g., ImageNet [9], with millions of samples. Recently, it was discovered that biases, spurious correlations, as well as errors in the training dataset [10] may have a detrimental effect on the training and/or result in 'Clever-Hans' predictors [11,12], which only superficially solve the task they have been trained for, leading to potentially unfair and hazardous model behavior. Unfortunately, due to the immense size of today's datasets, a direct manual inspection and removal of artifactual samples can be regarded hopeless. Analyzing the biases and artifacts in the model instead may provide insights about the training data indirectly. This however requires an inspection of the learning models beyond black box mode.
Only recently methods of eXplainable Artificial Intelligence (XAI) (cf. [13,14] for an overview) were developed. They provide deeper insights into how a Machine Learning (ML) classifier arrives at its decisions and potentially help to unmask Clever-Hans predictors. XAI methods can be roughly categorized into two groups: methods providing local (e.g. [15][16][17][18][19][20][21][22][23]) explanations and those providing global (e.g. [24][25][26][27]) explanations [28]. Current approaches are of limited use when scaling the search for biases, spurious correlations, and errors in the training dataset, as this would require intense 'semantic' human labor. A recent technique, the Spectral Relevance Analysis (SpRAy) [12], aims to bridge the gap between local and global XAI approaches by introducing automation into the analysis of large sets of local explanations. The method however still involves a considerable amount of manual analyses, especially in context of contemporary datasets with high numbers of classes and samples such as ImageNet [9].
One of the main goals of ML is to learn accurate decision systems to automate tasks that otherwise may only be solved manually. As such, specific inference behavior on the available data often is expected from the learned models, e.g., within well-defined expert domains. As a recent body of research however has demonstrated, deviations from the anticipated are very likely (and must be expected to) appear in practice. In our paper, we propose a series of methods constituting a pipeline for the identification, description and suppression of those deviations in model inference, i.e., a set of tools to bring the model "back on track": We introduce a novel framework we collectively denote as Class Artifact Compensation (ClArC) to enable (a) large-scale analyses of a model's inference behavior on datasets with hundreds of classes and millions of samples for a semi-automated discovery of undesirable Clever-Hans effects that are embedded into data and model; here we rely on an extension of SpRAy, which increases the automation potential on such large datasets. (b) In addition, we provide an intuition for Clever Hans artifacts and the desensitization of a trained model to their influence. In this manner, ClArC provides (c) a well-controlled quantitative strategy to detect (Figure 1 (Ic)), model and validate (Figure 1 (II)), and consequently remove the influence of such artifacts from the model (Figure 1 (III)). We showcase the steps of our approach on a modified MNIST [29,30] dataset with color-based Clever Hans (CH) information, the ImageNet [1] dataset, the challenging Adience [31] benchmark dataset of unfiltered faces and the ISIC 2019 [32][33][34] skin lesion analysis dataset, and discuss the intricacies of (informed intervention in) the decision-making of end-to-end learned predictors. These extensive analyses allow interesting findings that are illuminating beyond our specific technical approach.  does not scale well for large datasets, is labor intensive infeasible: artifacts need to be known / expected prior to analysis z Figure 1: The workflow of our Class Artifact Compensation framework. (I) We first aim to identify spurious confounders in the data as learned by the model. (Ia) A direct analysis of the training data is infeasible due to the missing correspondence to the features used by the model during inference. (Ib) Explanations from local XAI methods may provide this information. However, manual analysis requires the evaluation of extreme amounts of explanations (per class). (Ic) We therefore propose an automation of this process, based on an extension of the SpRAy [12] algorithm. (Id) While the application of globally operating XAI techniques is disqualified in the identification phase, as here the concepts evaluated after must be known beforehand, (II) these techniques find application in the modeling of an artifact estimator in our approach: (IIa) While an artifact model can be built explicitly after identification, e.g. from expert domain knowledge, (IIb) it can also be learned from representative data, e.g. as CAVs. (III) With a known model of the artifact at the layer of its most distinct representation within the DNN, one can attempt to remove its influence on the network. To this end, we present the following two approaches: P-ClArC aims at the selective deactivation of the artifact signal, and, as a largely training-free approach, leaves the remainder of the model unaltered. A-ClArC on the other hand strategically augments the training data (of all classes) with the artifact signal in order to minimize its class-specific informative value, to force the model to adapt to other (benign) features in continued training.

Related Work
There is an increased awareness that ML models need to be interpretable to its users in order to assess the validity of the decision making of the predictor [35,36], especially in high risk settings, such as in medical applications [37][38][39][40][41]. Transparency in model predictions could point at anomalous or blundering decision behavior before harm is caused in a later usage as a diagnostic tool. Consequently, numerous approaches to understand aspects of state of the art Artificial Intelligence (AI) predictors have been developed in recent years (cf. [14] for an overview) in the emerging field of eXplainable Artificial Intelligence (XAI). In the following paragraphs, we will discuss related work by introducing relevant research work and terminology from the field of XAI important to this paper.
The Clever Hans Effect Clever Hans (CH) was a horse from Berlin, Germany, that allegedly was able to do math -a media sensation from the early 1900s. Later in 1907 it was discovered that Hans would read the examinator's body language instead of performing arithmetics, and in this manner give the right answer but for the wrong reason * [11]. "Clever Hans Strategies" or "Clever Hans Effects" for ML predictors [12,42] are accordingly named as a homage to this infamous horse, and describe a prediction making learned and executed based on biases and spurious corellations in the training data, instead of valid (i.e., intended or expected) features and relations.
As such, there is a notable distinction to make between the CH artifacts, Backdoor (BD) Attacks [43,44] and attacks based on Adversarial Examples [45]. Adversarial attacks are specifically generated for individual data points in order to cause a misprediction, and are as a consequence ineffective when used on other samples. BD Attacks and CH artifacts on the other hand are systematically learned and exploited by the model. BDs are generally injected with (malicious) intent during training, into samples of multiple classes via added "trigger patterns" (e.g. a gray pixel at a specific location) while overriding the the targeted samples' true training labels [43,46]. BDs are usually not part of the original training data anymore once training is finished. CH type artifacts, however, are "naturally occurring" phenomena in the training data corpus correlating with only single (or few) ground truth labels, providing for shortcuts around more complex connections in the training data [47]. In contrast to Backdoor Attacks, which, if present, cause the model to override its prediction making on valid features, CH artifacts almost always appear alongside benign indicators for a class, and thus exert a significantly weaker influence on the model. Further, the decision whether a characteristic in the data is indeed a CH, or merely a benign feature, often is subject to the expectation of the model's behavior and expert domain knowledge [12,41,48]. They are consequently, in addition to their unexpected nature, more difficult to detect, as experimentally highlighted in Section 3.1. Other than BDs, CH artifacts are part of the features in some of the original training samples, and may thus be identified during a joint analysis of the available data and the model's utilization of it, as described throughout this paper. The particular difference between datasets with CH artifacts and datasets with BDs is illustrated in Figure 2. In literature, numerous CH strategies have been identified and collected † , e.g., with the help of techniques from XAI, in a surprising number of current and former state-of-the-art ML models, in part invalidating their reported (benchmark) performance as a measure of generalization capability [10,12,41,[47][48][49][50][51].

Clever Hans
Backdoor 0 1 2 0 2 1 Target Figure 2: Difference between datasets with Clever Hans (left) and Backdoor (right) artifacts visualized for colored MNIST. The artifact feature that separates afflicted samples (red frame) from unaffected ones (yellow frame) here is for both types of artifacts the color blue (different from the standard color white). In the case of CH artifacts, the artifact feature will only ever appear in samples alongside features for the a single class. For BD attacks, the artifact feature appears in samples among features for all (other) classes except the target class, making the artifact the only discriminative feature in affected samples distinctive for its target class.
features, concepts or data transformations by systematically evaluating the model's reaction to varying exposure thereto, using (larger) sets of real or artificially generated samples [24][25][26][27]. Other approaches aim at understanding predictors by identifying important neurons and their interactions [57], and visualizing learned feature encodings by e.g. synthesizing preferred inputs to hidden filters within a neural network model, e.g. [58][59][60][61].
Bridging the gap Both the local and global approaches to XAI suffer from a (human) investigator bias during analysis and thus are on their own of only limited use for searching and exploring for biases, spurious correlations and errors learned by the model from the training data. Global methods can only measure the impact of predetermined, expected or a priori known features or effects (cf. [26,27]), which limits their applicability when aiming for the discovery of yet unknown behavioral facets of a model. Local methods, on the other hand, have the potential to provide much more detailed information per sample, but the task of compiling information about model behavior over thousands (or even millions) of samples and explanations is tiring and laborious for a human investigator: the success of such an analysis depends on the examiner's keen perception and domain knowledge, limiting the potential for knowledge discovery about model behavior.
A recent technique, the Spectral Relevance Analysis (SpRAy) [12], aims at bridging the gap between local and global XAI approaches, by introducing automation into the analysis of large sets of local explanations. SpRAy has been applied in a recent set of works, e.g., [12,48], which however mainly operate on smaller datasets, each containing only hundreds of samples each. The in [12] described procedure however still involves a considerable amount of manual analyses, especially in context of contemporary datasets with high numbers of classes and samples, such as ImageNet [9]. In our work, we purposefully extend the SpRAy technique and bring it to scale for robustly analyzing extensive datasets, in Section 2.4.
Feature unlearning The awareness of CH predictors has invigorated research with the intent to improve models, by unlearning unwanted inference patterns. A most naive approach to unlearn a concept that can be found in a subset of samples in the training set is to remove those samples altogether, and to retrain the model from scratch on the reduced training set. While this approach is straight forward and easy to implement, it comes a the cost of also removing desirable features the model could positively benefit from, along with the characteristics in the data deemed problematic. This may be especially harmful if there are only few training data available to begin with. Furthermore, in some cases the initial model training may have been extremely costly, and an approach to fine-tune the model instead would be more desirable.
Several approaches have thus been developed to unlearn unwanted predictive behavior from existing models [62][63][64] or to guide the model during training by providing information about the expected explanations [48,62,65]. eXplanatory Interactive Learning (XIL) [48,63] presents local explanations to a human observer during training, who in turn provides feedback to the model by replicating samples affected by CH phenomena and replacing the contained artifactual features with noise or otherwisely generated patterns. The work of Kim et al. [64] introduces a model regularization scheme, in which an additional "artifact detector" learning specific biasing features is attached to the original predictor. The original model is then driven to minimize the shared information with the dedicated bias predictor, and thus to unlearn to use artifactual features for inference. Ross et al. [65] aim to guide the model towards the correct behavior by penalizing high attribution scores in undesired regions by extending the optimization function with a "Right for the Right Reasons (RRR)" loss term. Similarly, Rieger et al. [62] propose Contextual Decomposition Explanation Penalization (CDEP), a method for regularizing model behavior based on explanations obtained from Contextual Decomposition (CD) [66], by complementing the classification error of the loss function with an explanation error term. Recent work however has shown that models can be manipulated in such a way that produced attribution maps may be arbitrary, while the prediction of the model is unchanged [67]. Consequently, there is no guarantee that in general unlearning approaches based on extensions of the loss function effectively correct the model's use of the input features.

Spectral Signature
For the detection of BD-type artifacts used by DNNs, Tran et al. [44] propose the Spectral Signature (SpeSig) method. Given some dataset X that is poisoned with a BD and a model f trained on this data, let X y = {x1, . . . , x n } be the subset of samples corresponding to a target label y. Tran et al. [44] apply the following method separately for all y in the dataset, since the aim is to identify all (previously unknown) BD samples within X: For each sample x i , the model f provides a feature representation a(x i ). From these representations, one computes the covariance matrix a(x i ) and n = |X y |. For each sample, an outlier score τ is then computed using the top right singular vector v of M : Samples with a high τ i are more likely to be outliers, allowing for the k samples with the largest τ i to be detected as poisoned. Note that since SpeSig detects outliers w.r.t. samples of one class label y, the found BDs are usually images that originally belonged to other classes -and thus do not fit into the manifold of X y . More concisely, SpeSig does not detect the poisoned artifact itself, but the "odd" samples within X y . Tran et al. [44] then propose to remove the detected outliers and retrain the model, thereby defending against the BD attack. In Section 3.1, we apply the SpeSig method not only to identify BDs, but also on a dataset containing CH artifacts to assert their conceptual differences.

Concept Activation Vectors
Kim et al. [26] introduce CAVs as a means to provide an interpretation of a DNNs internal state in terms of human-understandable concepts. Given two sets of samples X + and X − , where the samples in X + all exhibit a specific property c (e.g. X + contains images showing striped objects) which is not present in X − , a CAV is trained as a linear classifier separating the hidden representations of the samples from X + and X − at some layer l within the DNN. The thus learned weight vector v l c then represents the direction in latent space encoding the concept c unique to X + .
Kim et al. [26] use CAVs as directional derivatives in order to test the sensitivities of neural network models w.r.t. to a priori known concepts. We apply CAVs twofold throughout our paper. Similar to [26], we use CAVs as a means to verify the sensitivity of the model to the CH artifacts, e.g., those identified via SpRAy, as shown for example in Section 4.2. Further, we use CAV directions specific to CH effects in context of the ClArC unlearning framework, as a means to remove specific behavioral facets from the DNN's inference process.

Layer-wise Relevance Propagation
Layer-wise Relevance Propagation LRP [16] is a local XAI approach reversely iterating over the layered structure of a neural network to produce an explanation. Consider the neural network In a forward pass, activations are computed at each layer of the neural network. The activation score in the output layer forms the prediction, which is then backpropagated and redistributed, layer by layer, until the input is reached. The redistribution process follows a conservation principle analogous to Kirchoff's laws in electrical circuits, i.e. all relevance assigned to any neuron during the process of backpropagation will be further distributed towards its inputs in the layer below without loss.
Various propagation rules have been proposed in literature [16,68,69]. For example, the LRP-γ rule [68] defined as where a j are the layer's input activation at the j th neuron, w jk the learned parameters mapping the j th input activation to the k th layer output and w + jk = max(0, w jk ) is the positive part of the learned weights. The variable γ ≥ 0 is a free parameter to tune the decomposition rule. Equation (4) redistributes R k based on the contribution of lower-layer neurons to the given neuron activation, with a preference for positive contributions over negative contributions. This makes it particularly robust and suitable for the lower-layer convolutions.
Other propagation rules such as LRP-ε, LRP-αβ or LRP-z B , are suitable for other application scenarios and layer types [68,69] and have been shown to work well in practice [70].
After the step of relevance decomposition, lower layer neuron relevance is aggregated from incoming relevance messages as R j = k R j←k . For a technical overview of LRP including a discussion of the various propagation rules and further recent heuristics, see [68]. In all our experiments, we compute LRP attribution scores using LRP-ε (near the model output), LRP-γ (in intermediate layers) and LRP-z B (near the input), as described in [71].

Spectral Relevance Analysis
Spectral Relevance Analysis (SpRAy) [12] is a meta-analysis tool for finding patterns in model behavior, given sets of instance-based explanatory attribution maps. The SpRAy algorithm has its core in Spectral Clustering (SC) [72,73] and -via the use of attribution maps as input -enables the analysis of the input data from the model's perspective for finding (hidden) characteristics of specific classes, which however are exploited by the model.
The SpRAy algorithm, as introduced in [12] initializes by computing the sparse affinity structure over the input attribution maps considering all pair-wise similarities between the given samples. A (normalized, symmetrical and) positive semi-definite graph laplacian L sym [12,74] is then computed from the affinity matrix A, and provided as input to SC (cf. [74]). As output, SpRAy yields a spectral embedding Φ of the input attributions and the corresponding spectrum of eigenvalues Λ = {λ i } i=1...q . Lapuschkin et al. [12] follow [74] and (manually) read the structure (i.e. number and nesting) of clusters from the eigenvalue spectrum Λ, via the spectral-or eigen-gap [74], e.g., for ranking a set of analyzed classes w.r.t. to their potential for exhibiting CH phenomena [12]. For further visual analysis, the affinity matrix A is then used together with a suitable number of cluster labels inferred from Λ as a basis for an embedding into R 2 , e.g., by using t-SNE [75]. Figure 3 provides an overview of the procedure outlined above, where arrows and symbols in black color describe the workflow of SpRAy from [12], and arrows and symbols in red color distinguish our own extensions and adaptations of the algorithm described below.
Spectral Relevance Analysis brought to scale We extend the SpRAy algorithm by drawing proper utility from the spectral embedding Φ, an intermediate result of the SC algorithm, which so far has remained unused in [12].
While the q ≤ n most significant eigenvectors of the singular value decomposition on the graph laplacian L sym constitute the columns of the (n × q) shaped spectral embedding Φ, each of the matrix' rows corresponds to exactly one of the n input attribution maps. We therefore use the rows of Φ (instead of A) as an input to mapping and embedding algorithms such as t-SNE [75] or UMAP [76] for projecting the spectral analysis results (instead of the preprocessed data representation A) into R 2 for further visual inspection. Note that the final algorithmic step of SC is the assignment of cluster labels to input samples. For this purpose, one usually applies any other suitable clustering algorithm (e.g. k-Means [77] or DBSCAN [78]) on top of the data represented by the already well-structured embeddings in Φ. The use of Φ as a source for computing embeddings in R 2 thus leads to a close correspondence of the visualized cluster groupings to the assigned cluster labels.
A critical decision in clustering approaches is the number of desired clusters. While for small datasets like Pascal VOC [79] it suffices to analyze the per-class eigen-spectrum [12]; datasets with a large number of classes cannot be feasibly analyzed by manual comparison and ranking of the eigen-spectra of all classes to identify those exhibiting spurious model behavior. In order to automate this process, we propose Fisher Discriminant Analysis (FDA) to rank all class-wise clusterings by their respective (linear) separability as the quantity τ . FDA [80,81] is a widely popular method for classification as well as class-(or cluster-) structure preserving dimensionality reduction. FDA finds an embedding space by maximizing between-class scatter S (b) Steps followed by the SpRAy procedure as defined in [12]. (Red paths): Our extensions and changes to the SpRAy algorithm to increase the automation potential and applicability to very large datasets. (a) From a set of local attribution maps, a sparse affinity matrix is computed in (b). (c) The affinity data is then passed as input for analysis with SC [72,73] in the form of a positive semi-definite graph laplacian, resulting in a spectrum of eigenvalues Λ, the spectral embedding Φ corresponding to the input data (see (e) and (g)), as well as sets of proposed cluster labels y c . (d) Lapuschkin et al. [12] perform to a large extent direct manual analyses on the eigenvalue spectrum Λ, within and between analyzed classes, for the identification of CH behavior and distinct cluster groupings, and embed the sparse affinity structure of the data given the estimated cluster labels y c for visualization. Our extensions rely on the already expressive spectral embedding Φ (together with cluster labels y c ) for (e) visualizing the analyzed data groupings, (f ) and the automation and quantification of rating clusters and classes for "Clever Hans'ness" τ , via the computation of separability scores, from, e.g., FDA. and minimizing within-class scatter S (w) , given by Here, C K is a clustering with K clusters c K k with k ∈ {1, . . . , K}, µ k the sample mean of cluster k and µ the mean over the whole set of samples. The solution of FDA can be understood as directions of maximal separability between clusterings, and, when normalized and plugged into the original objective, gives scores of separability R(C K ). In our specific use-case, for each class we compute separability scores R(C K ) on the spectral embedding Φ and each clustering C K in a set of clusterings K = {C K }. We then define the class-separability score as which may then be used to compare classes w.r.t. their "Clever Hans'ness". In the SpRAy setting, large τ denote outlierness in the predictor's attribution -as indicators for artifact candidates -whereas low τ does not indicate any strikingly "irregular" prediction behavior. Clearly any algorithmic alternatives quantifying the separability of two or more sets of labelled samples may be used as an alternative to compute τ , although we see FDA as one of the more intuitive approaches. Algorithm 1 provides a complete algorithmic description of the extended SpRAy technique, while the red arrows and symbols in Figure 3 distinguish our approach from SpRAy in [12].

Class Artifact Compensation
Assume we have a set of atomic features F. A concept c ∈ 2 F may be any combination of atomic features to describe an abstract property, where 2 F is the power set of F. We may define an M -tuple of concepts C = (c 1 , c 2 , ..., c M ) with c i ∈ 2 F for i ∈ {1, ..., M } . Given the superset of concepts C = N i=1 c i , assume a set of untangled data points that can be constructed by a combination of concepts D = { c∈c c|c ∈ 2 C } Each untangled data point α ∈ D is like a concept also a combination of atomic features 2 F . We may now, given α, construct a signal vector s(α) ∈ {0, 1} M using with the Kronecker Delta δ, where each entry at index i is 1 if c i ⊆ α. In other words, s(α) is a binary encoding of α given concepts C. Now assume we have an N-tuple of untangled datapoints D = (α 1 , α 2 , ..., α N ) with α i ∈ D for i ∈ {1, ..., N }. We may now construct a corresponding N-tuple of tangled datapoints X = (x 1 , x 2 , ..., x N ) based on D, where each sample x i is a mixture of concepts given a pattern matrix A : R N ×M Suppose we call concept c k at index k ∈ {1, 2, ..., M } an artifact. A set of labels t i that indicate whether a datapoint α contains the artifact c k can then be defined as Assuming we have a function f : R d → R d on the tangled datapoints X, there are two questions we seek answers for: 1 -Is f sensitive to artifact c k ? 2 -How can f be modified such that it is insensitive to artifact c k ?
Concept Sensitivity of Functions To measure the sensitivity to artifact c k with labels t i ∈ {0, 1}, one needs to compare the behavior of function f on non-artifact samples X − = {x i .i ∈ {1, 2, ..., N }|t i = 0} and artifact samples X + = {x i .i ∈ {1, 2, ..., N }|t i = 1}. A naive approach may be for example to compare the sufficient statistics with their non-artifact counter parts, where |X + | is the cardinality of |X + |. This may however not give any decisive results when the number of samples is limited. As another drawback, the function may not be analyzed on a per-sample basis.
Another approach is to explicitly estimate an artifact model h : R d → R d , which, given a non-artifact sample We can formulate the artifact model with the objectivê whereθ are the optimal hyperparameters of h. The artifact estimator h is thus the function h with hyperparametersθ that produces the minimal 2 -distance between mapped non-artifact samples h(x − ) with x − ∈ X − and artifact samples x + with x + ∈ X + . The sensitivity of function f to a concept c k modeled with h may then be estimated using Intuitively, the addition of a concept may be more feasible to estimate than the removal. Take, for example, the introduction of an opaque watermark in an image. This operation is not invertible as we destroyed the pixel information under the watermark. While Equations (13) and (14) assume the transformation of a non-artifact sample to an artifact sample in a forward artifact model, they may equivalently be formulated with a removal of the concept in a backward artifact model h b witĥ The sensitivity of a function to a concept backward modeled by h b may then be measured using Concept Desensitization Depending on the type of function f , there may be multiple possible approaches to obtain a desensitized function f . If f is for example a function with learned parameters ω, it may be possible to learn f by modifying its training data. If there is enough data available, the most naive approach to reduce the sensitivity to an artifact c k , is to remove all samples X + that contain the artifact from training. Depending on the amount of available training data, this may not always be preferred, since these samples often contain other concepts that may be valuable for training. In contrast, if the number of samples with the artifact concept is larger than the number of samples without the artifact concept, one may instead discard all samples without the artifact to obtain an artifact-insensitive function. Of course care must be taken not to change the data so much that the original problem may not be solved anymore. A better approach may be is to transform individual samples, such that either all samples, or none contain the artifact. Assuming the addition of an artifact is non-invertible, we may prefer to transform all samples to contain the artifact. This may be done by estimating a forward artifact model h, as defined in Equation (13). The model f may then be trained with the transformed dataset X = (x 1 , x 2 , ..., x N ), with: A simplification arises when the task is to solve a classification problem. Since the model is trained to produce logits for multiple classes, one may simply balance the number of samples between classes, such that for each class, an identical amount of samples with an artifact are put into the training set by transforming non-artifact samples.
Another simplification arises when a regularization term is introduced in the artifact model, such that h acts as the identity for artifact samples x + ∈ X + witĥ θ = arg min θ 1 With this regularization term, the error caused by transforming an already-artifact sample is minimized.
Application on Logistic Regression To build a better intuition for the problem, we introduce a logistic regression model f (x) = σ(w T x + b) with sigmoid non-linearity σ(x) = 1 1+exp(−x) . The parameters w and b are obtained by minimizing the loss function with labels y i ∈ {−1, +1}, where y i = −1 for samples of class A, using Stochastic Gradient Descent (SGD). We first consider the case X + = ∅, which is visualized in Figure 4 in the panel titled "Clean". In the panel, we see samples of two classes, A (blue) and B (orange), scattered along the y-axis. The green lines visualize the decision hyperplane of f over 25 epochs of training. We can see that the final decision hyperplane (dark green) converged orthogonal to the signal direction on the y-axis, separating classes A and B perfectly along their center. In panel "Artifact" of Figure 4, we introduce an artifact concept into some of the samples of class A, i.e. |X + | > 0, which manifests as an increased value along the x-axis. The artifact samples are well on the right side of the panel. When now minimizing L(f ), the converged decision hyperplane to which w is normal has rotated. While still classifying all the samples correctly, we can visibly see that the introduction of an additional concept has changed the model. Based on this observation, and the previous discussion, we introduce two approaches under the common name of Class Artifact Compensation to compensate for class-specific Clever Hans artifacts in SGD-trained inner-product + non-linearity type models such as logistic regression, or neural networks.  : Logistic regression on data with, among possibly others, a discriminative signal direction and an artifact direction which is only represented in one of the two classes. The decision-hyperplane is shown over the SGD-based training-process of 25 epochs in shades of green, with: Clean: no artifact in the data; Artifact: a Clever-Hans artifact in Class A (blue); A-ClArC: with artifact, but training is continued with the mean difference between clean samples and artifact samples in Class A added to some samples of Class B (orange); The introduction of an artifact to samples from Class A changes the decision boundary. By introducing the same artifact direction to samples from Class B and retraining, this effect can be reduced significantly. P-ClArC: with artifact, but the model is modified such that data points are projected onto the hyperplane at position z to which the estimated artifact direction v is normal, with v = 1 and zero reference z chosen as the mean of clean samples of Class A. The resulting decision hyperplane ignores artifact direction v and sits at the same position where the original hyperplane lay between classes A and B, thus leaving the function output unchanged for clean samples. Reference z may be chosen as the mean of both clean and artifact samples of Class A to move the resulting decision hyperplane towards the middle of both classes.

Augmentative Class Artifact Compensation
The goal of A-ClArC is to augment samples in such a way that the SGD-trained classifier becomes insensitive to an artifact given artifact labels t i . Given these labels, we estimate a forward artifact model h, which for our logistic regression toy model we define as purely additive, with: Given the objective from Equation (13), we can see that the optimal value for parameter v is which is the shift between non-artifact samples and artifact samples in class A with with X + . This is visualized in panel "A-ClArC" in Figure 4. Some samples of class B x B i ∈ {x i , i ∈ {1, 2, ..., N }|y i = +1} are then modified given this artifact model with The modified samples are visualized in Figure 4 with a brighter shade of orange, shifted to the right. The model training is then continued with the transformed samples, of which the resulting hyperplanes over the epochs are visualized as purple lines. We can observe that the converged hyperplane resembles the one obtained by the model trained on artifact-free data in panel "Clean" of Figure 4. Beyond this example, in our experiments with image data we assume artifacts are objects that are blended into the image. Therefore we may parameterize the artifact model as .., d} and z ∈ [0, 1] d are the RGB values of the static image artifact pixels, here each for simplicity represented by a single value. By taking CAVs as a motivation, we parameterize the forward artifact models in our experiments for feature representations in a neural network in an alternative approach. Explicitly, we train a linear soft-margin SVM g with hinge-loss with v ∈ R d , regularization constant η and bias term β. We then design the artifact model explicitly by pushing samples over the decision boundary relative to some fixed position z. We choose z as the mean artifact reference point, with The forward artifact model h is then chosen as an affine transformation Projective Class Artifact Compensation While A-ClArC addresses the problem of desensitization by augmenting the underlying training data of a prediction model f using a forward artifact model h, P-ClArC instead aims to correct the model without retraining by incorporating a backward artifact model h b directly into the prediction model. The approach is again motivated by CAV and uses the same parameterization for the backward artifact model as the forward model in Equation (26) with and v given in Equation (24). However, the artifact reference point z here becomes the non-artifact reference point, which we now choose as the center of non-artifact samples X − with This now moves all points along v to a fixed position, while leaving all orthogonal directions untouched. A strong assumption that is taken for this approach is that really all other concepts are encoded in the directions orthogonal to v. Given this assumption however, we may further assume that for all non-artifact examples there is no variance along the artifact CAV. With this, we further obtain Given the logistic regression model f in Figure 4 in the "P-ClArC" panel, we obtain the model f corrected for insensitivity against the artifact modeled by h b using The "P-ClArC" panel shows the decision hyperplane of the original model f in green, along with the parameters v and z for the backward artifact model h b , as well as the corrected decision hyperplane according to Equation (31). Note that the non-artifact reference z is chosen such that the decision hyperplane of f is at the same position exactly between classes A and B, resulting in a decision hyperplane that is somewhat shifted towards class A. An alternative z may be chosen as the mean of all samples of class A to correct for this difference. However, a constraint of this approach were unchanged function values for non-artifacts, which results in this shift. We can transfer this approach directly to the neural network models in the experiments section due to their piecewise-linear nature. A detailed Algorithm for both A-ClArC and P-ClArC on Neural Networks is shown in Algorithm 2 under the common name of Class Artifact Compensation.

Algorithm 2: Class Artifact Compensation
Data: Model f operating on X, with accessible layer l (and subnetwork f l ) For A-ClArC: data D, epochs E for training, poison rate p ∈ [0, 1] Result: predictor f desensitized to artifact c /* obtain feature representations of data at layer l */

Experiments -Clever Hans Identification
The goal of this section is to explicitly find artifact models given sets of labels on our dataset regarding CH artifacts in the training set that were learned by the analyzed neural network model. Therefore, we start with an experiment to investigate the relation and difference between the detection of CH and BD artifacts within features representations of neural networks [44]. The corresponding results point us towards the necessity of deeper insight into the model. Such an insight is promised by SpRAy [12], which we verify subsequently on a specially designed version of Colored MNIST using our separability score extension. We then proceed to verify the proposed separability score τ on a VGG16 model [82] trained on ILSVRC2012 by comparing the scores of classes for which we have manually found CH artifact candidates. A description of the training procedures and architectures of all models used in this section can be found in A. We proceed to visualize some promising CH artifact candidates which we have found in an algorithm-assisted dataset exploration with SpRAy, which provides us with a set of positive and negative labels on samples for each artifact candidate. An exploration is conducted both in input space and feature space in various layers of our model, for which we provide a comparison on the acquired separability scores. The previously obtained sets of labels may then be used to fit or construct an artifact model, which will be verified and used as prerequisite to remove the corresponding artifact from a model using A-ClArC and P-ClArC in the following Section 4.

Relation of Clever Hans and Backdoor Artifacts
In this section, we conduct an empirical demonstration on the difficulty of detecting CH artifacts compared to BD attacks by analysing a neural network's hidden activations. We prepare two modified instances of the CIFAR-10 dataset [83], one poisoned by introducing a CH artifact, the other by adding a BD. In both cases, the trigger pattern is a static (3 × 3)-sized grey pixel patch applied to a subset of the training set. For the CH, this trigger is introduced into 25% of all samples of class "airplane". For the BD, it is introduced into 10% of all samples, with the class label of each poisoned sample changed to "airplane". A simple convolutional network is then trained on each training set instance. This network achieves an unpoisoned validation accuracy of 49.1% when trained using the CH artifact, and 46.6% with the BD-poisoned dataset. As suggested by Tran et al. [44], the SpeSig method (cf. Section 2.1) is used to detect poisoned samples as outliers. While Tran et al. [44] use this outlier score only to detect BD samples, we also attempt to detect samples affected by the related CH effect in order to compare these two types of dataset poisoning in terms of their induced feature representation.
For each sample, an outlier score is thus obtained, yielding an implicit ordering of samples, with the highest score denoting the most outlying samples. For the datasets poisoned by a BD and CH, respectively, we then compare this ordering to the ground truth "poison labels". The results of this comparison are depicted as Receiver Operating Characteristic (ROC) curves in Figure 5. Coinciding with the findings of [44], the BD candidates suggested by the outlier score correspond extremely well to the ground truth ( Figure 5 (right)), with an Area Under Curve (AUC) of 1.0. However, for the CH case in Figure 5 (left) this comparison yields almost random results, with an AUC that is only marginally above 0.5.
This experiment highlights the difference between BDs and CH artifacts, and emphasizes the additional issues that are present when dealing with the latter: Intuitively, features introduced by BD artifacts will be the only feature in their respective sample to correlate with the target label, making them for many samples the only indicator usable for a valid prediction. Additionally, they must be an indicator stronger than all features that correlate with labels different from the BD target label for a correct prediction. This may very well be the reason they can be detected so evidently using only the direction of the largest variance in feature space over the dataset with SpeSig. In contrast, features introduced by CH artifacts will always appear alongside other features in their respective sample that correlate even stronger with the target label. This means that in theory, they are not necessary for a correct prediction at all.
To detect CH artifacts more reliably, deeper insight into the predictor is necessary. A promising direction is thus XAI, which is utilized in SpRAy to detect these elusive CH artifacts in the rest of this section.
An interesting note to make is that FDA can be understood as an extension to simply finding the direction of the largest variance as done in SpeSig, as given a set of labels, the direction of largest variance between labels and smallest variance within labels is found.

Clever Hans
Backdoor Artifact Figure 5: Differences in the detection of CH artifacts (top) and BDs (bottom). In both cases, the introduced artifact consists of a small white pixel patch in the top right corner. (Left): A subset of the samples that were identified as outliers via SpeSig. All samples considered as outliers in the BD setting do in fact contain the BD feature. The same evaluation performed in the CH setting leads to a significant amount of false positives for the detection of the CH artifact. (Right): This is further confirmed by the ROC curves comparing poisoned samples detected by SpeSig to the ground truth. Note that in both cases, 1000 evenly spaced thresholds were used for the AUC/ROC computation.

Spectral Relevance Analysis in Input Space
We explore SpRAy for the identification of Clever Hans artifacts in input space. We start with a verification of the algorithm by constructing a modified version of MNIST where an artifact is introduced as a distinct color. We then proceed to analyze the applicability of SpRAy on input attribution space on the ILSVRC2012 dataset. Spectral Relevance Analysis on Colored MNIST The SpRAy framework is applied on colored MNIST setup as following. For each of the 10 MNIST classes, we create a dataset where for the corresponding class, samples are colored with a probability of 20 percent in a distinct color as shown in Figure 6. The rest of the samples are left in their original white color. On each of these datasets, a simple feed-forward convolutional neural network is trained (cf. A.2). We can then verify for each model how much it has learned the color to be a distinct feature for the corresponding class, by evaluating the model accuracy and the fractions of the predicted classes on a validation set which has been completely colored in the color of the artifact. Subsequently, we do a Spectral Relevance Analysis by using 4 neighbors to build an affinity graph of the attributions to compute the spectral embeddings reduced to the dimensions corresponding to the 2 smallest eigenvalues. Note that we did not sum over the color channels of the attributions, as is often done for visualization purposes, since the color plays an important role in this experiment. We do a simple agglomerative clustering with 2 clusters on the spectral embedding, and compute its separability score τ . The aforementioned results are visualized in Figure  7. The spectral embeddings at the bottom of Figure 7  Top: Accuracy on poisoned dataset (left) and separability score τ of 2-cluster agglomerative clustering where the class included a clever hans. Bottom: Spectral Embedding (left) with 4 neighbors and 2 eigenvalues of each individual class on its corresponding dataset, where orange crosses are colored artifact samples and blue crosses are uncolored clean samples. The predicted class fractions are shown for each class to the right of its Spectral Embedding. For each of the 10 classes, a modified MNIST dataset was prepared where 20 percent of the samples in that particular class were colored to act as artifact samples. One model was trained on each of these datasets. The poisoned accuracy is the accuracy of each of these models on the validation set with each sample colored in the same artifactual color. The used colors were the same as the ones shown in 6. Models with a high poisoned accuracy and a low separability score indicate that the model has not learned the artifact. Models with a low poisoned accuracy and a high separability score indicate that the artifact was learned. The spectral embeddings show a clear split for models where the artifact was learned.
attribution can be well separated, clean and artifact samples move towards the opposite ends of the the crescent. This is visible for classes 0, 1, 3, 6, 7 and 9. These classes also show a high separability score τ when compared to the scores of the other classes. With the exception of class 1, all of these classes also show a low performance on the poisoned validation set, with an accuracy below 30 percent. In contrast, again with the exception of class 1, all classes with an accuracy above 50 percent show a separability score close to 0. The predicted class fraction for classes with a high separability score show a high tendency of the model to predict poisoned samples as the artifact class, especially for classes 0, 3, 6, 7 and 9. On classes 2 and 4, the model seems to not, or barely have picked up the artifact as a class-relevant feature. Class 8 shows a high confusion even though the separability score is close to 0. In contrast, class 1 shows comparatively low confusion even though its separability score is high. It is worth to note that all models show a reduced accuracy on the poisoned validation set compared to the accuracy on a clean validation set of 98 to 99 percent, even for class 2 where the confusion does not seem to focus on the artifact class. This means that even though we may confuse models by coloring all samples of the whole validation set, we cannot detect the artifact in some of these models using SpRAy. Only part of the reason for this seems to be that the model has not picked up the artifact during training, since for example class 8 shows a relatively high tendency to confuse colored samples for their corresponding artifact class, yet the SpRAy does not give any indication of an artifact in the class.
Concluding this experiment, the assigned importance of an artifact may vary greatly between models and classes, and even though we may not find the artifact in all instances where the model has in fact picked up an artifact as an important feature, SpRAy pointed out most artifacts in this setup.
Quantifying Clever Hans Candidates on ImageNet We examine ILSVRC2012 for CH candidates by applying SpRAy with various clustering approaches for which we compute cluster separability scores τ (Eq. (7)) for each class. Figure 8 lists a ranking of the ImageNet classes with the highest and lowest τ values with a striking result for class laptop, due to a large cluster with copies of almost the same image (see UMAP of its spectral embedding in Figure 10   We inspect the validity of the class ranking for CH candidates generated by FDA in a small experiment, by screening a subset of all 1000 ImageNet classes, namely (1) those with the 20 highest τ scores, (2) those with the 20 lowest τ scores and (3) 63 randomly picked classes. In all three cases, we assume a positive CH "prediction" per class due to a large value of τ . We then produce "ground truth" labels via manual assessment of the existence of a CH candidate. We would like to remark that this "ground truth" has been established based on the class label description in the taxonomy of the ImageNet dataset and our subjective human understanding of the image content. Using this information we produce ROC curves and corresponding AUC values.
The results show a clear picture validating that a high τ score is indeed a strong indicator for the presence of CH candidates (Figure 9 (left), high AUC). Both randomly selected or bottom 20 classes (Figure 9 (mid, right)) yield essentially random AUC scores due to only sporadically encountered CH. However, the AUC 0 here also show that even a τ rating in the lowest 2-percentile does not guarantee a class to be free of CH behavior. Summarizing, large τ is an excellent indicator for CH behavior, but small τ is no ultimate guarantee for their absence, so further research will be needed here to ideally bring forward indicators that can provide a theoretical bound for absence of CH behavior.
Inspecting and Isolating Clever Hans Candidates Based on the ordering by FDA and τ established in the previous section, we manually investigate whether the CH candidate classes show prominent CH artifacts to be expected. The SpRAy framework provides as a side effect (through its spectral embedding space Φ) also a basis for visualizing clusters of heatmaps, here we use UMAP. Promising clusters are often located far away from the rest of datapoints in the UMAP embedding, see e.g. Figure 10 center left the UMAP scatter-plot of class "garbage truck". There, the red cluster-members all show examples of images of the same watermark with high attribution in LRP. Another intriguing example is the top middle UMAP plot of class "stole": while not as separated as for other examples, we find a cluster of mannequins wearing stoles, with high attribution scores  on the mannequin's "head". For class "carton", we can see even two artifacts at the same time: watermark written with Hanzi in the center of the image, as well as a watermark in latin characters in the bottom right. The bottom right watermark is in fact not only present in the carton class.
Based on the clustering labels provided by SpRAy, for each artifact, we may extract a set of labels that indicate whether a sample is affected by the artifact candidate. Using these labels along with the corresponding samples, we may estimate an artifact map according to Section 2.5, which given a clean sample creates a poisoned version of the sample with the artifact present. This may be done for example by training a generative model conditioned on the presence or absence of an artifact, manually extracting a watermark from an affected image using an image manipulation framework, or something as simple as fitting a linear regression model. For A-ClArC in input space, we manually extract the artifact from samples labeled as poisoned, such that we can apply it to samples by a simple affine transformation h(x) = (I − diag(α))x + diag(α)z where z is a vector with the pixel values of the watermark, and α an alpha channel the same size as the number of pixels, which is zero for all pixels except the ones where the watermark is present. For P-ClArC, we instead use the labels to train a linear classifier f (x) = v T x + b with v = 1, which is used to instead estimate an inverse artifact map as an affine transformation h(x) = (I − vv T )x − vv T z, where z is chosen as the mean over all clean samples of the class, as highlighted in Section 2.5.

Spectral Relevance Analysis on ImageNet in Feature Space
Until now we have based our SpRAy solely on model attributions in input space. While this has not been explored by Lapuschkin et al. [12], we attempt base the analysis on model attributions in feature space for additional insight and compare the obtained separability scores over the various intermediate representations at different model depths. The motivation behind using intermediate representations is that the model must encode increasingly invariant representations of concepts towards its classification task in higher levels, which may not be detectable with the contribution scores in input space. We investigate which clusters of samples contribute the most towards the separability score of a given class. To this end, we compute the score τ as many times as there are clusters, with samples from one cluster withheld in each iteration. In this setting the cluster group with the lowest separability score will have left out the cluster of samples with the highest contribution to the outlierness of the class. Complete class separability scores, along with samples of clusters with the highest outlier score, are reported in Figure 10. The shown ImageNet classes are laptop (top), mountain bike (middle) and swimming trunks (bottom). Note that the measured absolute magnitude of the separabilit score τ might be different between the three classes, so that only relative within-class comparisons can be inferred here. The scores τ vary strongly over various layers for different classes. E.g., the FDA score for "laptop" is comparatively large at the input layer, but then decreases with increasing depth of the layer. "Swimming Trunks", on the other hand, seems to separate best at layer 4. For this layer of maximum separation score, examples of the top three separating clusters are shown to the (right), revealing possible CH artifacts.
In this figure, the above analysis is shown for the three example classes "laptop", "mountain bike", and "swimming trunks" (top to bottom). Within each panel, a relative comparison of the separability score τ over layers, i.e., the input layer (layer 0) and various intermediate layers obtained from the model's convolutional feature extractor (layers 2-10) can be found to the (left). At each layer, measurements vary over the chosen number of clusters K ∈ {2, · · · , 32}, with the respective mean shown as a colored dot. However, a high τ does not necessarily only occur due to the presence of a CH, although if a CH is present and well represented, a high separability score is likely. Thus, correspondingly on the (right) side of each panel, for K = 32 and the layer with the highest mean τ in Figure 11 (left), samples of the top three clusters in terms of contribution to separability (i.e., the separability score decreased the most when this cluster was left out) are visualized. The most contributing cluster is shown in the (top row), decreasing towards the (bottom row).
We find that the separability scores vary significantly with the layers: for the "laptop" class, the clearly highest separability score appears at the input layer. Here, a cluster showing laptop lids has the largest separability contribution, showing the same laptop (albeit with different patterns printed on its lid), digitally rendered from the same angle in each sample in front of a white background. Thus, this cluster seems to describe a CH artifact. Results for the "mountain bike" class behave in a similar manner. Again, the highest separability score is found at the input layer, and, correspondingly, the cluster with the highest separability contribution there seems to contain a CH in the form of a distinctive gray border and a watermark.
In contrast to the first two examples, the largest mean τ value for the class "swimming trunks" occurs not at the input layer, but at intermediate layer 4 of the model instead. Again, the top contributing cluster consists of relatively similar samples, however, they are all perfectly valid examples of "swimming trunks", with no distinguishable artifact between them. The same seems to be the case for the second most contributing cluster. Interestingly, the third most separable cluster is extremely dissimilar to the first two, with every sample containing male upper bodies -a feature that, while often appearing alongside "swimming trunks" should not indicate this class in any way. In other words, a CH.
This last example demonstrates why it may be difficult to automate the process of CH identification: While a CH is in fact present in the class, it is not the top separating cluster, but has the third highest contribution (of 32 total clusters) to the τ score instead. More concisely, the most separable cluster is not necessarily a CH, and a high separability score does not guarantee the presence of a CH. Thus, SpRAy offers an indication of which clusters in which classes are CH candidates, but -in accordance with the property of CH artifacts of requiring expert domain knowledge to detect (Section 1.1) -human judgement is still required for a final decision. We further note that the CHs found in the first two examples are relatively simple features. They can, in fact, be expressed as an affine transformation in input space. Correspondingly, the highest separability score for these classes occurs in input space. In contrast, the third presented example, where the "upper body" CH was identified, is far more complex, but the highest τ score is also found at a deeper intermediate layer.
Thus, there seems to be a correlation between the complexity of an artifact, and the depth of the layer at which it separates best from the rest of the class.

Experiments -Concept Desensitization
In the previous section, we obtained cluster labels for (potential) CH artifacts, and correspondingly are able to estimate artifact models for CH candidates in ILSVRC2012 according to Section 2.5. The goal of this section is to verify the impact of these artifacts candidates on our classification model and at the same time reduce their impact by using either A-ClArC or P-ClArC. We first verify A-ClArC empirically by introducing a controlled setup based on a variation of MNIST, where artifacts are introduced as colors. With the established verification, we proceed an attempt to unlearn CH artifacts candidates using A-ClArC given the artifact estimators modeled after Section 3, first in input space, then in feature space, and at the same time measure their respective impact on the classification model. We then proceed to verify P-ClArC empirically on a setup similar to the previous one on a variation of colored MNIST. Subsequently, an extensive analysis using P-ClArC on ILSVRC2012 is presented, followed by an analysis on the ISIC 2019 dataset. Finally, we report results on the Adience dataset using P-ClArC, touching upon the issues of fairness and robustness in machine learning.

Unlearning Concepts with Augmentative Class Artifact Compensation
After identifying several CH artifacts of the ILSVRC2012 in Sections 3.2 and 3.3, we aim to desensitize models to them in the following experiments, firstly by employing the proposed A-ClArC method. CH artifacts appearby definition -alongside desired features of a class. Furthermore, each CH only natively occurs within one class and helps a model predict this class correctly. As such, if unlearning is successful, a decrease in the measured accuracy (as opposed to the true generalization accuracy) is to be expected, making it difficult to distinguish from simply confusing the network. Due to these unique properties of CH artifacts, our method for evaluating the experiments is two-fold: A quantitative evaluation of whether A-ClArC leads to a desensitization against a concept representation, combined with a qualitative assertion of whether this representation corresponds to the target concept and leads to an unlearning thereof.
Augmentative Class Artifact Compensation on Colored MNIST As an empirical verification of the method, A-ClArC is applied on a simple convolutional feed-forward type network (cf. A.2) on the previously described MNIST dataset with color artifacts. Here, we train the three variants of the model: (1) For the first model, of the 10 different classes, the samples of one class are colored with a probability of 20 percent during training. We call this the native model. (2) Another model is trained, but in addition to coloring the same single class as before, we also color samples of all other classes with a probability of 20 percent. We call this model a priori ClArC. (3) For the third model, we continue training from the learned native model, but also color according to the a priori ClArC samples of all classes with a probability of 20 percent. This model we call a posteriori ClArC.
To evaluate the influence of the color-based CH, we introduce two test modes. One test mode describes the performance of the models on the real dataset, where samples of the CH class are colored with a probability of 20 percent. The second test mode describes the performance of the models on a maximally poisoned dataset, where every sample is colored. By comparing these two performances, we get a measure of the error caused by the CH. Note that in this toy setting we can actually measure the performance of the model on the clean, CH free dataset, which would normally not be available. The performance on the realistic dataset is as one would expect marginally better (around 0.02 percent) than the performance on the clean dataset for the native model. However, when comparing these quantities to the fully poisoned dataset, they do not differ very much, and thus we compare the realistic setting to the fully poisoned setting. Data   10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90  95  100  mnist poisoned baseline Figure 12: Accuracy on a realistic test set (x-axis) vs. accuracy on a fully poisoned test set (y-axis) on colored MNIST. Red crosses describe the baseline, a native model which has seen a CH artifact during training for 20 percent of the samples of one class. Blue dots describe a fine-tuned version according to ClArC, the a posteriori ClArC, of the aforementioned models. The red and blue ellipses describe the confidence of the points. For visualization purposes, the ellipses are drawn with 40σ in x-and y-direction for a posteriori ClArC (poisoned) and 40σ in x-direction and 1.4σ in y-direction for the native models (baseline), where σ is the standard deviation of the accuracies in the respective direction. Figure 12 shows the accuracy on the realistic test set on the x-axis and the accuracy of a maximally poisoned dataset (all samples colored) on the y-axis. The native models (baseline) are represented by red crosses, while the a posteriori clarc (poisoned) are represented by blue dots. All models achieve an accuracy of about 99 percent on the realistic test set. As one would expect, the native models perform considerably worse on the fully poisoned test set. Some models are only slightly impacted by the poisoning, which means they do not pay as much attention on the CH artifact (color). Other models however perform as bad as only 10 percent accuracy, which has the model predict the class only based on the CH artifact. Fine-tuning the model according to ClArC, as done for the a posteriori ClArC model, results in all models now performing very closely to how they perform on the realistic dataset. Therefore, the models have successfully been fine-tuned to ignore the CH artifact. We have therefore empirically shown the effect of CH artifacts on the model, as well as shown the effectiveness of ClArC.

Acc. on Original Data Acc. on Poisoned
Augmentative Class Artifact Compensation on ImageNet We conduct a similar setup to the one used on Colored MNIST on ILSVRC2012. Due to the size of the dataset, we only use the previously described native model, which is the model trained on the original training set with all natural artifacts included, and the a posteriori A-ClArC model, which is a fine-tuned version of the native model. Additionally, we introduce a baseline model, which is fine-tuned with the same hyperparameters as the a posteriori A-ClArC model but trained on the unmodified training set. Furthermore, we reduce our training to a subset of 100 classes of the original ILSVRC2012. One a posteriori A-ClArC and one baseline model is trained for each artifact candidate model we have identified in Section 3. We fine-tune on the original model for a total of 10 epochs, and report the model accuracies using the two previously introduced test modes, where we use the original validation set (0% poisoned) as well as the original validation set with the artifact introduced into all samples (100% poisoned) in Figure 13. The model performances are also compared for these two test modes in scatter plots in Figure  14. As an additional approach to evaluate whether the importance of artifact was reduced for the prediction of each model, we visualize the difference between the attribution of the original model, and either the a posteriori A-ClArC or the baseline model in each epoch in Figure 13. The performance for both the A-ClArC and the baseline model do not seem to change considerably for artifacts "stole" and "garbage truck" when looking at the performances in Figure 13. This can be seen by very similar confidence ellipses in Figure 14 over all epochs and an additional poisoning of the training data at 50% for class "garbage truck". Class "stole mannequin" in Figure 13 which corresponds to "stole mannequin head" in Figure 14 shows however a slight improvement in the poisoned validation mode in the latter. Class "carton Hanzi" in Figure 13 which corresponds to "carton Hanzi" in Figure 14 shows a clear improvement over the baseline model on the poisoned validation set. We can see a strong collapse in the performance on the poisoned dataset for "jigsaw puzzle" for the baseline model, likely caused by the fact that the artifact is a very clear indicator for the class. However, the A-ClArC returns to about 50 % of accuracy on the poisoned validation set. For none of the artifacts in Figure 13 we see the A-ClArC model perform worse than the baseline model. By investigating the heatmap differences of the A-ClArC and baseline model to the original model, we can see that the A-ClArC model consistently decreases the amount of relevance assigned to the artifact location in the input image. For "garbage truck", there is a watermark in the bottom left corner of the image. The A-ClArC attribution subtracted by the original attribution shows strong positive values on the watermark location, indicated by the red color. The baseline attribution difference to the original model seems to decrease and increase the relevant pixels more generally focused on edges in the image. Even though the baseline model also seems to weakly reduce the relevance on the watermark in the second epoch, this is not as targeted and consistent as for the A-ClArC model. A similar behavior can be seen for the class "stole mannequin", where the A-ClArC model consistently reduces the relevance on the mannequin, while the baseline partly even reduces the relevance on the stole itself. For the "carton Hanzi" artifact, which here corresponds to the Hanzi in the center of the image, the A-ClArC model also reduces the relevance on the characters, mostly concentrated at the higher contrast area at the right hand side of the image. The baseline model even increases the relevance on the location of the watermark and decreases the relevance on the actual cartons compared to the original model. While somewhat harder to see, the A-ClArC model seems to reduce the jigsaw pattern away from the object of interest that the baseline model for "jigsaw puzzle".
Similarly, Figure 14 gives similar insights for "carton chinese watermark" and "jigsaw puzzle cutting pattern", where all A-ClArC models (poisoned) perform comparably on the original dataset, but outperform the baseline significantly on the poisoned validation set. With "stole mannequin head", A-ClArC outperforms the baseline slightly. "carton alibaba watermark" seems to only weakly affect both models, with no visible improvement for A-ClArC. The alibaba watermark is found not only in class "carton", but in many classes of ILSVRC2012, and is a rather small artifact in the bottom right of the image, possibly cropped most of the time during training, which is why it may not be a very strong artifact for class "carton" alone. "stole rounded edges" is also a very small artifact at the corner of only a few samples in class stole. Presumably for this reason, we do not see the either model particularly impacted by poisoning the validation set. The "garbage truck" artifact result is somewhat surprising, both models seem to be only slightly affected poisoned dataset, with only at best a very slight improvement of the A-ClArC model over the baseline model. Therefore, we may conclude that A-ClArC in input space does seem to work for some artifacts that are very significant in input space, but may not show any significant effect otherwise.  Original (x-axis) vs. poisoned (y-axis) validation set accuracy of baseline model (red) and A-ClArC models (blue) over training poison rate (bright 20%, dark 50%). All points are below the line of equal accuracy on poisoned and original data, which means they consistently perform better on the clean dataset. For "carton hanzi" and "jigsaw puzzle", the unlearned models perform significantly better on the poisoned validation set than the baseline. This can also be seen less significantly for "stole mannequin". In all cases, the accuracy of the unlearned model does not visibly decline compared to the baseline model. In the case of "jigsaw puzzle", the artifact is for many samples the only class-defining feature, which therefore extremely confuses the model on the poisoned validation set.

Augmentative Class Artifact Compensation in Feature Space
Equivalently to the previous section, we may instead choose to do a fine-tuning with A-ClArC using artifact representations that we have found in the feature space of any layer of a neural network in Section 3.3. There, we noted that these intermediate representations of artifacts differ significantly in how well they can be separated via SpRAy. For each artifact, a different layer depth seems to allow for an optimal separability, and this depth seems to correlate to the complexity of the respective artifact. Building on those observations, we conduct another experiment, similar to the one in Augmentative Class Artifact Compensation on ImageNet, applying A-ClArC using feature space representations of the target CH concept. In contrast to the input space variant of A-ClArC, here, the target CH concept is represented via CAVs.
We again compare an unlearned model (corresponding to the a-posteriori A-ClArC introduced previously) that is fine-tuned for 10 epochs on a subset of ILSVRC2012 consisting of 100 classes and employs A-ClArC to a baseline model that is fine-tuned in the same manner, but without A-ClArC. Both models are initialized from the native model, which is the same VGG-16 as for Augmentative Class Artifact Compensation on ImageNet. For A-ClArC, the target CH-artifact -described by a CAV, i.e., a direction in feature space -is added during fine-tuning to the activations at the respective layer l, with a probability p of 50% and a contribution of 50%. The contribution denotes in which ratio the original activations and the added CAV are mixed. This method of introducing the CAV to the activations corresponds to a i = 0.5 ∀i ∈ {1, 2, · · · , d} w.r.t. Equation 23), and is used multiple times over Section 4.1 (for concept desensitization and evaluation) and 4.2 (only for evaluation). As described in Section 2.5, the parameters of layers {1, · · · , l} are not altered during training, to keep the feature representation of the target concept static. Again, we employ the two test modes described previously, reporting accuracies on the original (0% poisoned) and a poisoned validation set (100% poisoned). The poisoning process, however, is executed in feature space for this experiment, using the computed CAV to poison the activations at layer l instead of introducing the artifact in input space. This experiment is repeated for feature extractor layers l ∈ {0, 4, 10}, with layer 0 denoting the input layer.
The results of this experiment are summarized in Figure 15. The four CHs shown there ("pattern", "border", "colored pattern", "mannequin head") are chosen to range from relatively simple to quite complex concepts.
At the (bottom right) of each panel of this figure, the validation accuracy after the final epoch is visualized for the three investigated layers. Results for both previously discussed test modes are shown: To the (left), i.e., for the 0% poisoning setting, we note that the unlearned model performs equally well as the baseline. The application of A-ClArC did thus not affect the model's accuracy on unpoisoned data in a negative manner, indicating that it does not confuse a model unnecessarily or introduce any unfair biases. To the (right), the 100% poisoning setting is reported. Here, the unlearned model vastly outperforms the baseline model for every single CH example. However, its accuracy varies w.r.t the layer at which A-ClArC is applied -as could be expected based on the results from Section 3.3, where we found the separability score τ of samples containing a CH artifact from clean samples to be quite dependent on the layer where SpRAy is applied. Moreover, the layer where performance is best seems to correlate to the perceived complexity of the CH. E.g., for the "pattern" artifact in class "jigsaw puzzle", which can easily be represented via an affine transformation in input space, thus being a relatively simple CH, the unlearned model performs best at the input layer. In contrast, for the far more complex "mannequin head" from the "stole" class, the highest accuracy is retained for intermediate layer 10. In fact, for all closer investigated CHs, the performance of the unlearned model at the "optimal" layer in the poisoned setting is almost on par with the performance of both unlearned and baseline models on the unpoisoned setting, demonstrating a significant gain in invariance against the concept described by the CAV after applying A-ClArC.
However, since we employ the computed CAV to poison the validation data, the above evaluation only shows that invariance against the target concept is gained, if the CAV represents that concept correctly. Thus, to ascertain whether in fact the target concept is unlearned via A-ClArC, we interpret Figure 15  For class "jigsaw", at the input layer, we note that the model employing A-ClArC is able to successfully reduce the relevance of exactly the target artifact, consisting mainly of three distinct puzzle piece shapes in the upper right as well as the bottom left of the image, while gaining relevance on the panda's head. The baseline model, in contrast, slightly reduces the relevance of the upper-right puzzle piece (although not as significantly as the unlearning model does), however, it barely has an effect on the lower-left puzzle pieces.  , initial starting model before training) that used A-ClArC are shown over training for the respective best performing layer, i.e., the layer where the accuracy of the unlearned model in a poisoned setting is largest. Here, the red areas of the attribution maps are used less than the original model for prediction; the blue areas more.
For the "border" artifact in the "mountain bike" class, we find a similar behavior, with the unlearned model precisely reducing the relevance of the "border", while simultaneously putting more emphasis on the desired features of the mountain bike and its driver. However, here we additionally observe another interesting effect: The unlearned model is relatively stable in terms of which features receive more or less relevance over the course of fine-tuning, instead only varying in intensity, not locality, pointing to a goal-oriented behavior. The same is not true for the baseline model, on the other hand, which seems to vary w.r.t. both.
While this observation with regards to training stability is also confirmed for the "ocarina" class, neither the A-ClArC model nor the baseline manage to correctly decrease relevance for the full "colored pattern" artifact. It seems that the computed CAV representation for that artifact may not sufficiently capture the artifact direction in this instance. This could be either due to the high variability in terms of how this artifact appears for different samples, or because of the examples offered for computing the CAV vector not describing the target CH precisely enough.
For the "mannequin head" concept, however, the correct concept seems to be unlearned by the A-ClArC model, and with high stability. On a first glance, the baseline model seems to reduce relevance of similar features as the unlearned model does. But, when inspecting this more closely, we find that the A-ClArC model actually reduces the relevance of the "mannequin head" with higher precision and more completely -and simultaneously loses less relevance on actually desirable features, i.e., the lower part of the blue stole.
Although there are cases where the unlearning in featurespace via A-ClArC is not successful -for instance due to the computed CAV not representing the correct concept -, generally, it performs extremely well, gaining significant invariance against a target concept. Moreover, the method performs in an extremely stable manner, showing improvements in comparison to a baseline model both quantitatively and qualitatively. We were further able to confirm our findings from Section 3.3 again, demonstrating a connection between artifact complexity and the layer at which it can be unlearned with the best results.
However, the application of A-ClArC still requires tedious and time-consuming fine-tuning. In contrast, the second proposed method for concept removal -P-ClArC -is far more efficient in that respect. Keep in mind, though, that -in contrast to A-ClArC -P-ClArC does not perform true unlearning in that sense, since it does not allow the network an opportunity to adapt its weights, and instead rather suppresses artifacts. Due to its promising properties with regards to efficiency, the following experiments will be dedicated to evaluating the P-ClArC method -and whether it can keep up with A-ClArC in performance.

Unlearning Concepts with Projective Class Artifact Compensation
After the identification several CH type artifacts used by models trained on the ILSVRC2012 dataset (see Sections 3.2 and 3.3), we have successfully demonstrated the removal of their influence on the model in the previous paragraphs, using A-ClArC. However, as A-ClArC requires the model to be fine tuned, it is not very efficient and might even become tedious in an iterative artifact identification and removal process. The P-ClArC-method proposed in 2.5, on the other hand, does not require any further training after the modeling of the artifact, but conversely does not allow the model to adapt its weights and strictly unlearn -as A-ClArC does. Instead, it acts as a filter and removes a concept's contribution to the output. Whether the concept suppression of P-ClArC is successful and comparable to A-ClArC is evaluated experimentally in the following paragraphs.
First, we measure the performance of P-ClArC in a toy setting on ColoredMNIST, before proceeding to the more complex ILSVRC2012 domain. Finally, we touch upon the subjects of fairness and reliability in machine learning by showing that P-ClArC is able to increase the robustness in the prediction of biased real-world datasets, i.e., the ISIC 2019 dataset in a skin lesion classification setting, and the Adience face classification dataset with a DNN trained to predict biological gender.
Projective Class Artifact Compensation on Colored MNIST To assess the validity of the proposed P-ClArC method, we first apply the method in a toy setting with relatively simple (CH type) concepts in the dataset. More concisely, as described in the eariler Section 3.2, we add color-based CH artifacts to the MNIST dataset [29,30] by distinctly changing the tint of 20% of the samples per class. While simple, the resulting concept is complex enough as to not have a pixel-wisely localizable representation in input space. We train a simple convolutional network as described in A.
For one color concept and intermediate layer l at a time, we "unlearn" the target concept without re-training by using P-ClArC, and evaluate the success of this unlearning procedure using an altered (or poisoned) test set: here, the target concept color is applied to a certain percentage of samples from the (whole) test set. We then evaluate and compare the performance of the original model to the performance of the model desensitized to the color concept via P-ClArC on this poisoned test data, as shown in Figure 16 (top). The accuracy (y-axis) of the original model blue and the corrected model orange is compared for the poison rates 0% (uncolored MNIST), 50%, and 100% (left to right), averaged over all ten classes. This comparison is visualized for the input layer and the first convolutional layer of the feature extractor (x-axis).
With increasing dataset poisoning, the model to which P-ClArC is applied outperforms the baseline model. However, the accuracy of both models on average decreases slightly with higher poison rates, showing that while P-ClArC makes the model more robust against the CH artifact specifically, the model is not completely unaffected otherwise. Note however, that since the CAV is only computed from samples within one class, due to the class-specific properties of CH artifacts, this evaluation may suffer from generalization issues of that CAV vector, when applied to other classes, explaining the high variance of the performance after applying P-ClArC. As a sanity check, we further perform the same evaluation using randomly generated CAVs, as shown in Figure 16 (bottom). As expected, the model to which P-ClArC was applied does not outperform the baseline model in this instance, and instead only achieves a considerably lower accuracy due to the arbitrary and not data-specific projection of the features. In combination with Figure 16 (top), this shows not only that the computed CAVs describe the targeted color concept in a meaningful manner, but also that the proposed P-ClArC method is able to exploit the CAV representation successfully in order to make a model more robust For the evaluation, the CH Artifact is added in the input space. The accuracy of a baseline model (blue) and an corrected model employing P-ClArC (orange) on these datasets is compared for CAVs obtained after the input layer (0 th layer) and the first convolutional layer (1 st layer). Measurements are taken from the separate unlearning of all CH artifacts in the Colored MNIST dataset. In the (top) row, the corrected model uses meaningful CAVs that are computed from two distinct sets of data samples, as described in Section 2.5. In contrast, a random vector is utilized instead for the (bottom) row. While the meaningful CAV leads to an improvement of the corrected model over the baseline for the poisoned datasets, the random vector has an extremely detrimental effect on model performance in every case.
against the target concept.
The above method of evaluation, however, again requires the addition of concepts in input space (since the colors are introduced to the test samples in input space) and may thus not be suitable for arbitrary (especially more complex) concepts. Especially "naturally occurring" artifacts known (and in this paper discovered) to appear in various popular datasets, e.g., CH artifacts like the mannequin head in ILSVRC2012 [9], colored bandaids in ISIC 2019 [32][33][34], or shirt collars in Adience [31] do often not have a singular, pixel-wise representation in input space, and, as such, the performance of P-ClArC on these artifacts would be difficult to assert using the above method of poisoning data in input space. Thus, we propose the following alternative: as previously established, CAVs offer a representation of a concept in feature space. Instead of altering test samples in input space, we can thus poison the test data by adding the CAV corresponding to a target concept to latent activations of a certain percentage samples at layer l during inference, and again compare the predictions of the model before and after applying P-ClArC. As such, this evaluation is not restricted to the input space. Its validity is, however, dependent on whether the obtained CAV actually denotes the correct concept. Therefore, we also aim to validate whether the CAV correctly describes the targeted concept: To this end, we discard all network layers after layer l, and model the network output with the CAV classifier receiveing its inputs from layer l. We thus obtain a network that classifies for a given input sample, whether it contains the concept described by the CAV, or not. In the following, this network is called CAV-predictor. After applying LRP to this CAV classification network, the resulting relevance maps can be evaluated in terms of whether they correspond to the expected target concept.
With the second proposed method of evaluation shown in Figure 17 (I), both models decrease in accuracy relatively, especially for higher rates of dataset poisoning and in comparison to Figure 16. However, the P-ClArC-corrected model significantly outperforms the baseline on the poisoned validation dataset, and more so when the artifact has been modeled after latent feature representations. Furthermore, the CAVs seem to describe their respective color artifact with high precision: In Figure 17 (II), the distribution of LRP-relevances for the CAV-predictor is visualized across the three color channels, for classes "0" (left) and "5" (right), and, respectively, colors blue and orange. Higher relevance is mostly attributed to the color channels that describe the target color concept. E.g., for the "blue" artifact, high relevance is attributed equally to the green and red channels, and less to the blue channel: Due to the additive rgb color system, red and green are the altered channels when introducing a blue artifact. In addition, Figure 17 (II) shows the absolute amount of relevance attributed to each color channel, confirming that the CAV indeed describes the target CH. However, as indicated in both parts of Figure 17, the disentaglement of benign and artifactual features seems to work better for layer 0 than for layer 1, implying that the CAV encodes the coloring more precisely there. Apparently the coloring is in fact a relatively simple (i.e. static w.r.t. its embedding into the input dimensions) CH, that is still most accurately represented in input space. Validation of the concept that is described by the computed CAV. The distribution of CAV-predictor (LRP-) relevance over color channels is shown for the classes 0 (left) and 5 (right), with the introduced CH concepts blue and orange. The bar plots above show the sum of (unsigned) relevances across color channels. The CAV-predictor assigns most relevance to the color channels that differentiate the poisoned samples from the from the original samples (e.g., red and green for the blue artifact).
Projective Class Artifact Compensation on ImageNet With the above toy example showing promising results, we further apply and evaluate P-ClArC in the more complex setting of ILSVRC2012, where various CH-type artifacts were identified using SpRAy, as described in Sections 3.2 to 3.3.
Equivalently to the corresponding experiments with A-ClArC in Section 4.1, we use the VGG-16 model with the pretrained weights obtained from the Pytorch model zoo. P-ClArC is performed at layers 0, 4, and 10 of the model's convolutional feature extractor in separate experiments. We evaluate on a subset of 100 (randomly chosen) ILSVRC2012 classes that include the class where a CH occurs in the data (called "target class" in the following). Again we compare a corrected model that employs P-ClArC to a baseline model that does not. For this purpose, we use an unpoisoned and a poisoned validation dataset, with the latter being augmented by adding the CAV that encodes the target CH to the activations of all samples at the respective layer (100% poisoning). To assert how well P-ClArC suppresses the target CH concept, we again employ the previously established twofold evaluation method that does not rely on the introduction of artifacts in input space, combining a quantitative comparison between the two models' outputs with a qualitative analysis of the difference in attributed relevances. The CH artifacts that are inspected more closely were identified using SpRAy and range from simple artifacts with static placement in pixel space (e.g., laptop -"lid") to relatively complex conceptual and non-static concepts (e.g., swimming trunks -"upper body"). Since dataset poisoning for the purpose of evaluation is achieved by adding the computed CAV to activations at the respective intermediate layer, it is not sufficient to show that P-ClArC successfully counteracts this, since the same CAV is used in its projection step. Rather, we first need to establish that the CAV actually encodes a meaningful feature of the target class. Furthermore, to be valid, P-ClArC should be concept-specific, and thus optimally not have any effect on the network's inference for samples that do not contain the target artifact.
The results of a quantitative analysis of these three properties is shown in Figure 18 for the classes "laptop" and "stole", with the CHs "lid" and "mannequin head", as examples for a relatively simple and a more complex CH, respectively.
Note that in this figure, class-wise (normalized) logits are visualized as opposed to the final softmax probabilities, since slight changes may not be easily registered in the latter, due to the high number (1000) of classes in the ILSVRC2012 dataset and thus the model's output. However, as the model is originally trained to optimize softmax probabilities, it is sufficient to only compare the relative relationship between classes outputs due to the shift invariance of the softmax function. As shown on the (left) side of Figure 18, P-ClArC preserves a model's performance if applied to unpoisoned data. Note that since the 100 validation classes also contain the target class, a very slight change in performance can be found for, e.g., class "laptop" at Layer 0 or the class "stole". However, the mean logit values never vary by more than 0.03 between the baseline and corrected model, with the ratio of true label logits and target label logits barely changing.
Since the projection step of P-ClArC relies on the computed CAV precisely representing the targeted CH concept, we next assert whether the CAV is meaningful w.r.t. the target class, i.e., whether it describes a feature specific to the target class. We do this by observing how the mean logit values of the true and target class labels change when the model's inference process is poisoned by adding the computed CAV to the activations of the respective layer, thereby shifting these activations in the CH direction -as it is described by the CAV. The corresponding results are demonstrated in the top right of each panel of Figure 18. Here, we note a significant decrease in the mean logit values of the true class labels when poisoning the validation data. At the same time, the mean logit values of the target labels mostly increase, e.g., class for "stole" in layer 4, where the true label logit mean value diminishes from 0.91 to 0.48 due to poisoning, while for the target label it increases from 0.11 to 0.53 at the same time. An exception is the class "laptop" at layer 0, where values decrease for both (sets of) classes. However, the ratio between true label and target label logits always changes in favor of the target label with poisoning. We thus deduce that since adding the computed CAV to the activations at the respective layer relatively increases the model's confidence of the target class over the true class, the CAV encodes for a feature that is specific to the target class. Note that this does not necessarily imply that the CAV describes the exact target CH concept, which is an observation that we investigate further in Figure 19 a few paragraphs further.
In the poisoned setting, the baseline model consistently assigns a larger logit value to the target class than to the true class, in contrast to the unpoisoned setting. Observed exceptions to this rule are layer 0 for class "laptop" and layer 10 for class "stole" (cf. the bottom right parts of the panels in Figure 18). This can be explained by the relative complexity of the respective artifacts and their (attempted) point of encoding in the network. Artifacts best expressed statically in pixel space (here, the laptop's lid) are more readily encoded by a CAV trained here, compared to later layers, where the model has developed invariances against pixelspecific encodings. Conversely, more complex and semantic concepts such as the mannequin head, which as a feature appear in multiple locations and poses over the dataset are more readily encoded in invariant latend representations later in the model.
When employing P-ClArC, the model manages to correct this skewed distribution successfully, and assigns larger logit values to the true class than to the target class. e.g., for layer 4 of class "stole", the baseline model infers a normalized logit mean value of 0.48 for the true class, but 0.53 for the target class. The corrected model, however, shifts this distribution in favor of the true class by projecting the activations beyond the CAVpredictor's hyperplane, assigning a normalized logit mean value of 0.87 to the true class and 0.11 to the target class. Note that in this example, the previous unpoisoned mean logit values ((top right) of each panel, blue) are almost perfectly restored. Thereby, it successfully counteracts the introduced poisoning. P-ClArC projects samples beyond the hyperplane separating samples within a class that contain a target CH concept and samples that do not. Thus, its performance is entirely dependent on how well the learned CAV, i.e., the vector orthogonal to that hyperplane, describes the target CH. The quantitative analysis of Figure  18, however, is only able to assert that the CAV describes some feature specific to the target class of which the influence on the model's inference can be increased by adding it to the activation in feature space (and consecutively decreased again applying P-ClArC). Up to this point, however, we have not yet shown that the computed CAV describes exactly the target CH feature or that P-ClArC is able to successfully remove a CH concept that is not added artificially, but naturally occurs in the data.
For these two purposes, relevance maps computed via LRP are shown in Figure 19, with each panel dedicated to one specific CH concept. These CHs range (in the order of left to right, and top to bottom) from simple artifacts that are present in roughly the same pixels of each affected sample (e.g., laptop -"lid") to far more complex features (e.g., swimming trunks -"upper body") that present differently in each sample and can thus not be described in input space in a uniform manner. For three example images of the each CHs, and for the models's intermediate layers 4 and 10, (1) relevance maps of the CAV-predictor (left) and (2) the difference in relevance attribution between the baseline model and the P-ClArC-employing corrected model are shown.
The former (1) visualizes which features are important to the decision hyperplane of the linear CAV-predictor in classifying wheter a sample contains a CH or not, with red highlighting positive relevance, and blue negative relevance, and thereby offers an estimation of how well the computed CAV encodes the correct concept. The latter (2), on the other hand, shows how the importance of features for a certain decision changes when employing P-ClArC, with features that are more relevant to the after applying P-ClArC in blue, and features whose relevance is reduced in red, and thus indicates how successfully the target concept's influence on the model's prediction is removed by P-ClArC.
For all CH examples, the CAV seems to correctly encode the targeted artifact, as the correct features are used to identify them as containing the CH. However, there are notable variations in the precision of the CAVpredictor on the correct features between CHs and -for the same CH -between layers. E.g., for layer 4 of the laptop -"lid" artifact, the outline of a laptop backside, digitally rendered from a specific angle by the image creator, is clearly visible. However, for layer 10, only the corners of the same outline seem to be relevant, indicating that the artifact is encoded by the CAV less precise and complete. A similar trend can be observed for other relatively simple CHs, i.e., mountain bike -"border". For stole -"rounded corners" the CAV seems to be on point for both layers. In contrast, the CAV-predictor for the more complex stole -"mannequin head", seems to focus on the mannequin head artifact as well as the correct class features of the stole itself in layer 4, however, in layer 10, it seems to single out the mannequin head artifact almost exclusively. A similar effect occurs with swimming trunks -"upper body", where the layer 4 heatmaps are relatively diffuse, while for layer 10 only the human upper bodies are assigned a large positive relevance.
In a similar manner, the relevance difference heatmaps show that the artifact is less impactful on the model's decision-making after applying P-ClArC. Again, the success of this artifact removal varies with the specific CH and layer, and this variation seems to behave in the same way as for the CAV-predictor heatmaps, as described above, although some differences exist. E.g., for layer 4 of the laptop -"lid" artifact, where the CAV-predictor seems to be most precisely learned, the corrected model assigns far less relevance to the outline of the laptop's lid. In contrast, at layer 10, suddenly also parts of the image imprinted on the lid are removed. Mountain bike -"border" behaves in a similar manner, however, for stole -"rounded corners", while both CAV-predictor and relevance difference heatmaps somewhat coincide at layer 4, with the corners being removed correctly, at layer 10 mainly the blue stole itself receives less relevance, and relevance on the rounded corners actually increases. This makes sense, as the rounded corners are an extremely simple artifact -thereby being removed more successfully in earlier layers, in accordance with our previous findings. However, it also seems that just because the CAV seems to describe the artifact correctly, the unlearning result does not always exactly correspond to that. Note, however, that since these heatmaps are normalized w.r.t. to the largest absolute relevance value, the rounded corners may only be assigned an extremely large relevance, and the some smaller relevance value. In fact, this example further showcases another interesting problem: The samples of the class stole that contain the "rounded corners" artifact also always contain the same person and the same blue stole. The CH is thus ill defined here, since the "rounded corners" cannot be described by only using example images, making CAVs apparently not the ideal choice of representation for this specific artifact. Since the resulting CAV would encode both, in a way, this is thus both a simple and a complex CH, with P-ClArC removing the simple part ("rounded corners") at the earlier layer, and the more complex "blue stole" at the later layer.
Matching these interpretations, the complex "mannequin head", "colored pattern", and "upper body" artifacts are removed far more successfully at layer 10. Note especially the class "swimming trunks", where not only the relevance of the upper body decreases, but also relevance on the swimming trunks themselves is increased. The same effect is also visible for the "mannequin head" artifact.
To summarize, there seems to be an intermediate layer where the computed CAVs not only encode the correct and intended CH concept -although this layer differs for each respective artifact. The CH correction is also more precise at the same layer, not only leading to a lessened impact of the targeted artifact on the model's prediction, but also often an increase of the correct non-CH class features. In fact, this layer largely coincides with the complexity of the targeted artifact, confirming expectations and our findings from Section 3.3. Although, we observe that in comparison to Section 3.3, the best performing layers are shifted backwards in the network, e.g., for the "lid" CH, this optimum seems to be at layer 4 instead of layer 0 when applying P-ClArC,  Figure 19: Effects of P-ClArC on ILSVRC2012. In every panel, P-ClArC was applied after layers 4 and 10.
For each of these, the LRP relevances of the CAV-predictor and the relevance difference between the baseline and the corrected model is visualized. In the relevance difference images, the corrected model focused less on the areas highlighted in red compared to the baseline model, but more on the parts highlighted blue. While the first three CHs ("lid", "border", "rounded corners") occupy the same pixels between samples, the last three CHs ("mannequin head", "colored pattern", "upper body") consist of more complex features. In line with this complexity, P-ClArC seems to perform better on earlier layers of the feature extractor for the first group, with the heatmaps corresponding more to the target concept, as indicated by the green border. The opposite seems to be the case for the second group, concurring with the separability scores in Figure 11.
possibly due to exploitation of the model's feature space representation at later layers being more invariant.
Further taking the results of Figure 18 into account, where we showed how P-ClArC not only counteracts poisoning and shifts the prediction towards the true class, but also does not affect performance on unpoisoned data in a significant manner, we thus surmise that P-ClArC is an efficient but powerful tool for concept removal. Note, however, that P-ClArC will not lead to an increased generalization performance, since the model never has a chance to adapt its weights for learning other features and thus correct its faulty prediction reasoning.
Nevertheless, its strengths lie in its ability to offer a fairer estimation of a model's generalization performance, untainted by features that should not contribute to the decision-making. P-ClArC is able to successfully reduce the impact of CH artifacts on a model's prediction, and employing it on ILSVRC2012 is able to demonstrate that fact. Although, this dataset my not be sufficient for showcasing how powerful P-ClArC can be towards the solution of some pressing problems hindering the application of ML-methods in real-world scenarios. For this reason, the following paragraphs will offer two examples, where P-ClArC is employed to avoid predictions for the wrong reasons with dangerous consequences, and to increase classification fairness on biased data.
Unlearning with Projective Class Artifact Compensation on ISIC 2019 In the previous sections, we have confirmed the success of P-ClArC applications on toy examples and more complex settings on real photographic images, i.e. the ILSVRC2012 dataset.
In this (and the following) section, we will apply P-ClArC to more domain specific datasets in order to solve practically relevant issues. Here, we demonstrate that P-ClArC can be used to increase the trustworthiness of models trained for skin lesion classification on the ISIC 2019 dataset. As it common practice, we fine-tune a neural network (here a VGG-16 model) pretrained on ILSVRC2012 on the ISIC 2019 [32][33][34] skin lesion classification dataset for 100 epochs, using the weights from the Pytorch model zoo for initialization. Due to ISIC 2019 not having a pre-defined labeled test set, 10% of the original training set were split off instead to evaluate its performance. Our model achieves a final test accuracy of 82.15%.
It is known, however, that the ISIC 2019 dataset contains several issues and confounders. First and foremost, a significant data artifact, that only occurs in the largest class, i.e. colorful band-aids next to the photographed skin alteration. Since this artifact is again limited to one class, it constitutes a CH-type artifact. For the purpose of skin lesion classification, aimed to be applied in the medical field to assist medical personnel or allow mobile diagnoses [84], CHs like these can have serious consequences, as they may easily lead to a misclassification, affecting the resulting diagnosis, and, as such, the life of a patient. Especially, since the affected class, "melanocytic nevus", is a benign form of skin alteration, possibly leading to fatal false negatives in terms of skin cancer diagnosis.
With this in mind, we aim to mitigate the effect that the "colorful band-aids" CH has on the model's prediction by employing P-ClArC. For this purpose, we again compare the model which P-ClArC is applied and the original model in terms of predictions and LRP relevance maps. Results are shown in Figure 20. Here, as opposed to the corresponding evaluations for ILSVRC2012 ( Figure 18) where normalized mean logit values were considered (due to the high number of classes), we measure the more stabilized mean softmax probabilities, since ISIC 2019 only contains 9 distinct classes, whereas ILSVRC2012 contains 1000. Due to the missing test set labels, the (whole) training set is used for the quantitative evaluations in panels (I) and (II) of this figure. However, since the application of P-ClArC does not contain any further training, the model never has the opportunity to adapt to the performed alterations in any way, e.g. by shifting its inference strategy to features which prior to CH removal had a merely supporting function.
In Figure 20 (I), for layers 0 (i.e., the input layer), 4, and 10, the effect of adding the CAV computed (for later usage during P-ClArC) to the activations at the respective layer is measured. If the CAV encodes a feature that is specific to the target class "melanocytic nevus", one would expect the softmax probability of that class to increase when poisoning the samples in that way, while confidence in the actual true class label would decrease simultaneously. Note that due to the true class changing from sample to sample, the sum of (mean) true class and target class probabilities may exceed 1 in this figure. For layer 0, we observe a decrease both for true and target labels, indicating a generally confusing effect of the CAV-poisoning on the model, as could be expected to some degree: In input space, the encoding of CHs via CAVs may not be feasible, because the data is too complex in its raw form, and no invariant representation learned by the model has been applied yet. In contrast, for layer 4 and even more so for layer 10, the softmax probabilities exhibit precisely the expected effect. E.g., for layer 10, they change from 0.97 to 0.11 for the true class, but rise from 0.51 to 0.94 for class "melanocytic nevus". As indicated by the green border, this effect is most prominent in layer 10. We thus infer that the computed CAV indeed denotes a concept specific to the target class -at least for layers 4 and 10.
Building on that assertion, the next step is to validate whether the P-ClArC method is able to counteract said poisoning. Figure 20 (III) shows the corresponding results. The inference results of the baseline model and the model employing P-ClArC are compared in the form of mean softmax probabilities. With the data poisoned in the same manner as in Figure 20 (I), not only should confidence in the target class decrease with a successful removal of an artifact, but also confidence of the true class should increase, restoring the predicted probabilities of an unpoisoned setting as closely as possible. As visible throughout Figure 20 (I) to (II), this is barely the case for layer 0, partly due to the probabilities already decreasing both for the target and the true class because of the poisoning. Even so, P-ClArC manages to almost restore the original confidences, Example images and corresponding CAV-predictor-and LRP relevance difference heatmaps for the layers 4 and 10, where the quantifications of (I) and (II) yielded positive results. In CAV predictor heatmaps, red areas indicate high relevance, i.e., highlight features indicative for the CAV direction. In the difference heatmaps, red areas were attributed less relevance, and thus used less by the model, after applying P-ClArC. The focus of the CAV-predictor seems to be relatively diffuse in layer 4, and only partly located on the targeted band-aids, supported by the only partial success of the concept mitigation, where sometimes even desired features are diminished. In contrast, both heatmap types are extremely precise at layer 10, and not only is the relevance of the CH reduced, but the melanoma itself becomes more important for the model's decision.
with the true label probability growing from 0.38 to 0.83 (unpoisoned 0.97) and target label probability from 0.00 to 0.54 (unpoisoned 0.51). Although the CAV at layer 0 is not meaningful, P-ClArC can still mitigate the poisoning, showcasing again the need for our two-part quantitative evaluation, validating that not only the concept suppression is successful, but also that the CAV encodes a target-class-specific concept. For layers 4 and 10, P-ClArC restores predictions even more closely to the original values shown in Figure 20 (I), increasing confidence in the true class, while decreasing confidence for "melanocytic nevus" simultaneously. E.g., for layer 4, the former rises from from 0.30 to 0.96 (unpoisoned 0.97), the latter from 0.94 to 0.52 (unpoisoned 0.51). In fact, the same result is obtained for layer 10: But since the poisoned probabilities deviate more extremely from an evaluation on unpoisoned data, which is also why we find an application P-ClArC at layer 10 to be even more successful in counteracting poisoning (see green border ).
In Figure 20 (III), we aim to confirm above assertions for layers 4 and 10 by means of three sample images of the class "melanocytic nevus" that contain the targeted "colored band-aid" CH. The results here are obtained from the unperturbed data of the ISIC 2019 dataset, as opposed to the artificially poisoned setting of Figure 20 (I) and (II). For each sample and layer, a heatmap computed for the CAV predictor is shown, highlighting areas which speak for the presence of the concept described by the CAV in red color, and areas speaking against it in blue color. Furthermore, to the right of the CAV-predictor heatmaps, the difference in relevances between the model to which P-ClArC is applied and the original model is visualized, with red areas denoting a decreased relevance after the application of P-ClArC. Conversely, blue areas identify features which are increasingly used by the model. For layer 4, the computed CAV seems to encode the "colored band-aid" concept only relatively diffusely, with some portion of the positive relevance being attributed to the nevus (i.e., the desired feature) itself, as can be seen for the middle example. Similarly, the heatmap for the CAV predictor also attributes negative relevance to the CH features. The then following CH correction results suffer from similar issues: While relevance is decreased on the "colored band-aids" themselves, often also the nevus receives less relevance, e.g., as observable with the first and second examples.
In contrast, the CAV-predictor heatmaps are far more precisely marking the confounding features in layer 10, with not only the CH being extremely relevant, but the desired features also beeing a seemingly neutral (black color in heatmap) or an even negative indicator (blue color in heatmap) for the presence of the encoded concept. The accompanying difference maps show a strong decrease in relevance for the CH areas, and a simultaneous increase in the relevance of the desired features, showing not only that P-ClArC in layer 10 successfully corrects the faulty usage of the "colored band-aid" as an important feature for the model to decide for the "melanocytic nevus" class, but also further shifts the model's focus to the actually desired features, i.e., the nevi themselves.
Since the computed CAV is not only meaningful w.r.t. the target class, but also exactly describes the targeted CH artifact (at least for layers 4 and 10), and since P-ClArC is able to unlearn that concept, we can thus surmise the -albeit layer-dependent -success of the P-ClArC method on the ISIC 2019 skin lesion classification dataset for mitigating the effects of training data containing CH artifacts. Due to the corrected model using desired features preferably to the CH features, its trustworthiness increases, reducing the risk of costly misclassifications caused by the CH.
Unlearning with Projective Class Artifact Compensation on the Adience dataset of unfiltered faces As opposed to the medical setting of ISIC 2019, we now apply P-ClArC to a gender classification task with the Adience dataset [31]. This dataset has various known problems, e.g., a relatively high class imbalance, as well as a multitude of biases within the data models tend to quickly overfit on, as in part identified in [85] via LRP.
In the gender classification setting, one of these bias-concepts is the presence of shirt collars in the class of male faces. Samples labelled as "male" with a shirt collar are a common occurence within the dataset, and samples labelled as "female" wearing a showing a shirt collar are quite rare. Thus, models trained on the Adience dataset often use use this confounding feature as a CH for the class defining the appearance of male faces, thereby short-cutting (the learning of) more complex features. This is also the case for the VGG-16 model we trained for gender classification. Using the pretrained ILSVRC2012 weights provided by Pytorch for initialization, the model was trained over 100 epochs on folds 1-4, keeping fold 0 for testing. The final accuracy achieved by this model was 94.02%.
However, the reliance of this model on CHs like the shirt collar concept may lead to unfair predictions, e.g., when a woman is predicted as "male" due to wearing clothes associated by the model with the class "male", i.e. here, a shirt collar. The impact of this is especially high in real-world applications, when stereotypes -that are apparently present in the available training data -are propagated into the inference of machine learning solutions. Here, we thus employ P-ClArC, with the aim of obtaining fairer gender predictions on the Adience dataset w.r.t. the "shirt collar" CH. Figure 21 shows the results of this experiment on the fold 0 test set, comparing the original model to the model employing P-ClArC to suppress the targeted "shirt collar" CH. To compute the corresponding CAVs, two hand-selected subsets of the samples representing class "male" were used, one containing samples with shirt collars, and one without. Figure 21 (I) and (II) shows for intermediate layers 0, 4, and 10, similar to Figures 18  and 21, a quantitative evaluation of the change in mean softmax probabilities when using the computed CAV to poison activations at the respective layer (Figure 21 (I)) and when applying P-ClArC to mitigate that poisoning (Figure 21 (II)). This change is measured for both class labels of the dataset, i.e., "female" and "male", with the latter being the target class. That is, the class for which the CH "shirt collar" is used by the model as an indicative feature. Similarly to the results obtained for the ISIC 2019 dataset, we find that for layer 0, the computed CAV does not seem to be able to concisely describe a feature specific to the target class, since poisoning activations with it leads to a decrease the softmax probability of all classes, including class "male". To reiterate, a meaningful CAV direction, i.e., a CAV that encodes for a feature of the target class, would lead to an increase in the model's confidence on that class. However, this is not the case here with scores for class "male" dropping from 0.51 to 0.32 probably due to the raw input data that has not yet been affected by any learned invariant internal representation of the model, being too complex for the CAV to successfully describe. Note that the softmax scores for class "female" simultaneously increase in this setting (from 0.49 to 0.68). That is, however, a byproduct of the confidence decrease for class "male" due to the binary classification task. The poisoning counteraction of P-ClArC for layer 0 is comparatively successful (Figure 21 (II)), with the original probabilities from Figure 21 (I) only being within a margin of error of only 0.04. But since the computed CAV is not fully meaningful, the direction removed by P-ClArC cannot correspond to the target concept for layer 0. In contrast, for layer 4 and even more so for layer 10, as indicated by the green border, the poisoning in Figure 21 (I) yields the expected results, increasing the predicted probability of class "male" on average for, e.g., layer 10, from 0.51 to 0.94, while decreasing it for class "female" from 0.49 to 0.06. In Figure 21 (II), however, the removal of CAV-poisoning seems to overreach for layers 4 and 10: The original predicted probabilities of 0.49 for class "female" and 0.51 for class "male" are not exactly restored, instead, e.g., for layer 10 the confidence of class "female" rises from 0.06 to 0.60 (in stead of 0.49 for a perfect recovery), while it drops from 0.94 to 0.40 (instead of 0.51) for class "male", with 11% discrepancy compared to the original values. Keeping in mind that Figure 21 (I) shows that the CAV is meaningful w.r.t. the target class, we thus infer that either the CAVs for layers 4 and 10 encode the targeted concept -and removing it affects the prediction so much because the model strongly relies on that feature, or the CAV encodes not only the shirt collar, but additionally other (possibly valid) features for class "male" that appear alongside shirt collars with a relatively large correlation. In any case, layer 10 is marked with a green border, since the concept suppression effect is strongest there.
For this layer 10, Figure 21 (III) shows samples for both classes "male" and "female", both with and without the target CH "shirt collar", respectively. Each sample is accompanied by two types of LRP relevance maps, the first on the left showing which features are important for the CAV-predictor in red, i.e., which features indicate the presence of the target concept as it is represented via the computed CAV, while features speaking against it are highlighted in blue color. The second relevance map evaluates features of the respective sample that are used less by the model for its predictions after the application of P-ClArC in red color, and features that are used more in blue color. On images of the target class "male" that contain the target CH (top left), on a first glance positive relevances in the CAV-predictor heatmap seem to focus on the actual shirt collar, indicating that the computed CAV does encode for the target concept. In the relevance difference maps, however, while the relevance of the shirt collar decreases with an application of P-ClArC and that of the facial features (i.e. the features desired to be used by the model, naively summarized) increases some other features, e.g., visible and uncovered ears, seem to also be suppressed. In the CAV-predictor heatmap, these are assigned a small positive relevance. As found by [85], specifically the visible ears also tend to be learned by models as an indicator for class "male" and possibly even constitute a CH. Apparently, these features often appear alongside the positive examples for the "shirt collar" concept, thereby leading to the computed CAV not only encoding for "shirt collar" features, but additionally for other -possibly CH features of the class "male", further confirming our suspicions regarding the large shift in mean softmax probabilities when P-ClArC is applied at layers 4 and 10 in Figure 21 (II). As the Adience dataset is an extremely complex dataset with highly biased data, a noisy CAV encoding is to be expected, especially, since the precision of the CAV is highly dependent on the samples chosen for its computation.
In contrast, when the target CH is not present (Figure 21 (III) (bottom left)), correctly no collar is identified. Although, again, uncovered ears seem to receive partial positive relevance. For the "female" class, however, even though the shirt collar is identified by the CAV-predictor relevance maps ( ( Figure 21 (III) (top right)); albeit by far not as precisely as for class "male" -"collar"), it does not seem to diminish reliably after applying P-ClArC. Instead, e.g. in the top example, its relevance in the prediction process even increases, and the concept removal seems to focus mostly on the eyes and hairline. Contrary to class "male", an application of P-ClArC is as successful for samples from class "female". This brings up a possible issue with using CAVs to represent the target CH artifacts that we have previously only briefly touched upon: within the Adience dataset, the "shirt collar" CH only has a significant presence within class "male" -leading to positive and  Figure 21: Application of P-ClArC on the Adience dataset, with the aim to obtain less stereotypical and fairer predictions. The target CH is the "shirt collar" concept used by the model to predict in favor of the class "male". (I): By adding the computed CAV to activations at the respective intermediate layer, the prediction can be affected in such a way that confidence on class "male" increases, showing that the CAV describes a concept specific to "male". The layer where this works best is marked by a green border. (II): Using P-ClArC, poisoning via the computed CH is easily mitigated. As a result, the softmax probabilities on the class "female" increase, while they decrease for "male". (III): CAV-predictor and LRP relevance difference heatmaps at layer 10 (the best performing layer according to (I) and (II)) for examples of both genders, with and without the target CH "shirt collar" each. The artifact is predicted and suppressed successfully if present in class "male", however, in the class "female", this is not always the case. (IV): Analysis of transitions between true positive and false negative predictions when applying P-ClArC. Examples for layer 10 of which the predicted class is flipped are shown to the (left), together with softmax probabilities of each sample before and after using P-ClArC and the corresponding change in relevances. The table to the (right) shows the percentage of original true positives that change to false negatives, and vice versa. Generally, a higher percentage of false negatives is corrected than true positives are confused. Due to the original model being 94% accurate, however, a larger absolute number of samples are changed from true positives to false negatives, leading to an overall decrease in accuracy.
negative "shirt collar" examples for the CAV computation only being obtainable in a reliable manner from samples of class "male". However, because the CAV is only computed using samples from one class, and because its ability to distinguish a concept relies entirely on the data used for fitting the corresponding linear classifier, it does not necessarily encode the target CH as precisely when faced with samples from class since the domain changes for the CAV model. For the samples from class "female" without shirt collar features, no shirt collar is found and consecutively not removed (similar to the corresponding "male" samples). In the second example in Figure 21 (III) (bottom right) the shape of the long hair seems to be identified as a shirt collar, showcasing another issue for this specific CH among samples belonging to class "female". To summarize, while the concept suppression of P-ClArC seems to have a similar success on the "male" class as we previously found for CH in other datasets, albeit slightly more noisy due to the complex nature of the Adience dataset, applying it to the "female" class sheds light on various issues, e.g., a relatively strong domain dependence of the computed CAVs.
Even though the previous results are relatively mixed, we evaluate the ability of P-ClArC to achieve fairer predictions in Figure 21 (IV). Here, the table to the right shows for layers 0, 4, and 10 and both classes the percentage of previously mispredicted (false negatives, i.e., FN) and correctly predicted samples (true positives,i.e., TP) of which the predicted class changed after an application of P-ClArC, turning them into true positives and false negatives, respectively. Relatively, more false negatives turn into true positives when P-ClArC is applied. Where we found the computed CAV for layer 0 to not be meaningful w.r.t. the target class, the FN to TP rate is comparatively high with 15.7% for class "male" and 7.6% for class "female". At the same time, however, the TP to FN rate is also significant, with 0.7% for class "male" and 2.1% for class "female". In layer 4, they decrease to 7.2%, 0.3%, 5.3% , and 1.4%, respectively. In layer 10, an interesting phenomenon occurs, with the rates growing to 8.4% (FN to TP) and 0.4% (TP to FN) for class "male", but still diminishing for class "female", to 3.1% and 0.9%. A large amount of samples changing from TP to FN and vice versa is not necessarily a sufficient measurement on its own, because many alterations to the model's inference process would have that effect, especially since with an accuracy of 94.02%, there are far more TP than FN absolutely. E.g., this seems to happen for P-ClArC with a badly encoded CH, as is the case for layer 0, according to our findings in Figure 21 (I)-(III). However, both (TP to FN) and (FN to TP) rates seem to steadily diminish with higher layers, presumably due to alterations later in the network not being propagated as far and thus having a lessened effect, except -as noted above -for layer 10 of (only) the class "male", where a sudden increase occurs. This observation corresponds to our two previous assertions, that the layer 10 CAV and P-ClArC process for class "male" is quite precise w.r.t. the target concept "shirt collar" -although some other correlating distinct "male" features are also affected. For class "female", however, the same artifact does not seem to be as well defined.
In any case, a closer look at the affected samples is needed to come to a conclusion. For this purpose, Figure 21 (IV) (left) shows examples of which the prediction switched after applying P-ClArC in layer 4 are shown, along with the softmax probabilities of the respective samples before and after the attempted correction w.r.t. the CH concepts, together with the corresponding attribution difference maps, for classes "male" and "female" and both types of prediction change. For the class "male", samples seem to be predicted from TP to FN (bottom left) due to the target concept, i.e., "shirt collar" or correlating male features like uncovered ears, being suppressed successfully. The accompanying change in softmax probabilities is quite significant, especially for the first example. Interestingly, in the second example, the female face visible in the image gains in attributed relevance due to the removal of features corresponding to class "male". Furthermore, the change from FN to TP (top left) appears to happen due to more significance being attributed to facial features, and less to surrounding features. Interestingly, in the top example, part of a "shirt collar" is removed, but confidence for "male" is increased, perhaps due to the colorful expression of the visible clothing item. Again, we note significant changes in the predicted class probabilities. In contrast, for class "female" , probabilities often seem to only change slightly and due to the model having difficulties classifying the sample in the first place, as is the case, e.g., for small children (top right and bottom right, first sample each). However, we also observe changes from FN to TP due to a shirt collar feature being withheld from the model (top right second image), although the shirt collar removed here is a misinterpreted pearl necklace, and the corresponding alterations in relevance are by far not as distinct as for the examples labelled as "male". Even so, the accompanying discrepancies in softmax probabilities are notably higher for examples such as this, where the classification changes due to valid (w.r.t. the targeted CH) reasons.
To summarize, on the Adience dataset -which is admittedly quite difficult to solve, due to its various inherent biases and imbalances -, we found that the influence of even highly complex CH, e.g., the "shirt collar" of class "male", can be successfully mitigated via P-ClArC, although not quite as precisely and significantly as achieved for, e.g., the ISIC 2019 dataset. Especially the issue of P-ClArC not being transferable between classes without losing in precision of the CH correction becomes clear if a concept is present within multiple classes but the CAV representation is only learned from samples of a single class. This, however, seems to be a problem of the representation only being computed from samples of one class -due to a sufficient number of examples expressing the CH sufficiently well only being available from the target class -, not the P-ClArC method itself. Finding more accurate and generalizing representations is subject to future work. In terms of fairness, we conclude that for the target class, the predictions after applying P-ClArC become more focused on the desired features, leading to classifications for the right reasons. For the other class, this is not always the case due to the representation issue stated above, however, if the concept is detected and suppressed correctly, the resulting difference in predicted probabilities is far more significant.

Conclusion
Deep Learning models have gained high practical usability by pre-training on large corpora and then reusing the learned representation for transferring to novel related data. A prerequisite for this practice is the availability of large sets of rather standardized and, most importantly, representative data. If artifacts or biases are present in data, then the representations formed are prone to inherit these flaws. This is clearly to be avoided, however, it requires either clean data or detection and subsequent removal of the influence of artifacts, biases etc. of data bases that would cause dysfunctional representation learning. In this paper we have used techniques from eXplainable Artificial Intelligence (e.g., LRP [16] and SpRAy [12] with several meaningful extensions), and introduced the Class Artifact Compensation framework to scalably and automatically detect, validate and alleviate Clever Hans behavior in multiple recent and large data corpora. While we mainly used LRP, the proposed ClArC framework is independent of the particular XAI method. ClArC encompasses a first simple intuition based of how artifacts may harm generalization. As this intuitive model is based on logistic regression, it is rather crude, but it already shows the main effects caused by artifacts: deterioration of generalization ability. For neural networks it may, however, still serve as a reasonable guideline and indeed our large-scale experiments on various datasets show analogous effects, that can exhibit a dramatic drop of generalization for some classes. Based on the ClArC model of artifactual features, we have introduced two concrete algorithms to implement the desensitization and unlearning of undesired features in a deep neural network: First, we proposed A-ClArC, an approach building on strategic augmentation of the data and subsequent fine-tuning of the model in order to remove the influence of artifactual confounders from inference. Second with, we aim at P-ClArC suppressing the the representation of an artifact as a feature to prevent its use in inference. While the latter approach is extremely efficient as it does not involve any training beyond the modeling of the artifact itself, the former can drive the model to adapt to a different, benign set of features. Both approaches can be applied on artifact representations obtained in input spaces, as well as latent space.
Let us discuss the main experimental findings. Based on an extended SpRAy technique we could in toy settings verify artificially created Clever Hans artifacts, and automatically detect some rather unexpected Clever Hans strategies of a popular pre-trained VGG-16 deep learning model on ILSVCR2012. These are caused by a zoo of artifacts and biases isolated by our framework in the corpus: encompassing copyright tags, unusual image formatting, specific co-occurrences of unrelated objects, cropping artifacts, just to name a few. Detecting this zoo gives not only insight but also the possibility for relieving models and datasets from their Clever Hans moments, i.e., based on our theoretical findings, we are now able, using ClArC, to implicitly un-Hans large reference datasets such as the ImageNet corpus and thus provide a more consistent basis for pre-trained models. We demonstrated this in unlearning experiments for several artifactual features on ImageNet, and in practical application scenarios, i.e., the ISIC 2019 dataset skin lesion prediction dataset and the Adience benchmark dataset of unfiltered faces, yielding more representative predictors for the tasks. In all scenarios, we observe that a precise modeling of the artifact, i.e. the availability and use of representative data distinguishing artifactual features from desired ones, will have a beneficial effect on the success of both ClArC variants.
Let us reiterate that without removing, or at least considering such data artifacts, learning models are prone to adopt Clever Hans strategies [12], thus, giving the correct prediction for an artifactual/wrong reason. Once these artifacts are absent or appear in unusal combination with other features in the wild such Clever Hans models will experience significant loss in generalization (see, e.g., Figures 14, 20 and 21). This makes them especially vulnerable to adversarial attacks that can harvest all such artifactual issues in a data corpus [86].
Future work will therefore focus on the important intersection between security and functional cleaning of data corpora, e.g., to lower the attack risk when building on top of pre-trained models.

A.1 CIFAR-10 Training
The simple convolutional model used to train CIFAR-10 in 3.1 consists of two ReLU-activated convolutionalpooling blocks (filter sizes 16 and 32), followed by two dense layers (512 and 10 outputs, respectively). The model is trained for 5 epochs using SGD with a learning rate of 0.01 and a momentum of 0.9.

A.2 Colored MNIST Training
All models on colored MNIST in Sections 3.2 and 4.1 are trained using the AdaDelta algorithm with a learning rate of 1.0, which is multiplied by 0.7 after each epoch, for 10 epochs. The a posteriori ClArC is trained for 10 epochs on top of the native model, which has also been trained for 10 epochs. The network consists of 2 convolutional layers, followed by a max-pooling, and finally 2 fully connected layers. Dropout is used after the max pooling and after the first fully connected layer, with 25 percent and 50 percent dropout probabilities respectively. ReLU activations follow all linear layers except the final one.
The model used for 4.2 is trained with SGD, a learning rate of 0.001 for 5 epochs. The architecture, however, is the same as for the other colored MNIST models.

A.3 A-ClArC on ImageNet
In Section 4.1 we employ A-ClArC using a VGG-16 model with the pretrained weights obtained from the Pytorch model zoo. For the input space A-ClArC experiment, we use an Adam optimizer with learning rate 0.0001 for fine-tuning. During feature spaceA-ClArC, an SGD optimizer with learning rate 0.001 and momentum 0.9 is applied. In both cases, we fine-tune over 10 epochs.

A.4 P-ClArC on ISIC 2019 and Adience Training
We again employ the VGG-16 model in Section 4.2 with the pretrained weights obtained from the Pytorch model zoo to train on both ISIC 2019 and Adience datasets, replacing the last fully connected layer of the classifier to fit the number of classes, i.e., 9 and 2, respectively. Both models are then trained over 100 epochs, using an SGD optimizer with learning rate 0.001 and momentum 0.9.