PredDiff: Explanations and Interactions from Conditional Expectations

PredDiff is a model-agnostic, local attribution method that is firmly rooted in probability theory. Its simple intuition is to measure prediction changes while marginalizing features. In this work, we clarify properties of PredDiff and its close connection to Shapley values. We stress important differences between classification and regression, which require a specific treatment within both formalisms. We extend PredDiff by introducing a new, well-founded measure for interaction effects between arbitrary feature subsets. The study of interaction effects represents an inevitable step towards a comprehensive understanding of black-box models and is particularly important for science applications. Equipped with our novel interaction measure, PredDiff is a promising model-agnostic approach for obtaining reliable, numerically inexpensive and theoretically sound attributions.


Introduction
Understanding complex machine learning models is fundamental for high-stake applications, e.g., in healthcare or criminal justice. To this end, the Explainable AI (XAI) community has put forward a plethora of different attribution methods, see [7,30,35,37,40] for reviews. Most methods summarize the complex, non-linear interactions that a single feature undergoes while traversing a machine learning model into a single attribution score. While this approach can provide invaluable informative heatmaps, feature-wise relevances do not provide access to feature interactions [51,52] and can even be misleading as interaction effects are implicitly distributed onto single-feature relevances [10].
We envision various applications where the understanding of interaction effects is instrumental to extract knowledge about underlying mechanisms from a machine learning model. We exemplify the prospects for such methods in two domains: firstly, in natural sciences and secondly, in healthcare. In the first case, consider a model that is trained to infer protein-protein interactions framed as a binary classification task given both primary protein sequences as input. Interpretability methods that allow quantifying interaction effects would then enable to identify corresponding binding sites in both sequences. In the second case, we consider a medical risk prediction model, which infers the mortality risk based on multiple demographic features and lab values. Here, relying only on single-feature importance might lead to a misleadingly simple picture, as multiple risk factors interact and hence, aggravate or alleviate the mortality risk (such as age and sex in the simplest case). Thus, interaction measures are necessary to capture the complex underlying physiological reality.
In this work, we revisit Prediction Difference analysis (PredDiff ), which was originally introduced in [39]. In our opinion, the beauty of PredDiff lies in its simplicity and strong connection to probability theory. The whole formalism is fixed by marginalizing variables and measuring prediction differences. It has been successfully applied on various image classification tasks [15,50,55,59] and also in Natural Language Processing, where it is referred to as input marginalization [17,26]. However, all previous studies miss a comprehensive treatment in a well-controlled setting, testing analytical and experimental limits of PredDiff . The unifying perspective on perturbation-based attribution methods in [7] shows how PredDiff is closely connected to Shapley values [35,46] and other approaches of this category. In particular, PredDiff encompasses single-shot attribution methods, such as occlusion [57] or inpainting image parts with generative models [3,32]. These are, however, not covered by the foundations of PredDiff and potentially unreliable.
Our main contribution is a novel interaction measure for PredDiff . It is well-founded and allows decomposing feature relevances into main and joint effects. Importantly, our decomposition is applicable to any interaction order and obeys a completeness relation. Incorporating and quantifying feature interactions has very recently attracted interest within the XAI community, see [52] for a review. First works on interaction effects appeared in [10,11,22,41,50]. Additionally, interaction measures have been proposed for Shapley value-based approaches [13,34,47,58], global ALE-plots [5], and other perturbation-based approaches [24,53]. PredDiff has the particular advantage that it allows quantifying interaction effects for arbitrary (non-overlapping) feature sets, while remaining an optimal linear scaling.
Additionally, we investigate PredDiff 's theoretical properties and demonstrate its intimate relationship to Shapley values. In particular, we shed light on the intricacies of classification due to the inherent connection between the classifier and the underlying data distribution. We present consequences for PredDiff and Shapley values, both on the level of relevances and interactions. Finally, we present experimental evidence for the soundness of our framework and find qualitative agreement to the popular Shapley Interaction Index. In particular, we quantify feature interactions for an image classifier, as a task that is already intractable for many competing methods with a less favorable scaling than PredDiff .
To summarize, our main contributions are (i) an investigation of the theoretical properties of PredDiff and its relation to Shapley values (ii) a novel interaction measure based on a proper functional decomposition, satisfying an interaction completeness property on relevance level (iii) an analysis of intricacies of classification due to the inherent connection between classifier and data distribution (iv) the experimental validation on analytic, synthetic and realworld datasets for classification and regression.

PredDiff : a local, model-agnostic, probabilistically sound attribution method
We specify our notation as follows.  = { 1 , … , } is our set of features. Uppercase letters ( ) denote the features itself (with unspecified values) and lowercase letters ( ) refer to a specific instance. Additionally, we routinely split all features into pairwise disjoint subsets , and with  = ∪ ∪ . Typically, we assess the interaction relevance between feature sets and in the presence of the remaining set of features .

Relevances for classification and regression tasks
We consider a classification task, where a classifier provides access to the conditional probability of class , i.e., ( , ) ∶= ( | , ). One way of assessing the relevance of a particular set of features is to compare the original prediction ( | , ) to the prediction ( | ), where the feature(s) has(ve) been removed. For an arbitrary classifier, this can be implemented in a probabilistically sound manner by m-arginalizing [39] via Here, ( | ) represents the true generative distribution for reconstructing given the remaining features evaluated at . In practice, we typically draw a fixed number of random samples from an empirical imputer distribution ( | ) that approximates ( | ), see Appendix B for more numerical details. Therefore, PredDiff does not suffer from an unfavorable factorial scaling with the number of involved features. Additionally, one straightforwardly obtains confidence intervals for relevance scores via empirical bootstrapping. In this sense, the approach is completely domainand task-agnostic, provided an appropriate generative model for imputation. In terms of imputer distributions ( | ), one can broadly distinguish between marginal imputer distributions, which completely neglect the dependence on and therefore in general inevitably produce off-manifold samples, and -dependent conditional imputer distributions.
It is worth noting that all perturbation-based attributions methods have to deal with this issue. For Shapley values, this is captured in the recent discussion on interventional as compared to observational Shapley values [23,28,48]. In our experiments, we always present results for a conditional as well as a marginal imputer to give the reader a qualitative impression of the impact of imputer choice. A detailed comparison is deferred to future work. As a final remark, we stress that the probabilistic interpretation of Eq. (1) clearly requires the use of a conditional imputer distribution. In general, PredDiff relevances are obtained by comparing the occluded prediction to the sample prediction. Several possibilities have been proposed in the literature [39]. Here, we compare logarithmic differences, which are interpreted as the information difference conveyed by , i.e., see Sec. 2.2 for a novel argument favoring this choice. We stress that this equally applies to other attribution methods, see Sec. 2.3. We avoid issues with vanishing probabilities, as in [39], by means of a Laplace correction, i.e., by mapping → ( + 1)∕( + ), where is the number of training instances and the number of classes. As a second remark, the probabilistic interpretation of Eq. (1) relies on the identification of with a proper probability distribution. On general grounds, having a well-calibrated classifier is desirable for any classifier. In this work, we use temperature-scaling, which scales the pre-softmax activations by a single global scaling factor in order to shift the prediction confidence appropriately, to achieve this [16].
Turning to regression problems, where we infer relevances with respect to a particular model ( , ). Hence, only the class subscript in Eq. (1) is suppressed, i.e., Here, the target is directly meaningful and therefore, we directly consider centered -values via [33,35,46] with a slight abuse of notation in order to unify regression and classification tasks as far as possible.
We discuss different properties of PredDiff attributions in Appendix D. PredDiff satisfies the classic Shapley axioms on sensitivity, linearity and symmetry. However, the completeness axiom-i.e., summing up all feature relevances is equal to the prediction plus some reference value-only holds under particular circumstances. However, it holds in the case where it is indispensable, namely for linear models with independent features. Here, PredDiff relevances in fact coincide with Shapley values, see Appendix E. We present a comprehensive discussion of the completeness axiom in Sec. 2.3.4.

Decomposition and completeness relation
We start in a regression setting, which is conceptually slightly simpler. The intuition behind our approach is to decompose the model prediction into its main, additive components / and its interactive, non-additive part . Subsequently, this induces a similar decomposition on the level of relevances, which we use to measure interactions effects between the features and . This is achieved by using the anchored expansion from [29] with the sample ( , , ) as anchor point. This results in a decomposition of the form (and already evaluated at = ), is only a function of . The decomposition is unique in the sense that it is the only decomposition that fulfills the annihilation property, i.e., = 0 if any feature set in is set to its anchor point value. The decomposition is minimal in the sense that it avoids unnecessary higher-order terms as far as possible [29], which is a desired property in our case [56]. Now, we can use Eq. (4) to compute PredDiff relevances for Eq. (5) and obtain a completeness relation, which constitutes the heart of our formalism, i.e., where we used that̄ ∅ ∅| = 0 and̄ | =̄ | (as by definition does not depend on ). The interpretation of the different terms will be discussed in Sec. 2.2.2. Using Eq. (4), the quantity of interest̄ | is thus explicitly given bȳ In particular,̄ | vanishes in the case of a non-interacting regressor of the form ( , , ) = ℎ( , ) + ( , ). We refer to this property as the no-interaction property. As an important remark, the different constituents on the right-hand-side of Eq. (6) inherit the computation complexity of the original PredDiff relevances. Anchoring the decomposition at the sample point ( , , ) is the only consistent choice within the PredDiff framework, see Appendix G.2 for a detailed discussion. Generalizing the decomposition Eq. (5) to an arbitrary number of interacting feature sets, i.e., higher-order effects, is straightforward and leads to the interaction completeness property Eq. (55), which is analogous to Eq. (6), see Appendix G.3 and G.4 for details.

Interpretation
We now work out an interpretation for the individual terms of the interaction completeness Eq. (6). The left-handside of the equation relates to the prediction change, i.e., loss or gain of information, when both feature sets , are occluded. Therefore, the interpretation of the raw PredDiff effects follows to be: • (raw) main effect̄ | : Prediction difference corresponding to solely occluding with knowledge of all other features. Hence, contains all higher-order joint effects at fixed values of the interaction partners = .
• (raw) joint effect̄ | : Interactive prediction difference corresponding to jointly occluding and in a way that is not covered by a single corresponding main effect, i.e., by keeping either feature fixed at or , respectively.
Finally, we refer tō | as (raw) relevances, which agree with the corresponding main effect̄ | up to the used conditioning. For regression, we can additionally define shielded counterparts, which specifically exclude the combined feature effects from the main effect. This point of view relies on regrouping terms in Eq. (6) and leads to an alternative decomposition of̄ | of the form The terms in Eq. (8) have the following interpretation: • shielded main effect̄ ∖ | : Prediction difference corresponding to solely occluding without the presence of . Hence, it is shielded from the joint effect between and .
• shielded joint effect̄ ∖ | : Interactive prediction difference corresponding to jointly occluding and , i.e., the super-additive part with respect to the shielded main effects.
We show how shielded effects can be constructed for third order interactions in Appendix G.5. To build a better intuition, note that, under the assumption of a factorizing imputer distribution ( , | ) = ( | ) ( | ), we can writē where ∖ ( , ) = ∫ d ( , , ) ( | ). The shielded main effect is, therefore, nothing but the main effect of the model where has been marginalized.

Classification
For classification settings, the situation is more intricate due to the fact that the model's class-conditional probabilities and the data distribution are implicitly tied, as both relate to the joint distribution of labels and input features. We rely on PredDiff relevances for classification as additive information differences and postulate that the completeness relation Eq. (6) remains valid in the classification setting, i.e., upon replacing by . We support this argument by investigating the no-interaction property, as a necessary condition for any sensible interaction measure, which entails that a non-interacting classifier yields a vanishing interaction relevance. To define a non-interacting classifier, we consider a generalization of informative conditional interactions [18,21], which implies that there is no label and residual features such that the feature sets and interact directly. Thus, we define a classifier where and do not interact by If one works out PredDiff relevances using this assumption, see Appendix H, one is lead to the joint effect where the first term is conventionally referred to as local conditional mutual information. The second term relates to the conditioning, i.e., to usinḡ | =̄ | instead of̄ | , which is inevitable if one insists on comparing only objects that share a common conditioning, as it was done in the regression case, see Appendix H for details. Also the occurrence of the first term is naturally explained by the fact that classifier and data distribution are tied (through a constraint on the joint distribution Eq. (10)) in the sense that the information difference on the left-hand-side also yields a term that just quantifies the information difference on the level of the input features. These terms are not specific to PredDiff but naturally appear also in other formalisms such as Shapley values in a classification setting, see Sec. 2.3. Lastly, it is worth stressing that the no-interaction property singles out logarithmic differences in Eq. (2) and does not hold for other popular difference measures, such as raw probabilities or log-odds [39], see Appendix H.
At this point, there are different ways to ensurē | = 0 for a non-interacting classifier as defined by Eq. (10). For conventional discriminative models, one would use a separate generative model (imputer) ( , | ) to approximately sample from ( , | ), which unties the relation between the output probabilities and the data distribution. Here, we proceed by noting that both terms on the right-hand-side vanish upon using a factorizing imputer distribution ( , | ) = ( | ) ( | ), which also implies ( | ) = ( | , ) and ( | ) = ( | , ). This can be implemented by sampling two copies ( 1 , 1 ) and ( 2 , 2 ) from ( , | ) and using ( 1 , 2 ) and ( 2 , 1 ), see Appendix B for details. Firstly, this exposes the classifier to samples that are off-manifold to a slight degree, as the connection between and has been broken, and secondly, induces a sampling error due to the fact the sampling distribution ( | ) ( | ) does not capture the implicit relation between and in ( , | ). We see this as a minor issue as this sampling error will most likely still be smaller than the inherent approximation error arising from training the imputer ( , | ) to match ( , | ) based on a limited amount of data. Alternatively, for hybrid models that provide access to the joint probability ( , , ), such as [14], or imputer that provide an exact sampling probability (notwithstanding the inevitable mismatch between imputer and data distribution), such as normalizing flows [27], one option would be to compute the terms on the right-hand-side and to subtract them from the left-hand-side in order to define the joint effect.

Connection between Shapley and PredDiff
Shapley values are a popular tool for local model-agnostic attribution [35,46] based on game theory [42]. In general, Shapley values are given by The remaining ambiguity is to specify a connection between a model , an instance and the value function ( ). A common choice uses an observational (conditional) distribution to occlude redundant features ̄ ∉ , i.e., reg One then identifies the first Shapley term = ∖ with the PredDiff relevancē | . This reveals an intimate connection between both formalisms. However, there is an ongoing debate whether one should replace the observational by an interventional (marginal) distribution, see [23,28,48]. This would break the previous correspondence. In general, marginal distributions generate illegitimate, out-of-distributions samples, questioning the reliability of resulting attributions. Additionally, ignoring feature dependencies unavoidably leads to simple adversarial attack strategies [4,44].
Turning to feature interactions, we consider the relation to the 'Shapley Interaction Index' [34], which was proposed as an explicit measure for interactions based on game theory [12]. Interestingly, we can map PredDiff 's shielded joint effect onto their central object, a discrete second order derivative, i.e., for = ∖ and / restricted to a single feature each. In the same setting, PredDiff 's shielded main effects, such as̄ ∖ | , can be identified with the second Shapley term. This reiterates the close connection between both formalisms, which we expect to hold at higher orders as well. A different proposed interaction measure within the Shapley value formalism is the 'Shapely Taylor Interaction Index' [47]. It is centered around a general discrete derivative formula, which allows incorporating arbitrary interaction orders. Here, we point out that these discrete derivatives are identical to the general decomposition underlying Eq. (5) up to a global sign, see Appendix G.1 for details.

Common challenges for classification
Here, we leverage our insights from the PredDiff discussion on classification in Sec. 2.2.3 and revisit the foundations of Shapley values within a classification setting. To the best of our knowledge, this topic has so far scarcely received attention in the literature and there is no rigorous argument for either measure, see [46] for possible choices. In the following, we introduce a novel argument based on the no-interaction property, which clearly favors logarithmic -values.
As already stated previously, there is no fundamental rule connecting a classifier ( |) to the Shapley value function , ( ). As for PredDiff , the occluded raw probabilities ( | ) are the most natural object to base the value function on. Drawing further inspiration from PredDiff , we propose to use logarithmic m-values, i.e., To demonstrate the benefits of this choice, consider a non-interacting classifier ( |) with  = { , } and ( , | ) = ( | ) ( | ) [18,21], as a simplified version of Eq. (10). In this case, the attribution of feature is supposed to not depend on ( | ). Indeed, one easily derives the Shapley value for , where we obtained the second line by inserting the definition of the value function Eq. (15) and used Bayes' rule in conjunction with the non-interacting classifier Ansatz to write ( | , ) = ( | ) ( | ) ( ) ( ) ( ) ( , ) . Due to occurrence of the logarithm, the second independent classifier ( | ) cancels from the final expression as required. Hence, the Shapley values now independently rely on the respective classifier and the corresponding data distribution. The last term is inevitable and a consequence that predictors for different classes are inherently tied, see the discussion in Sec. 2.2.3.
It is worth stressing that such a cancellation does not take place upon using the value function from the regression setting, i.e., , ( ) = ( | ) with ( ) = ( | ). Here, both independent classifiers interactively define the single feature Shapley values. This value function relates Shapley values to differences of probabilities, see the first line of Eq. (16), which leads to difficulties as for classification the notion of additivity relates to independent-hence factorizing-feature contributions. This clearly invalidates the use of the regression value function in the classification setting. As a further remark, other value functions such as , ( ) = E ̄ | log 2 ( | ̄ , ), which have been used in related contexts [6], break the natural connection to the occluded raw probabilities but do not resolve this issue. We point out that the former value function coincides with Eq. (15) if the expectation value is approximated by a single sample, as it is conventionally done for Shapley values.
To summarize, the classification setting poses similar challenges in the Shapley value framework as for PredDiff , even on the level of two single features rather than for entire sets as for PredDiff . The choice of the PredDiff relevance measure in Eq. (2) translates into the choice of the value function for Shapley values.

Consequences of no-interaction properties
In Appendix F.2 we explicitly evaluate the 'Shapley Interaction Index' w.r.t the no-interaction property. Here, we summarize the main findings: For both regression and classification, the main issue is that Shapley values are obtained by aggregating attributions obtained from different conditional distributions. In the regression setting, one is directly left with differences of conditional distributions. Consequently, the no-interaction property can only be satisfied upon using an interventional (marginal) distribution. For classification, the no-interaction property induces an additional constraint on the classifier level, ( , | ) = ( | ) ( | ), which is in general not satisfied. Thus, the 'Shapley Interaction Index' does not satisfy the no-interaction property in the classification setting.

Comparing PredDiff's interaction completeness relation vs. Shapley's completeness axiom
Equipped with insights from PredDiff interaction attributions, including the interaction completeness Eq. (6), and the intimate relation to Shapley values, it is worthwhile revisiting the Shapley completeness axiom, which was already briefly discussed in Sec. 2.1.

Off-manifold Evaluation
The completeness axiom enforces the Shapley formalism to use the complete set of coalitions . This generally leads to off-manifold evaluations of the underlying predictor. Here, using an interventional distribution leads to maximally off-manifold samples. In principle, this issue could be mitigated through the use of a conditional distribution. However, from a practical point of view, devising high-quality imputers, which produce on-manifold samples upon imputing a large fraction of input variables, remains very challenging, see Appendix M for explicit visualizations. Hence, for all practical purposes, this still leads to a certain degree of off-manifold evaluation and consequently unreliable attributions. In this respect, PredDiff takes the least invasive approach as it only requires to impute the variable of interest, i.e., the feature set for which a user wants to compute the relevance for.
Recovering Completeness As a second aspect, we would like to stress that the completeness axiom is not lost within the PredDiff framework but recovered after including all interaction effects. Importantly, through explicitly including interaction effects, PredDiff can circumvent potential inconsistencies related to solely considering additive explanations, i.e., see Sec. 3.1 and [13]. This also motivates the phrasing interaction completeness property, Eq. (6) or Eq. (55), which decomposes the relevance into main effects and higher-order interaction effects. Here, we focus on the case of three variables , , and stress that this argument generalizes to more variables in a straightforward way. In this case the interaction completeness Eq. (55) yields (with = ∅) where = ∫ ( , , ) ( , , )d d d equals the mean prediction. This implies that the sum of all orders PredDiff effects (left-hand-side) yields the difference between actual and mean prediction (right-hand-side). Hence, the completeness axiom is recovered upon including all PredDiff interaction terms. However, evaluating all 1 + … + = 2 terms in the case of features becomes computationally infeasible for large . This problem is well-known from the Shapley value literature [36]. In addition, measuring higher-order interactions is potentially numerically unreliable. The advantage of PredDiff lies in the fact that it allows terminating at a given interaction order. The fact that the right-hand-side of Eq. (17) agrees with the Shapley result, reiterates that PredDiff , upon including interaction effects, represents a different way of combining Shapley terms and reveals the close relationship between both formalisms.

Favorable computational scaling and practical considerations
In this section, we comment on practical considerations for interaction measures and discuss the superior numerical scaling of PredDiff relevances and joint effects. In general, the analysis of feature interactions is inherently hindered by the combinatorics of combining all features. For features and a binary interaction measure, this scales as ( 2 ). Here, this issue is circumvented by two effects 1. Grouping features into semantically meaningful sets, e.g., superpixels obtained from classical segmentation algorithms as in [55], which eases this problem significantly. Further, this renders relevances and interactions based on these feature sets more interpretable. This is trivially incorporated in PredDiff and is in principle also possible for Shapley value-based approaches [25]. Exponential scaling ( !) Table 2 PredDiff raw (left side) and shielded (right side) main and joint effects for ( , ) = ∨ and a uniform data distribution (up to a constant 1 ∕ 4 ). As a consequence of the interaction completeness Eq. (6) and Eq. (8), the column totals on the left side equal those on the right side.
2. Another advantage of PredDiff joint effects is the application on a targeted subset of features. We are not bound to evaluate all possible feature combinations but can instead focus on specific features, e.g., selecting reference features with high feature relevance and investigating interactions among them and/or with all other features. We demonstrate such an approach in Sec. 3.4 and 3.5 for image applications. Alternatively, one could rely on heuristics to group and select interesting combinations of features, as for example done in [24,53]. We propose a procedure within a regression setting in Sec. 3.2. We stress that the previous considerations apply to all interaction measures and are not specific to the PredDiff joint effects.
Next, we compare the computational cost of PredDiff to Shapley value-based approaches, which are the most direct competitors of PredDiff . We summarize the different scaling behaviors in Table 1. Due to the completeness axiom, Shapley value-based methods need to correlate all feature attributions. This is either done via sampling feature coalitions or alternatively by dealing with all features simultaneously (KernelSHAP). In contrast, PredDiff directly isolates feature attributions and therefore scales optimally with the number of features . Note that the popular occlusion attributions are a one-shot approximation of PredDiff , i.e., use a single model call per feature attribution [3,32,57]. Consequently, PredDiff relevances and interactions achieve the most favorable scaling possible (i.e., only scales with the number of imputations) for model-agnostic, perturbation-based approaches. We explicitly show PredDiff 's computational advantage in Sec. 3.4.

Analytic example
We revisit a famous example from [45,46], which has been used as an argument against approaches along the line of PredDiff . We consider two binary input variables and that are sampled uniformly, i.e., are subject to the data distribution ( , ) = 1 4 . The function under consideration is ( , ) = ∨ . For consistency with the literature, we work in a regression setting, but the same qualitative conclusions can be drawn from a classification setting.
The apparent paradox arises from the fact that the single feature relevances vanish if the other, conditioned variable  is set to 1. Explicitly, this means that̄ | = 0 if = 1 as the outcome of ∨ is already completely specified for = 1. The same applies tō | for = 1, from which [45,46] incorrectly conclude that neither nor are relevant for the prediction in this case. This apparent contradiction is obviously resolved by incorporating interaction effects, see Table 2: Firstly, we note that all shielded main effects are positive (negative) for a value of one (zero). Secondly, the shielded joint effect is only positive in the exclusive or combination. In Appendix I, we additionally demonstrate that ∨ , ∧ and ⊻ share the same shielded joint effects up to a constant factor, which has already been demonstrated on a global interaction level in [31]. This at first sight slightly unintuitive result, illustrates the danger of inferring intuitive ground truth relevances and interactions for seemingly simple functions.

Regression: Synthetic datasets
This section aims to validate the definitions for both single feature contribution and feature interaction based on a synthetic regression task. The main message we try to convey is that PredDiff successfully grasps the relevant contributions for a model-agnostic interpretation.
We consider a synthetic dataset with four independent features  = { , , , }, generated by a Gaussian distribution with mean zero. Additionally, we defined a target function: At this point, we want to stress that this choice is rather arbitrary. However, we believe that the results and conclusions are generic and invite the reader to try different functional forms in the accompanying notebook. In this section, we present results for a Random Forest regressor trained on 3600 samples. In Fig. 1 we show a possible workflow for the interaction analysis: On the left side, the raw PredDiff attributions are shown. We observe that all individual contributions are recovered correctly, e.g., the sinusoidal ( ) or cubic ( ) functional form is immediately recognizable. Examining feature and next, we observe additional structure superimposed onto the underlying raw additive feature contribution. In particular, the main effect for shows two distinct branches, which clearly indicates the presence of an interaction. Importantly, single feature attribution methods are restricted to this analysis depth. However, investigating the PredDiff joint effect of and , allows us to go one step further. To this end, we show the shielded PredDiff attribution given by Eq. (8) on the right side of Fig. 1. In the top panel, the shielded main effects are shown. They correspond to the feature contribution without specifying the other feature. Consequently, only the pure additive feature contribution remains. In the lower panel, we show the color-encoded shielded joint effect of and . Here, we immediately recognize the absolute value contribution of combined with a sharp transition at = 0 induced by the sign operation. The latter also explains the spurious jump  [1], that are in qualitative agreement with existing methods [34]. top left: Ranking of the five most important features. top center: Relevance for the most important feature age. top right: Relevance for the second most important feature systolic blood pressure. bottom left: Ranking of the five most important feature interactions. bottom center: Interaction relevance between systolic blood pressure and age. bottom right: Interaction relevance between age and sex, revealing a pronounced age dependence.
at the origin in the shielded main contribution of . Due to the sign operation, feature has a monotonic positive or negative effect. Hence, a part of the interaction term can solely be contributed to feature and causes the discontinuity at the origin. Please note that this is in contrast to feature , for which all interaction contributions are removed and only the linear dependence remains. These results, both for main as well as joint effects, are in qualitative agreement with the results obtained with the Shapley Interaction Index [34], see Appendix J. Finally, in Appendix J.1, we also repeat the experiment with correlated Gaussian features. This setting reveals differences between both approaches: Whereas Shapley distributes relevance evenly in the limiting case of perfectly correlated features, PredDiff single feature relevances tend to zero in this case. This can be seen as a sign for higher reliability of PredDiff relevances, since positive/negative attributions are guaranteed to be caused by the model and are not inflated by the conditional dependence.
In summary, PredDiff has successfully disentangled all relevant contributions. Importantly, and in contrast to, e.g., the Shapely interaction index, there was no need to calculate all possible interactions. By manual inspection, we could select the relevant features and calculate the shielded effects with linear computational costs. From a more general point of view, this touches upon the problem of efficiently identifying interacting feature sets and potentially combining them in a hierarchical fashion, see [24,53] for approaches in this direction, which we leave as future work.

Regression: Real-world dataset NHANES
To demonstrate that PredDiff can also be applied to real-world regression datasets, we revisit the NHANES dataset [8], which was discussed at length in [34]. It is a healthcare dataset with 14,407 individuals. The prediction task is to infer the (log relative) risk of mortality based on 79 features. We train a Random Forest and compute relevances for all individual features via Eq. (4). Here, we use the (conditional) Mahalanobis imputer [1] and show results for the (marginal) train set imputer in Appendix K.
The results on feature relevances are shown in the top panel of Fig. 2. We infer global feature importances by computing the mean of the absolute relevance score for each feature across the whole test set. The three most important attributes agree with previous investigations based on the SHAP TreeExplainer [34], although the ordering of the features sex and systolic blood pressure is interchanged. Also, the single feature relevances for the two most important continuous features age and systolic blood pressure are in qualitative agreement with earlier investigations. Next, we turn to interaction relevances in the bottom panel of Fig. 2. Similarly, as for feature relevances, we assess the global interaction relevance from the mean absolute interaction relevances across the whole test set. Here, we consider all pairs of interactions between the five most important features identified in the first step. Systolic blood pressure and age, as well as age and sex, show pronounced interaction effects and the corresponding interaction relevances on a per-sample basis are again in qualitative agreement with literature results [34].

Classification: MNIST
In the previous sections, we have established intuitive global interpretations using local PredDiff attributions. In this section, we move forward and analyze instance-wise attribution maps. To showcase the abilities of PredDiff , we use the MNIST dataset, which allows for an intuitive interpretation of the resulting attribution maps and is not too small neither in terms of dataset size nor in terms of input dimensionality. We train a fully-connected classifier (hidden layers 1000 and 500) and achieve an accuracy of 97.8% after epochs = 10 epochs of training. To enforce a proper probabilistic interpretation, we calibrate the network using temperature scaling as proposed in [16]. This is the natural way of dealing with potential saturation issues without the need to adjust the original formalism as in [15].

Meaningful relevance and interaction attributions
We analyze this calibrated model using PredDiff in the next step. To this end, we obtain ∼ 50 superpixels via the Simple Linear Iterative Clustering (SLIC) algorithm [2]. However, PredDiff gives no restrictions on this selection, see [50,55] for similar approaches and Appendix L for results with more finegrained superpixel. Importantly, this flexibility is retained by our novel interaction measure. Here, we use a (conditional) variational autoencoder (VAE) imputer with pseudo-Gibbs sampling [38], see Appendix C for details. Additionally, we show results for a (marginal) train set imputer in Appendix L. In Fig. 3 (a) we show the corresponding attributions for four different digits. The digits are chosen to be a representative subset of the complete test set. We see that the attributions are visually reasonable, e.g., the characteristic white space for the four and five are highlighted or also the characteristic parts of figure eight and nine. This demonstrates PredDiff 's ability to produce intuitively meaningful attributions.
In the next step, we analyze the interaction measure. To this end, we calculate the joint effect between all superpixels with respect to the super-pixel with highest relevance and show the resulting heatmap in Fig. 3 (b). The first thing to note is that the heatmaps are sparse and hence, informative. Our measure clearly highlights the intuitively related figure parts such as neighboring pixels. In contrast, if we would measure the overall effect of both pixels, the resulting heatmap would be blurry and covered up by the main effects. Additionally, we note that joint effects are particularly pronounced for meaningful combination of superpixels. For example, consider the digit five, here the enclosing corner is highly connected to the characteristic (reference) white space. This means that the model jointly leverages the information of both superpixels, i.e., a corner combined with an open whitespace is likely a five. Similar conclusions can be drawn from the other digits, e.g., the digit four is characterized by the centered whitespace enclosed with a vertical stroke.

Comparison to Shapley Interaction Index
We now compare the PredDiff joint effect to the Shapley Interaction Index [34], which directly builds upon the popular Shapley value concept. To this end, we use a custom Shapley implementation, which approximates the true Shapley values via subsampling coalitions . This approach was proposed in [45] for traditional Shapley values and can straightforwardly be extended to calculate the Shapley Interaction Index. Importantly, our custom implementation allows comparing PredDiff vs. Shapley values based on identical (conditional or marginal) imputer distributions. To ensure a consistent comparison, we need to account for a global sign between the Shapley Interaction Index and the PredDiff raw joint effect, cf. Eq. (14) for further details.
Within this setting, we provide attributions for randomly selected examples in Fig. 4. We find qualitative agreement between PredDiff and Shapley value attributions, both in terms of relevances and in terms of interaction measures. It is worth stressing that both relevance attributions are generally aligned. Due to the close relationship between PredDiff and Shapley values, both approaches allow for similar qualitative insights. Interestingly, in most cases for which the relevances do not fully align, the difference between both heatmaps is at least partially compensated for by the corresponding PredDiff interaction effect. This observation aligns with the completeness axiom discussion in Sec. 2.3.4 and potentially allows for a low-cost Shapley approximation based on the interaction completeness relation. Particularly, the latter needs to be investigated in a dedicated follow-up study. These findings are robust against using a (marginal) train set imputer as shown in Appendix L. In summary, PredDiff joint effects are capable of extracting information on feature interactions in a scalable and model-agnostic fashion. Importantly, this kind of analysis can easily be extended to large-scale image datasets, see Sec. 3.5.

PredDiff's superior computational scaling
We now investigate the numerical fidelity of PredDiff and compare it to Shapley values. Previously in Sec. 2.4, we theoretically established that PredDiff provides the optimal linear scaling in terms of model calls #. Next, Fig. 5 experimentally supports this claim. Therein, we compare the numerical convergence properties of PredDiff vs. Shapley values both for relevance and interaction attributions. To this end, we first compute a numerically expensive, highfidelity baselinē ∞ / ∞ for both approaches using # = 1200 model calls. For this reason, we restrict ourselves to the (marginal) train set imputer in this particular experiment. However, as the previous comparison around Fig. 4 indicates, these findings straightforwardly generalize to other imputers. For interactions, we stick to the comparison established around Fig. 4 and calculate an interaction heatmap with respect to the super-pixel of highest PredDiff baseline relevance. Subsequently, these high-fidelity baselines̄ ∞ / ∞ are compared to approximate heatmaps̄ # / # , which are based on # model calls. We measure the approximation fidelity via the cosine similarity with respect to the flattened heatmaps. Consequently, a cosine similarity of one reflects optimal alignment, i.e., perfect convergence, whereas lower values indicate noisy attributions. From Fig. 5 it is clear that PredDiff attributions converge rapidly to the high-fidelity baseline (# ≤ 50). In contrast, Shapley values do not fully converge and are limited to a noisy baseline approximation. Importantly, these findings are independent of whether one considers relevance or interaction attributions. Arguably, this is fundamentally related to the necessity of sampling all possible coalitions , which is a possible source of numerical noise. In summary, PredDiff attributions are less noisy and effectively easier to access in real-world applications.
The previous results indicate that PredDiff rapidly converges towards its own high-fidelity baseline, but do not  allow for conclusions on the quality of the resulting attributions. Inspired by [9,20], we report in Appendix J.2 on a synthetic experiment where the ground truth relevances are known by construction. Here, PredDiff main and joint effects show a considerably better overlap with the ground truth attributions as compared to sampled Shapley values, irrespective of the number of model evaluations.

Classification: CUB Birds
As a proof-of-concept to demonstrate that PredDiff is applicable to high-resolution, real-world datasets, we present results on the CUB-200-2011 birds dataset [54]. More specifically, we finetune a vgg16 [43] model that was pretrained on ImageNet on the CUB dataset while excluding a small number of overlapping samples from the CUB test set. As for PredDiff , we work with superpixels determined using the Simple Linear Iterative Clustering (SLIC) algorithm [2] and with a (conditional) histogram imputer [55]. A dedicated study on the imputer dependence of the results is deferred to future work.
In Fig. 6, we show the results for five randomly selected test set samples. In the top row, we visualize the two most positive (negative) raw joint effects for the three reference superpixels with highest relevance. This is to be contrasted with the bottom row, where three random reference superpixels are chosen. First-of-all, the results reveal that interaction effects do exist. These cannot be captured by the predominantly used single-pixel attribution methods, which implicitly distribute them onto single-feature relevances [10]. As an interesting observation, the largest interactions occur between the individually most relevant superpixels. On the contrary, the interaction between random superpixels typically remains small. As one expects, these random superpixels do not show a joint effect on the model prediction. Strong interactions between spatially separated superpixels, which are visible in several examples in the top row, could be interpreted as signs for more complex reasoning patterns, which remain to be uncovered in detail in the future. We close by stressing that the direct measurement of interaction effects in large-scale datasets, such as the CUB dataset, is impossible with most competing attribution methods, which show a less favorable scaling compared to PredDiff .

Conclusion
In this work, we revisited PredDiff as a model-agnostic attribution method that is firmly rooted in probability theory. We carefully analyze its theoretical properties and demonstrate its close relation to Shapley values. Both rely on the same foundations but PredDiff only evaluates a minimal subset of terms considered by Shapley values. This enables a favorable linear scaling behavior. The main focus of our investigation lies in the analysis of feature interactions. Here, we present an interaction completeness property, which allows decomposing the relevance-for a given set of featuresinto main effects and joint (interaction) effects. Crucially, this enables a targeted in-depth analysis to substantially increase model understanding. Secondly, we shed new light on the foundations of model-agnostic interpretability methods for classification and propose a novel argument based on the no-interaction property. In conclusion, the argumentation clearly favors logarithmic differences as the appropriate attribution measure, as it correctly disentangles the conditional classifier distribution from the underlying data distribution. We discuss consequences for both PredDiff and Shapley values. For the reader's convenience, we concisely summarize the main properties and advantages of PredDiff in Appendix A.
In our experiments, we demonstrate how interaction effects can resolve apparent paradoxes and lead to a better understanding of the model behavior. Due to the favorable scaling of PredDiff , for both relevances as well as interaction measures, it is applicable in real-world scenarios. As a first step in this direction, we analyze the interaction effects for an image classifier. The results clearly indicate that the classifier jointly exploits different image patches. These in-depth insights are not possible via conventional feature-wise attribution methods. The foundations laid out in this work, pave the way towards systematic investigations of interaction effects in more realistic use-cases and datasets. From our point of view, a sensible next step in this direction would be a systematic study of the imputer dependence on both relevances and PredDiff joint interaction effects on a large image dataset such as ImageNet.

A. Summary: Main properties of PredDiff
A code repository to reproduce the experiments reported in the main text can be found at https://github.com/ AI4HealthUOL/preddiff-interactions We summarize the most important properties of PredDiff : • Conceptual simplicity: For well-calibrated classifiers, PredDiff is deeply grounded in probability theory, see Eq.
(1). Additionally, interaction effects provide a novel argument in favor of logarithmic differences, as relevance measure.
• Arbitrary feature sets: PredDiff can adaptively evaluate relevances for arbitrary sets of features. These relevances naturally include all interaction effects (i.e., are inherently non-additive).
• Error estimates: PredDiff provides an uncertainty estimate for relevances on a per-sample basis via bootstrapping.
• Imputation/On-manifold: The imputation process, which is a necessary component of all perturbation-based approaches, is completely transparent through an exchangeable imputer. In addition, using conditional rather than marginal probabilities for imputation alleviates the common problem of evaluating the classifier far from the data manifold.
• Linear Scaling: Most crucially for practical applications, both PredDiff relevances and interactions enjoy a linear scaling with the number of feature sets for which relevances/interactions are supposed to be evaluated. The scaling coefficient can readily be adjusted by varying the number of imputations, see Fig. 7. Additionally, in practical applications often semantically meaningful feature combinations, rather than individual single features themselves, are the true objects of interests [55].
• Quantifying interaction effects: PredDiff provides a decomposition formula for relevances into main and joint effects, see Eq. (6) and Eq. (55) for the generalization beyond two feature sets, in the form of an interaction completeness property.

B. Approximation using finite samples
The PredDiff relevances, Eq. (1), can be approximated by sampling from the respective conditional distributions, i.e., for a potentially multidimensional . As discussed in the main text, there are many perturbation-based attribution methods that can be understood as single sample ( = 1) approximations of PredDiff . In Fig. 7 we show that a general trend is easily recovered with few samples, but more samples are needed for high fidelity attributions. Importantly, suppressed interaction signals are immediately visible via measuring the joint effect of features. In contrast to other attribution methods, PredDiff offers meaningful error bars without any additional overhead via bootstrapping. This is particular important to balance the trade-off between statistical accuracy and computational costs. Turning to the interaction relevance, Eq. (6). Here, we first consider a regression setting, for which we can rewrite the joint effect in a numerically more convenient form, i.e., This identity allows reusing imputations for every -value evaluation and consequently, reduces numerical noise significantly.
We now turn to a classification setting. Here, we are bound to explicitly intervene on feature and and break their dependence. In Sec. 2.2.3 we propose to sample from the joint distribution, e.g., , ∼ ( , | ), for all centered -values in Eq. (6). For the main effects, one discards the redundant features or . For the joint effect, one intervenes and shuffles and . Thereby, one samples from the distribution ( | ) ( | ).

C. Imputation algorithms
In this work, we make use of the following imputation algorithms: Train Set Imputer: The Train Set Imputer uses randomly sampled instances from the training set to impute respective values in the target features. This was among the imputers proposed in the original PredDiff publication [39]. Along the line our discussion in Sec. 2.3.3, we employ an factorizing train set imputer distribution, i.e., each segment is imputed with an independent train set sample.
Mahalanobis Imputer: The Mahalanobis Imputer [1] can be seen as a generalization of the Train Set Imputer. It also returns training set samples of the respective features to be imputed but additionally provides a weighting factor. These weights are obtained from a kernel estimator based on the Mahalanobis distance.

Multivariate Gaussian Imputer: The Multivariate Gaussian Imputer samples from a multivariate, conditional
Gaussian distribution that is conditioned on the values of the features that are not to be imputed. In a PredDiff application in computer vision, a similar imputer was used in [59].
Variational Autoencoder with Pseudo-Gibbs Sampling: A trained variational autoencoder can be used for imputation by iteratively passing the sample through encoder and decoder. After each iteration values of features not to be imputed are restored. This procedure was shown to approximately sample from the desired conditional distribution [38]. In the MNIST example, we use fully connected encoders and decoders each with hidden units 500 and 256.
Color Histogram Imputer: The Color Histogram Imputer was introduced in [55] and is based on sampling from the colors present in the image. To this end, one generates a histogram of all RGB values within an image and subsequently, imputes with a color sampled from this histogram, which is interpreted as a probability distribution. Importantly, the imputed patches are uni-color.

D.1. Properties of PredDiff relevances:
We discuss basic properties of PredDiff relevances based on the five axioms investigated in [49]. In particular, these include the classic Shapley axioms [42] completeness, linearity, symmetry and null player. Properties of attribution methods are typically investigated in a regression setting and not investigated in a classification setting. The PredDiff formalism provides an explicit definition of the relevance in terms of calibrated class-wise output probabilities and therefore, allows verifying properties explicitly in the classification setting.

Completeness/Efficiency/Additivity/Local accuracy
The completeness axiom states that the summed relevances of all individual features should yield the difference between the function value and a reference value, ( ) = 0 + ∑ . In the PredDiff framework, relevances for individual features are not distinguished compared to those of arbitrary combinations of features. In particular, there is no reference value, which is either set explicitly as for Integrated Gradients [49] or implicitly as for Shapley-values. In contrast, for every sample and feature combination there is a separate reference point for which the relevance vanishes. Note that the completeness axiom is satisfied for linear models with independent features, see Appendix E. Finally, it is worth stressing that the completeness axiom is recovered from PredDiff 's interaction completeness property upon including all interaction effects, see Sec. 2.3.4 for a detailed discussion.

Sensitivity/Dummy/Null Player/Missingness
Consider a function ( , ) = ( ) that does not depend on the features . We find and hence,̄ | = 0, i.e., if does not depend on also the corresponding relevance is zero. This property holds both for classification and regression.

Linearity
For regression, one easily verifies that̄ 1 + 2 | = ⋅̄ 1 | + ⋅̄ 2 | . In a classification setting, linearity in the output probabilities themselves is not a natural assumption and the property is also not satisfied. Note, that even for factorizing functions, i.e., additive log probabilities, the relevances in general do not decompose into two separate contributions.

Symmetry
For a function ( , , ) that is symmetric with respect to exchanging and , one easily verifies that also the relevances coincide, i.e.,̄ | , =̄ | , if evaluated at = and provided that also the data distribution ( , , ) shares the same symmetry with respect to exchanging and . The additional requirement on the data distribution is unavoidable for approaches that explicitly depend on the data distribution, as also realized in [23,48] in slightly different contexts. This property holds both for classification and regression.

Implementation Invariance
The relevance is trivially independent of the way the function is implemented, as PredDiff is model-agnostic and only depends on the model outputs.

D.2. Properties of PredDiff interaction relevances/joint effects
In this subsection, we discuss basic properties of the PredDiff interaction relevance.

No Interaction
In an additive regression setting, i.e., if decomposes into a sum of two terms ( , , ) = ( , ) + ℎ( , ), the joint effect between variables and vanishes̄ | = 0. In a classification setting, we require a vanishing joint effect in the case of generalized informative conditional interactions, as specified in Eq. (10), where we additionally require a factorizing imputer distribution, i.e., ( , | ) = ( | ) ⋅ ( | ), see the discussion in Sec. 2.2.3. In this case, one can show̄ | = 0, see Appendix H for a detailed derivation.

Null Player
If does not depend on , by the null player property for PredDiff relevances, we find̄ | = 0 and additionally | = | . Hence, we find alsō | = 0. This property holds both for classification and regression. Linearity In a regression setting, one easily verifies that̄ As in the case of the linearity property for PredDiff relevances, linearity is not a sensible assumption in the classification and also not satisfied in the PredDiff formalism.

Symmetry
By construction, the interaction relevance is symmetric with respect to its arguments, i.e.,̄ | =̄ | . This property holds both for classification and regression.

E. PredDiff for linear models and elementary multiplicative interactions
It is insightful to compute PredDiff relevances for linear models, i.e., For a given subset of features, one now straightforwardly evaluates -values, wherē is the complement set of features evaluated at the sample point ̄ . This leads to centered -values/relevances of the form For a single variable, i.e., = { }, this yields the relevancē which is in line with the expectation that for linear models the relevance should scale with the corresponding coefficient of the variable under consideration (after appropriate centering).
In particular, this implies that for linear models PredDiff also satisfies the completeness axiom, which is also the situation where it is most desirable. This also follows explicitly from Eq. (25), where we assumed independent features in order to obtain a constant reference valuē It is worth noting that this very expression is also obtained within the formalism of Shapley values [28], which directly follows from the fact that Shapley values are uniquely characterized by satisfying sensitivity, linearity, symmetry and completeness.
We consider also the second explicit example from [28]. Here, we consider the simplest multiplicative interaction, again under the assumption of independent features as above. For a given subset of features, one now straightforwardly evaluates -values, and hence, As before, for a single variable, i.e., = { }, this yields the relevance − E[ ] ∏ ≠ . In particular, for centered variables, we have ∏ , which, again, coincides with the result from the Shapley formalism [28] up to a global factor. However, contrary to the argumentation in [28], we do not see it as a contradiction that all features obtain the same relevance as opposed to assigning a larger relevance to features with a larger absolute numerical value, as we are dealing with an inherent interaction effect that cannot be distributed in a simple fashion.

F. Shapley values F.1. Classification
We first give the Shapley values based on the regression value function Eq. (13) Since all terms appear additive, it is not clear how one should leverage the multiplicative no-interaction property. Consequently, all single feature contributions remain mixed, which clearly highlights the need for a special treatment of classification tasks.

F.2. Shapley interaction index
We now move forward and consider how interactions are explicitly treated in the Shapley formalism. In [34] the 'Shapley Interaction Index' is proposed, an interaction measure based on a game theory [12]. It is given by for ≠ In this section we restrict ourselves to simply evaluate this interaction measure with respect to the no-interaction properties introduced in Sec. 2.2.

Regression
Here, we consider the additive function ( , , ) = ℎ( , ) + ( , ) for which and are clearly noninteracting. In contrast to PredDiff , we need to restrict to a single feature for an analytically tractable analysis. We then have two possible subsets ∈ {∅, { }}, for which we can calculate the interaction contribution and We observe that using different imputer distributions has a non-trivial effect on the resulting attribution. It is clear that using a interventional (marginal) definition for the value function would resolve this problem and lead to a vanishing interaction contributions.

Classification
We consider a classifier ( | , , ) that obeys the no-interaction property Eq. (10), i.e., ( , | , ) = ( | , ) ( | , ). The Shapley values are based on the classification value function Eq. (15). Otherwise we use the same setting as for regression and the interaction contributions yield (∅) = log 2 ( ( | )) − log 2 ( ( | )) − log 2 ( ( | )) + log 2 ( ( )) and The contribution ( ) is identical to the joint PredDiff effect up to a global sign and a slightly different conditioning. We start by discussing Eq. (36), where the first term vanishes due to the no-interaction property. The second term relates to the mutual information dilemma, which we discuss in Appendix H. However, unlike for PredDiff , for which only the analogue of Eq. (36) applies, the 'Shapley Interaction Index' produces a conditional independence condition with respect to all subsets . In the given case, this means that (∅) introduces two additional conditions ( , | ) = ( | ) ( | ) and ( , | ) = ( | ) ( | ). The second term could in principle be avoided upon using a fully factorizing, marginal imputer distribution, which potentially leads to off-manifold evaluations. However, ( , | ) = ( | ) ( | ) remains as an additional constraint that has to be imposed for a non-interacting classifier in the Shapley case. In general, this condition is not fulfilled, thus, the 'Shapley Interaction Index' does satisfy the no-interaction property in its most general form.

G. Anchored decomposition and interactions G.1. Two-point interactions
In this section, we focus our discussion on the simplest non-trivial case, where we are interested in the quantification of interaction effects between two sets of features = { 1 , … , } and = { 1 , … , } in presence of the remaining features = { 1 , … , }. We aim to decompose the model function into terms that depend only on subsets of the set of feature sets { , , }. The anchored expansion 3 from [29] with anchor point = ( 1 , … , , 1 , … , , 1 , … , ) gives us a decomposition of the form Its terms are given by where is only a function of features contained in the set and is the projection that freezes the features in at their anchor point values, e.g., ( , , ) = ( , , ). It is the unique decomposition of this form that satisfies the annihilating property = 0 for all ∈ . It can be shown that the decomposition is minimal, meaning that it never introduces unnecessary terms [29].
We can recombine the terms in Eq. (37) as follows where Taking ( , , ) as an example, we identify 3 An alternative, related approach would be to use a functional ANOVA decomposition using ( , , ) as weight to off-manifold evaluation, which would in principle provide a similar decomposition. However, the projection would require numerous high-dimensional integrations instead of function evaluations as in the case of the anchored decomposition with additional complications in the case of correlated features [19]. Both issues prevent the approach from being widely applicable in real-world applications.
where we have used We note that main effects and joint effects are shifted between the terms upon varying the anchor point of the decomposition. We demonstrate this by evaluating them for ( , ) = + + for independent features with ( ∕ ) =  (0, ∕ ), where we find = ( + ) , This illustrates that the expansion point ( , ) allows shifting relevances between main effects and joint effects. This is a well-known effect that has been observed already in linear models with multiplicative interactions, see for example the discussion in [31]. Here, we argue that fixing the expansion point to the sample itself, i.e., ( , ) = ( , ) in the example from above, is the only consistent choice in the PredDiff formalism for the following reasons: 1. A different evaluation point than the sample itself is inconsistent with the original definition of PredDiff relevances in the sense that the propertȳ | =̄ | in case ( | ) = ( | , ) no longer holds. 2. A different evaluation point than the sample itself will require to evaluate the model off the data manifold. This is exemplified in Eq. (46), where the first summand is in general not contained in the data manifold. Note that the integral in the second summand involves a conditional probability is not conditioned on . This might still lead to an off-manifold evaluation in case of strongly correlated features, which is however inevitable. 3. There is no other distinguished evaluation point apart from the sample itself. A different choice would require to impose a condition at the sample or the global level necessitating additional optimization procedures that would most likely turn the approach impractical for real-world applications.  … , , 1 , … , , 1 , … , , 1 , … , ). Analogously to the case of three sets, we can decompose an arbitrary function following Eq. (42) (already evaluating at the anchor point = for simplicity): Table 3 PredDiff raw main and joint effects for ∧ , = ∨ and ⊻ and a uniform data distribution (up to a constant 1 ∕ 4 ).

G.3. Three-and -point interactions
from which the PredDiff joint effect follows to bē This term is conventionally referred to as local conditional mutual information. This local mutual information is closely related to the mutual information via It measures the joint information content of and and vanishes if they are independent. Importantly, we cannot simplify everything through specializing to cases for which the local mutual information vanishes, e.g., conditional independent data distributions.
The simplest way of achieving this is via ( | , ) = ( | ) or equivalently, ( , | ) = ( | ) ( | ) (requiring this for either or is sufficient). However, this renders either or uninformative for the prediction. We dub this the local mutual information dilemma. It states that we either have to explicitly calculate the local mutual information, which is difficult in practice, or alternatively, break the feature dependencies and thereby inevitably evaluate the model off-manifold.

I. AND, OR, XOR regression examples
We consider two binary input variables and that are sampled uniformly, i.e., are subject to the data distribution ( , ) = 1 4 . For the three functions ( , ) = ∧ , ( , ) = ∨ , ℎ( , ) = ⊻ , we work out the raw and shielded PredDiff effects in Table 3 and Table 4.
Because ∨ , ∧ and ⊻ share the same shielded joined effects up to a constant factor and the shielded main effects vanish for ⊻ , we can understand ∨ and ∧ are versions of ⊻ modified with main effects, as already demonstrated in [31]. This result is slightly unintuive at first and illustrates the danger of inferring intuitive ground truth relevances and interactions for seemingly simple functions. Table 4 PredDiff shielded main and joint effects for ∧ , = ∨ and ⊻ and a uniform data distribution (up to a constant 1 ∕ 4 ).

J. Additional plots: synthetic dataset
For the readers convenience we present attributions for two alternative model categories: (i) a fully-connected neural network (Fig. 9) and (ii) a gaussian process (Fig. 10).

J.1. Relevances in the presence of correlated features
We repeat the synthetic regression task with correlated features with unit variance and correlation = 0.7. To avoid ambiguities due to model training, we directly use the analytic function Eq. (18). For correlated features, PredDiff and Shapley values show a qualitatively different behavior. In the limit of perfectly correlated features, their attributions are ambiguous without additional causal assumptions. In this setting, PredDiff single-feature attributions tend towards zero as | → ( , ) if denotes one of the correlated features in question. In contrast, Shapley values distribute relevance evenly across all features. This can be seen as a sign for a higher reliability of PredDiff relevances, since positive/negative attributions are guaranteed to be caused by the model. In this sense PredDiff is true to the model and true to the data. In Fig. 12 the Shapley values for and are tilted in comparison to the uncorrelated setting in Fig. 8. For the Shapley Interaction Index, the same effect occludes the true interaction. In contrast, PredDiff attributions in Fig. 11 are structurally equivalent to the independent feature setting, i.e., the functional form of sine and interaction are still clearly recognizable. However, this comes at the price of partially less pronounced attributions.

J.2. Comparing convergence using a white box regressor: PredDiff vs. Shapley values
In the following, we build on [9,20]   arbitrary interactions generally induce corresponding main effects, see [31]. The overall target function is given by This task singles out a well-defined ground-truth, both on the level of relevances () and pairwise interactions (), i.e., binary masks on features and pairs of features. To obtain a more challenging task, we include additive white noise based on all non-contributing pairwise features with variance = 0.01. Within this setup we compute attributions for 200 random samples. The absolute value of main and interaction relevances are compared to the respective ground truth. As in [20], we base our analysis on precision and recall to assess whether all features identified as salient were in fact informative (precision) and whether all informative features were identified (recall), over a range of thresholds. To summarize the behavior through a single number, we chose the average precision score 4 , which quantifies the area under the precision recall curve, and for completeness also state the AUC-ROC score. We repeat this experiment three times to obtain error estimates. The results for varying computational costs (i.e., numbers of function evaluations) are summarized in Table 5. Since this model is inherently additive, PredDiff main effects perfectly recover the relevant features with minimal computational effort. In contrast, Shapley values need to sample many coalitions to reveal the simple underlying structure. Revealing the sparse interactive structure is challenging for both methods. However, PredDiff consistently outperforms Shapley values independent of the number of model calls #.  Figure 13: PredDiff (interaction) relevances for a Random Forest trained on the NHANES dataset using a (marginal) train set imputer [1] that are in qualitative agreement with existing methods [34]. top left: Ranking of the five most important features. top center: Relevance for the most important feature age. top right: Relevance for the second most important feature systolic blood pressure. bottom left: Ranking of the five most important feature interactions. bottom center: Interaction relevance between systolic blood pressure and age. bottom right: Interaction relevance between age and sex revealing a pronounced age dependence.

L. Additional plots: MNIST
We show attributions for a (marginal) train set imputer in Fig. 14 and Fig. 17. PredDiff relevance attributions are qualitatively very similar to the (conditional) VAE imputer in Fig. 3 and Fig. 4. Interactions, as measured by the joint effect, are also similar. However, the VAE joint effects are more pronounced and sparse, which makes them easier to interpret, e.g., consider digit four and nine in Fig. 3 for which all background attributions are removed. In contrast, overall, the important, highly interacting superpixel do not change. Additionally, we show attributions for marginal and conditional imputer with more fine grained superpixels in Fig. 15 and Fig. 16 respectively. To further highlight the qualitative differences between both imputers, we show example imputations in Fig. 18. As expected, the VAE imputations are more realistic but consequently less diverse. This is the reason for their more targeted attributions.

M. Additional plots: CUB Birds
We show attributions for a train set imputer in Fig. 19. We used the same number of imputations (# = 100) as for the results based on the histogram imputer in Fig. 6. While PredDiff relevance attributions are qualitatively similar between the two imputers, the relevance is less concentrated on the central object for the train set imputer. The highest interaction effects with the most relevant superpixels as reference points are mostly similar between train set and histogram imputer. Importantly, the observation that interaction between random superpixels is small can be confirmed for the train set imputer. (marginal) Train Set (conditional) VAE Figure 18: Imputed samples generated with a (marginal) Train Set imputer compared to a (conditional) VAE Imputer for three independent patches. Digits are identical to Fig. 3.
Finally, in Fig. 20 we visualize the practical challenges for conditional imputers arising from imputing a large fraction of superpixels. This is supposed to support the argument of potential off-manifold model evaluations in these cases. PredDiff only requires a few imputed superpixels. In contrast, a typical Shapley coalition covers 50% of all features, which practically leads to increasingly off-manifold samples.