Shapley values for feature selection: The good, the bad, and the axioms

The Shapley value has become popular in the Explainable AI (XAI) literature, thanks, to a large extent, to a solid theoretical foundation, including four"favourable and fair"axioms for attribution in transferable utility games. The Shapley value is provably the only solution concept satisfying these axioms. In this paper, we introduce the Shapley value and draw attention to its recent uses as a feature selection tool. We call into question this use of the Shapley value, using simple, abstract"toy"counterexamples to illustrate that the axioms may work against the goals of feature selection. From this, we develop a number of insights that are then investigated in concrete simulation settings, with a variety of Shapley value formulations, including SHapley Additive exPlanations (SHAP) and Shapley Additive Global importancE (SAGE).


I. INTRODUCTION
T HE problem of feature selection in Machine Learning (ML) constitutes selecting some subset S of a set F of |F | = d feature indices, such that the submodel formed from the features indexed by S will maximise some evaluation function C(S) of the submodel, while minimising a cost (or complexity), which is increasing in |S|. The model chosen by this procedure is the selected model.
A similar (and more general) problem -model selectionhas deep roots in computational statistics [1], where attention is paid to inferential nuances like quantification of uncertainty, significance testing, confounding predictors, collinearity, and the design of experiments. It was in this literature that the Shapley value was first applied to linear regression models, with its own history of discourse (see [2]- [7] and the more critical [8], which traces development to [9]- [11], with reinventions by [12] and [13]).
The Shapley value has, over recent years, become a popular method for interpretable feature attribution in fitted ML models (cf. [6], [14]- [26]), holding promise for the development of Explainable Artificial Intelligence (XAI). Attribution (or credit allocation), here, is the determination of the contribution by each feature to the performance of a model -often the selected model. The ML methods that stand out in terms of popularity are SHapley Additive exPlanations (SHAP) [15], [17], Shapley Effects [6] and Shapley Additive Global importancE (SAGE) [24], though the Shapley value itself carries a rich history of investigation in the context of game theory -Lloyd Shapley's 1953 seminal paper [27] has over 9000 citations, and the concept has attracted the attention of various Nobel prize winning economists [28]- [34].
Particular praise is given, in both the game theory and ML literature, to a small set of "favourable and fair" axioms, commonly known as efficiency, null player, symmetry and additivity, under which the Shapley value can be uniquely defined. We will introduce these axioms in section II-B. While much attention has been paid by game theorists and economists to interpreting and reformulating the axioms, and towards investigating axiom sets (and game formulations) that lead to Shapley value alternatives [28], [29], [31], we found comparatively little attention to these matters in ML [24], [35], [36]. There have, however, been a number of recent criticisms of the Shapley value in ML [37]- [39], suggesting to us that the Shapley value and its alternatives may be further developed considerably over the coming years.
This paper is not an exhaustive theoretical or empirical study of the worth of various Shapley value methods in ML, neither in general nor for feature selection. Our goal is to draw scrutiny towards the Shapley value axioms, and attention towards the generality of the game theoretic formulation. We do this in the specific context of feature selection, since this direction is particularly underdeveloped, and because in both academic and industrial settings we have encountered what we consider to be an over-reliance on axiomatic "guarantees" (e.g., of "fairness") when appropriating Shapley based feature attribution methods for feature selection (see, e.g., [40]- [49]). Through the arguments in this paper, the authors are convinced of two things: • The axioms do not in general provide any guarantee that the Shapley value is suited to feature selection, and may, in some cases, imply the opposite. • The relevance of the Shapley value to the feature selection task (indeed, to any ML task) is governed by the specific game formulation, and the justification of this relevance from the axioms is non-trivial.
In Section II, we introduce the Shapley value and game formulation (Section II-A), the axioms (Section II-B), we give a brief overview of feature selection in general (Section II-C), and then a brief overview of Shapley value feature selection (Section II-D). In Section III, we investigate the significance of the Shapley value axioms in the context of feature selection, introducing general "toy" examples to illustrate. Then, in Section IV we perform simulation studies on more concrete data sets, with various evaluation functions and game formulations, including the popular mean absolute SHAP and SAGE formulations.
"Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question. . . " -John Tukey [50].

A. The Shapley value
In Definition II.1, we define a Transferable Utility (TU) game. This definition captures the general scenario where a set of objects, denoted by F , has some associated evaluation, C(F ), and, for any subset S of F , the evaluation C(S) is also well-defined. This captures a typical feature selection scenario, in which F = {1, . . . , d} represents the indices of all features in the full model of dimension d, and S represents the indices of features in some submodel of dimension |S|. In a TU game, we are guaranteed that the worth of every submodel can be evaluated, as C(S).
In the following, we use 2 F to denote the set of all possible subsets of the objects F , which include F itself and the empty set, denoted by ∅.
Definition II.1 (TU game). A TU game is a pair (F, C), where F = {1, . . . , d} is a set of indices called players and the characteristic function C : 2 F → R assigns a non-negative real value C(S) to every coalition S ⊆ F . Furthermore, C assigns the value zero to the empty coalition ∅, i.e. C(∅) = 0.
For readability, we will often refer to C as an evaluation function, since it evaluates the worth of each coalition, and we will refer to the players as features.
The Shapley value (Definition II.3) is intuitively appealing for feature selection. At first glance, it extracts (or compresses) information from the evaluation function C, to assign a single value ϕ i representing the worth of each feature in the modelling task. Here, the meaning of worth depends strongly on the choice of evaluation function, and (less obviously) on the manner in which features are understood to be removed from the model (see [36] and [24,Section 4]). Generalising the terminology in [36], we refer to these choices as the Shapley value game formulation.
An attractive notion is that, with a suitable game formulation, the Shapley value could guarantee a principled method of feature selection. However, while such a method of feature selection is principled via the axioms that are discussed in Section II-B, it is, as we will see, the principles (or axioms) themselves that are not in general suited to feature selection, and must be scrutinised in both the context of the specific game formulation, and the context of the outcome that is desired from the feature selection task.
Central to the definition of the Shapley value is the notion of a marginal contribution, which can be understood as the amount by which the evaluation of a given submodel increases, upon introducing a given feature to the submodel.
Definition II.2 (Marginal contribution). The marginal contribution of feature i to submodel S is defined as the difference in evaluation when i is added to the submodel: Definition II.3 (The Shapley value). The Shapley value of feature i is defined as a weighted average over all marginal contributions by feature i. That is, over M i (S) for every subset S of F that excludes i.
The formula for the Shapley value, or at least that for its weights, may not immediately lend itself to intuition. Historically, much attention has been paid instead to the small set of axioms (in Section II-B) from which Definition II.3 can be uniquely derived. An axiom is a principle, usually taken to be self evident, from which other truths may be derived. The simple and intuitive nature of the Shapley value axioms encourages an assessment that the Shapley value is an explainable or interpretable approach to computing the importance of features. However, great care must be exercised in establishing the exact meaning of importance, both in the general Shapley value context and in the specific contexts in which the Shapley value is applied.

B. The Shapley value axioms
In keeping with the machine learning literature, and as a matter of preference, we present the following four axioms as a unique characterisation of the Shapley value. However, there are a number of alternative axiomatisations available that may provide varying levels of insight [28].
Axiom 1 (Efficiency). In a TU game (C, F ), the worth of the full model C(F ) is distributed in a lossless manner among the features: Axiom 2 (Null player). In a TU game (C, F ), if feature i contributes nothing to each submodel it enters, then its Shapley value is zero: Axiom 3 (Symmetry). In a TU game (C, F ), any two features i, j that play equal roles have equal Shapley values: Axiom 4 (Additivity). Given two TU games (C, F ), (K, F ), the Shapley value of feature i preserves addition of the evaluation functions: In Axiom 4, the notation ϕ i (C) denotes the Shapley value of player i using the evaluation function C, and the addition of two evaluation functions is defined as the natural (pointwise) addition (C + K)(S) = C(S) + K(S).
Axiom 5 (Balanced contributions). Let C i denote the game produced by restricting the feature set F to F \{i}. Then, To paraphrase [33], balanced contributions is a principle of fairness in cooperation. It states that every pair of features should share equally the gain (or loss) received from their cooperation.
In a TU game, Axioms 1-4 are sufficient to uniquely define the exact Shapley value. However, this value is only unique up to the choice of characteristic function and game formulation. Between such choices, the Shapley value will vary greatly. Also, it should be noted that in many practical settings, the Shapley value is only approximated, since the complexity of (1) is exponential in the number of features (see, e.g., [15] and [24, Section 6.2]).
In algorithmic feature selection the search space generally involves 2 d submodels. Existing feature selection methods all take some approach to avoiding the exhaustive search of the model space.

C. Feature selection in general
If an evaluation function C is used for feature selection, it should correspond to a specific goal. Feature selection may target a number of typically exclusive goals, between which there is generally understood to be a trade-off in performance. These goals include, but may not be limited to, succinctly describing a data generating process; improving the predictive performance of the model; maximising the overall significance or power of the model, or of its parameters, with respect to a hypothesis; and producing a more cost-effective or computationally efficient predictor. There are two prominent categories of feature selection techniques identified by the surveys of [51], [52]: • Wrapper methods evaluate a number of trained submodels selected via sequential (e.g., stepwise forward/backward) elimination, or via a heuristic search algorithm. While these model-driven evaluations are extrinsic to the data, they are native to the specific modelling task. • Filter methods are general model-independent frameworks that avoid the computational burden of model training, present in wrapper methods. These methods rank features via empirical estimates of intrinsic properties of the data, such as covariance or mutual information. Both of the above methods can be applied in the context of Shapley values, given appropriate choices for the evaluation function and game formulation (see Section II-D). Regardless of the specific goal, the feature selection task is motivated by the notion that a submodel exists that is of higher value than the full model, in the sense of Definition II.4.
Definition II.4 (Monotonicity). In a TU game (F, C), the evaluation function C : 2 F → R is called monotonic when C(S) does not decrease whenever new features are added to S.
Popular examples of non-monotonic evaluation functions in feature selection are the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), often used in conjunction with a stepwise procedure [1].

D. Shapley values for feature selection
The following is the simplest general Shapley value feature selection procedure. An alternative to the final step in Algorithm 1 is to select features i for which ϕ i lies above some threshold. Upon review of the literature, we found a number of articles suggesting to use a variant of Algorithm 1 [44], [45], [48], [49], two completely applied uses of Algorithm 1 [46], [47], and one paper critical of the algorithm [53].
Alternatives to Algorithm 1 are found in [40]- [43]. In [40], [41] the considered coalition sizes are restricted, and a stepwise selection procedure is performed. A genetic algorithm is described in [42]. In [43], the problem of model averaging (see Section III-A) is discussed, and it is claimed that the method avoids the influence of unselected features via a decomposition of the Shapley value into high-order interaction components. We do not investigate these methods, but we expect that controlling the coalition sizes may have a positive impact, at least on the problem considered in Example 2 and Section IV-C.

III. THE MEANING OF THE AXIOMS
In this section we use "toy" examples to interrogate some general consequences of the axioms, and to gain insight into how feature selection may be impacted by them. The insights discussed in this section are too general for drawing conclusions about the value of any specific Shapley value feature selection procedure, in practice.

A. The meaning of model averaging
As a consequence of Axioms 1 and 4 (efficiency and additivity), Shapley values are a model averaging procedure, being the weighted average (1) of marginal contributions. In statistics, model averaging has been used to combine the strengths of several candidate selected models [1]. These candidate models may arise from perturbations, e.g., when the result of a model selection procedure is recognised to be sensitive to sample effects or other conditions, or in circumstances where there is no clear optimum of the evaluation function. In any case, model averaging procedures are traditionally not used for feature selection, but to give a weighted average of estimates or predictions associated with a number of different models.
There is a compelling reason for caution around the direct use of model averaging for feature selection: The average performance of a feature across all submodels may not be indicative of the particular performance of that feature in the set of optimal submodels. Ideally, one would select all features explicitly on the weight of their contribution to submodels that are optimal. We illustrate this with the following example.
Example 1 (Taxicab payoff ). Consider a game with d = 3 players and "taxicab" style payoff with characteristic function given by if 3 ∈ S and 2 ∈ S, 3, if 2, 3 ∈ S and 1 ∈ S, 0, otherwise. ( This game can be pictured as one taxicab ride, where the homes of players 1 and 2 lie on route to the home of player 3. In Example 1, players 1 and 2 are not "null players", in the sense of the Axiom 2, but they contribute nothing to the optimal model. As the example demonstrates, performance of features may increase within submodels due to an interaction with a dominant feature. Averaging across all submodels takes these superfluous performances into account, when ideally the presence of the dominant feature should be fixed. Indeed, a possible solution to this problem may be to identify and fix the presence of such dominating features, prior to computing the Shapley value -though at this stage feature selection may no longer be required. A concrete manifestation of Example 1 is explored in Section IV-D.

B. The meaning of efficiency
Axiom 1 (efficiency), states that the evaluation of the full model, C(F ), should be distributed losslessly amongst the features. This axiom narrows the scope of possible model averaging procedures to those that treat the full model, not the selected model, as the final outcome. When the objective function is non-monotonic, this becomes especially distinguished from the problems discussed in Section III-A. Since the full model is not generally the target, non-monotonic evaluation functions imply that efficient payoffs may waste value.
Non-monotonic evaluation functions (Definition II.4) are those that may decrease as the number of features increases, i.e., there exist submodels T, S such that C(T ) < C(S) for some T ⊃ S. This means, for example, that we may have for some S ⊂ F. In this case, the Shapley values do not sum to the maximum possible value of the feature set, over all subsets of features. In other words, the Shapley values sum to the payoff for the full model, but they don't sum to a payoff that is optimal. Note that, at least for the exact Shapley value, evaluation of all 2 F submodels is an intermediate step in the calculation. From these evaluations, the optimal model can be computed. In practice, for some approximation to optimal efficiency, it may be preferable to estimate the optimal model prior to approximating the Shapley value -at which stage feature selection is no longer required.

C. The meaning of balanced contributions
Axiom 5 (balanced contributions), can be substituted for Axioms 2-4 (null player, symmetry and additivity). Axiom 5 captures a notion of fairness in the rewards of cooperation between players in a TU game. The symmetry axiom alone may be undesirable in certain contexts. In the case of two strongly correlated features, the symmetry axiom dictates that both should receive approximately the same attribution. However, in feature selection, the high correlation implies that one of the two features is redundant. Here, we regard a feature X i as redundant in the presence other features X = {X j , j ∈ S}, if feature X i is conditionally independent of Y , conditional on X , for which we write Y ⊥ ⊥ X i |X . Redundancy is investigated in more precise contexts in Sections IV-A and IV-B.
A second consequence that may be attributed to Axiom 5, is that the earnings received by a coalition, after discounting the earnings of its subcoalitions, are shared equally amongst the players in that coalition. To better understand this, see the formulation of the Shapley value in terms of Harsanyi dividends [28], which we do not enter into here. In particular, from (1), a player's contribution to teams of size k is averaged across all d k teams of size k. Since the binomial coefficient decreases in |k − d/2| for fixed d, a player's single contribution to a team at the extremes of the spectrum of team sizes will be weighted higher than to team sizes close to d/2. As a consequence, poor performance of a feature on particularly small submodels, such as the singleton models, may be weighted highly even if such small submodels are not attractive for the model selection task. We illustrate this in Example 2 Example 2. One should be wary of deciding that the features with high Shapley values are also the strongest contributors to model performance. Consider the scenario with d = 3 and a "secret holder" style payoff where player 1 is alone worthless, but has the "secret" that endows any team with the maximum possible payoff. Specifically, suppose we have the payoff lattice in Figure 1 we would do well to discard player 2 or 3. This example is investigated further in Section IV-C.

IV. EXPERIMENTATION
In this section, where the Data Generating Process (DGP) is known to us, we investigate, through simulation, a number of situations in which the results produced by Algorithm 1 may be undesirable. In Sections IV-A and IV-B, we reproduce the results of [53, Theorem 8 and 9] and extend them via simulation to include the SHAP FSelection formulation [48] (i.e., feature selection by ranking mean absolute SHAP values of model predictions), and the SAGE formulation. Accurate definitions of SHAP and SAGE methods are very detailed, so rather than reproducing them here, we encourage the reader to access [17] (SHAP) and [49] (SAGE).

A. Markov boundary experiment 1
In this experiment we consider a scenario where we wish to predict a response variable Y from four features X 1 , X 2 , X 3 , Z, with the following DGP suggested in [53,Theorem 8], where ε, γ ∼ N (0, 4) introduce irreducible uncertainty into the relationships defining Y and Z, respectively, via normal perturbations with means zero and variances 4. The causal graph is shown in fig. 2a, where X 1 , X 2 , X 3 → Y and X 1 , X 2 , X 3 → Z. Regardless of any implied causal relationships, we can interpret it as a case where Z is separated from Y via two terms of uncertainty (ε and γ), while the set {X 1 , X 2 , X 3 } is separated from Y by only one term of uncertainty (ε). It follows that Y is conditionally independent of Z, given the values {X 1 , X 2 , X 3 }, but not vice versa. Furthermore, {X 1 , X 2 , X 3 } is the minimal set with the property that the remaining features are conditionally independent of Y . Thus, X 1 , X 2 , X 3 are referred to as Markov boundary members of the DGP. However, as also shown by [53], when we employ the R 2 evaluation function, the Shapley values are The three features X 1 , X 2 , X 3 , the Markov boundary members, all have smaller Shapley values than Z, the non-Markov boundary member. This is not a peculiarity of that particular game formulation. Simulating a data set with sample size n = 10 6 from DGP (3), and training an XGBoost regression model, the SHAP FSelection method sorts the features as (Z, X 1 , X 2 , X 3 ) with corresponding mean absolute SHAP values (1.4, 1.1, 1.1, 1.1). In both cases, a top-3 feature selection procedure will select (Z, X 1 , X 2 ), rather than the preferred triple (X 1 , X 2 , X 3 ), thus introducing an unnecessary term of uncertainty (γ) into the model.
The SAGE values, on the other hand, which compute a global Shapley value of model loss, representing the predictive power associated with each feature in a model, produce feature importance scores of (1, 4, 4, 4) for the features (Z, X 1 , X 2 , X 3 ). Thus, while the SHAP FSelection and R 2 formulations produce poor results for model selection, the SAGE values successfully highlight the appropriate features in this scenario.
Here, despite introducing unnecessary irreducible error into the model (see Figure 2b), the non-Markov boundary variable X 1 is given the highest preference. Simulating a data set of sample size n = 10 6 , and training a simple XGBoost classification model, for = 0.05, the SHAP FSelection method also sorts the features as (X 1 , X 2 , X 3 ), with respective mean absolute SHAP values (1.43, 0.50, 0.40). On the other hand, SAGE sorts the features as (X 2 , X 3 , X 1 ) with values (0.0797, 0.0787, 0.0003). Thus, as in Section IV-A, the SAGE values sidestep the issues of the SHAP FSelection and m formulations, instead producing a correct ranking for feature selection.
To determine the relationship of the parameter ∈ (0, 1) to the pathology, we simulate 20 more data sets, equally spaced on the grid 0.05 ≤ ≤ 0.95, and calculate SHAP FSelection and SAGE values for each value of . The variation of the SAGE values with is shown in fig. 3a. Calculating the differences ϕ 1 − ϕ 2 in Shapley values of the variables X 1 and X 2 , for the SHAP, SAGE and m formulations, yields fig. 3b. Similar behaviour is realised in the differences ϕ 1 − ϕ 3 . From this, we see that SAGE performs admirably over the investigated parameter space, while the SHAP and m formulations perform poorly for approximately | − 1/2| > 0.3.
The features X i , i ∈ {1, 2, 3} are generated as independent standard normal random variables. From (6), we simulate a total of 6561 data sets of sample size n = 1000, producing one data set for each position on the 81 × 81 parameter grid (t 1 , t 2 ) with −2 ≤ t i ≤ 2 and increments of length 0.05. On each data set we compute Shapley values for the conditional log likelihood evaluation function where θ is estimated via least squares, using the closest submodel of the true model (6), given the vector x S of features indexed by S. In Figure 4 we present the results of highlighting all (t 1 , t 2 ) such that the characteristic function is pathological in the sense of matching Example 2. From this we see that pathologies occur at approximately t 2 = ±(|t 1 | + α) for 0 < α < 0.4.   Choosing t 1 = 2, t 2 = 2.2, we compute the characteristic function for (7): Thus, although the results are quite close, we see that X 2 and X 3 are favoured in feature selection, despite X 1 being the "secret holder" and a member of both optimal submodels of size 2.
The corresponding mean absolute SHAP values are (0.97, 1.52, 1.48) and the SAGE values are (4.6, 5.8, 5.9). SHAP and SAGE disagree regarding whether feature X 2 or X 3 is the most important, but neither method gives precedence to X 1 , which is in fact most important.

D. A taxicab experiment
The following scenario is a concrete realisation of Example 1. Consider d variables generated as where a 1 < a 2 < · · · < a d . The generative model is Y = max{X 1 , X 2 , . . . , X d } + ε , where ε ∼ N (0, 1). The predictive model that we apply to this data is the correct model, which is simplŷ Y = max{X 1 , X 2 , . . . , X d } .
We evaluate this model using as evaluation function the following difference between the mean squared errors We see that the scenario in Example 1 has been generated in this more concrete setting.

V. DISCUSSION
Our investigations of the axioms in Section III prompted a number of experiments on potentially pathological DGPs, presented in Section IV. In each experiment, we were able to define a reasonable evaluation function and game formulation for which the Shapley value behaved in an undesirable way for feature selection. When these experiments were applied to the mean absolute SHAP (i.e., SHAP FSelect) and SAGE formulations, the former performed poorly in all of the three experiments (Sections IV-A to IV-C), while the latter performed favourably in two out of those three experiments.
Our results confirm that the axioms do not in general provide any guarantee that the Shapley value is suited to feature selection, and may in some cases imply the opposite. However, more importantly, the relevance of the Shapley value to the feature selection task (indeed, to any ML task) is governed by the specific game formulation, and the justification of this relevance from the axioms is non-trivial. Crucially, it is the authors' perception that abstract general axioms presented as "favourable and fair" may introduce a significant potential for magical thinking [54] in XAI. It is our intention to caution against this, and to emphasize the nuance in any application of Shapley values.
There are a large variety of alternatives to the Shapley value provided in the literature on game theory [28]. As an example, [28, p.55] suggests (under Games with Hierarchies), an inessential player axiom, dictating that players that are not essential should receive zero value.
Future work is needed to thoroughly explore the application of game theoretic solution concepts to feature selection and attribution. Extensive empirical studies are needed to understand the game formulations and axioms that are suited to the large variety of practical and frequently occurring feature selection tasks in XAI and ML.