On the Robustness of Sparse Counterfactual Explanations to Adverse Perturbations

Counterfactual explanations (CEs) are a powerful means for understanding how decisions made by algorithms can be changed. Researchers have proposed a number of desiderata that CEs should meet to be practically useful, such as requiring minimal effort to enact, or complying with causal models. We consider a further aspect to improve the usability of CEs: robustness to adverse perturbations, which may naturally happen due to unfortunate circumstances. Since CEs typically prescribe a sparse form of intervention (i.e., only a subset of the features should be changed), we study the effect of addressing robustness separately for the features that are recommended to be changed and those that are not. Our definitions are workable in that they can be incorporated as penalty terms in the loss functions that are used for discovering CEs. To experiment with robustness, we create and release code where five data sets (commonly used in the field of fair and explainable machine learning) have been enriched with feature-specific annotations that can be used to sample meaningful perturbations. Our experiments show that CEs are often not robust and, if adverse perturbations take place (even if not worst-case), the intervention they prescribe may require a much larger cost than anticipated, or even become impossible. However, accounting for robustness in the search process, which can be done rather easily, allows discovering robust CEs systematically. Robust CEs make additional intervention to contrast perturbations much less costly than non-robust CEs. We also find that robustness is easier to achieve for the features to change, posing an important point of consideration for the choice of what counterfactual explanation is best for the user. Our code is available at: https://github.com/marcovirgolin/robust-counterfactuals.

In particular, we study the effect of addressing robustness separately for the features that are recommended to be changed and those that are not. We provide definitions of robustness for sparse CEs that are workable in that they can be incorporated as penalty terms in the loss functions that are used for discovering CEs. To carry out our experiments, we create and release code where five data sets (commonly used in the field of fair and explainable machine learning) have been enriched with feature-specific annotations that can be used to sample meaningful perturbations. Our experiments show that CEs are often not robust and, if adverse perturbations take place (even if not worst-case), the intervention they prescribe may require a much larger cost than anticipated, or even

Introduction
Modern Artificial Intelligence (AI) systems often rely on machine learning models such as ensembles of decision trees and deep neural networks [1,2,3], which contain from thousands to billions of parameters. These large models are appealing because, under proper training and regularization regimes, they are 5 often unmatched by smaller models [4,5]. However, as large models perform myriads of computations, it can be very difficult to interpret and predict their behavior. Because of this, large models are often called black-box models, and ensuring that their use in high-stakes applications (e.g., of medicine and finance) is fair and responsible can be challenging [6,7]. 10 The field of eXplainable AI (XAI) studies methods to dissect and analyze black-box models [8,9] (as well as methods to generate interpretable models when possible [10]). Famous methods of XAI include feature relevance attribution [11,12], explanation by analogy with prototypes [13,14], and, of focus in this work, counterfactual explanations. Counterfactual explanations enable 15 to reason by contrast rather than by analogy, as they show in what ways the input given to a black-box model needs to be changed for the model to make a different decision [15,16]. A classic example of counterfactual explanation is: "Your loan request has been rejected. If your salary was 60 000$ instead of 50 000$ and your debt was 2500$ instead of 5000$, your request would have 20 been approved." A user who obtains an unfavourable decision can attempt to overturn it by intervening according to the counterfactual explanation.
Normally, the search of counterfactual explanations is formulated as an optimization problem (see Sec. 2.1 for a formal description). Given the feature values that describe the user as starting point, we seek the minimal changes to 25 those feature values that result in a point for which the black-box model makes a different (and oftentimes, a specific favourable) decision. We wish the changes to be minimal for two reasons: one, to learn about the behavior of the black-box model for a neighborhood of data points, e.g., to assess its fairness (although this is not guaranteed in general, see e.g., [17]); two, in the hope that putting the 30 counterfactual explanation into practice by means of real-life intervention will require minimal effort too. For counterfactual explanations to be most useful, more desiderata than requiring minimal feature changes may need to be taken into account (see Sec. 9) [18].
In this paper, we consider a desideratum that can be very important for the 35 usability of counterfactual explanations: robustness to adverse perturbations.
By adverse perturbations we mean changes in feature values that happen due to unforeseen circumstances beyond the user's control, making reaching the desired outcome no longer possible, or requiring the user to put more effort than originally anticipated. These unforeseen circumstances can have various 40 origins, e.g., time delays, measurement corrections, biological processes, and so on. For example, if a counterfactual explanation for improving a patient's heart condition prescribes lowering the patient's blood pressure, the chosen treatment may need to be employed for longer, or even turn out to be futile, if the patient has a genetic predisposition to resist that treatment (for more 45 examples, see Sec. 5.1 and choices made in the coding of our experiments, in robust cfe/dataproc.py).
We show that, if adverse perturbations might happen, one can and should seek counterfactual explanations that are robust to such perturbations. A particular novelty of our work is that we distinguish between whether perturbations 50 impact the features that counterfactual explanations prescribe to change or keep as they are (note that some features may be irrelevant and can be changed differently than how prescribed by a counterfactual explanation, we address this in Sec. 2.3). This is because counterfactual explanations are normally required to be sparse in terms of the intervention they prescribe (i.e., only a subset of 55 the features should be changed), for better usability (see Sec. 2.1). As it will be shown, making this discrimination allows to improve the effectiveness and efficiency with which robustness can be accounted for. Consequently, one might need to consider carefully which counterfactual explanation to pursue, based on whether they are robust to features to change or keep as they are. 60 In summary, this paper makes the following contributions: 1. We propose two workable definitions of robustness of counterfactual explanations that concern, respectively, the features prescribed to be changed and those to be kept as they are; 2. We release code to support further investigations, where five existing data 65 sets are annotated with perturbations and plausibility constraints that are tailored to the features and type of user seeking recourse; 3. We provide experimental evidence that accounting for robustness is important to prevent adverse perturbations from making it very hard or impossible to achieve recourse through counterfactual explanations, when 70 adverse perturbations are sampled from a distribution (i.e., they are not necessarily worst-case ones); 4. We show that robustness for the features to change is far more reliable and computationally efficient to account for than robustness for the features to keep as they are; 75 5. Additionally, we propose a simple but effective genetic algorithm that outperforms several existing gradient-free search algorithms for the discovery of counterfactual explanations. The algorithm supports plausibility constraints and implements the proposed definitions of robustness.

80
In the following, we introduce preliminary concepts for reasoning about robustness of counterfactual explanations in a sparse sense. In particular, we (i) describe the problem statement of searching for counterfactual explanations, (ii) present the notions of perturbation and robustness in general terms, and (iii) introduce the definitions of C and K, which are sets that partition the fea-85 tures of a counterfactual explanation. The following Secs. 3 and 4 will then present the main contribution of this paper: notions of robustness that are tailored to sparse counterfactual explanations, i.e., specific to C and K.

Problem statement
Let us assume we are given a point x = (x 1 , . . . , x d ), where d is the number 90 of features. Each feature takes values either in (a subset of) R, in which case we call it a numerical feature, or in (a subset of) N, in which case we call it a categorical feature. For categorical features, we use natural numbers as a convenient way to identify their categories, but disregard ordering. For example, for the categorical feature gender, 0 might mean male, 1 might mean female, 95 and 2 might mean non-binary. Thus, x ∈ R d1 × N d2 , where d 1 + d 2 = d.
A counterfactual example 1 for a point x is a point z ∈ R d1 × N d2 such that, given a classification (black-box) machine learning model f : R d1 × N d2 → {c 1 , c 2 , . . . } (c i is a decision or class), f (z) ̸ = f (x). We wish z to be close to x under a meaningful distance function δ that is problem-specific and meets 100 several desiderata (see Sec. 9). For example, commonly-used distances that are capable of handling both numerical and categorical features are variants of Gower's distance [19] (see Eq. (9) and, e.g., [20] for a variant thereof). Often, when dealing with more than two classes, we also impose f (z) = t, i.e., the target class we desire z to be. Other times, we wish to find a set of counterfactual 105 examples {z 1 , . . . , z k }, possibly of different classes, to obtain multiple means of 1 Many authors use x ′ to represent a counterfactual example for x, instead of z. We chose z not to overload the notation with superscripts later on in the manuscript, for readability. recourse or simply gain information on the decision boundary of f nearby x (e.g., to explain f 's local behavior) [15, 21,22].
For the sake of readability, we provide formal definitions only for the case when all features are numerical (i.e., x ∈ R d , d 1 = d, d 2 = 0). For completeness, 110 we include explanations of how to deal with categorical features in the running text. Furthermore, we assume feature independence. While this assumption is rarely entirely met in real-world practice, it is commonly done in literature due to the lack of causal models (e.g., only four works consider causality in Sec. 9), and allows us to greatly simplify the introduction of the concepts hereby 115 presented. We discuss the limitations that arise from this assumption in Sec. 8.
A counterfactual explanation is represented by a description of how x needs to be changed to obtain z. In other words, a counterfactual explanation is a prescription on what interventions should be made to reach the respective counterfactual example. For example, under the assumption of independence and 120 all-numerical features, the difference z−x is typically considered the counterfactual explanation for how to reach z from x. What particular form counterfactual explanations take is not crucial to our discourse, and we will use z − x for simplicity.
We proceed by considering the following traditional setting where, for simplicity of exposition and without loss of generality, we will assume that features are pre-processed so that a difference in one unit in terms of feature i is equivalent to a difference in one unit in feature j (i.e., the user's effort is commensurate across different features). Alternatively, one can account for this in the computation of the distance (see, e.g., Eq. (9)). We seek the (explanation relative to an) optimal z ⋆ with: and subject to f (z) = t and z − x ∈ P. (1) In other words, δ is a linear combination, weighed by λ, of the sum of abso-125 lute distances between the feature values of x and z, and the count of feature values that are different between x and z. Note that z ⋆ needs not be unique, i.e., multiple optima may exist. Moreover, the difference z − x must abide to some plausibility constraints specified in a collection P. We model plausibility constraints as a set of specifications, each relative to a feature i, concerning 130 whether z i − x i is allowed to be > 0, < 0, and ̸ = 0, i.e., a feature can increase, decrease, or change at all (for categorical features, we only consider the latter).
For example for a private individual who wishes to be granted a loan, one of such constraints may specify that they cannot reasonably intervene to change the value of a currency (such a feature is called mutable but not actionable), 135 i.e., counterfactual explanations must have z i − x i = 0, for i representing currency value. Similarly, the individual's age may increase but not decrease, i.e., We particularly consider the L1-norm (i.e., the term || · || 1 of δ in Eq. (1)) because it is reasonable to think that, for independent features, the total cost 140 of intervention (i.e., the effort the user must put) is the sum of the costs of intervention for each feature separately, and that these costs grow linearly. Some works (e.g., [20,23]) choose the L2-norm (|| · || 2 , also known as Euclidean norm) instead of the L1-norm; the definitions of robustness given in this paper can be easily adapted for the L2-norm. Regarding the L0-norm (i.e., the term || · || 0 145 of δ in Eq. (1)), this term explicitly promotes a form of sparsity, as it seeks to minimize how many features have a different value between z and x. This is desirable because, oftentimes, the user can only reasonably focus on, and intervene upon, a limited number of features (even if this amounts to a larger total cost in terms of L1 compared to intervention on all the features) [24]. 150

Perturbations & robustness
unforeseen circumstances (e.g., inflation) might lead to more or different intervention to be needed, compared to what was originally prescribed by a counterfactual explanation (e.g., increase savings by 1000 $ to be granted credit access). Thus, instead of reaching z as intended by the counterfactual explana-155 tion, a different point z ′ is obtained. Note that while the effects of unforeseen circumstances can impact feature values, the circumstances themselves need not be encoded as feature values. In fact, we will only focus on the extent by which feature values may be perturbed by such circumstances.
Let us define the vector w = z − z ′ as a perturbation for the counterfactual 160 example z. We assume that perturbations that impact feature i are sampled from some distribution P (w i ) and we are interested in controlling for, or being robust to, large magnitude perturbations that have reasonable risk. For example for normally-distributed perturbations, we might want to consider the values that can be sampled at the 95 th or 99 th percentile. We will therefore assume 165 that we can define a vector p = p For categorical features, we will assume that p contains elements that represent what categorical perturbations are possible for that feature, i.e., p i will be a set of indices that represent categories.
Under the problem setting we considered in Sec. 2.1, perturbations that 180 may impact a counterfactual explanation define a box (hyper-rectangle) of all possible points z ′ that can be reached from z due to perturbations. An example is illustrated in Fig. 1. We define the concept of p-neighborhood of z as follows: Definition 1. (p-neighborhood and p-neighbors of a counterfactual example) Given a model f , a point x, a respective counterfactual example z, and a vector for treating blood pressure, z v for treating vitamin deficiency, and z b,v for treating both. We assume to know the maximal extent of perturbation (under reasonable risk) for blood pressure and vitamin deficiency due to natural physiological events. This allows us to define the blue areas surrounding each counterfactual example. Perturbations w to one of the counterfactual examples can lead to any other point in the blue area. z v is the best of the three in terms of proximity to x but its blue area partly overlaps with the red area. This means that there exist w such that z v + w leads to a point in the red area, invalidating the counterfactual explanation. In such cases, it is important to estimate if additional intervention is possible so that z v can still be reached, and at what cost.
of possible perturbations p, the p-neighborhood of z is the set: A point z ′ ∈ N such that z ′ ̸ = z is called a p-neighbor of z.
Not all perturbations are problematic. Our goal is to study robustness to 185 adverse perturbations, i.e., those for which unforeseenly, if f is assumed to be a general model (e.g., not necessarily a linear one), then the following argument holds.
For a general f , information on the classification of a pneighbor (e.g., that f (z ′ ) = f (z) for z ′ on the boundary of N ) provides no infor-200 mation about the classification of another p-neighbor (e.g., that for z ′′ in the interior of N ).
Proof. We cannot preclude that the model f is, for example, a neural network.
Under the universal approximation theorem [25], f may represent any function. This proposition means that if no information on, e.g., regularity or smoothness of f is available, then we must check each and every p-neighbor z ′ of z to assess whether some of them may invalidate the explanation, i.e., ∃z ′ such that f (z ′ ) ̸ = 210 t. Checking all neighbors is typically not feasible, e.g., as soon as some of the features are real valued. Thus, the best one can do is to take an approximate approach. For example, a Monte-Carlo sampling approach can be used where a batch of random points within N is considered, hoping that the batch is representative of all points in N . As we will show in the next sections, a better 215 strategy can be designed if sparsity is considered.
We conclude this section by noting that perturbations, as described so far, are absolute, i.e., independent of the starting point x, the counterfactual in consideration z, or the intervention entailed by the counterfactual explanation z−x.
Perturbations to feature i might however depend on x i and z i , i.e., be sampled 220 from a distribution P (w i |x i , z i ). For example, due to market fluctuations, a return on investment may be smaller than anticipated by 5% of the expected value. Such type of relative perturbations entail different p-neighborhoods for different x and z. For simplicity and without loss of generality, we will proceed by assuming that perturbations can only be absolute. We explain how we also 225 included relative perturbations in the annotations used for our experiments in Sec. 5.

Sparsity, features in C and K
We use the form of sparsity mentioned in Sec. 2.1 to partition the features into two sets. As mentioned before, sparsity is an important desideratum be- Note that the proposed partitioning between C and K implicitly assumes that all features are relevant to the counterfactual explanation. If certain features are 240 irrelevant, perturbations to those features will have no effect on f 's decision, and thus those features need not be accounted for when assessing robustness. This means that accounting for irrelevant features makes assessing robustness more computationally expensive than needed. However, as f is considered a blackbox, we cautiously assume that all features are relevant for assessing robustness. 245 We will proceed by accounting for perturbations and respective robustness separately for features in C and K. Accounting for robustness separately is important because, as we will show, assessing robustness for features in C can be done far more efficiently and be more effective than for features in K. Knowing 250 this, if multiple counterfactual explanations can be found, the user may want to choose the counterfactual explanation that fits him/her best based on the robustness it exhibits in terms of C and K. In the next section, we present our first notion of robustness, which concerns C.

Robustness for C 255
We begin by focusing on the features that the counterfactual explanation instructs to change, i.e., the features (whose indices are) in C. Recall that we assume that a vector of maximal perturbation magnitudes p can be defined.
This leads us to the following definition.
and such that ∃i : w c i > 0, i.e., w c is not the zero-vector.

260
In other words, a C-perturbation is a perturbation that acts only on features in C, and at least on one of such features. Next, we use the concept of Cperturbations to introduce the one of C-setbacks.
We denote C-setbacks with w c,s . for the possibility that maximal C-setbacks may take place, then the total cost (original intervention + additional intervention to remedy the perturbation, i.e., 3 + 2.5) to reach z v becomes larger than the total cost to reach z b (i.e., 4 + 1).
In words, a C-setback is a C-perturbation where each and every element 265 of the perturbation w c,s i is of opposite sign to the counterfactual explanation z i − x i . We can interpret the meaning of C-setbacks w c,s as vectors that push the user away from z and back towards x along the direction of intervention.
Furthermore, we call a maximal C-setback, denoted by w c,s max , C-setback whose elements that correspond to features in C have maximal magnitude, i.e., w c, An example is given in Fig. 2. C-setbacks are arguably more interesting than C-perturbations because Csetbacks are the subset of these perturbations that plays against the user. In fact, certain C-perturbations might be advantageous, enabling to reach z with less intervention than originally provisioned (i.e., when the sign of w c i and that of 275 z i − x i matches). To account for robustness, we are interested in understanding whether perturbations can prevent us to reach z, hence we will proceed by focusing exclusively on C-setbacks. It is important to note that even C-setbacks can be advantageous if one 280 allows their perturbations to be of larger magnitude than intervention, a C-setback can lead to a point that "precedes" x in terms of the direction of intervention. For that point, the intervention may be less costly than the one that was originally planned or entirely not needed because the point is of the target class (see, e.g., Fig. 3).
Advantageous situations are not interesting for robustness and counterfactual 290 explanations that can be overturned by perturbations may well not be interesting to pursue. We therefore consider any C-setback to have elements capped by Perhaps the most interesting scenario for considering C-setbacks is when dealing with z ⋆ , since a counterfactual example that is optimal (i.e., one mini-295 mizes Eq. (1)) is an ideal outcome. The following simple result holds for z ⋆ : Proof. We use reduction ad absurdum. Let us assume the opposite of what was said in Proposition 2, i.e., there exists w c,s such that f (z ⋆ + w c,s ) = t.

300
Let z ′ := z ⋆ + w c,s , and so f (z ′ ) = t. By construction of w c,s , δ(z ′ , x) = δ(z ⋆ + w c,s , x) < δ(z ⋆ , x). In other words, z ′ is of the target class and is closer to x than z ⋆ is. This contradicts the fact that z ⋆ is optimal. Now, because of Proposition 2, we are guaranteed that if a C-setback w c,s happens to z ⋆ , the resulting point will no longer be classified as t. Intuitively, 305 this is a natural consequence of the fact that optimal counterfactual examples lay on the border of the decision boundary as otherwise they would not be optimal. Also, since z ⋆ is optimal, the respective L0 component for the distance between x and z ⋆ is minimal, i.e., all features in C and thus in w c,s are relevant for the classification. Given the premises just made, it becomes important to un-310 derstand whether invalidation to z ⋆ can be averted with additional intervention and, if so, whether the cost of such intervention can be minimized.
It is important to note that invalidation of a counterfactual explanation due to a C-setback can always be averted, i.e., additional intervention to reach the intended z i for all i ∈ C is always possible. To see this, consider the fact that 315 the intervention entailed by the counterfactual explanation z − x must adhere to the plausibility constraints specified in P (else, z − x would not be a possible counterfactual explanation). Since C-setbacks are aligned with the direction of the original intervention, the point z + w c,s − x, which is in between x and z, must meet P. It therefore suffice to apply additional intervention along 320 the originally-intended direction to recover the desired counterfactual example.
Under the L1-norm (as per the choice of δ in Eq. (1)), the cost associated with the additional intervention needed to overcome a C-setback w c,s is simply ||w c,s || 1 .
Since invalidation due to C-setbacks can be dealt with additional interven- contrast C-setbacks is minimal. To this end, we can use Proposition 2 in order to seek counterfactual examples that are optimal (i.e., require minimum intervention cost) when the additional cost to contrast maximal C-setbacks w c,s max is factored in. In the following definition, to highlight that C-setbacks depend on the specific z and x (as they determine C) and avoid confusion, we use the an optimal counterfactual example under C-setbacks.
This definition gives us a way to seek a (multiple may exist) counterfactual explanation that entails minimal intervention cost when accounting for maximal C-setbacks. Indeed, it suffices to equip a given search algorithm with Eq. (6),

Robustness for K
We now consider K, i.e., the set concerning the features that should be kept 360 to their current value. Mirroring the notion of C-perturbation (Definition 2), we can define a K-perturbation to be a vector w k such that p Similarly, we can cast the concept of neighborhood from Definition 1 to consider only K-perturbations, leading to: Definition 5. (K-neighborhood and K-neighbors of a counterfactual example) Given a model f , a point x, a respective counterfactual example z, and a vector of possible perturbations p, the K-neighborhood of z under p is the set: For a categorical feature i ∈ K, the neighborhood can be built by swapping z i with one of the possibilities listed in p i ∈ p, where p i will be a set containing categories perturbations can lead to.
Next, we use K to define the concept of vulnerability to K-perturbations: Definition 6. (Vulnerability to K-perturbations) Given a model f , a point x, 370 and a vector p, a counterfactual example z is vulnerable to K-perturbations if Informally, this definition says that z is vulnerable to K-perturbations if the decision boundary surrounding z is not sufficiently loose with respect to the features in K. Fig. 4 shows an example. The reason why vulnerability to 375 K-perturbations is particularly important is that, differently from the case of Cperturbations, a K-perturbation can invalidate the counterfactual explanation permanently. In fact, a K-perturbation changes z along a different direction than the one of intervention. Thus, a K-perturbation can lead to a point z ′ from which there exists no plausible intervention to reach the originally-intended z 380 from. perturbations because these can lead to the red area; the same is not true for z b . If it is not plausible to reduce blood pressure, then K-perturbations to z v can lead to permanent invalidity. Else, they can be resolved with additional intervention, in terms of blood pressure.
For example, consider the feature i to represent inflation as a mutable but not actionable feature, i.e., a feature that can be changed (e.g., by global market trends) but not by the user. P will state that no (user) intervention can exist to change i, i.e., P imposes z i − x i = 0. However, an unforeseen circumstance such 385 as a the financial crisis of 2008 may lead to a large inflation increase (p + i > 0). Consequently, it may become impossible for the user to obtain the desired loan, e.g., because the bank does not hand out certain loans when the inflation is too high. Now, recall that the reason why Definition 4 can be used for the case of C-390 perturbations is that Proposition 2 holds, i.e., there cannot exist points of class t between x and an optimal counterfactual example z ⋆ . The same does not hold for K-perturbations, i.e., since the features in K are orthogonal to the direction of intervention, it can happen that the maximal perturbation to a feature i ∈ K leads to a point z ′ for which f (z ′ ) = t, while a non-maximal perturbation to 395 the same feature can lead to a point z ′′ for which f (z ′′ ) ̸ = t. Thus, checking for maximal perturbations is no longer sufficient: we must check instead for all points in the K-neighborhood K.
As mentioned in Sec. 2.2, checking each and every point in a neighborhood may not be feasible. Thus, we propose to approximate the assessment of how Krobust (i.e., non-vulnerable to perturbations in K) counterfactual explanations can be, with Monte-Carlo sampling. Let 1 f (z) : K → {0, 1} be the indicator function that returns 1 for K-neighbors that share the same class of z (i.e., f (z)), and 0 for those that do not. Taken a random sample of m K-neighbors, we define the following score: We remark that even if K-robustness score(z, m) = 1, we are not guaranteed that z is K-robust, because the score is an approximation. Still, this score can 400 be used to determine which counterfactual examples are preferable to pursue in that they are associated with a smaller risk that adverse perturbations will invalidate them (permanently or not).

Experimental setup
In this section, we firstly describe the preparation of the data sets used in our 405 experiments. Secondly, we describe the search algorithms considered for finding near-optimal counterfactual explanations. Lastly, we describe the loss function considered, as well as how to incorporate the proposed notions of robustness into it.

Data sets
410 Table 1 summarizes the data sets we consider. For each data set, we make an assumption on the type of user who seeks recourse, e.g., the user could be a private individual seeking to increase their income, or a company seeking to improve the productivity of its employees. Based on this, we manually define the target class t, the set of plausibility constraints P on what interventions are 415 reasonably plausible, and the collection p of maximal magnitudes from which perturbations can be sampled (we will consider uniform and normal distributions). We named the data sets in Table 1 to represent their purpose. Originally, Credit risk (abbreviated to Cre) is known as South German Credit Data [27], which is a recent update that corrects inconsistencies in the popular Statlog 420 German Credit Data [28]. Income (Inc) is often called Adult or Census income [29,30]. Housing price (Hou) is also known as Boston housing [31] and is often used for research on fairness and interpretability because one of its features raises ethical concerns [32]. Productivity (Pro) concerns the productivity levels of employees producing garments [33]. Lastly, Recidivism (Rec) is a data set 425 collected by an investigation of ProPublica about possible racial bias in the commercial software COMPAS, which intends to estimate the risk that an inmate will re-offend [34]. Examples of recent works on fair and explainable machine learning that adopted (some of) these data sets are [20,35,36,37,38,39,40].
We pre-process the data sets similarly to how done often in the literature. elaborate on this in Sec. 8. We sample the amount of perturbation using a uniform or normal distribution, as indicated in Sec. 7. Sec. 5.1 shows some examples of maximal perturbations we annotated. As mentioned before, we also define plausibility constraints P for each data set. Each constraint is specific to a feature. For an i th numerical feature, possible constraints are  The column t is the target class for the (simulated) user. Plausib. constr. reports the number of plausibility constraints that allow features to only increase (≥), remain equal (=), and decrease (≤). The column Perturb. reports the number of perturbations concerning numerical (N) and categorical (C) features. Finally, Acc. rf and Acc.nn report the average (across five folds) test accuracy of the random forest and neural network models.
Data set (abbrev.) Savings 10% 10% Might happen to save less or more relative to what intended.

Hou
Crime rate 1% 5% Relative, might increase more than decrease.

Pro
Overtime 3 3 Up to 3 more or less days of overtime might be needed.

Rec
Age 0 2 Judicial system delays for up to 2 years. We take relative perturbations (those with %) with respect to the value of the feature in the intended counterfactual example z in consideration by the search algorithm. 450 We consider random forest and neural networks (with standard multi-layer perceptron architecture) as black-box machine learning models f . We use Scikitlearn's implementations [69]. We assume that we can only access the predictions The settings used for the algorithms are reported in Table 3. We describe the algorithms below. Note that all of the algorithms are heuristics with no guarantee of discovering optimal (i.e., minimal distance) counterfactual examples,

Black-box models
given the nature of the search problem (general, black-box f ). Sec. 5.4), and picking the best-ranking point. We will further consider two different configuration of DiCE:

475
• Configuration "a" uses the default settings except for allowing for a longer number of iteration, to match the same computational budget given to the 490 other algorithms.
• Configuration "b" uses custom settings that are aligned to be similar to those used for CoGS, since both DiCE and CoGS are genetic algorithms.

GrSp
GrSp is a greedy algorithm that iteratively samples neighbors of the starting Note that this is sub-optimal because an artificial ordering is introduced between categories. applying the rule to the starting point x (e.g., using the rule above, we set the 515

LORE
x's age and salary to 3.4 and high, respectively).
We found (confirmed by a discussion with the authors) that applying LORE's rules may results in points that are not actually classified as t. When that happens, we perform up to 15 attempts at generating a counterfactual example from the (shortest returned) rule, by focusing on numerical features that are 520 prescribed to be >, ≥ (or <, ≤) than a certain value. In particular, in applying such part of the rule to x, we add (or subtract) to the prescribed value a term ϵ, which is initially set to 10 −3 and is doubled at every attempt. Moreover, since we found LORE to be computationally expensive to run (see Fig. 5), we used a fraction of the computation budget allowed for the other algorithms (see 525   Table 3).

NeMe
NeMe is a classic simplex-based algorithm for gradient-free optimization.

CoGS
We design CoGS as a relatively standard genetic algorithm, adapted for the After crossover and mutation, the quality (fitness) of offspring solutions is evaluated using the loss function (Eq. (9)) as fitness function (minimization is sought). Finally, we use tournament selection [67]) to form the population for the next generation. 560 We set CoGS to allow for plausibility constraints (P) to be specified. If plausibility constraints are used, then mutation is restricted to plausible changes (e.g., the feature that represents age can only increase). If mutation makes a numerical feature obtain a value bigger than max i (resp., smaller than min i ), then the value of that feature is set to max i (resp., min i ).

565
CoGS is written in Python, and relies heavily on NumPy [68] for speeding up key computations. For example, the population is encoded as a NumPy matrix, and crossover and mutation are implemented with matrix operations.

Loss
We use the following loss to drive the search of counterfactual examples (where f (z) and t are treated as integers): The function γ in the equation above is Gower's distance [19,46], where features    (almost all) features are numerical. Although LORE supports both numerical and categorical features, it does not perform better than GrSp on most data sets; 620 at least for the limited number of runs conducted with LORE due to excessive runtime, as explained before. Lastly, NeMe often performs substantially worse than all other algorithms.

Quality of discovered counterfactual examples
As last part in our benchmarking effort, we consider what algorithm manages to produce near-optimal counterfactual examples (i.e., those with smallest loss). In particular, we report the relative change in loss for the best-found counterfactual example with respect to the loss obtained by CoGS, only for success cases. Since we consider only successes, the last term of the loss (Eq. (9)) is always null, i.e., ||f (z) − t|| 0 = 0. The relative change in loss with respect to CoGS for another algorithm Alg is:  LORE has worse success rate than GrSp, and NeME worse of all. Therefore, we use CoGS for the following experiments on robustness.
We remark that DiCE, like CoGS, supports the specification of plausibil-640 ity constraints. We show that CoGS performs better than DiCE also under plausibility constraints in Appendix B.

Experimental Results: Robustness
We proceed with presenting the experimental results regarding robustness to perturbations in C, K, and jointly. We focus on results that allow us to   Table 5 shows the frequency with which robust counterfactual examples are discovered accidentally. To realize this, we compare the best-found counterfactual example that is discovered by CoGS when robustness is not accounted for, and the one that is found when C-or K-robustness is accounted for (as indicated in Sec. 5.4.1). We take the frequency by which the two match as indi-665 cation of whether robust counterfactual examples can be discovered by accident.
Since numerical feature values may differ only slightly between two best-found counterfactual examples, we consider the values to match if they are sufficiently close to each other, according to a tolerance level of 1%, 5%, or 10% of the range of that feature. As reasonable to expect, the results show that the larger the 670 tolerance level, the more a z ⋆ discovered when not accounting for robustness matches the respective one that is discovered when accounting for robustness.
In general, the result depends on the data set in consideration, and also (albeit arguably less so) on whether random forest or a neural network is used as black-box model f .

675
For brevity, we now focus on the tolerance level of 5% and random forest.
On Inc, best-found counterfactual examples rarely match with those discovered when accounting for C-robustness (4% on average for the tolerance of 5%), while the vice versa happens on Hou (84% on average for the same tolerance). independence. The last row shows how often best-found counterfactual examples happen to be both robust to perturbations to C and K. The frequencies are clearly always lower than for the previous triplets of rows. Hou is the only data set for which the frequency of discovering a counterfactual example that 690 happens to be both robust w.r.t. C and K by chance is relatively large (e.g., above 50% for the tolerance of 5%).
When using the neural network instead of random forest, the trends mentioned before remain the same, but the specific magnitudes can differ. For example, the accidental discovery of robust counterfactual examples w.r.t. C 695 and/or K is lower on Cre with the neural network compared to random forest, but the opposite holds for Hou (with some exceptions, e.g., the tolerance level of 1% when both C-and K-robustness are sought).
Overall, this result indicates that, except for lucky cases (e.g., Hou with f being the neural network), it is unlikely to discover robust counterfactual 700 examples by chance. Hence, if one wishes to achieve robustness, the search must be explicitly instructed to that end. In the next sections, we investigate whether achieving robustness can actually be important.    Table 1). Conversely, on Rec, perturbations to K can 730 often make it impossible to reach the originally-intended counterfactual example, unless K-robustness is accounted for. In fact, accounting for K-robustness generally improves the chances that additional intervention is possible, at times substantially (e.g., on Inc, Hou, and Rec). Cre represents the only exception to this, as accounting for K-robustness performs similar (or sometimes worse) than 735 accounting for none. This suggests that the decision boundary learned by f on this data set may not be very smooth, making the use of the K-robustness score a too coarse approximation to be helpful. Generally, accounting for perturbations to C alone does not help achieving substantial robustness to perturbations to K, except for on Hou. This suggests that, on Hou, f learns decision boundaries 740 that incorporate interesting interactions between certain features. Importantly, accounting for C-robustness together with accounting for K-robustness does not substantially compromise the gains obtained by accounting for K-robustness alone, even though perturbations to C always admit additional intervention.
Overall, these results show that accounting for robustness can be crucial to en-745 sure that, if perturbations happen, additional intervention to obtain t remains possible.

(RQ3) Are robust counterfactual explanations advantageous in terms of additional intervention cost?
We present the following results in terms of a relative cost, namely, the Rec Figure 11: Cost of accounting for robustness relative to not accounting for robustness (i.e., ideal cost) when no perturbations take place. Note that the (rare) relative costs smaller than 1 are due to a lack of optimality of the search algorithm.
tially. Accounting for C-robustness (second blue box from the left in each plot) counters perturbations to C very well on all the data sets. On Inc in particular, the relative cost improves by two orders of magnitude. As found in Sec. 7.2, accounting for perturbations to K with the K-robustness score can remain in-775 sufficient, as it can be observed on Cre and Inc for both types of f . Again, this is likely a limitation of using a simple heuristic such as the K-robustness score to deal with K-robustness. Accounting for robustness w.r.t. C (resp., K) does not, in general, lead to smaller relative cost under perturbations to K (resp., C).
We confirm this general trend with statistical testing in Appendix C. Lastly,

780
accounting for both C-and K-robustness (right-most triplets of boxes in each plot) offers protection (lower relative cost) from situations in which both types of perturbations take place. In general across data sets and types of f , the distribution of relative costs for when perturbations to both C and K take place and both C-and K-robustness are accounted for (right-most green box in each 785 plot) is better than the distribution for when the same perturbations take place but no notion of robustness is accounted for (left-most green box in each plot).
Since the ideal cost is computed when no notion of robustness is accounted for, part of the relative costs for when robustness is accounted for comes from the fact that robust counterfactual examples are generally farther away from 790 x than non-robust ones. Fig. 11 shows the cost increase that comes solely from accounting for robustness on the considered data sets and types of f , without any perturbation taking place. We remark that values smaller than 1 happen only because the discovered counterfactual examples can be suboptimal.
Importantly, we find that the cost when accounting for robustness is between 795 1× and 7× the ideal cost, i.e., when not accounting for robustness. In general, this is significantly smaller than the increase incurred when perturbations take place and robustness is not accounted for, as reported before (generally between 5× and 10× the ideal cost, with up to 100×).
Lastly, Fig. 12 shows what part of the cost increase comes from counterfac-800 tuals becoming less sparse. For certain data sets (e.g., Cre for both random forest and neural network), robust counterfactual explanations are as sparse as non-robust ones. In general, however, robust counterfactual explanations tend to be less sparse, depending on the choice of f and the type of robustness that is accounted for. The reduction in sparsity can be moderate or substantial.

805
For example, less than 10% more of the features need to change to account for K perturbations on Hou with random forest (i.e., approximately one feature).
Instead, up to 30% more of the features need to change to account for K perturbations on Rec with the neural network (i.e., approximately three features).
The fact that sparsity decreases when seeking robust counterfactuals is a natural tional intervention than non-robust ones.

Discussion
Our experimental results provide a positive answer to all three research questions. In general, counterfactual explanations are not robust, be it in terms of 820 the features whose value is prescribed to be changed (C-robustness), or those whose value is prescribed to be kept as is (K-robustness). Moreover, non-robust counterfactual explanations are more susceptible to make it impossible for the user to remedy perturbations by additional intervention, and the cost of additional intervention is larger for non-robust counterfactual explanations than for 825 robust ones. Ultimately, it is clear that accounting for robustness is important.
Our experimental results suggest that accounting for robustness for features in C tempers perturbations to C, and similarly, accounting for robustness for features in K tempers perturbations to K. Moreover, even though f can learn non-linear feature interactions, accounting for C (or K) has limited effect on 830 contrasting perturbations to K (resp., C). Only in some cases (e.g., on Hou), robustness w.r.t. C has substantial repercussions on the effect of perturbations to K or vice versa.
In addition to this, even if a counterfactual search algorithm does not guarantee that the discovered counterfactual example will be optimal, we experimen- What our results also show is that seeking robustness with respect to features 840 in K is problematic. This is because of Proposition 1 and the fact that features in K are not aligned with the direction of intervention. Thus, we proposed to control for K-robustness using an approximation, i.e., the K-robustness score.
We found that seeking counterfactual examples that maximize the K-robustness score is often but not always sufficient to obtain a good resilience to perturba-845 tions to the features in K. Moreover, the K-robustness score requires to sample (and evaluate with f ) multiple points, which is far more expensive than computing Definition 4. Therefore, future work should consider whether a better method can be used than the K-robustness score. For example, if information on f is available, that information may be used to provide guarantees on the 850 neighborhood of z (see, e.g., Theorem 2 in [40] for linear f ).
The assumption that features are independent is simplistic but often made in literature, because only a small number of works assume a causal model is available (e.g., [47,48] and different functions to define the maximal extent of perturbation, which may e.g., account for the distribution of feature values. For example, for denser areas of feature i, p + i and p − i should be smaller than for less dense areas. Other desiderata may need to be included when seeking counterfactuals in practice (see, e.g., [49,50]), including accounting for multiple types of robustness of the 880 same time, such as those related to uncertainties of f [51,52] Lastly, we made subjective choices to define perturbations (p) and plausibility constraints (P) in the data sets. We made these choices as best as we could, based on reading the meta-information in web sources and the papers that describe the data sets. We have no doubt that domain experts would make 885 much better choices than ours. Nevertheless, we argue that this is not an important limitation because, as long as the community agrees that our choices are reasonable, they suffice to provide a sensible test bed for benchmarking robustness. Hopefully, other researchers will find our annotations to be useful for future experiments on the robustness of counterfactual explanations. Similarly, 890 we hope that other researchers will find CoGS to be an interesting algorithm to benchmark against.

Related work
A number of works in literature propose several new desiderata that are largely orthogonal to our notions of robustness but can be important to enhance 895 the practical usability of counterfactual explanations. For example, Dandl et al. [49] consider, besides proximity of z to x according to different distances, whether other training points x ′ are sufficiently close to z for it to reasonably belong to the training data distribution. A similar desideratum is considered in [21] and [53]; the latter work employs neural autoencoders to that end.
[54] re-900 marks the importance of sparsity for explanations, with the concepts of pertinent negatives (the minimal features that should be different to (more) confidently predict the given class) and pertinent positives (the minimal features that help correctly identifying the class). Laugel et al. [36,50] require that z can always be reached from a training point x ′ without having to cross the decision 905 boundary of f , for z not to be the result of an artifact in the decision boundary of f . In [47] and [55], counterfactual explanations are studied through the lens of causality. For recent surveys on counterfactual explanations, the reader is referred to [56,16,57].
We now focus on works that deal with some notion of robustness and/or  [48] extends [47] to consider possible uncertainties in causal modelling. In [17], it is shown that a malicious actor can, in principle, jointly optimize small perturbations and the model f such that, when applying the perturbations to points of a specific group (e.g., white males), the respective 920 counterfactual explanations are much less costly than normal (in fact, counterfactual explanations are conceptually similar to adversarial examples, see, e.g., [59,60,61]). Some works consider forms of robustness of counterfactual explanations with respect to changes of f (e.g., whether z is still classified as t if f ′ is used instead of f ) [51,62] or updates to f (e.g., after data distribution shift 925 of temporal or geospatial nature) [63,52]. In [64], robustness of counterfactual explanations is studied in the context of differentially-private support vector machines. Dominguez et al. [40] consider whether counterfactual explanations remain valid in presence of uncertainty on x, and also account for causality.
We also note that Dominiguez et al. consider a neighborhood of uncertainty 930 around x which is akin to Definition 1; in fact, such sort of neighborhoods are common tools in post-hoc explanation methods, e.g., the Anchor explainer by [65] seeks representative points for a class by assessing that the prediction of To the best of our knowledge, there exists no other work prior to ours that attempts to exploit sparsity when assessing robustness, although sparsity is an important property for counterfactual explanations. Moreover, existing works 945 typically consider whether robustness helps preventing counterfactual explanations from becoming invalid, while we further consider that additional intervention may be possible, and assess the associated cost.

Conclusion
Counterfactual explanations can help us understand how black-box AI sys-950 tems reach certain decisions, as well as what intervention is possible to alter such decisions. For counterfactual explanations to be most useful in practice, we studied how they can be made robust to adverse perturbations that may naturally happen due to unforeseen circumstances, to ensure that the intervention they prescribe remains valid, and potential additional intervention cost that 955 may be needed remains limited. We presented novel notions of robustness, which concern adverse perturbations to the features that a counterfactual explanation prescribes to change (C-robustness) and to keep as they are (K-robustness), respectively. We have annotated five existing data sets with reasonable perturbations and plausibility constraints and developed a competitive counterfactual The performance of tuned random forest on all folds in shown in Table A   As it can be seen, DiCE comes close to the performance of CoGS only on Rec.
Thus, overall, CoGS remains superior to DiCE also when plausibility constraints are enforced.
We attribute this to the fact that, differently from CoGS, DiCE is inherently 1220 designed to discover a diverse set of counterfactuals instead of a single and closest-possible counterfactual.   Regarding perturbations to (features in) C (i.e., C-setbacks), recall that accounting for the respective notion of robustness is intended to provide counterfactual explanations with minimal additional intervention cost, the maximal C-setback were to happen. Ideally, the returned counterfactual example should still be optimal, i.e., as near to x as possible, which means that the example is is not on the boundary, and the C-setback is too small to cross the boundary.
The frequency of this phenomenon depends on the data set. Also, while accounting for robustness w.r.t. C should not, in theory, decrease invalidity rate but only make additional intervention less costly, as confirmed in Sec. 7.3), we find that accounting for robustness w.r.t. C lowers invalidity rate on Hou (e.g., 1250 most evident for both types of f with normally-distributed perturbations).
When K-robustness is accounted for, the best-found counterfactual explanation is supposed to be in a region such that the decision boundary is relatively loose with respect to the features in K. Consequently, accounting for K-robustness should, in fact, counter invalidity, as we do not wish risking that it 1255 becomes impossible to carry out additional intervention due to the plausibility constraints. The figure shows that, in general, there can be a substantial gain in lowering invalidity by accounting for K-robustness. At times, accounting for Krobustness allows to reach almost zero invalidity, see the cell that corresponds to robustness for K and perturbations to K, on Inc, Hou, and Pro, for both 1260 types of f and sampling distributions. However, it is not always the case that K-robustness helps, due to the heuristic nature of the K-robustness score: see, e.g., Cre.
Lastly, we observe that the frequency of invalidity can raise when both notions of robustness are accounted for at the same time (e.g., on Inc for uniformly-1265 distributed perturbations when using the neural network). Note that this is not necessarily a problem because invalidity from perturbations to C is expected to be high, as the goal of robustness w.r.t. C is to be able to minimize additional intervention cost.
Appendix B.3. Setting m for K-robustness 1270 We report results on setting the hyper-parameter m for computing K-robustness scores (see Eq. (8)). In particular, we run CoGS accounting for K-robustness in the loss function, for m ∈ {0, 4, 16, 64}. Note that using m = 0 corresponds to not accounting for K-robustness.
Appendix B.3.1. Achieved K-robustness 1275 We consider how increasing m improves K-robustness, using an approximated ground-truth. We approximate the ground-truth of the true K-robustness by calculating the K-robustness score over 1000 samples over the counterfactual example discovered using a specific m. (see, e.g., Cre). Further increasing m has diminishing returns (note that m is increased exponentially). Accounting for C-robustness is largely orthogonal, meaning, it has no effect in terms of K-robustness.
Appendix B.3.2. Additional required runtime values of m. We do not find major differences based on the setting of m for computing the K-robustness score, except for the tails of the respective distributions on Hou, and slightly less so on Pro (for both types of f ). Accounting for C-and K-robustness at the same time leads to larger costs than accounting for only 1305 one of the two, as it is reasonable to expect. On average, the cost that comes from accounting for robustness alone is limited (up to 6.5× the ideal cost, see Inc), especially in light of the results found for when perturbations take place, described in Sec. 7.3 (additional intervention due to perturbations can lead to 100× times larger costs for non-robust counterfactual explanations, see Inc on 1310 Fig. 9).   relative costs smaller than 1 are due to a lack of optimality of the search algorithm.

Declaration of interests
☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
☐ The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: