Supervised Feature Compression based on Counterfactual Analysis

Counterfactual Explanations are becoming a de-facto standard in post-hoc interpretable machine learning. For a given classifier and an instance classified in an undesired class, its counterfactual explanation corresponds to small perturbations of that instance that allows changing the classification outcome. This work aims to leverage Counterfactual Explanations to detect the important decision boundaries of a pre-trained black-box model. This information is used to build a supervised discretization of the features in the dataset with a tunable granularity. Using the discretized dataset, an optimal Decision Tree can be trained that resembles the black-box model, but that is interpretable and compact. Numerical results on real-world datasets show the effectiveness of the approach in terms of accuracy and sparsity.


Introduction
Classification systems based on Machine Learning algorithms are often used to support decision-making in real-world applications such as healthcare (Babic et al., 2021), credit approval (Silva et al., 2022;Kozodoi et al., 2022;Bastos and Matos, 2022;Dumitrescu et al., 2022), or criminal justice (Ridgeway, 2013).These systems often act as black-boxes that lack of interpretability.Making Machine Learning systems trustworthy has become imperative, and interpretability, robustness, and fairness are often essential requirements for deployment (European Commission, 2020;Goodman and Flaxman, 2017;Rudin et al., 2022).
This paper is devoted to enhancing the interpretability of black-box classifiers.Without loss of generality, we will focus on binary classification problems.We are given a black-box classification model, hereafter, the target model.The goal is to discretize the features by detecting their most critical values.Discretization techniques have often been proposed in the literature as a preprocessing step that allows the transformation of continuous data into categorical ones (Dougherty et al., 1995;Dash et al., 2011;García et al., 2013;Ramírez-Gallego et al., 2016); the objective is to make the representation of the knowledge more concise.Note that a trivial way to discretize continuous features consists in creating a dummy variable for each split point along a feature; a split point is defined as the middle value between two consecutive points on a feature.However, this results in a very large number of dummy variables.In our approach we aim at a meaningful compressed feature representation, that gives us a more interpretable view of the data since the only relevant values are the critical ones defining the discretization.In addition, this procedure acts as a Feature Selection technique (Piramuthu, 2004), meaning that features on which no critical values are extracted are considered not important for detecting the input-output relationship, and can thus be filtered out.Discretizing features can also help in reducing noise that can cause overfitting to training data.Finally, we make it more affordable to train an optimal classification tree (Carrizosa et al., 2021a) as the interpretable surrogate of the target model, and to certify its optimality, thanks to the reduction in the number of thresholds that need to be considered, namely, the critical ones.
In this paper, the feature discretization will be supervised by a Counterfactual Analysis on the target model.Counterfactual Analysis is a post-hoc local explainability technique (Karimi et al., 2022;Martens and Provost, 2014;Molnar et al., 2020;Wachter et al., 2017) that has gained a lot of attraction especially in Supervised Classification.The starting point is an already trained classification model and an observation that has been classified by the model in the undesired class, e.g., as a bad payer in a credit approval application (Fethi and Pasiouras, 2010;Doumpos et al., 2023).Counterfactual Analysis provides feedback on how to change the features of the observation in order to change the prediction given by the model to the desired class, e.g., as a good payer in the credit approval application above.The Counterfactual Explanation depends on the cost function used to measure the changes, which is minimized, and the constraints imposed on the explanation, such as upper bounds on the changes to continuous variables or the correct modeling of changes to categorical features.
We propose to use Counterfactual Analysis to detect decision boundaries of the target model.For this, we extract a set of univariate decision boundaries, i.e., axis-parallel hyperplanes, from Counterfactual Explanations.We use observations that have been correctly classified by the target model.To each of those, we find their counterfactual explanation, yielding a set of axis-parallel hyperplanes.By putting together all these hyperplanes from all the counterfactual explanations, we can identify for each feature a set of thresholds.This step allows us to derive a meaningful supervised discretization of the original dataset, where the cutting points on each feature are the thresholds, and hence are related to the decision boundary of the target model.With this, we can reproduce an equivalent decision boundary by means of a classification tree.
A major strength of our approach is that our supervised discretization is based on an optimization problem, the counterfactual problem.First, this allows us to control the importance of the features extracted, and hence the granularity of the discretization, through the counterfactual cost function.Furthermore, we can impose desirable properties of the boundary, by modelling constraints on the counterfactuals, like plausibility or fairness (Carrizosa et al., 2023;Maragno et al., 2022) In summary, given a black-box classification model, Counterfactual Explanations can help us to find a supervised discretization where the compression focuses on the important decision boundaries of the target model.With this new representation of the data we can build an interpretable surrogate model, namely a optimal univariate decision tree.Our approach has the following advantages: We produce a supervised discretization of the original dataset whose granularity can be tuned.The discretization is driven by an optimization problem, the counterfactual problem, that allows us to control the importance of the thresholds extracted and to impose properties on the boundary.
The interpretable model that we build uses thresholds extracted from the blackbox model.This guarantees the use of features and thresholds that are relevant for the problem, since they represent the decision boundaries of a more complex classification model.
Our methodology is suitable for any type of classifier for which the counterfactual problem can be solved efficiently, such as tree ensemble models (e.g., random forests, gradient boosting) or linear models (e.g., logistic regression, linear support vector machines).
Our approach allows to strongly compress the original dataset, making it feasible to train an optimal classification tree and to certify its optimality (Lin et al., 2020).The reduced dataset can be used to train other machine learning models.
Once all the thresholds have been computed using a counterfactual analysis, the user can easily visualize the accuracy and granularity tradeoff by changing one parameter of the discretization.Tuning the granularity just results in selecting a subset of the thresholds computed by counterfactual analysis and is thus computationally cheap.
Our discretization shows a good tradeoff between out-of-sample accuracy and standard metrics for evaluating discretization procedures (García et al., 2013), such as compression rate and inconsistency rate.
The remainder of the paper is structured as follows.In Section 2 we analyze the existing literature on Decision Trees, Counterfactual Explanations and discretization procedures.In Section 3 we formalize our method.In Section 4 we describe the experimental setup and the obtained results.Finally, in Section 5 we draw some conclusions, and propose some lines for future research.

Literature Review
The most popular and inherently interpretable models are univariate Decision Trees (Carrizosa et al., 2021a).Training an optimal Decision Tree is known to be an NP-complete problem (Laurent and Rivest, 1976).Many approaches for training Decision Trees thus rely on heuristics; the most used heuristic is based on a top-down greedy strategy in which the tree structure is recursively grown from the root to the terminal nodes (Breiman et al., 1984;Quinlan, 1986Quinlan, , 2014)).However, because of their greedy nature, these heuristics may result in poor generalization capabilities.
Recently, there has been an increasing interest in developing mathematical optimization formulations and numerical solution approaches to train Optimal Classification Trees (Carrizosa et al., 2021a).One of the first approaches for Optimal Classification Trees was proposed in Bertsimas and Dunn (2017), where a Mixed-Integer Linear Programming problem (MILP) for both univariate and multivariate classification trees is formulated.This formulation, however, has a weak linear relaxation and, thus, it is hard to solve to optimality on realsized instances; the authors thus propose a local search algorithm to try to overcome this issue, returning a local solution that has not guarantee to be the global minimum (Dunn, 2018).Recently, new approaches based on binary features have been proposed, and ad-hoc algorithms have been designed with the purpose of being scalable and fast (Verwer and Zhang, 2019;Lin et al., 2020;Günlük et al., 2021;Aghaei et al., 2021).Furthermore, the dynamic programming approach proposed in Lin et al. (2020) is better suited for realistic datasets with a few continuous features, thanks to a new representation of the dynamic programming search space.
Still, the number of continuous features that can be handled is limited as each continuous feature is associated with a large number of dummy variables, thus enlarging the search space and slowing down the optimization process.In order to extend the applicability of the procedure, the authors propose some strategies to exploit knowledge extracted from black-box models to make the Decision Tree's optimization faster (McTavish et al., 2022).The most effective strategy is the "Guessing Thresholds", which allows them to limit the number of dummy variables modelling continuous features.The "Guessing Thresholds" uses Tree Ensembles as a black-box model, and it considers as thresholds the ones used in the various splits of each tree in the ensemble; the thresholds are sorted by their importance (e.g.their Gini index) so that the least important can be removed; the Tree Ensemble is then re-fitted with the remaining features, and the procedure is repeated until there is a too large drop in the training performance.The idea of leveraging black-box models, and in particular Tree Ensembles, to build Decision Trees is also proposed in Vidal and Schiffer (2020), in which the authors design a dynamic programming approach where a single Decision Tree is built to reproduce exactly the decision function of a Tree Ensemble; however, in general, the resulting Decision Tree can be large and thus can lose some interpretability.
Our research collocates in between the approach in Vidal and Schiffer (2020) and the one in McTavish et al. (2022).We use optimization as in Vidal and Schiffer (2020), but we do not reproduce exactly the original black-box model.Instead, we exploit the boundary of the black-box model to compress the original dataset in a meaningful way by means of optimization.Then, we use the compressed dataset to efficiently train an optimal decision tree as in McTavish et al. (2022).We force interpretability by limiting the depth of the optimal decision tree.Our discretization is driven by a procedure based on optimization for computing Counterfactuals Explanations, that are local explainability techniques, that can be used to explain a single decision of a black-box model.Indeed, Counterfactual Explanations allow providing feedback to users on how to change their features in order to change the outcome of the decision (Karimi et al., 2022;Guidotti, 2022).Formally, the Counterfactual Explanation of a datapoint x 0 , namely x CE , is defined as the perturbation of minimal cost (w.r.t.some cost function) that allows changing the classification outcome.An optimization problem for computing the Counterfactual Explanations x CE of a given point x 0 , namely the counterfactual problem was proposed in Wachter et al. (2017): where C is a cost function, f is the classification function and y CE ̸ = f (x 0 ) is the required label for the Counterfactual Explanation.The problem is then reformulated as an unconstrained problem with a differentiable objective function, composed of two terms: the first term of the objective requires the classification function to be as close as possible to the required label y CE , while the second term requires minimizing the distance between the Counterfactual Explanation and the initial point.Later in the literature, Verma et al. (2022) identifies some additional constraints Counterfactual Explanations should satisfy.Proximity requires that a valid Counterfactual Explanation must be a small change with respect to the initial point.Actionability implies that the Counterfactual Explanation can modify some features (e.g., income), while others must be immutable (e.g., sex, race).Sparsity requires a Counterfactual Explanation to be sparse, i.e. as few features as possible should change.This makes the Counterfactual Explanation more effective because simpler explanations can be better understood by users.Data Manifold Closeness suggests that a Counterfactual Explanation should be realistic, meaning that it should be close to training data.Finally, Causality requires that the Counterfactual Explanation adheres to observed correlations between features.Examples of constraints modelling domain knowledge and ac-tionability can be found in Parmentier and Vidal (2021); Maragno et al. (2022).Robustness is also an important requirement, as it imposes that Counterfactual Explanations remain valid after retraining the model or slightly perturbing the input features (Forel et al., 2022;Fernández et al., 2022;Maragno et al., 2023).
For some classifiers such as linear Support Vector Machines (SVMs) or Tree Ensembles, it is possible to derive an explicit expression of the classification function f , allowing to directly write problem (1) as a Linear Programming problem or at most a convex quadratic problem.Integer variables can be added to represent the l 0 -norm in the objective function.For SVMs with non-linear kernels or for Neural Networks it is not possible to do so.To overcome this issue, Maragno et al. (2022) suggest the idea that the Counterfactual Explanation problem is a special case of Optimization with Constraint Learning, in which some of the constraints are learnt through a predictive model.
Recent literature states that Counterfactual Explanations can provide useful insights into the classification model, allowing them to be used not only for posthoc explainability but also for debugging and detecting bias in models (Sokol and Flach, 2019).For example, if it turns out that without imposing the actionability constraints the Counterfactual Explanation changes a sensitive feature (e.g., gender) by saying that a woman would receive the loan if she were a man, then the classification model may be biased.This observation highlights that Counterfactual Explanations can be used to detect biases in Machine Learning models, opening the possibility of designing new fairness metrics that rely on Counterfactual Explanations (Kusner et al., 2017;Goethals et al., 2023).Some recent literature (Kuppa and Le-Khac, 2021;Mothilal et al., 2020;Aïvodji et al., 2020;Zhao et al., 2021) focuses on the uses of Counterfactual Explanations in an adversarial setting for detecting the decision boundaries of machine learning models.
In this work, we propose to use Counterfactual Explanations to enhance the interpretability of the model itself.We show that using Counterfactual Explanations to detect the important decision boundaries of a black-box model allows us to design a supervised discretization technique that helps to build a optimal Decision Tree.Using a discretization technique as a preprocessing step can lead to many advantages: (1) some Machine Learning algorithms prefer categorical variables, e.g., the Naive Bayes classifier (Yang and Webb, 2009;Flores et al., 2011); (2) discretized data are easier to understand and to explain; (3) discretization can decrease the granularity of data, potentially decreasing the noise in the dataset (García et al., 2013).Nevertheless, any discretization process generally leads to a loss of information, making it crucial the minimization of the loss of information.In García et al. (2013), a taxonomy for categorizing discretizing methods is introduced.The effectiveness of a discretization procedure can then be evaluated according to different aspects: (a) the discretization should be able to compress the information as much as possible, by detecting as few intervals as possible; (b) the inconsistency rate produced by the discretization, i.e. the unavoidable error due to multiple points associated with the same discretization but with different labels;

Method
We assume we have at hand a binary-classification training set: taken out from a sample of n individuals.This procedure can, however, be easily extended to multi-class problems.For simplicity, we assume w.l.o.g.all features to be scaled between 0 and 1.
We train a black-box model T using D tr , that we use as the target model for our procedure.For our purpose, T can implement any classification algorithm as long as we can efficiently compute the Counterfactual Explanation associated with each instance x ∈ D tr with respect to model T .Our objective is to train a small, compact and interpretable decision tree that acts as a surrogate for T .At each branch node s, a univariate Decision Tree takes a univariate decision: if the input x on a given feature j is less or equal than a threshold τ s the point is directed to the left child of s, otherwise to the right child.By following the path of each point x from the root of the tree to the last level, a set of points is assigned to each leaf l of the tree; the classification outcome at each leaf l depends on the most frequent label of the points assigned to l. Training a Decision Tree results in choosing which feature to consider at each node s and the value of the threshold τ s .
As shown in Figure 1, the intuition behind our procedure is that a Counterfactual Explanation is very close (±ϵ) to some decision boundaries of the target model.So, if we generate a large set of Counterfactual Explanations, we should be able to identify the most critical decision boundaries of the model  Each Counterfactual Explanation x CE perturbs just a subset of the features of the corresponding initial point x 0 (Figure 2a); the values these features assume can be used to mimic some decision boundaries of the Target model, i.e. these values can be used as splitting values in the nodes of the surrogate Decision Tree (Figure 2b).Following Carrizosa et al. (2021b) for computing Counterfactual Explanations, we solve a problem derived from (1) under the assumption that the Target model is a Tree Ensemble.This problem takes the following parameters: y CE ∈ {0, 1}, with y CE ̸ = y 0 , is the required outcome for the Counterfactual Explanation; T is the set of trees in the Tree Ensemble; L t and N t denote respectively the leaves and the internal nodes of each tree t in the tree ensemble; v t,s and c t,s denote respectively which feature and which threshold is used to split at node s in tree t; A L (t, l) and A R (t, l) denote respectively the ancestors s of leaf l in tree t whose left/right path leads from s to l.We denote by p k (x) the probability for point x of being in class k; the specific expression of p k (x) depends on the tree ensemble method used (e.g.Random Forest (Breiman, 2001) or Gradient Boosting (Chen et al., 2015)).The variables define the point x CE and a binary variable z t,l that denotes the assignment of x CE to one of the leaves for each of the trees in the tree ensemble.The formulation is the following: In the objective function, we consider the weighted combination (with nonnegative coefficients) of l 0 -,l 1 -and l 2 -norm; the l 0 -norm is used for feature sparsity while the l 1 -and l 2 -norms are used to measure proximity.The l 0 -and l 1 -norms are modelled in a standard way by introducing respectively binary variables and linear constraints.By increasing the value of parameter λ 0 , we encourage changing as few features as possible.Both M 0 and M 1 can be set to 1 since the features are assumed to be scaled between 0 and 1. Constraint (7) imposes that the Counterfactual Explanation should belong to a set X 0 that represents a plausibility set for the initial point x 0 ; actionability, data manifold closeness, causality and other requirements can thus be expressed via additional constraints.Equations ( 3)-( 6) impose that the label assigned to the Counterfactual Explanation should be the required label y CE ; in this set of constraints, binary variables z model the assignment of x CE to one of the leaves, for each tree in the tree ensemble.We can notice that constraints (3) and (4) depend on a threshold ϵ j for each feature j ∈ [1 . . .m].In our paper, we set the value of ϵ j as the smallest difference between the two consecutive values that feature j assumes on the datapoints of D tr .For the value of ϵ that appears in constraint (6), we use a fixed value, while in Forel et al. (2022) it is analyzed how to set it for requiring Counterfactual Explanation robustness in Tree Ensembles.The expression of constraint (6) depends on the specific tree ensemble method.We report the expressions for Random Forest: where w t,l,k denotes the classification weight of leaf l in tree t for class k, and Gradient Boosting: where p 0 is the initial prediction of Gradient Boosting computed as the fraction of points in the majority class in the training set, lr is the learning rate, and w t,l is the value predicted by leaf l of tree t.
For each counterfactual couple (x 0 , x CE ), composed by a point x 0 and its counterfactual explanation x CE , we restrict our attention to the features that change significantly, i.e. |x 0 j − x CE j | > ϵ j .Then, we can compute a possible splitting threshold for feature j: Generating a set of counterfactual couples thus results in computing for each feature j a set of thresholds: The set of thresholds across all features is denoted by τ = ∪ j∈[1...m] τ j .Our objective is to identify the most important decision boundaries of T .The importance of a feature j can be measured by looking at the number of counterfactual couples for which a threshold of feature j is extracted, that is the size of τ j .Note that increasing parameter λ 0 in the cost function (2) results in principle in selecting the most important features.Let us denote by π tj the multiplicity of After fixing a quantile value Q, we want to use the thresholds in τ Q as splitting values in the nodes of the surrogate Decision Tree.This can be translated into a Feature Discretization procedure of the data in D tr .Parameter Q allows thus to control the number of thresholds considered in τ Q across all features.
Each feature j can be replaced with a set of binary variables, one for each threshold t ∈ τ Q j .The binary variable b it associated with each threshold t ∈ τ Q j represents whether feature j for data point x i lies before or after t: With this procedure we transform all the numerical features into binary variables.If we have no threshold on a feature, we remove it from the discretized dataset.Note that the higher the Q, the more compressed is the discretized dataset.

Algorithm
In this section, we describe the details of our procedure, namely FCCA (Feature Compression based on Counterfactual Analysis).As Target system T we can consider any black-box model for which it is possible to solve the counterfactual problem (1).
After training the Target system, the second step is to extract from D tr a large set of points M for computing their Counterfactual Explanation.Our numerical experiments showed that the time for solving problem (2)-( 7) strongly depends on the closeness of the initial point to the decision boundary of T .In fact, if the initial point is far from the decision boundary of T , a large perturbation may be needed to cross the decision boundary i.e. change the classification outcome.A proxy for the time needed for solving problem (2)-( 7) is thus the classification probability.On the other hand, it is possible that considering points too close to the decision boundary (i.e.classified with a very low probability) can introduce some noise and make the procedure less robust.We can thus define M to be composed by the points in the training set D tr which are correctly classified and where the classification probability is bounded between two values 0.5 ≤ p 0 ≤ 1 and p 0 ≤ p 1 ≤ 1: where f T (x i ) and Π T (x i ) return respectively the classification label and the classification probability for x i .We compute the set C of Counterfactual Explanations for all points in M.
We can use equation ( 10) for extracting, from all couples in (M , the set of thresholds τ .We set a value of Q between 0 and 1 and use this value to restrict τ to include only the more frequent thresholds τ Q .The output of the procedure is thus the discretization of both data in the training and test set, namely D ′ tr (τ Q ) and D ′ ts (τ Q ).The discretized data can later be used to train and test the performance of a Decision Tree, which can be trained both by using the heuristic CART algorithm or by using an optimal approach as the one proposed in Lin et al. (2020).Note that training a Decision Tree on the discretized dataset implies that the decision tree uses exactly some of the thresholds found by the counterfactual computation.
The described procedure is summarized in Algorithm 1.

Discretization effectiveness
As a side product, our procedure returns different discretizations by changing Q.In order to evaluate the effectiveness of these discretizations in terms of compression ability, we introduce two metrics: Compression rate When we apply the discretization, some points collapse to the same discretization.The compression rate is defined as η = 1 − r, where r is the ratio between the number of points in D tr with different discretizations and the total number of points in D tr .
Inconsistency rate As a downside of the compression, when multiple points collapse to the same discretization, it may happen that not all of them have the same label.For each feature j, we denote by ξ j the number of thresholds in τ Q j ; the number of possible values that each discretized point assumes on feature j is thus ξ j + 1.The number of possible discretized points N τ Q thus depends on how many thresholds we have for each feature: For each possible discretization l ∈ [1 . . .N τ Q ], we denote by Ω l ⊆ D tr the set of points that fall into that discretization.We denote by Ω 0 l = {x i ∈ Ω l : y i = 0} and by Ω 1 l = {x i ∈ Ω l : y i = 1}.The number of inconsistencies δ l in Ω l is thus equal to the number of points in Ω l with minority label: The inconsistency rate produced by the discretization procedure is thus expressed as: The inconsistency rate represents an irreducible error; thus, 1 − δ is an upper bound to the accuracy that any classifier built on the discretized dataset is able to achieve.Note that in principle it is always possible to train on the discretized dataset an univariate decision tree, without limiting the depth, achieving an accuracy on the training set equal to 1 − δ.
Both the compression rate and the inconsistency rate depend on the value of Q.For high values of Q we consider a low granularity discretization that results in a high compression rate, related to high interpretability; but on the other hand, the discretization could produce a large number of inconsistencies, that represents a lower bound on the error rate of any classification method on the discretized dataset.The objective is thus to choose Q in order to keep a good trade-off between the compression rate and the inconsistency rate.Summarizing, in our methodology we have two parameters allowing to control the quality of the discretized dataset: the parameter λ 0 in the cost function (2) influences the number of features involved in the discretization (feature sparsity), whereas the parameter Q controls the number of thresholds involved across all the features (threshold sparsity).

Experimental Setup
We tested the procedure summarized by Algorithm 1 on several binary classification datasets with continuous features, whose characteristics are summarized in Table 1.As Target black-box model, we can use any black-box model for which it is possible to efficiently solve problem (1).Possible choices are Tree Ensembles (e.g.Random Forest, Gradient Boosting) or Support Vector Machines with a linear kernel.Since the objective is to learn the decision boundaries of T , it is important to choose a model with good performance on the dataset considered.
In our analysis, we use Gradient Boosting with 100 estimators of depth 1 and learning rate equal to 0.1: the performance of this algorithm on the datasets computed in k-fold crossvalidation is shown in Table 2.The advantage of using such a simple algorithm is that solving problem (2)-( 7) is extremely fast (order of 10 −2 seconds).
The experiments have been run on an Intel i7-1165G7 2.80GHz CPU with 16GB of available RAM, running Windows 11.The procedure was implemented Algorithm 1 Pseudocode for the FCCA procedure 1: Input data: p 0 ≥ 0.5, 17: add t j to τ j 18: end if

19:
end for 20: end for 21: π ← frequency distribution of the thresholds in ∪ j=1...m τ j Phase 3 -Discretizing the dataset in Python by using scikit-learn v1.2.1 for training our models; the optimization problem ( 2)-( 7) for computing Counterfactual Explanations was solved with Gurobi 10.0.1 (Gurobi, 2021).The value of ϵ used in constraint ( 6) is set to 10 −4 .The code of the experiments is available at https://github.com/ceciliasalvatore/sFCCA.git.The experimental settings used in our experiments are described in As shown in Table 1, we considered both small datasets with less than 5000 data points (boston, arrhythmia and ionosphere) and big datasets with more than 5000 data points (magic, particle, vehicle).For each dataset we compute the performance in a 5-fold crossvalidation.For big datasets we restrict the dataset to a random subset of 5000 data points; the remaining ones are used as an external test set to additionally validate the performance of the system.
The procedure requires setting a group of parameters: • p 0 : lower bound on the classification probability for points in the set M. We recommend using p 0 = 0.5.A higher value can be set if the dataset is very noisy or unbalanced.
• p 1 : upper bound on the classification probability for points in the set M. We recommend using p 1 = 1.0.A lower value can be used to reduce the number of points of which computing the Counterfactual Explanation if a) the cardinality of M is large (more than 1000 points) or b) the Target model T is complex (e.g. it is an ensemble of trees with high depth) so that the time for computing each Counterfactual Explanation can be high.
• λ 0 , λ 1 , λ 2 : hyperparameters for the Counterfactual Explanation problem (2)-( 7).We recommend setting λ 2 = 0 in order to significantly reduce the computational time for the Counterfactual Explanation problem.λ 0 and λ 1 must be set to trade-off between sparsity and proximity; we can set λ 1 = 1 without loss of generality so that we only need to tune λ 0 .By setting λ 0 = 0.1 and λ 1 = 1, we mean that changing one additional feature has the same weight as changing the absolute value of one feature of 0.1, which is the 10% of the scale range (recall that we assume features to be scaled between 0 and 1).
• Q: the value of Q represents the granularity for the thresholds selected.The standard value is Q = 0, meaning that we consider all the thresholds extracted from the Counterfactual Analysis.Increasing the value of Q makes the dataset more sparse, but, at the same time, it can degrade the input-output relationship.The trade-off between compression rate and inconsistency rate must be carefully considered when Q is increased.The Target model T used is the Gradient Boosting algorithm with 100 estimators and depth 1.We also report the computational time needed to run the FCCA procedure on a single fold of the dataset.

Performance Evaluation
In order to evaluate the effectiveness of the FCCA procedure, we look at two points of view: • The performance of a classification model trained on the discretized dataset.
In particular, our objective is to train a small Decision Tree; we can train Decision Trees either by using a heuristic algorithm such as CART or by using an optimal approach.In the second case, the discretization procedure allows us to use the GOSDT approach proposed by Lin et al. (2020), which requires binary features in input.For both CART and GOSDT, we limit to 3 the maximum depth allowed.For GOSDT, we also set the regularization parameter to 10 ntr ; in our experiments, in fact, we noticed that using this value allows us to produce Decision Trees that are shallow but can properly represent the input-output relationship.
• The compression and inconsistency rates of the discretized dataset.
Results obtained by the FCCA procedure are compared with the ones obtained by the initial dataset with continuous features and by the dataset discretized following the "Guessing Thresholds" procedure proposed in McTavish et al. (2022), that we will refer to as GTRE (Guessing Thresholds via Reference Ensemble).The GTRE procedure is also based on the idea of leveraging a Tree Ensemble model to extract relevant thresholds that discretize the input dataset in a meaningful way; the GTRE procedure uses as Tree Ensemble the Gradient Boosting algorithm.In all the experiments, when we run the two methods we set the same parameters for the Gradient Boosting to have a fair comparison (100 trees of depth 1 and learning rate 0.1).

Results
In this section, we analyze the results obtained in our experiments in terms of accuracy, sparsity, compression rate and inconsistency rate.

Accuracy
In Figure 3 we present the results in terms of accuracy on the benchmark datasets obtained by training both CART and GOSDT on the initial dataset with continuous features, the dataset discretized with the GTRE procedure, and the dataset discretized with the FCCA procedure at different levels of quantile Q. Please note that it is not possible to apply the GOSDT approach directly to the dataset with continuous features: in this case, it would, in fact, be necessary to introduce a threshold in the middle of any two consecutive data points for each feature; the resulting dataset is too big, and it is thus not affordable to compute GOSDT.In all datasets, we can notice that using a discretization technique is beneficial because it allows us to compute the optimal tree with the GOSDT algorithm; using this algorithm results in a more robust, accurate and sparse (Figure 4) decision tree with respect to the heuristic CART.Combining GOSDT with the two discretization techniques GTRE and FCCA for Q = 0 achieves similar performance on all datasets; increasing the value of Q to 0.7 preserves a high accuracy level on almost all datasets, except for magic.

Compression and Inconsistency Rate
In Figures 5 and 6, we plot respectively the compression rate η and the inconsistency rate δ of the initial dataset with continuous features, the dataset discretized with the GTRE procedure, and the dataset discretized with the FCCA procedure at different levels of quantile Q.On all datasets, we can notice that the initial dataset with continuous features has zero compression and zero inconsistency, while both GTRE and FCCA have a significant compression rate that, as a downside, leads to a non-zero inconsistency.For Q = 0, the compression and inconsistency rates of FCCA are very similar to the ones obtained by GTRE.In the FCCA procedure, compression and inconsistency rates are proportional to the quantile value Q.In fact, when Q tends to 1, also the compression rate η tends to 1.A high compression rate is positive because implies that we are able to summarize the data using information at a low granularity.Therefore it is easier to build a small and interpretable decision tree for learning the input-output relationship of this representation of the dataset.As a downside, however, also the inconsistency rate increases when Q tends to 1.The trade-off between compression rate and inconsistency rate can help us in deciding whether it is possible to increase the level of Q in order to obtain a more sparse dataset without affecting accuracy.In fact, in all datasets except  We compare the performance of CART and GOSDT trained on the initial dataset with continuous features, the dataset discretized with the GTRE procedure, and the dataset discretized with the FCCA procedure.It is not possible to apply GOSDT directly to the dataset with continuous features.As reported in Table 1, for datasets with few observations (boston, arrhythmia and ionosphere) the accuracy is computed in a k-fold crossvalidation, while for datasets with many observations (magic, particle and vehicle) the accuracy is computed as the average result of the k classifiers trained in k-fold crossvalidation on the external test set.for magic we can observe that increasing Q to 0.7 leads to an acceptable level of inconsistency when compared to the accuracy level that is generally reached on that dataset.On the contrary, on magic the inconsistency rate is quite high already for Q = 0, and it arrives at almost 20% for Q = 0.7: for this reason, for Q = 0.7 we record a decrease in the accuracy (Figure 3d).

Sparsity
As already stated, increasing the value of Q in the FCCA procedure leads to selecting the thresholds that, according to the Counterfactual Analysis, are more relevant in describing the input-output relationship.The discretized dataset derived is thus smaller because it is described through a smaller set of binary features.If Q is too high, we can lose the ability to represent the input-output relationship, ending up building low-performance surrogate classification models.In our previous analysis, we observed that Q = 0.7 is a good value in almost all of the benchmark datasets considered.In Figures 7-12, we analyze the thresholds extracted by the FCCA procedure at different levels of Q and compare them to the thresholds extracted by the GTRE procedure.In these figures, we represent the thresholds extracted by each discretization procedure through a heatmap.For the FCCA procedure, the heatmap also represents the quantile.Please note that thresholds that appear at a certain quantile also appear at lower levels of quantiles.We can notice that the heatmaps obtained through the GTRE procedure and the ones obtained through the FCCA with Q = 0 are very similar, meaning that the two approaches extract very similar sets of thresholds.When Q grows, instead, the set of thresholds extracted by the FCCA procedure is much smaller.

Other Target algorithms
In For the sake of brevity, we only show this comparison on one benchmark dataset, boston.The hyperparameters for both Random Forest and linear SVMs are computed in crossvalidation: we thus set the maximum depth of the Random Forest to 4 and the value of C for SVM to 1.The other parameters used in this experiment are analogous to the one used in Section 4.2.    Figure 7: Heatmap of the thresholds extracted by the GTRE procedure and the FCCA procedure on boston.For the GTRE procedure, the heatmap only represents whether, in a given interval, there are some thresholds (Y) or not (N).
For the FCCA procedure, and in case there are thresholds in the interval, the heatmap also represents the quantile.Please note that any threshold selected for a given quantile Q will also be selected for lower values of the quantile.Figure 13 presents the comparison of the FCCA method with different Target algorithms in terms of accuracy, sparsity, compression and inconsistency, while Figure 14 presents the heatmap of the thresholds extracted by the FCCA method starting from the Random Forest and Linear SVM algorithms.We can notice that when Q = 0, the three Target models all lead to similar results in terms of accuracy; however, comparing Figures 7 and 14, we can notice that the set of thresholds extracted is quite different.In particular, while the behaviour for Gradient Boosting and Random Forest is quite similar, for the SVM we can notice that the thresholds extracted are distributed over almost the whole interval of values; this is due to the continuous nature of SVMs.For SVMs, Counterfactual Analysis thus results in a feature selection technique.

Conclusions
In this paper, we use Counterfactual Analysis to derive a supervised discretization of a dataset driven by a black-box model's classification function.Counterfactual Explanations are computed by solving an optimization problem that identifies the most relevant features and the corresponding cutting points for the black-box model.The Counterfactual Explanations provide a set of univariate decision boundaries that allows discretizing the original dataset into a set of binary variables.Having a compact dataset of binary variables makes it affordable to train an interpretable optimal decision tree by the recent approach proposed in Lin et al. (2020).Our procedure allows us to discretize the dataset with a tunable granularity, controlled by means of the parameter Q with 0 ≤ Q ≤ 1.A high granularity, corresponding to low values of Q, results in a high level of detail in the dataset, where we consider a higher number of features, each  represented by a large number of binary variables.A lower granularity, corresponding to high values of Q, results in a sparse dataset, where we only select the most relevant features and model each of them with few binary variables.
Tuning the value of Q is needed to trade-off between the performance of a classification model built on this dataset (somehow inversely proportional to Q) and its sparsity (that is a measure of interpretability and directly proportional to Q).
In the numerical section, we demonstrate the viability of our method on several datasets different in size, both in terms of the number of data points and the number of features.We compare our approach with the one proposed in McTavish et al. (2022), where they discretize the dataset by extracting thresholds retraining smaller and smaller version of the original black-box model until the accuracy drops.Our method has the advantage to detect the thresholds by solving an optimization problem, allowing us to tune both the sparsity in terms of features, controlling the cost function of the optimization problem, and the sparsity in terms of the number of thresholds, controlling the parameter Q.
As a future line of research, we plan to use the optimization problem defined in Carrizosa et al. (2023) for computing counterfactuals of groups of individuals.We can add fairness and plausibility constraints to this problem to extract "fair" decision boundaries, and hence indirectly impose fairness of the optimal decision tree built by means of those boundaries.
Finally, we plan to formulate the problem of finding an optimal supervised discretization by a combinatorial optimization problem and study efficient algorithms for solving it.https://ec.europa.eu/info/publications/white-paper-artificial-intelligence-european-approach-excellence-and-trust_ en.

Figure 1 :
Figure 1: Closeness of a Counterfactual Explanation to the Decision Boundaries of the Target model produced by a Random Forest projected on two features (Income and # years of credit).
(a) An initial point x 0 and its counterfactual explanation (b) An example of a decision tree

Figure 2 :
Figure 2: Assume that x CE is computed by solving problem (1) with x 0 as input for a given Tree Ensemble acting as a target model.A perturbation (±ϵ) of the values of the features where x 0 and x CE differ can be used as splitting values for the nodes of a univariate Decision Tree.

Figure 3 :
Figure3: Accuracy results on the benchmark datasets.We compare the performance of CART and GOSDT trained on the initial dataset with continuous features, the dataset discretized with the GTRE procedure, and the dataset discretized with the FCCA procedure.It is not possible to apply GOSDT directly to the dataset with continuous features.As reported in Table1, for datasets with few observations (boston, arrhythmia and ionosphere) the accuracy is computed in a k-fold crossvalidation, while for datasets with many observations (magic, particle and vehicle) the accuracy is computed as the average result of the k classifiers trained in k-fold crossvalidation on the external test set.

Figure 4 :
Figure4: Number of features used on the benchmark datasets.We compare the performance of CART and GOSDT trained on the initial dataset with continuous features, the dataset discretized with the GTRE procedure, and the dataset discretized with the FCCA procedure.

Figure 5 :
Figure5: Compression rate on the benchmark datasets.We compare the performance of CART and GOSDT trained on the initial dataset with continuous features, the dataset discretized with the GTRE procedure, and the dataset discretized with the FCCA procedure.

Figure 6 :
Figure6: Inconsistency rate on the benchmark datasets.We compare the performance of CART and GOSDT trained on the initial dataset with continuous features, the dataset discretized with the GTRE procedure, and the dataset discretized with the FCCA procedure.

Figure 8 :
Figure 8: Heatmap of the thresholds extracted by the GTRE procedure and the FCCA procedure on arrhythmia.

Figure 9 :
Figure 9: Heatmap of the thresholds extracted by the GTRE procedure and the FCCA procedure on ionosphere.

Figure 10 :
Figure 10: Heatmap of the thresholds extracted by the GTRE procedure and the FCCA procedure on magic.

Figure 11 :
Figure 11: Heatmap of the thresholds extracted by the GTRE procedure and the FCCA procedure on particle.

Figure 12 :
Figure 12: Heatmap of the thresholds extracted by the GTRE procedure and the FCCA procedure on vehicle.

Figure 13 :
Figure 13: Comparison on the use of different Target algorithms in the FCCA procedure in terms of accuracy, sparsity, compression and inconsistency on boston.

Figure 14 :
Figure 14: Heatmap of the thresholds extracted by using the Random Forest and Linear SVM algorithms in the FCCA procedure on boston.

Table 1 :
Summary of the datasets used in the experimental phase.We report the dataset name; the total number of data points n; the number of features m; the number of data points used as training set; the value of k used for the k-fold crossvalidation; the number of data points used as an external test set.

Table 2 :
Experimental setting in the different dataset.
this section, we briefly analyze what happens by changing the algorithm implemented by the Target model T .As alternative algorithms to Gradient Boosting, we consider Random Forest or Linear Support Vector Machines.For