Calibrated Explanations: with Uncertainty Information and Counterfactuals

While local explanations for AI models can offer insights into individual predictions, such as feature importance, they are plagued by issues like instability. The unreliability of feature weights, often skewed due to poorly calibrated ML models, deepens these challenges. Moreover, the critical aspect of feature importance uncertainty remains mostly unaddressed in Explainable AI (XAI). The novel feature importance explanation method presented in this paper, called Calibrated Explanations (CE), is designed to tackle these issues head-on. Built on the foundation of Venn-Abers, CE not only calibrates the underlying model but also delivers reliable feature importance explanations with an exact definition of the feature weights. CE goes beyond conventional solutions by addressing output uncertainty. It accomplishes this by providing uncertainty quantification for both feature weights and the model’s probability estimates. Additionally, CE is model-agnostic, featuring easily comprehensible conditional rules and the ability to generate counterfactual explanations with embedded uncertainty quantification. Results from an evaluation with 25 benchmark datasets underscore the efficacy of CE, making it stand as a fast, reliable, stable, and robust solution.


Introduction
In recent years, Artificial Intelligence (AI) has become a pervasive component of Decision Support Systems (DSSs) across various domains, including, e..g., retail, sport, or defence (Zhou et al., 2021a).However, the predictive models used in AI-based DSSs are generally not designed for transparency, limited to only presenting a probable outcome (David Gunning, 2017;Ribeiro et al., 2016), which can lead to either misuse (based on user reliance being higher than appropriate) or disuse (due to users having less reliance than appropriate) (Alvarado-Valencia & Barrero, 2014;Buçinca et al., 2020).
Due to the lack of transparency, predictions of DSSs often require an explanation.In explainable artificial intelligence (XAI), the goal is to create AI systems that can explain their rationale to a human user, making it possible to detect erroneous predictions, e.g. an erroneous prediction in a medical diagnosis (Gunning & Aha, 2019).An explanation should characterise the strengths and weaknesses of the underlying model and communicate how they will behave in the future (David Gunning, 2017;Dimanov et al., 2020).
Explanations in XAI can be divided into two categories: local and global.
Local explanations offer information about the causes of individual predictions, while global explanations provide information about the model as a whole (Guidotti et al., 2018;Moradi & Samwald, 2021;Martens & Foster, 2014).However, local explanations have several drawbacks, such as instability, where slight variations in the instance can lead to significantly different explanations (Slack et al., 2021;Rahnama & Boström, 2019).This instability creates issues in evaluating the quality of explanations, as metrics like fidelity -how accurately an explanation captures the behaviour of the underlying model -do not provide an accurate picture of explanation quality since they depend heavily on the implementation details of the explanation method (Slack et al., 2021;Moradi & Samwald, 2021;Hoffman et al., 2018;Carvalho et al., 2019;Adadi & Berrada, 2018;Wang et al., 2019;Mueller et al., 2019;Agarwal et al., 2022).Additionally, state-of-the-art explanation techniques offer limited insights into model uncertainty and reliability.
In local explanation methods, the probability estimate that most ML models output is commonly used as an indicator of the likelihood of each class.
However, it is well-known that ML models are often poorly calibrated, meaning the probability estimates do not correspond to the actual probabilities of being correct.To address these issues, calibration methods like Platt Scaling (Platt et al., 1999) and Venn-Abers (VA) (Vovk & Petej, 2012) have been developed.
VA produces a probability interval for each prediction, which can be aggregated into a calibrated probability estimate through regularization to compare with other calibration methods or the underlying model's probability estimate.The calibrated probability estimate has been proven to be at least as well-calibrated as other calibration techniques (Pereira et al., 2020).
When using VA for decision-making, it is crucial to note that the technique also provides intervals for each class, quantifying the uncertainty in the probability estimate, which are useful from an explanatory perspective.The width of the interval reflects the model's uncertainty, where a narrower interval indicates less uncertainty in the probability estimate and a wider interval indicates more uncertainty in the probability estimates.Uncertainty information can also be applied to the features since the feature weights are based on the prediction's probability estimate.In recent years, uncertainty estimation has been identified as an essential part of an explanation in order to make the underlying model more transparent (Bhatt et al., 2021;Slack et al., 2021).Although well-calibrated uncertainty is highlighted as a key factor in transparent decision-making, Bhatt et al. (2021) point out the costs of obtaining calibrated uncertainty estimates for complex problems and the focus (Slack et al., 2021) has been mostly of using a well-calibrated underlying model (Bayesian) instead of using a calibration technique.In this paper, we have used the simple to use and time effective calibration technique VA to estimate the uncertainty.The presented method is also model agnostic, since it is applied to the underlying model.Consequently, uncertainty estimation can improve the quality and usefulness of explanations in XAI.
In this paper, we propose a new feature importance explanation method, Calibrated Explanations (CE).The proposed method is based on VA and has the following characteristics: • Fast, reliable, stable and robust feature importance explanations.
• Calibration of the underlying model to ensure that probability estimates are closer to reality.
• Uncertainty quantification of the probability estimates from the underlying model and the feature importance weights.
• Rules with straightforward interpretation in relation to the feature weights.
• Possibility to generate counterfactual rules with uncertainty quantification of the expected probability estimates achieved.
The paper is organized as follows: the next section reviews fundamental concepts related to explanation methods and VA predictors and provides an introduction to Calibration Explanations.Section 3 provides an overview of relevant work tackling uncertainty estimation in explanation methods.Section 4 outlines the experimental setup, while the results are presented in Section 5.The paper ends with a discussion followed by concluding remarks.

Post-Hoc Explanation Methods
Research in explainable artificial intelligence (XAI) can be broadly classified into two categories: developing inherently interpretable and transparent models and post-hoc methods to explain opaque models.Post-hoc explanation methods aim to create simplified and interpretable models to explain the relationship between feature values and the model's prediction.These explanations can be local or global and often use visual support such as pixel representations, feature importance plots, or word clouds to highlight the most critical features, pixels, or words responsible for the model's prediction (Molnar, 2022;Moradi & Samwald, 2021).
Post hoc explanations can be either factual (the feature value causes this effect on the prediction) or counterfactual (if this feature's values change what happens to the prediction) (Mothilal et al., 2020;Guidotti, 2022;Wachter et al., 2017).Worth noting is that counterfactual explanations are always local.Counterfactual explanations are also seen as highly human-friendly since they mirror how humans reason (Molnar, 2022).

Essential Characteristics of Explanations
Creating high-quality explanations in XAI is a multidisciplinary effort, drawing knowledge from both HCI and ML.Due to the interdisciplinary tendencies, the quality of an explanation method depends on the goals it addresses, which may vary.For example, assessing how the users appreciate the explanation interface is distinct from evaluating if the explanation accurately mirrors the underlying model (Löfström et al., 2022).Nonetheless, specific characteristics are universally desirable for post-hoc explanation methods.
It is critical that an explanation method reliably mirror the underlying model correctly, which is closely connected to the concept that an explanation method should have a high level of fidelity to the underlying model (Slack et al., 2021).In that sense, a trustworthy explanation must have feature weights that accurately correspond to the actual impact on the probability estimates to accurately reflect the model's behaviour, i.e., be well-calibrated (Bhatt et al., 2021).
Stability and robustness are two additional critical features of explanation methods (Dimanov et al., 2020;Agarwal et al., 2022;Alvarez-Melis & Jaakkola, 2018).Stability refers to the consistency of the explanations; the same instance and model should produce identical explanations across multiple runs (Slack et al., 2021;Carvalho et al., 2019).In contrast, robustness refers to the ability of an explanation method to produce consistent results even when an instance undergoes small perturbations (Dimanov et al., 2020).Consequently, the essential characteristics of an explanation method in XAI are that it should be reliable, stable, and robust.

Venn-Abers predictors
A probabilistic predictor generates a predicted class label and a probability distribution over all possible labels.To determine the validity of the predicted probability distributions, statistical tests are performed based on observation of the actual labels.Achieving validity for probabilistic predictions is generally impossible, as discussed in previous research (Vovk et al., 2005).However, this paper focuses on calibration, which can be defined as follows: In this equation, p c represents the probability estimate for a particular class label c.A well-calibrated model produces predicted probabilities that match observed accuracy.For example, if a model assigns a probability estimate of 0.9 to a specific label, that label should be correct approximately 90% of the time.However, many predictive models produce poorly calibrated probability estimates.When a model is poorly calibrated, an external calibration method can be applied using a separate portion of the labelled data called the calibration set to adjust the predicted probabilities.
Venn predictors are probabilistic predictors (Vovk et al., 2004) that, for each label, output multi-probabilistic predictions.These multi-probabilistic predictions are converted into probability intervals where the interval size indicates confidence in the estimation.
In inductive Venn prediction (Lambrou et al., 2015), the underlying model is used to divide the calibration instances into a number of categories based on a Venn taxonomy.Within each category, the estimated probability for test instances falling into a category is the relative frequency of each class label among all calibration instances in that category.Validity is achieved by including the test instance in the calculation.Since the correct label is unknown for the test instances, every possible label is tried, resulting in the multi-probabilistic probability distribution.
Choosing an appropriate taxonomy when using Venn predictors can be challenging.One alternative is Venn-Abers predictors (VA) (Vovk & Petej, 2012), where the taxonomy is automatically optimized using isotonic regression.Isotonic regression is a non-parametric regression technique that fits a piecewise constant function to the data, such that the function is monotonically increasing or decreasing.VA predictors are used together with scoring classifiers, and since VA predictors are Venn predictors, they inherit the validity guarantees.
The output of a two-class scoring classifier is a prediction score s(x i ) when predicting a test object x i .A higher value of s(x i ) signals a more pronounced belief in the positive class, i.e., class label 1.To obtain the predicted class label from a scoring classifier, the score is compared to a fixed threshold value t.The prediction gets the value of 1 if s(x) > t, and 0 otherwise.
The threshold is not a fixed value t when using VA with scoring classifiers.
An isotonic regression model is fitted to a number of prediction scores where the true targets are known which creates an increasing function g(s(x)) (Zadrozny & Elkan, 2001), interpreted as the probability that the label of x is 1, i.e., a calibrator.
Thus, an inductive VA predictor produces a multi-probabilistic prediction as follows: 1. Define a training set as {z 1 , . . ., z n }, where n = l + q.Each instance z i = (x i , y i ) consists of two parts; an object x i and a label y i .
2. Divide the training set into a proper training set Z T with q instances and a calibration set {z 1 , . . ., z l }.
3. Use the proper training set Z T to train a scoring classifier to produce the prediction scores s for {x 1 , . . ., x l , x n+1 }.

Let the probability interval for y
(henceforth referred to as [p 0 , p 1 ]).
6.When a regular probability estimate is needed, the probability intervals [p 0 , p 1 ] are aggregated into a single calibrated probability estimate by following the recommendation from Vovk & Petej (2012) to get a regularized value:

Calibrated Explanations
This section introduces a new feature importance explanation method based on VA called Calibrated Explanations (CE).At the heart of the proposed method lies the realization that the feature weights must have an intuitive meaning to facilitate understanding.Consequently, the feature weights are defined to be the amount each feature contributes to the calibrated probability estimate for the positive class.Furthermore, the method also provides an estimate of the amount of uncertainty underlying each feature weight.
The method is implemented in Python with the ambition of being straightforward to use for anyone familiar with explanation methods like LIME or SHAP1 .
Let's assume that a scoring classifier, trained using the proper training set Z T , exists for which a local explanation for test instance x n+1 is wanted.Separate all features into categorical features C and numerical features N , so that , the total number of features F .By default, the following steps are pursued to achieve an explanation for the test instance: 1. Define the probability intervals [p 0 , p 1 ] and the calibrated probability estimate p for x n+1 using steps 4 − 6 in the description of VA above.
2. Define a discretizer for numerical features that define thresholds and smalleror greater-than-conditions (≤, >) for these features.The discretizer logic is the same as is used in LIME and is further discussed below.Rules for categorical features are defined using identity conditions (=).
3. For each feature f ∈ F : Iterate over all possible categorical values v ∈ V f and create a perturbed instance exchanging the feature value of x f n+1 with one value at a time, creating a perturbed instance x fv n+1 = v, where f v denote the index of value v for feature f .Let f v denote the index of the feature value for our test instance, i.e.
(b) Calculate and store the probability intervals [p fv 0 , p fv 1 ] and the calibrated probability estimate p fv for the perturbed instance.
(a) Use the thresholds of the discretizer to identify the closest lower and(/or) upper threshold(s) surrounding the feature value of n+1 (the number of groups will depend on the type of discretizer as well as the feature value of x f n+1 ).Divide all possible feature values in the calibration set for feature f into the two or three groups V f separated by the lower and(/or) upper threshold(s).
(b) Within each group, the 25 th , 50 th and 75 th percentiles are extracted.
i.For each group, iterate over the percentile values pv and create a perturbed instance exchanging the feature value of x f n+1 with one value f pv at a time, creating a perturbed instance ii. Apply VA on the perturbed instance and store the probability intervals [p fpv 0 , p fpv 1 ] and the calibrated probability estimate p fpv for the perturbed instance.
iii.Before moving on to the next group, average over all percentile values within the group, creating a probability interval [p fv 0 , p fv 1 ] and the calibrated probability estimate p fv for each group.
(c) Let f v denote the index of the group including the feature value x f n+1 of the test instance.
(Finalizing step 3) Calculate the feature weight (and interval weights) for feature f as the difference between p to the average of all p fv (and The weights for the calibrated prediction and the lower and upper bounds are calculated as follows: It is essential to discuss the importance of the discretizer for the creation of unambiguously interpretable rules.As the feature weights are the difference between the calibrated probability estimate and the average of all the groups, not including the feature value of the test instance, it would obscure the meaning of the feature weight if the feature rule could be of the form 0 < feature f ≤ 2. The reason is that it is often reasonable to assume that probabilities for values below the interval (feature f ≤ 0) may differ significantly from probabilities for values above the interval (feature f > 2), making an average of the two hard to interpret.Consequently, the discretizer is meant to be binary for normal use of CE (even if other discretizers are allowed).We have added two binary discretizers to the set of discretizers provided by LIME: 1.A simple binary discretizer (BinaryDiscretizer) setting the threshold to the median of all values in the calibration set for any numerical feature.
This discretizer is identical to the QuartileDiscretizer (and DecileDiscretizer) in LIME, except that only the median is used.
2. A binary entropy discretizer (BinaryEntropyDiscretizer, which is the default discretizer), equal to the EntropyDiscretizer in LIME in every aspect except that the tree depth is limited to 1, to force a binary split based on the threshold defined by an ordinary decision tree applied to the feature values in the calibration set.
Note that when using a binary discretizer for numeric features, only two groups will be formed, one of which will represent f v , simplifying equation ( 2) to w f = p − p ¬f v (and equations ( 3) and ( 4) correspondingly).
By definition, CE calibrates the underlying model, creating well-calibrated predictions and explanations.VA provides uncertainty quantification of both the probability estimates from the underlying model and the feature importance weights through the intervals.By using equality rules for categorical features and binary rules for numerical features (as recommended above), interpreting the feature weights in relation to the rules and the instance values is straightforward and unambiguous.The explanations are reliable because the rules straightforwardly define the relationship between the calibrated probability estimate and the feature weight.The explanations are robust, i.e., consistent, as long as the feature rules cover the small perturbations.The method does not guarantee robustness for perturbations violating a feature rule condition.The explanations are perfectly stable as long as the same calibration set and model are used.Finally, depending on the size of the calibration set, which is used to train a few isotonic regression models per feature, the generation of wellcalibrated explanations is, in most cases, fast compared to existing solutions such as LIME and SHAP.
The main shortcoming of the calibrated explanation method is that it only considers one feature at a time.This means that the method will not detect when multiple features have a combined impact on the prediction or the probability estimate.With the risk of dramatically increasing computational cost, this shortcoming could partly be addressed by creating conjunctive rules with two or more of the current rules combined.Conjunctive rules would still result in clearly interpretable rules.The computational cost could be reduced by using weight pruning to reduce the number of rules to consider for concatenation.
This is left as future work.

Counterfactual Calibrated Explanations
Using the definition of CE above, it becomes straightforward to generate counterfactual rules.The main difference when using Counterfactual Calibrated Explanations (CCE) is that non-binary discretizers are recommended for numerical features.Non-binary discretizers will often allow both ≤-rules and >-rules to be formed.The recommended discretizer for CCE is the EntropyDiscretizer defined in LIME.Probability intervals [p fv 0 , p fv 1 ] are defined following the CE procedure described in step 3 above.Based on these results, it is trivial to define a rule for each of the alternative values, with the expected probability interval for each rule already defined.The feature weights defined in equation 2 are only used for sorting counterfactual rules based on impact.

Multi-class CE and CCE
As VA is limited to binary classification, CE and CCE are by default also defined for binary classification.However, following the recommendations by Johansson et al. (2021), extending CE and CCE to provide well-calibrated explanations to multi-class problems is trivial.However, this is also left for future work.

Uncertainty Estimation in Explanations
The importance of uncertainty to achieve transparency in models has been discussed earlier in Bhatt et al. (2021).In the paper, the authors emphasize the need for well-calibrated uncertainty to help humans understand the models better.Additionally, the authors suggest that providing high levels of transparency through uncertainty information can assist in building trust in the system.
Different techniques for calibrating uncertainty information are compared in Pereira et al. (2020); VA is identified as the preferred method for complementing predictions with a measure of uncertainty.The simplicity of the VA approach makes it preferable to the other calibration methods discussed in the study.
Uncertainty is also addressed in Slack et al. (2021), where the authors develop a new method based on Naive Bayes.However, the authors are not addressing the underlying model's calibration level, although Naive Bayes produces a rather well-calibrated model.
Venn predictors are used in Alkhatib et al. (2022), where the authors use them to quantify the uncertainty of rule-based explanations and introduce several metrics for rule explanation quality.The authors also highlight uncertainty quantification for additive feature importance methods as an attractive focus for research, although the authors highlight the challenges for evaluations of this type of explanations.

Explanation quality
Perturbation-based explanation methods such as LIME can produce misleading explanations causing explanation instability, i.e., repeated runs of the explanation algorithm under the same conditions do not yield the same results (Zhou et al., 2021b).Stability is pointed out as one of the essential characteristics of any explanation technique in the article, and they write that an unstable method does provide little insight as to how the model works and that they are considered unreliable.The problem with instability in LIME has been addressed in several studies (Rahnama & Boström, 2019;Zhou et al., 2021b;Agarwal et al., 2022).
The effects of calibration on explanations methods such as LIME have been studied earlier by Löfström et al. (2023).Results from the study show that explanations of better-calibrated models are themselves better calibrated, with ECE and log loss for the explanations after calibration becoming more conformed to the model ECE and log loss.The conclusion was that calibration improves the models and explanations alike by more accurately representing reality.

Counterfactual Explanations
In recent years counterfactual explanations have seen increased interest from researchers.In Guidotti (2022), the authors compare existing counterfactual explanations highlighting several essential aspects, e.g., stability and running time.The authors also point to the challenge of evaluations of counterfactual explanations and explainability methods in general and add that there is no standard agreement on performing an objective evaluation.

Method
The evaluation consists of two parts.The first part is an experiment evaluating VA's ability to improve calibration and the potential impact of the intervals.
The second part is focused on presenting and evaluating the CE explanations.

Calibration of the Underlying Model with Venn-Abers
The first part is an experiment that evaluates the effect of calibration on the underlying models to show how models are affected by calibration and the potential of the VA intervals.Three different setups per model were used: • UC: The original uncalibrated model, trained using the entire training set.
• VA: A model calibrated using VA, where the underlying model was trained using 2/3 of the training set and calibrated using 1/3.The calibrated probability estimate (equation 1) is used for comparison.
• (VA): Cheating VA Results, included to illustrate the additional information contained in the interval of VA.These results show the performance obtained by substituting the calibrated probability estimate with the lower end of the VA interval for instances belonging to class 0 and the upper end for instances belonging to class 1.The same VA intervals as above were used.Obviously, these results cannot be used in reality since they require knowledge of the true class label.Since the intervals span the whole width of possible outcomes identified by VA, these results aim to illustrate the interval's potential compared to only using the uncalibrated or calibrated probability estimates.
Accuracy and area under the ROC-curve (AUC) were used to measure the predictive performance.In order to investigate the quality of the calibration, log losses and the expected calibration error (ECE) are reported.The log loss is calculated using where log is the binary logarithm, and p is the estimate for the predicted label.
To avoid infinite results, the log loss function used (from scikit-learn) clip the probabilities to ensure that they never are exactly 0 or 1. Simply put, log loss penalises overconfident models making incorrect predictions.
When calculating ECE, the probability estimates for the positive class are divided into M equally sized bins (here M = 10) before taking a weighted average of the absolute differences between the fraction of positives (fop) predictions and the mean of the probabilities for the positive class (mopp): where n is the size of the data set, #B i represents the number of instances in bin i.

Evaluation of Calibrated Explanations
CE can be analysed from different perspectives.In this paper, three different kinds of plots are presented.The first two are used when CE is used to extract regular explanations: • Regular explanations, providing well-calibrated explanations without any uncertainty information.These explanations are directly comparable to other feature importance explanation techniques, like LIME.Even if the structure is similar to SHAP, the meaning of the weights differ.
• Uncertainty explanations, providing well-calibrated explanations including uncertainty intervals to highlight both the importance of a feature and the amount of uncertainty connected with its estimated importance.
These plots are inspired by LIME.Especially the rules in LIME have been seen as valuable information in the explanations.For the reasons given above, CE is meant to use binary rules with regular and uncertainty explanations (even if all discretizers used by LIME can also be used by CE).One noteworthy aspect of CE is that the feature weights only show how each feature separately affects the prediction estimate.Therefore it is not possible to combine the feature weights to see the collected impact on the prediction estimate (as can be done with SHAP).
The third kind of plot is a counterfactual plot showing how the features affect the probability estimate when other feature values are used.
When plotting CE explanations, the user can choose to limit the number of rules to show, which equals the number of features for regular and uncertainty plots, as there is one rule per feature.However, in counterfactual explanations, where CE creates as many counterfactual rules as possible, rules are ordered based on impact, starting with the most impactful rules.
To further enhance the usability of CE, the functions as_SHAP and as_LIME can be used to transform the CE feature weight structure into SHAP and LIME explanation objects, making it possible to use the plotting libraries from either of the explanation methods.However, interpreting SHAP plots, defined for Shapley values, must be done with some care.These methods also make it trivial to exchange use of LIME or SHAP with CE in existing code.

Experimental Setup and Data Sets
The evaluation is divided into an experiment evaluating model calibration, evaluating calibrated explanations through regular and uncertainty plots, evaluating counterfactual calibrated explanations, also through plots, and finally, evaluating execution time for CE.
In the first experiment, standard 10x10-fold stratified cross-validation was used; thus, all results are averaged over 100 folds.Furthermore, 25 binary classification problems were used, publicly available from either the UCI repository (Dua & Graff, 2017)  The evaluations of calibrated explanations and counterfactual calibrated explanations are evaluated using plots.
In the final evaluation, a comparison of the computational efficiency between CE, LIME, and SHAP is performed both from the underlying uncalibrated model and from a VA calibrated model.

Evaluation of the Model Calibration
Before presenting the explanations from Calibrated Explanations, we look at the results from experiment 1 and the calibration of the underlying models in Table 3 (on the next page).UC is the uncalibrated model, VA is the regularized Venn-Abers, and (VA) is the cheating Venn-Abers.
In general, even though VA was trained using 2/3 of the instances, accuracy and AUC are only marginally affected.At the same time, ECE and log loss show that the calibration clearly improved the models.The differences were more pronounced for XGboost.The Optimal VA clearly improved the accuracy, AUC and log loss results, while ECE was only marginally affected.The interpretation of the calibration results is that the calibration errors made in each of the ten bins of ECE do not affect the overall results much but that fewer large errors are made, resulting in much lower log loss.The cheating VA results clearly show that the intervals incorporate a much greater potential than the regularized values reveal.

Calibrated Explanations
Below, regular and uncertainty plots for calibrated explanations are introduced and discussed.

Regular Calibrated Explanations
Regular CE explanations are similar to LIME explanations in several ways.

Counterfactual Calibrated Explanations
In CCE, the plots are not showing feature weights.Instead, the plot is focusing on the VA probability intervals.Each rule shows the alternative VA probability interval resulting from changing the feature value to a value covered by the counterfactual rule condition.Numerical features can, at most, result in two counterfactual rules (above or below the thresholds surrounding the feature value), whereas one counterfactual rule is created for each alternative categorical value.In the plots, only the 10 most influential counterfactuals are shown.Just as with the uncertainty plot, the counterfactual plot show that the two features Pregnancies and BMI affect the prediction most.However, in the plot, the rules show that the influence from the Pregnancies feature starts already after two pregnancies (Pregnancies > 2.5).Although it increases the probability of being predicted with diabetes, it also increases the uncertainty notably.The rule BMI > 27.85 increases both the probability of being diabetic and the uncertainty.Furthermore, we can also see that the same features occur in several rules, indicating the impact resulting from being either below or above the current value.
In figure 6, the chosen instance is the same as in figure 2 with a clear prediction of the congressman being a Republican.The plot could, e.g., be read as "what happens to the prediction if the value for feature physician-fee-freeze changes to zero".There are some interesting features to point out for this instance.First, just as seen earlier in figure 2 the first four features are highly important for the prediction and changing the value would modify the prediction.
The fifth feature would increase the uncertainty dramatically without clearly indicating whether the model would predict it as Democrat or Republican.would mean calibrating an already calibrated model.The results in Table 4  Using any non-binary discretizer for creating counterfactual rules might be at most 50% slower on numerical features (due to the possibility of adding either a ≤-rule >-rule).Categorical features will not be affected.

Conclusions
We have in this paper presented a new explanation method named Calibrated Explanations (CE), which simultaneously calibrates the underlying model and generates explanations with uncertainty information.The weights are defined as the amount each feature contributes to the calibrated probability estimate for the positive class.Each feature and weight is explained using a conditional rule that is straightforward to interpret.
Furthermore, the method incorporates all information necessary to create counterfactual rules.Consequently, the possibility to extract counterfactual rules is inherent in the method.The counterfactual explanations convey an estimate of the uncertainty of the outcome of each counterfactual rule.
The method includes three types of plots: Regular, Uncertainty, and Coun- The method is designed to produce stable rules.Reliability follows from the inherent calibration, resulting in both predictions and explanations becoming better representations of the true underlying distribution.The procedure for rule creation premieres robustness within the coverage of the rule conditions.
Finally, we evaluated time costs for generating the explanations and found that CE outperforms both LIME and SHAP for calibrated models, even if SHAP was more efficient for the tree ensembles when not calibrating.In short, Calibrated Explanations show the characteristics of a high-quality explanation method.
Three directions for future work include conjunctive rules and support for multiclass problems, as mentioned above.Finally, developing support for regression is also an important direction for future work.

Figure 1
Figure 1 shows a regular CE plot from the diabetes data set with a low calibrated probability estimate of about 0.11 for the positive class, indicating

Figure 1 :
Figure 1: Regular CE barplot with continuous feature values from the data set Diabetes.

Figure 3 :
Figure 3: Uncertainty CE barplot with intervals from the data set Diabetes.

Figure 4 :
Figure 4: Uncertainty CE barplot with intervals from the data set Liver.
terfactual.The Regular and Uncertainty plots show straightforwardly how each feature affects the probability estimate with (Uncertainty plot) or without (Regular plot) uncertainty information.The Regular plot focuses on how the features affect the prediction estimate.The Uncertainty plot focuses on how each feature's uncertainty affects the probability estimate.Finally, the Counterfactual plot illustrates the expected probability estimate and corresponding uncertainty that the counterfactual rules would result in.

Table 4 :
are from extracting explanations from 10 instances.Computational cost of creating explanationsThe most striking results stem from SHAP, which differs greatly between the results achieved when extracting rules from an uncalibrated random forest or xgBoost model, for which it is clearly optimized, to explaining a Venn-Abers pre-dictor.In fact, if calibration is considered unnecessary (UC), SHAP is clearly the fastest explanation method, several times faster than CE and especially LIME.However, when looking at calibrated explanations (either explaining a calibrated underlying model, which is the case for LIME-VA and SHAP-VA, or by calibrating and explaining the calibrated underlying model simultaneously, which is what CE does), CE is clearly the fastest algorithm.It is worth mentioning that the results for CE are for the default discretizer BinaryEntropyDiscretizer.