teex: A toolbox for the evaluation of explanations

We present teex, a Python toolbox for the evaluation of explanations. teex focuses on the evaluation of local explanations of the predictions of machine learning models by comparing them to ground-truth explanations. It supports several types of explanations: feature importance vectors, saliency maps, decision rules, and word importance maps. A collection of evaluation metrics is provided for each type. Real-world datasets and generators of synthetic data with ground-truth explanations are also contained within the library. teex contributes to research on explainable AI by providing tested, streamlined, user-friendly tools to compute quality metrics for the evaluation of explanation methods. Source code and a basic overview can be found at github.com/chus-chus/teex, and tutorials and full API documentation are at teex.readthedocs.io.


Introduction
Explainable Artificial Intelligence (XAI ) is the field dedicated to making AI models human-understandable.An important part of this field are explainer methods, which, by generating explanations, give users a general overview of a model's functioning (global explanation) or the reasoning behind a single prediction (local explanation).teex is a tool designed to evaluate explainer methods in XAI, particularly those that generate local explanations for classifications made by machine learning models (such as LIME [1]).teex provides an extensible collection of metrics that enable comparison between post-hoc and ground-truth local explanations.It also provides built-in support for multiple explanation types-saliency maps, decision rules, feature importance vectors, and word importance vectors-while aiming to be extensible in this regard.Although its use is not strictly bound to the availability of ground-truth explanations (e.g., it can be used to compare explanations generated by different methods), teex contains multiple, easy-to-access real-world and artificial datasets [2][3][4][5] with ground-truth explanations to enable benchmark comparisons 1.In the case of the real-world data included, expert annotations are provided as ground-truth explanations.To enable integration with related software, we provide wrappers for extraction and usage of local explanations from popular Python XAI libraries.
teex supports the usage of XAI evaluation methods in a (1) general, (2) extensible, and (3) simple way: • By allowing evaluation of the most frequently used explanation types in a model-and explainer-independent manner.• By clearly encapsulating functionality: evaluation and data generation methods exist within distinct modules, inside a sub-package for each explanation type.APIs are standardized between all modules and the architectural structure is clearly laid out.
• By providing single-line evaluation APIs (as shown in the example below) and comprehensive documentation, including tutorials and use cases.This enables seamless integration with evaluation pipelines.particular, when no ground-truth explanations are available, evaluation is bound to indirect metrics related to the underlying model's behavior, usually measuring fidelity, sensitivity, complexity, or other aspects.While this form of evaluation is valid, it is desirable to streamline automatic evaluation against ground truths.Although this approach requires data with expert annotations, it is straightforward to use and understand, additionally being model-and explainer-independent.It also provides a way to establish whether a model produces correct classifications for the right reasons.
We believe that providing a tool that implements this approach to evaluating explanations is an important step for the community.There are libraries solely specializing in generating explanations (Alibi [6], dalex [7], iNNvestigate [8], zennit [9]) and libraries that include some evaluation metrics (Captum [10], AIX360 [11], TorchRay [12]), but there is relatively little comprehensive tool support for the streamlined evaluation of XAI techniques.The only other dedicated library (Quantus [13]) does not focus on evaluating against ground-truth explanations.Important features of these libraries are compared in Table 2.

Software description
To evaluate the explanation quality, the required elements are demonstrated in Fig. 3, where  is the explanation generated by an explanation method. explains the prediction  made by a black box model  for a given input .To evaluate the quality of , there needs to be a ground truth explanation ê.Now, given an evaluation function , we can compute (, ê).See Fig. 2 for a concrete example of the evaluation process.
teex makes the evaluation process convenient by: • Providing a collection of metrics  that are commonly used in the literature for evaluating explanations.• Providing easy access to ê for a collection of machine learning datasets, where the ground truth explanations for individual instances are available.This information is difficult to collect in practice and is not available in many traditional datasets.

Datasets
We provide datasets with ground-truth explanations for four explanation representations, including both real datasets and synthetic ones.
All datasets share the same user API.In particular, teex includes, as of now: • Image data with saliency maps as explanations.
• Text data with word importance as explanations.
• Tabular data with rules as explanations.
• Tabular data with feature importance as explanations.

Image data
We provide several image classification datasets with ground-truth saliency maps that are, e.g., suitable for the evaluation of the explanations of classifications obtained from convolutional neural networks: • Kahikatea 1 contains images for Kahikatea classification.The Kahikatea is an indigenous plant in New Zealand.The data has Fig. 2. Explanation evaluation procedure for a saliency map image explanation, given an expert explanation and an explanation generated by an external method.(1) First, the expert explanation is transformed into a binary 2D matrix, where each entry corresponds to a pixel, and is set to 1 or 0 depending on whether it contains the object or not.( 2) Then, the generated explanation is transformed into a 2D matrix, where each entry is the normalized attribution (from 0 to 1) of the corresponding pixel.This matrix, depending on the quality metric that the user chooses, will need to be binarized by choosing a value threshold.(3) After this, both matrices are flattened into 1D vectors.( 4) Finally, both vectors can be quantitatively compared using a selected metric .An example image from the Kahikatea data and a corresponding explanation can be found in Fig. 1.
• CUB-200-2011 2  An example image from the CUB-200-2011 data and corresponding explanation can be found in Fig. 4.
• The included synthetic image data generation method, adapted from [5] can produce an arbitrary number of images with pixellevel explanations of one class.An example can be found in Fig. 5. Given some parameters, first, a pattern image (yellow pixels in Fig. 5) is generated, then the images are generated with cells   'electronics': 1.0,'flash': 0.5}

Tabular data
Evaluating explanations on tabular data is another important task.We provide synthetic tabular data generation methods with two types 4 https://www.kaggle.com/datasets/crawford/20-newsgroups  of explanations -decision rules and feature importance -based on different underlying transparent models.
The example of feature importance is similar to the above example of word importance for text data.That is, an observation is a numerical list with an associated class, and its corresponding explanation is a list of numerical importances for each feature, bounded from −1 (inversely correlated with the observation's class) to 1 (positively correlated).The data points are sampled from normal distributions, labeled by thresholding randomly generated linear functions, and explained using their gradients.The observations are the same in the case of the synthetic decision rule data, but the explanations contain conditional rules for the classification of the observation into a class, instead of just importances.The data points are sampled from normally distributed clusters, and their explanations are generated by parsing decision trees trained on them.Below is an example of what the decision rule explanation looks like.See the original Ref. [5] for an in-depth explanation of these methods.

Metrics
Here we present an overview of the current quality metrics included in teex.

Feature importance
• Cosine Similarity [14].If the explanations are vectors of feature importance, regardless of whether the values are binary or in the range [0, 1], we can measure the explanation quality using Cosine Similarity: where  ⋅ ê is the dot product, and ‖‖ is the L2-norm of .The closer the metric is to 1, the greater the explanation quality of .• Precision, Recall,  1 score.For these metrics, both ground truth and prediction are binarized according to a user-defined threshold.Once this has been done, the explanation quality can be measured by the well-known precision, recall, and  1 score metrics.

𝑄(𝑒
Recall measures how many truly important features are selected.Its value is 1 if all features with non-zero importance in the ground truth explanation are also non-zero in the generated explanation.Note that this can be easily achieved with an explanation that assigns non-zero importance to all features.
The closer the  1 score, the harmonic mean of precision and recall, is to 1, the greater the explanation quality of .• AUC: The area under the ROC curve provides an alternative to the above metrics.In the case of this metric, only the ground truth is binarized and ground truth importance scores are used to obtain the ranking for the calculation of the area under the curve.

Saliency maps
In the context of image classification, a saliency map explanation  for a prediction  () is represented as a two dimensional array of the same size, where each entry in  is a real number and provides the attribution of the corresponding pixel in .For the evaluation of saliency maps, we provide the same metrics as in the case of feature importance.In this case, each pixel in an image is considered to be a feature: an saliency map explanation of size (, ) is flattened into a feature importance vector of length  × .
Table 3 In 3.a (left), we report the average metrics for the test set-comparing explanations extracted with each of the specified methods to the ground truth.In 3.b (right), these same explanations are compared to the ones extracted via Integrated Gradients (scores for Integrated Gradients are blank because we would be comparing them to themselves).Note that the explanations are binarized where necessary in order to compute f1score, precision, recall, cosine similarity and AUC (in 3.b, just for the explanation considered as ground truth, which would be Integrated Gradients).0.

Decision rules
In the case of tabular data, teex can also process explanations in the form of decision rules, not just feature importance scores.
• Complete Rule Quality.Each rule explanation can be converted into a vector where, for the th feature in the observed domain, the values of the lower and upper bounds  ()  ,  ()  are reported.For example, the explanation  0 > 5,  1 < 2 can be converted to {( ()  0 , 5), ( () 0 , ∞), ( () 0 , −∞), ( () 0 , 2)}.Then the explanation quality can be measured as: where Here,  is the similarity threshold, and  is the number of lower and upper bounds that are neither ∞ nor −∞ in both  and ê.This means that the closer a rule explanation's lower and upper limits are to the real lower and upper limits, the better the explanation is.The more limits that are close, the closer the explanation is to being accurate.• All metrics available for feature importance are also available for evaluating rules, where a transformation of the rule into a feature importance vector is performed first.The feature importance vector will have the same number of entries as the number of features in the domain.Each entry will be either 1 or 0, depending on whether the feature appears, or not, in the rule in question.

Word importance
For word importance, we have the same metrics as for feature importance, where the vocabulary is considered the feature space and a word importance explanation may or may not contain words from the vocabulary.

Experiments
To demonstrate teex's usage, we present a benchmark comparison.A classification model (pre-trained SqueezeNet [15] from Py-Torch [16]) has been fine-tuned on the Kahikatea dataset, obtaining 0.82 F1 score on the validation data and 0.65 F1 score on the test data when evaluating classification performance.From this model, we extract local explanations for the test set for 40 positively labeled observations, using the following methods: Gradient SHAP [17], gradCAM [18], deepLIFT [19], Guided Backpropagation [20], Occlusion [21] and Integrated Gradients [22].Then, we use teex to quantitatively evaluate them.The results are shown in Table 3. Code is available on teex's GitHub Saliency Map demo.
Inspecting the results in Table 3.a, all explainers achieve similar scores for the various metrics when compared to the ground truth, but there are some differences in precision and recall in particular.Also, the scores are not high, which indicates that the model has not entirely learned the particular features of the Kahikatea trees.Table 3.b reflects a characteristic of our evaluation procedure: the binarization threshold that is chosen for the evaluation directly influences the results.In this case, the cosine similarity scores indicate that explanations are almost identical to those of Integrated Gradients, but the other scores do not.This is due to how the threshold (0.5) interacts with the distributions of attributions, and needs to be taken into account by the user.This is particularly true if the explanations that are being compared seem to be very similar to each other.
For this simple experiment we have used our tool to quickly obtain relevant information about the model behavior, as well as compare the performance of various explainer methods on it.This could help iterate in the model-development process and allow for the selection of the right, and best-performing explainer for our particular use-case.
Combining teex with other explanation evaluation libraries would bring us even more benefits.

Summary
teex is a Python library comprised of tools to help researchers and end-users evaluate the quality of local explanations against ground truth explanations for labels provided by human experts (or algorithmically in the case of synthetic data).It includes a comprehensive set of quality metrics that can be applied to different explanation types and also aims to serve as a hub for datasets with ground-truth local explanations, which are notoriously hard to find.teex has been conceived as an effort to help make XAI evaluation a more streamlined, reproducible, simple, and clear procedure, with ease-of-use and flexibility in mind, and can be used in tandem with other libraries for the generation and the evaluation of explanations.

Future improvements
Future improvements will be focusing on expanding access to more datasets with explanations, as well as more metrics.Beyond the immediate scope of the project, it may also be possible to extend the library to other predictive tasks such as regression or clustering.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 3 .
Fig. 3. Evaluating explanation  (generated with any explanation method) by contrasting it against ground truth ê.

2 X
and Oxford-IIIT Pet 3 are well-known dataset frequently used for evaluating the accuracy of image classification techniques and are also available in teex.They exhibit over 19,000 images and 230 distinct classes. 1 from teex.saliencyMap.dataimport CUB200 , y, exps = CUB200()[:]

Fig. 5 .
Fig. 5. Example observation from the Synthetic image dataset generation method.