Rule Extraction in Unsupervised Anomaly Detection for Model Explainability: Application to OneClass SVM

OneClass SVM is a popular method for unsupervised anomaly detection. As many other methods, it suffers from the black box problem: it is difficult to justify, in an intuitive and simple manner, why the decision frontier is identifying data points as anomalous or non anomalous. Such type of problem is being widely addressed for supervised models. However, it is still an uncharted area for unsupervised learning. In this paper, we evaluate several rule extraction techniques over OneClass SVM models, as well as present alternative designs for some of those algorithms. Together with that, we propose algorithms to compute metrics related with eXplainable Artificial Intelligence (XAI) regarding the"comprehensibility","representativeness","stability"and"diversity"of the extracted rules. We evaluate our proposals with different datasets, including real-world data coming from industry. With this, our proposal contributes to extend XAI techniques to unsupervised machine learning models.


Introduction
Responsible Artificial Intelligence (RAI) is defined as the different AI principles that should be considered when developing and deploying real applications based on AI [1]. RAI serves as a methodological framework to both identify core aspects (or AI principles) that should be considered when developing AI solutions while also proposing how to implement them. These AI principles include aspects such as Fairness, Explainability, Security, Privacy and Humancentric design [2].
The AI principle of Explainability is addressed through the use of Explainable AI (XAI) techniques, which can be applied to black-box models in order to obtain post-hoc explanations based on the information that they provide. In the literature, there are many XAI proposals for supervised ML models. However, some of the most recent and thorough reviews on XAI [3,4,1,5] do not mention many applications of those techniques to unsupervised learning.
Copyright © 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Outlier detection is one of the tasks where unsupervised learning is applied. It is defined as the process of detecting anomalous observations within a dataset, and sometimes remove it as a first step within data-mining applications [6]. There is often no prior information about outliers in a dataset, hence unsupervised ML algorithms offer the chance to infer patterns and detect anomalies. However, not only is it important to detect outliers, but also to explain them. Explanations can help to understand why a particular datapoint has been labelled anomalous (and what changes in the feature values would lead to classify it as an inlier), and how the model behaves globally (for instance, what features influence more for classifying a datapoint as an outlier).
The output of an unsupervised ML model for anomaly detection can be seen as binary (an observation may be an "outlier" or an "inlier). Thus, surrogate post-hoc XAI techniques can yield explanations similarly to a supervised binary classifier where the two possible outputs are imbalanced. Hence, the explanations for the model can be obtained by using XAI techniques already designed for supervised ML binary classifiers. This is already addressed in the literature, particularly by using feature-relevance XAI techniques [7,8].
Among the different model-agnostic post-hoc XAI techniques that can be applied, rule extraction offers the possibility to provide both global and local explanations, as indicated by the recent literature [1]. This is achieved by using an "IF...THEN" schema that explains both the output of a particular datapoint as well as the global behaviour of the original model. In the case of outlier detection, they can explain both a particular outlier and also how the features of the whole model contribute to identify points as outliers or inliers. Even though there are some examples of this in the literature, particularly for the case of OCSVM [9], there are not many studies covering it to the best of our knowledge.
There is a particularity of the usage of rule extraction for explaining anomalies. An outlier detection system that uses rules as explanations may have more interest in explaining faithfully why a datapoint is an outlier, and what should have happened in order for it to be an inlier, rather than being able to cover all possible scenarios with explanations that may be wrong. This means that the extracted rules need to have a 100% precision (P@1); rules that classify datapoints from one class (i.e. "outliers") without including datapoints from the other one. Considering the example of rules extracted that cover inliers, this is important because the counterfactual explanation for how to turn an outlier into an inlier should lead to a scenario where the model will always classify it as an inlier. This is linked to another aspect regarding XAI. Even though there are many model-agnostic post-hoc rule extraction techniques that can be used for explaining a ML model in general, and an unsupervised one for anomaly detection in particular, there is still one question present: which technique provides the best explanations?. This leads to an open issue within the XAI literature: how to evaluate the quality of explanations?. Here, the literature suggests some concepts to consider while designing new metrics and algorithms. The metrics need to consider the type of explanations provided (rule based in this case) and the type of data used. For instance, in our case we work on anomaly detection without a prior ground truth. Hence, some XAI metrics (like those related to accuracy measurement of the rule predicitons over a test set) are not applicable. Together with that, other particularities of the problems addressed may influence in what metrics are more important. In our case, since we are using P@1 rules, measuring the fidelity of the explanations is not necessary (since the comparison will only be possible against the model output). However, other metrics gain more relevance, such as stability. With that, some relevants aspects to measure for this case are: • "Comprehensibility": Are explanations easy enough to understand? • "Representativeness": Are explanations relevant? Do they explain all possible cases? • "Stability": Do explanations match the predictions of the model? Or are there inconsistencies? • "Diversity": Are explanations sufficiently different among them? Or are they redundant? There are only a few implementations of these concepts as metrics to measure and compare rule extraction techniques (and even though, mainly for rule extraction techniques over supervised ML).
Following this, the main contributions of our work are: • Applying different model-agnostic rule extraction techniques to explain the anomalies detected by an unsupervised OCSVM ML model through P@1 rules and for different types of kernels (RBF and Linear). We use existing rule extraction techniques and propose some variations over the one described in [10]. • Quantifying the quality of the generated P@1 rule explanations for unsupervised anomaly detection with OCSVM using XAI metrics that measure comprehensibility, representativeness, stability and diversity. Particularly, we propose novel ways to implement stability and diversity. • Evaluating how the comprehensibility aspect regarding the number of P@1 rules generated significantly varies depending on the kernel considered (Linear vs RBF) and depending on whether the explanations are for outliers or for inliers.
The empirical evaluations carried out use both open datasets as well as real data from Telefónica. Our evaluation consists in analysing the results for the aforementioned metrics using different rule extraction algorithms over OCSVM models with two kernel configurations, Linear and RBF.
The rest of the paper is organized as follows. First, we describe some related work in the area of XAI and rule extraction applied to SVM. This chapter will also introduce the rule extraction techniques considered, as well as literature related to XAI metrics and the psychology of explanations. After identifying research opportunities derived from these works, the paper introduces some alternatives over some of the rule extraction techniques, as well as algorithms to compute the metrics described before. Following this, we present an empirical evaluation of our algorithm with several datasets. We then conclude, showing also potential future research lines of work.

Related Work
This section reviews unsupervised ML models used for anomaly detection, and reviews previous work on rule extraction in SVM that is relevant for our proposal.

Unsupervised ML for Anomaly Detection
The review of [7] provides an extensive analysis of the SOTA of ML models for anomaly detection, including unsupervised ones. Unsupervised ML models for anomaly detection can be differentiated according to their feature map, or according to the type of model used (in terms of how the decision frontier is obtained). Regarding the feature map, there are two possible types. First, Shallow models (i.e. Minimum Volume Ellipsoid) versus Deep ones (i.e. Generative Adversarial Networks). Regarding the type of model, four types are mentioned: classification (i.e. OCSVM), probabilistic (i.e. Kernel Density Estimation), reconstruction (i.e. Principal Component Analysis, Deep AutoEncoders) and distance-based (i.e. IsolationForest, Local Outlier Factor). OCSVM is a type of Kernel-based One-Class Classification anomaly detection model that is well-suited for multimodal, nonlinear and nonconvex datasets. OCSVM is also an algorithm that, since its original formulation [11], has being developed with many variations.
OCSVM has advantages in terms of computational performance [12]. One of the reasons is that it creates a decision frontier using only the support vectors (like general supervised SVM). Another advantage is that model training always leads to the same solution because the optimization problem is a convex one. However, SVM (hence OCSVM) algorithms are difficult to explain due to the mathematicallycomplex method that obtains the decision frontier [1].
From a theoretical point of view, SVM for classification maps the data points available in the dataset to a higher dimensional space than the one determined by their features, so that the separation among classes may be done linearly. It uses a hyperplane obtained from data points from all of the classes. These data points, known as support vectors, are the ones that are closer to each other and the only ones needed to determine the decision frontier. However, it is not really necessary to map to a higher dimension due to the fact that the equation that appears in the optimization of the algorithm uses a dot product of those mapped points. Because of that, the only thing to be calculated is such dot product, something that can be accomplished with the well-known kernel trick. Hence instead of calculating explicitly the mapping to a higher dimension the equation is solved using a kernel function.
In OCSVM there are no labels. Hence all data points are considered to belong to a same class at the beginning. The decision frontier is computed trying to separate the region of the hyperspace with a higher number of data points close to each other from another that has small density, considering those points as anomalies. To do so the algorithm tries to define a decision frontier that maximizes the distance to the origin of the hyperspace and that at the same time separates from it the maximum number of data points. This compromise between those factors leads to the optimization of the algorithm and allows obtaining the optimal decision frontier. Those data points that are separated are labeled as nonanomalous (+1) and the others are labeled as anomalous (-1).
The optimization problem is reflected in the following equations: subject to: In that equation, ν is a hyper-parameter known as rejection rate, which needs to be selected by the user. It sets an upper bound on the fraction of anomalies that can be considered, and also defines a lower bound on the fraction of support vectors that can be considered. Using Lagrange techniques, the decision frontier obtained is the following one: Hence the hyper-parameters that must be defined in this method are the rejection rate, ν, and the type of kernel used.

Rule Extraction techniques in XAI
Rule extraction belongs to the group of post-hoc XAI techniques [1]. This group of techniques are applied over an already trained ML model (generally a blackbox one) in order to explain the decision frontier inferred by using the input features to obtain the predictions. Rule extraction techniques are further differentiated into two subgroups: model specific and model-agnostic. Model specific techniques generate the rules based on specific information from the trained model, while model-agnostic ones only use the input and output information from the trained model, hence they can be applied to any other model. Post-hoc XAI techniques in general are then differentiated depending on whether they provide local explanations (explanations for a particular data point) or global ones (explanations for the whole model). Most rule extraction techniques have the advantage of providing explanations for both cases at the same time. [13] offers a review of rule extraction techniques for SVM, including the ones that are model specific. Here, regarding model specific techniques, they highlight three different types of algorithms. The first of them are rule extraction algorithms that use the support vectors from the original model as an input source for generating the rules. This is the case of SQRex-SVM [14] where the authors propose the usage of a subset of the support vectors for inferring the rules with the usage of a modified sequential covering algorithm. The second type of algorithms use both information from the support vectors together with information from the separating hyper-plane. This is the case of RulExSVM [15], where the authors propose a technique applicable for SVM with a RBF kernel. The algorithm uses the support vectors in order to build hyper-rectangles that intersect with the separating hyper-plane. Finally, the last type of techniques use the support vectors, the separating hyper-plane, and the training data. The training data is used to define the regions in the hyperspace, and the support vectors and the hyper-plane define the size of those regions. Within this category appears the proposal of [10].

Model Specific Rule Extraction techniques in XAI for SVM
The authors of [10] propose a technique called SVM+ Prototypes that can be considered model-agnostic or model specific depending on how is implemented. The general intuition consists in finding hypercubes (or hyperspheres) using the centroids (or prototypes) of data points of each class. Then, it can use as vertices either the support vectors from the SVM model, or the data points from that hyperspace area farther away from that centroid. For the first alternative, the proposal is model specific, since it focuses on a specific component of the model itself (the support vectors). The second one is model-agnostic, since it does not use any information that is specific only for SVM models. After this, it infers a rule from the values of the vertices of the hypercube that contain the limits of all the points inside it, creating one rule for each hypercube.
For example, a dataset that contains two numerical features X and Y will be defined in a 2-dimensional space. The algorithm will create a square that contains the data points on each of the classes, as shown in Figure 1. The rule that justifies that a data point belongs to class 2 is: The generated hypercubes may wrongly include points from the other class when the decision frontier is not linear or spherical, as shown in Figure 2. In this case, the algorithm considers an additional number of clusters trying to include the points into a smaller hypercube, as shown in Figure 3.
A rule will be generated for each hypercube, considering all those scenarios as independent, leading to this output: • Group 1: CLASS 1 IF X... • Group 2: CLASS 1 IF X... There are some downsides of that method in supervised classification tasks, especially when the problem is not sim-   ply a binary classification or when the algorithm is performing a regression. For instance, the number of rules may grow immensely due to the fact that a set of rules will be generated for each category and each set may contain a huge number of rule groups, leading to an incomprehensible output.
However, in OCSVM these difficulties may be potentially mitigated due to two reasons. On the one hand, the explanations are reduced to rules that explain when a data point is not an anomaly (so there would be no need to define rules for the anomalies). On the other hand, the algorithm tries to group all non-anomalous points together, setting them apart from the outliers. Because of this, the chance to define a hypercube that does not contain a point from the another class may be higher than in a standard classification task. Both the unbalanced inherent nature of data points in anomaly detection (few anomalies vs. many more non-anomalous data points) and the fact that non-anomalous points tend to be closer to each other may help achieving good results with this method.

Model-agnostic Rule Extraction techniques in XAI
Many rule extraction proposals contribute to XAI without the need to use any specific information from a particular type of model [1]. The only information necessary for building the rules is the input features and the model outputs. Some techniques use all the training data, while others need only a few input instances, or they can even generate artificial datapoints to infer the decision frontier. Even though the techniques were initially conceived for supervised ML, they can be extended for unsupervised ML for anomaly detection, since there output is analogous to a binary classifier where the classes are heavily imbalanced, as discussed in Section 1.
A general way to approximate any blackbox model globally is by using a surrogate supervised decision model trained over the same dataset, but instead of using the real labels (the ones used for the blackbox model), it is trained over the predictions of that blackbox model [5]. This may be accomplished with any ML model, but it is useful to do it with a whitebox model that can be directly interpreted. Among these whitebox models, some of them may be used for rule extraction. An example is a Decision Tree (DT) model. DT allows explaining the classification logic of the blackbox model through the usage of rules, which can be used even for classifying new instances. The advantages of using a DT as a surrogate global model is its flexibility (it can be applied over any model in an agnostic way) and simplicity (it is a solution that is easy to explain). However, this approximation at the end leads to explain a proxy model, and not the actual data, since the surrogate model never sees the true target values.
Anchors [16] is a model-agnostic XAI technique that extracts rule explanations for individual data points. The purpose of Anchors is finding a decision rule that approximates the decision function of the blackbox model around that individual data point. This rule "anchors" the prediction of that data point, so that any perturbation of the features of that point that are still inside the rule will always return the same output from the blackbox model. The approach is as follows. First, the algorithm generates candidate rules that may explain the data point. Then, it evaluates those candidate rules. In order to do that, Anchors generates permutations around the data point (similar data points to the original one) that yield the same result. The result is evaluated by calling the blackbox model (the oracle) and obtaining the classification for that data point. In order to optimize the exploration-exploitation of generating and evaluating data points, it uses a reinforcement learning approach with a Multi-Armed Bandit (MAB) approximation. In this MAB, each arm of the Bandit problem is a candidate rule, and the data points generated, after obtaining their classification result from the blackbox model, are used to compute a precision metric used to evaluate the candidate rule's payoff. This reinforcement learning approach helps minimizing the number of calls to the model in order to reduce the computational cost of the algorithm. Among all the candidate rules, the algorithm then checks if the best one of them matches a predefined convergence criteria. To do that, it filters rules according to a precision threshold, and selects form the remaining ones the one with highest coverage. That rule is used to explain that original data point. If there are no rules that match the convergence criteria, then the algorithm keeps iterating (using a beam search approach) using the B best rules from the previous step in order to generate new candidate rules for the following one. In those following steps, Anchors keep extending the rules with more features (in the first step, it only uses one feature per candidate rule). Thus, Anchors offers a model-agnostic approach that generate IF-THEN rules, easy to interpret, that are generated in an efficient way thanks to the usage of reinforcement learning (MAB) that can be parallelised. However, Anchors is very sensitive to its initial configuration, like many permutation approach algorithms, such as LIME [17]. Another important consideration of Anchors is that, while it keeps the calls to the oracle to a minimum (thanks to MAB), it still requires a lot of calls, and that can affect the runtime of the algorithm.
RuleFit [18] is a model-agnostic surrogate model that learns a linear regression model (Lasso) that uses as features both the original features of the model, as well as new generated features that represent decision rules. In order to accomplish that, first, a tree model is trained over the output and the input features, and the decision paths between the tree levels are turned into decision rules, except for the ones that lead to the leaf nodes, which are not considered. These rules are used as additional features, along with the original ones, on the Lasso surrogate model. Thanks to this, Rule-Fit yields both rules as well as their contribution, measured through the coefficients of the Lasso model. In summary, RuleFit generates a white-box model that includes rules as features, that can be interpreted as a standard linear regression one. The only caveat is that, for the original coefficients, the predicted outcome changes by |β j if feature x j changes by one unit if the other features remain unchanged, while for a feature-rule r k it is different; if all the conditions of the feature r k are met, the predicted outcome changes by α k (the weight associated to that rule-coefficient) for regression. Similarly, for classification tasks, when the conditions of r k are met, the odds for event vs. no-event changes by a factor of α k .
Similarly to RuleFit, SkopeRules [5] is another way to generate rules from tree ensembling techniques. They differ, however, in how they obtain the rules. First, SkopeRules generates the rules using surrogate tree ensembles trained using the input features and the target variable. Then, it applies a filtering step in which, using a threshold for Precision and Recall, some rules are removed and some are kept. This step allows to select only high-performing rules, and removing the ones that do not yield good results. The last step is known as "semantic rule duplication". This step eliminates duplicate rules (rules that are the same or very similar to other ones). It also eliminates again low-performing rules based on their results for a F1-metric. This allows to obtain high-performing as well as heterogeneous rules. The final set of rules is the output of SkopeRules, differing from RuleFit because it does not use a Lasso model to aggregate all rules.
Falling Rule Lists (FRL) [19] are classification models that generate a sorted list of IF-THEN rules, thus, they can serve as a model-agnostic global post-hoc rule extraction technique. The rules are binary, and are looked one after the other, in order to see if a particular datapoint can be classified into one of the classes. The rules are sorted according to the probability of classifying a datapoint into that class using that rule. Due to that, FRL offers a list of IF-ELSE IF rules associated to a particular class with a decreasing probability score. This is inspired in the concept of healthcare triage: patients are classified within risk level groups, and the highest-risk ones should be considered first. The particular algorithm cited and used in this paper uses an approach for learning based on a Bayesian framework, instead of a greedy decision tree learning method, named Bayesian Falling Rule Lists (BFRL).
Boolean Decision Rules via Column Generation (BRCG) [20] also provides a binary classifier by using disjuntive normal form (DNF, OR-of-ANDs) or conjuntive normal form (CNF, AND-of-ORs) through interpretable rules. In case of DNF (the one used in this paper), they provide an unordered set of decision rules that classify a datapoint into the positive category if at least one of the rules is satisfied. This is different than other methods already mentioned, such as BFRL where the rules are ordered in an IF-THEN schema, or the surrogate DT model, that provides the rules in a tree structure schema. In this article, we use the BRCG-light approximation from [21], that replaces the integer programming solver used in the original paper by a heuristic beam search one.
Generalized Linear Rule Models (GLRM) [22] generate decision rules and combine within a linear model (generalized additive model, GAM). Thus, they provide both a nonlinear modelling, thanks to the decision rules, while keeping the interpretability by using a linear model that ensembles them. However, as [23] notice, while it is feasible to interpret linear combinations of rules, if the number of rules increases too much, there is a risk of losing the interpretability of the model. The authors of the original paper highlight that in order to reduce the rules generated and not lose interpretability, they use a rule selection technique based on column generation (CG). CG searches the spaces of rules and generates them only when they are needed, and then fits again the GLM model. This allows analysing again old rules, re-weight them, and discard the ones that are not needed anymore. This is different to other methods used in the lit-erature, mainly pre-selecting a subset of candidate rules using optimization techniques, or a greedy optimization approach by adding rules one by one using sequential covering or boosting techniques.

XAI for Anomaly Detection
XAI is useful for both explaining an anomaly detection model from a global perspective, or for explaining the identification of particular instances as outliers. From the global explanation level, [24] use two anomaly detection ML algorithms (Decision tree and DeepLog) to detect outliers over log data. Together with that, they use Shapely values in order to generate model-agnostic feature relevance explanations that help to see which features contribute more for predicting outliers by seeing the individual contribution of each feature to the general outlier probability.
XAI [8] has also been used for anomaly detection for predictive maintenance. The authors highlight that even when an anomaly detection model is very accurate, the operators that will get the model prediction may not trust it if it remains a blackbox that does not provide any insights about its decisions. Because of that, they propose an anomaly detection system where the explanations are generated thanks to the usage of a whitebox model (ElasticNet Logistic Regression). So, they provide explanations in terms of feature relevance, focusing on explaining what contributes to an anomalous state. With that, they highlight that explanations for anomaly detection can be generated in a similar way to those of a supervised ML model for binary classification (even though anomaly detection models provide an output heavily imbalanced) Shapely values for explaining anomalies also appear at [25], where the SHAP algorithm is used to generate feature relevance explanations in order to explain what contributes specimen mix-up. For the anomaly detection, they use a Gradient Boosting Tree in order to be able to learn efficiently from highly unbalanced data while yielding good predictions. The authors highlight the importance of having a highly accurate model that is able to predict correctly the specimen mix-up, because this is a crucial problem that may lead to an incorrect diagnostic or an inappropriate therapy.
An additional recent reference is [7]. Here, the authors also cover within the review of the SOTA of anomaly detection the importance of using XAI in order to have a deeper understanding of the model. Their focus on explanations is mainly for unsupervised deep learning (DL) models, where the explanations can be produced by model-agnostic posthoc techniques for feature relevance (LIME) or by using model specific algorithms (LRP). One of the usages of XAI that they describe is the improvement of the model based on the explanations provided. They show an example for anomaly detection based on images, where XAI helps to see the cases where the pixels used for making the decision are actually the correct ones.
The analysis of the literature highlights how detecting anomalies is critical within some domains, and because of that, their detection needs to be very precise. However, being able to detect anomalies is not enough, and explanations are needed for both understanding the model better (and see-ing if it can be trusted or improved), as well as for explaining the model for other audiences in order to see if they can also rely on the predictions or not (something connected to the explanation generation for different user profiles [1]). A model may perform apparently very well and explanations may help to see that the model is taking its decision by using features that are not relevant [5], so in that case, the model may not be finally trusted. This shows that XAI can complement the classical evaluation of models based only on their performance.
However, after the assessment of a model and seeing that it behaves correctly (from both the XAI and the performance point of view), before providing explanations to some user profiles, it is important to ensure that they are aligned to what the model predicts, and are not showing any contradictory information. A way to do it within the rule extraction scenario is by using P@1 rules with respect to the model output. With that, even though they may not be explaining the whole model or are not able to explain every instance, their explanations will be completely aligned with the model.
For feature relevance explanations, the literature shows that they help to see how they contribute to the positive class (outliers in anomaly detection). For rule extraction explanations, they can help to explain outliers with respect to what will turn that outlier into an inlier. Considering this, the explanations will target the inlier class, so the outliers can be explained in a counterfactual approach with respect to the non-anomalous subspace (for local explanations). For global explanations that help to see what feature values are normally associated to outlier situations, the explanations would still target the outlier class.

XAI for OCSVM
The terms "XAI", "explainable" or "interpretable", together with "OC-SVM" or "OCSVM" only provide 4 results searching them within titles, abstract and/or keywords within Scopus ® One of them is the work of [26]. Here, the authors propose a model-specific method based on the concept that OCSVM models can be rewritten as pooling neural networks. Due to the asymmetry between inliers and outliers, they model with a min-pooling over distances for outliers, and a max-pooling over similarities for inliers. Thanks to turning OCSVM models to a neural network, they apply a deep Taylor decomposition (DTD) to obtain explanations in term of input features. DTD serves as a framework to apply layer-wise retropropagation (LRP) in order to obtain the feature contribution of the input features to a predicted output. The authors extend the explanations generated to include using both input features or support vectors. In [27] the authors benchmark different unsupervised ML algorithms for anomaly detection (IsolationForests, OCSVM, Cluster Support Vector Data Description and One-Class decision Tree, OC-Tree), and analyse them over data from the medical domain. They indicate that OC-Tree provides the best results. OC-Tree has the advantage of being a hybrid method that combines the first kernel density estimation for anomaly detection with a decision tree that automatically provides rules that explains the first model. The benchmark of the models is performed in terms of predictive performance, mentioning that OC-Tree is then better for that use case since it directly provides explanations. In [28] the authors use OCSVM and Variational Autoencoders for detecting engine faults within 2.4L diesel engines. The faults, which may belong to two types, are precisely the anomalies. For that they use 130 feature parameters. Together with that, they include a posthoc explainability layer by using LIME (thus, explaining the models in terms of feature relevance).
[9] also shows the combination of OCSVM with XAI. For the XAI part, they use the algorithm Ripper for rule induction. For this algorithm, they use the information from the support vectors from the OCSVM. At the evaluations, they use three different datasets and measure the performance of the rules extracted in terms of Precision, Recall and F1 metrics over the ground truth of the real anomalies. They also train OCSVM models with a RBF kernel.
The previous analysis of the literature shows that even though there are some works regarding XAI and OCSVM, they are either focused in a particular data field, or they do not compare many rule extraction methods (for the previous literature that uses OCSVM with rule extraction). Due to that, there is still an open area regarding the benchmark of rule extraction techniques over OCSVM models for anomaly detection.

Metrics for XAI
Beyond indicating the importance of both detecting outliers and being able to explain how the decision took place, it is also crucial to quantify the quality of those explanations. There are some recent reviews in the literature that deal with the challenge of providing metrics in XAI, such as [29]. In that article, the authors analyse the literature and define a taxonomy of properties that should be considered in the individual explanations generated by XAI techniques. Even though the paper deals with quantifying the quality of the explanations for an individual datapoint, some of them are also applicable for global explanations.
• Accuracy: It is related to the usage of the explanations to predict the output using unseen data by the model.
• Fidelity: It refers to how well the explanations approximate the underlying model. The explanations will have high fidelity if their predictions are constantly similar to the ones obtained by the blackbox model. The authors mention how accuracy and fidelity are intertwined: If the explanations have high fidelity (thus, approximate the model well) and the model has high accuracy, the explanations will also have high accuracy. However, the explanations may have high accuracy (because they predict very well over unseen data) while having low fidelity (because they do not approximate well the original model).
• Consistency: It refers to the similarity of the explanations obtained over two different models trained over the same input dataset. High consistency appears when the explanations obtained from the two models are similar. However, a low consistency may not be a bad result since the models may be extracting different valid patterns from the same dataset due to the "Rashomon Effect" (seemingly contra-dictory information is fact telling the same from different perspectives). • Stability: It measures how similar the explanations obtained are for similar datapoints. Opposed to consistency, stability measures the similarity of explanations using the same underlying model. • Comprehensibility: This metric is related to how well a human will understand the explanation. Due to this, it is a very difficult metric to define mathematically, since it is affected by many subjective elements related to human's perception (such as context, background, prior knowledge, etc.). However, there are some objective elements that can be considered in order to measure "comprehensibility", such as whether the explanations are based on the original features (or based on synthetic ones generated after them), the length of the explanations (how many features they include), or the number of explanations generated (i.e. in the case of global explanations). In general terms, using the original features, while keeping the number of explanations generated and the features used to a minimum, will increase comprehensibility. • Certainty: It refers to whether the explanations include the certainty of the model about the prediction or not (i.e. a metric score). • Importance: Some XAI methods that use features for their explanations include a weight associated with the relative importance of each of those features. • Novelty: Some explanations may include whether the datapoint to be explained comes from a region of the feature space that is far away from the distribution of the training data. This is something important to consider in many cases, since the explanation may not be reliable due to the fact that the datapoint to be explained is very different from the ones used to generate the explanations. • Representativeness: It measures how many instances are covered by the explanation. Explanations can go from explaining a whole model (i.e. weights in linear regression) to only be able to explain one datapoint. Considering the case of rule extraction techniques, the outputs (rules) for the whole dataset can be analyzed from the perspective of global explanations. In this context, one additional aspect to consider is diversity, a metric that indicates whether the explanations are redundant or repetitive and can already be mostly covered by another explanation, or if they provide insights that are not deducible from the other explanations available.
From among all these metrics, [13] already commented on the importance of comprehensibility, accuracy and fidelity for rule extraction techniques that explain a SVM model. The metrics are defined as: Accuracy = No. instances classified correctly by the rules / Length test set Fidelity = No. instances where the rule predictions match the model predictions / Length test set For "consistency", No. of rules and No. of antecedents (analogous to rule size).
[30] analyses XAI metrics for Random Forests by using an Interpretability Matrix that shows the relationship be-tween Rule Coverage -Rule Certainty -Feature Relevance. [31] shows a model-agnostic comparative for rule extraction algorithms using C4.5Rule-PANE, REFNE, RxREN and TREPAN. For that, they use 8 datasets of up to 8124 total instances and 40 features. As blackbox models they use Neural Networks models for classification (with different configurations). Finally, they propose several metrics for measuring the quality of the explanations.
• Completeness: Percentage of input instances covered by rules over total input instances. Analogous to "Representativeness".
• Correctness: Percentage of input instances correctly classified by rules over total input instances. Analogous to "Accuracy".
• Fidelity: Percentage of input instances on which the predictions of model and rules agree over total instances.
• Robustness: Applying small perturbations over the datapoints that do not change the prediction of the model, the sum of differences between the original prediction and the new prediction, divided by the number of instances analyzed. It is analogous to the concept of "Stability".

Robustness
Robustness is further analysed in [32], where the authors evaluate it for feature relevance model-agnostic post-hoc XAI techniques (LIME and SHAP).
• Number of rules and Average rule length, similar to [13].
They apply these metrics and see, using the Friedman's test, that C45-Pane has significantly superior results over all of the datasets considering all of the metrics, followed by TREPAN. The remaining papers yielded by the query aforementioned either do not deal with metrics for rule extraction techniques, or only focus in the "Accuracy" aspect.
For our research regarding rule extraction, we will focus in analysing the degree of "comprehensibility" of the rules, the coverage of those rules of the datapoints available ("representativeness"), if the rules approximate the underlying model ("stability"), and if they have overlaps among them and are redundant ("diversity"). The advantage of these metrics for unsupervised ML is that they do not need any ground truth information about the "correct" output that the model should have, as opposed to other metrics like "accuracy". This is interesting because many times that ground truth is not available. This is applicable to our use case, since we want to detect anomalies over real industry datasets belonging to Telefónica where there is no prior information about the anomalies.
The challenge here is defining how to quantify the metrics of stability and diversity (since comprehensibility and representativeness for rule extraction is already defined in the previous SOTA. Stability was also defined previously, but it is important to minimize the number of data points to consider in order to generalize the metric for large datasets (where generating perturbation for every input datapoint would be computationally expensive). This is specially important when working with unsupervised ML models, where many times there is not a reduced test set available, and the evaluations needs to be done using only the training data. Also, when working with P@1 rules, where the fidelity to the original model will always be perfect, it is important to measure the stability in order to see how the explanations behave with unseen data (moreover when there is no test set available). Due to that, we will propose and use algorithms to quantify them in a rule extraction scenario applied for unsupervised ML for anomaly detection.

The psychology of explanations
In ML, explanations are the "key" to open blackbox algorithms, and are therefore likely to play an important role in possible future AI regulations. It is expected that in certain domains or applications, whenever an algorithm takes an autonomous decision or provides a recommendation that has a significant impact on people's lives, some kind of explanation is required [33]. In this paper, we have quantified several quality parameters of explanations generated by different methods for non-supervised learning algorithms, including the comprehensibility of the generated explanations. But for explanations to be comprehensible and effective for people in different situations, we also need to consider the consumer of the explanations: the people.
How do people understand explanations? In psychology, explanations are seen as crucial for human knowledge and learning, as they are considered proofs of understanding. According to Wilkinson [34], there are three kinds of views of explanations: the formal-logical view (an explanation is like a deductive proof given some propositions), the ontological view (events -state of affairs-explain other events), and the pragmatic view (an explanation needs to be understandable by the "demander"). Explanations that are sound from a formal-logical or ontological view, but leave the demander in the dark, are not considered good explanations. For example, a very long chain of logical steps or events (e.g. hundreds) without any additional structure can hardly be considered a good explanation for a person, simply because he or she will lose track.
Wilkinson introduces two more concepts to define the adequacy of explanations for demanders. The level of explanation refers to whether the explanation is given at a high-level or more detailed level. The right level depends on the knowledge and the need of the demander: he or she may be satisfied with some parts of the explanation happening at the higher level, while other parts need to be at a more detailed level. The kind of explanation refers to notions like causal explanations and mechanistic explanations. Causal explanations provide the causal relationship between events but without explaining how they come about (a kind of "why" question). For instance, smoking causes cancer. A mechanistic explanation would explain the mechanism whereby smoking causes cancer (a kind of "how" question). Causal explanations can be further divided into common-cause explanations (a single cause has several consequences), common-effect explanations (several causes converge to one consequence), and simple linear chain explanations (one causes leads to one consequence) [35].
As said, a satisfactory explanation does not exist by itself, but depends on the demander's need. In the context of ML algorithms, we can distinguish between several typical demanders of explainable algorithms [1]: • Domain experts: those are the "professional" users of the model, such as medical doctors who have a need to understand the workings of the model before they can accept and use the model.
• Regulators, external and internal auditors: like the domain experts, those demanders need to understand the workings of the model in order to certify its compliance with company policies or existing laws and regulations.
• Practitioners: professionals that use the model in the field where they take users' input and apply the model, and subsequently communicate the result to the users' situations, such as loan applications.
• Redress authorities: the designated competent authority to verify that an algorithmic decision for a specific case is compliant with the existing laws and regulations.
• Users: people to whom the algorithms are applied and that need an explanation of the result.
• Data scientists, developers: technical people who develop or reuse the models and need to understand the inner workings in detail.
In summary, for explainable AI to be effective, the final consumers (people) of the explanations need to be duly considered when designing XAI systems.

Our Proposal
We first describe the intuition behind our rule extraction approach fom an OCSVM model for anomaly detection. Then, we describe in detail the algorithm implementation.

Algorithm Intuition
We propose using rule extraction techniques within OCSVM models for anomaly detection, by generating hypercubes that encapsulate the non-anomalous data points, and using their vertices as rules that explain when a data point is considered non-anomalous. As already mentioned in the introduction, [10] proposes an algorithm to extract rules from a SVM model by performing clustering over the datapoints that belong to one of the classes. The clustered datapoints will be used to obtain a geometric surface that enclose the rest of the datapoints inside. There are two ways to accomplish it: building hypercubes or building hyperspheres. This paper will focus the analysis over the first approach: building hypercubes. The paper will also focus in the model-agnostic variant, where the algorithm obtains the furthermost datapoints from inside the cluster as vertices for the hypercube, so they enclose the rest of datapoints of that category inside (the model specific alternative uses the support vectors). In case that the hypercube generated encloses points from the other category, then the number of clusters will be increased, aiming to obtain smaller cubes that could fit the data without including points from the other class. This is done iteratively until no points from the other class are inside the hypercubes, or a maximum number of predefined iterations is reached. During the process, if a hypercube does not contain points from the other class, then that hypercube is translated into a rule, and those datapoints are removed from the following iteration steps. Images 4 and 5 shows an example application of this algorithm for a 2D space. In Image 4 appears the initial scenario, where the first step in the iteration process consists in applying one cluster over the dataset for datatpoints of one of the classes (blue ones). However, with one cluster, the 2D square that enclose the datapoints contains points from the other class, so more clusters need to be applied. As 5 shows, iteration 3 (with 3 clusters) is the first one with squares without red points, so those subspaces are turned into rules and the points inside them removed from the iteration process, that starts again with one cluster for the remaining datapoints. Iteration 6 will be the last one, and 5 rules have been extracted up to that point.   [10], the number of clusters keeps increasing until no points from the other class are inside, an then that hypercube is translated into a rule.
The approximation proposed before is not the only one that can be applied in order to extract the rules. Image 6 shows one of our alternative proposals over [10] method. Instead of removing datapoints that are inside a rule without points from the other class, the process always keeps all datapoints in every iteration since there could be clustering patters that could only be found if all points are together. In this approach, the number of clusters is constantly increased until no datapoints from the other class are inside the hypercubes, or the maximum number of iterations is reached. We will further address this method as "keep" in the remaining of the paper. In contrast, the references to [10] method will be addressed as "keep_reset". Figure 6: Keeping all datapoints in every iteration could lead to a reduced number of clusters since there may be data patterns that could only be found in this scenario.
Another proposal that we include in this paper over [10] is splitting the subspaces in a binary partition scheme. This is an alternative over the original proposal, that constantly increases the number of clusters until one rule has only datapoints from the same class, and then restarting the clustering process from the beginning for the remaining ones. We will address this method as "split" for the remaining of the paper. Image 7 shows how the same 2D example using this approach. According to the taxonomy for XAI in [5], our method has the following characteristics: • Post-hoc: Explainability is achieved using external techniques. • Global and individual: Explanations serve to explain how the whole model works, as well as why a specific data point is considered anomalous or non-anomalous. • Model-agnostic: As with other techniques for global explanations [5], the only information needed to build the explanations are the input features and the outcomes of the system after fitting the model. • Counterfactual: The explanations for why a data point is anomalous also include information on the changes that should take place in the feature values in order to consider that data point as non-anomalous.
Since the explanation algorithm is model-agnostic, it can work for any blackbox model. The only information needed is the train dataset and the outputs from the model. To illustrate it, this paper will show evaluations over OCSVM models with different kernels: radial basis function (RBF) and linear kernel.
Regarding the clustering technique itself, potentially any algorithm could be used, both for [10] or for any of out two proposals over it from this paper. However, there is a caveat that should be considered. The clustering algorithm needs to take into account if the features are only numerical, categorical (non ordinal), or both.
One algorithm that will be used in this paper for extracting the hypercubes is K-Means ++ [36]. However, the standard version of this clustering algorithm is designed for numerical features, and categorical ones should be treated differently. In that case, the approximation would be to extract a rule for each of the possible combinations of categorical values among the data points that are not considered anomalous. Considering again the aforementioned 2-dimensional example, with variable X being binary categorical, a dataset may look like in Figure 8: In that case, two rules would be extracted, one for each of the possible states of X: Generally speaking, the algorithm logic can be summarised as: • Apply OCSVM to the dataset to create the model. • Depending on the characteristics of variables, do: -Case 1. Numerical only: Iteratively create clusters in the non-anomalous data (starting with one cluster) and create a hypercube using the centroid and the points further away from it. Check whether the hypercube contains any data point from the anomalous group; if it does, repeat using one more cluster than before. End when no anomalies are contained in the generated hypercubes. If there are anomalies and the data points in a cluster are inferior to the number of vertices needed for the hypercube, complete the missing vertices with artificial datapoints and end when there are no anomalies or when the convergence criterion is reached. -Case 2. Categorical only: The rules will correspond directly to the different value states contained in the dataset of non-anomalous points. -Case 3. Both numerical and categorical. This case would be analogous to Case 1, but data points will be filtered for each of the combinations of the categorical variables states. For each combination, there will be a set of rules for the numerical features.
• Use these vertices to obtain the boundaries of that hypercube and directly extract rules from them.
Besides K-Means++, there are other clustering algorithms that could be applied. In this paper we will analyse also the rules obtained by applying K-Prototypes [37]. The advantage of using K-Prototypes is that it can work directly with both categorical and numerical features.

Algorithm Description
Algorithm 1 contains the proposal for rule extraction for an OCSVM model that may be applied over a dataset with either categorical or numerical variables (or both). ocsvm_rule_extract is the main function of the algorithm. Regarding input parameters, X is the input data frame with the features, d f a dictionary with two lists (l n a list with the numerical columns and l c a list with the categorical columns), d p is a dictionary with the hyperparameters for OCSVM (kernel type, upper bound on the fraction of training errors and a lower bound of the fraction of support vectors, ν, and the kernel coefficient, γ). This function starts with the feature scaling of the numerical features (function featureScaling), followed by the encoding of categorical ones (function featureEncoding). After that, it fits an OCSVM model with all the data available and detects the anomalies within it, generating two datasets, X y with the anomalous data points and X n with the rest (function filter-Anomalies).
The next step is checking the type of features available. If all the features are categorical, then the rules for nonanomalous data points will simply be the unique combination of values for them. If there are both categorical and numerical features, the algorithm obtains the hypercubes (as mentioned for numerical features only) for the subset of data points associated to each combination of categorical values. Function getR() calls different subfunctions depending on the t parameter value, but in any of the cases, the approach is similar: clustering non-anomalous data points in distances ← model.decisionF unction(X) 10: X y , X n ← f ilterAnomalies(X, preds) 11: if len(l c ) = 0 then 12: rules ← getR(X n , X y , X, d f , m, t) 13: else if len(l n ) = 0 then 14: rules ← getU nique(X n , l c ) 15: else 16: cat ← getU nique(X n , l c ) 17: rules empty list 18: for c ∈ cat do 19: X nf , X yf ← f ilterCat(X n , X y , c) 20: rules.append(getR(X nf , X ny , d f , m, t)) 21: end for 22: end if 23: rules ← f eatureU nscaling(rules, l n ) 24: rules ← pruneRules(rules, d f ) 25: return rules 26: end procedure a set of hypercubes that do not contain any anomalous data points.
The "keep" approach, described in algorithm 2, iteratively increases the number of clusters (hypercubes) until there are no anomalous points within any hypercube. The function outPosition checks whether the rules defined based on the vertices of the hypercube do not include any data point from the anomalous subset, X y . getRulesKeep then calls function getVertex (described in algorithm 4) with a specific number of clusters, n cl . This function performs the clustering over the non-anomalous data points, X n , using the function get-Clusters that returns the label of the cluster for each data point, as well as the centroid position for each cluster using the specified cluster algorithm.
If the algorithm is K-Prototypes, then if considers both categorical and numerical features (using getKP function). If is K-Means++, then it applies the clustering over numerical features only (using getKM function).
Then, it iterates through each cluster and obtains the subset of data points for that cluster X nc with the function insideCluster. After that, if there are enough data points in that cluster (more data points than the vertices of the hypercube), it computes the distance of each of them to the centroid with getDist and uses the furthest n v as datapoints for obtaining the vertices that enclose the cluster using the getVertex function. n v is a value that represents the hyperspace dimensionality, and is obtained with hyperDimension function. In case there are less datapoints than the number of vertices that a hypercube of that dimen-sionality has, then all of them are used for obtaining the vertices. This last scenario does not stop the iterations, since a hypercube in this situation could still include outliers, needing further splitting. As long as there are no outliers inside the rules, they are stored in rules list. However, as soon as there is one rule with outliers inside, then the whole process is repeated again with one more cluster. This keeps taking place until no outliers are inside the rules or the maximum number of iterations is reached.
Algorithm 2 Rule Extraction -Keeping all datapoints 1: procedure GETRULESKEEP(X n , X y , m, d f ) 2: max_iter reference value 5: check ← T rue 6: n clusters ← 0 7: while check do 8: rules empty list 9: if n clusters > max_iter then 10: check ← F alse 11: else 12: n cl ← n cl + 1 13: vInf o ← getV ertex(X n , X, d f , m, n cl ) 14: for iterV alue ∈ vInf o do 15: rules cluster ← iterV alue[0] 16: X nc ← iterV alue [1] 17: l y ← outP osition(rules cluster , X y ) 18: if len(l y ) = 0 then 19: rules.append(rules cluster ) return rules 28: end procedure The "split" approach is defined in algorithm 3. This function has some similarities with 2 with the following differences. Instead increasing the number of clusters in every iteration, n cl is always 2. Also, l_sub receives the data after every split. Initially, l_sub contains only one dataset, the inliers X n . However, after another iteration, its value is set to the data from the clusters in which the rules did contain some outlier.
In any of the three methods, after obtaining the rules, function f eatureU nscaling is used to express rules in their original values (not the scaled ones used for the ML models). And function pruneRules checks whether there are rules that may be included inside others; that is, for each rule it checks whether there is another with a bigger scope that will include it as a subset case.

Influence of the kernel
As mentioned before, OCSVM models are configured using mainly three hyperparameters: ν, γ and the kernel type. Depending on the kernel type, the construction of the decision frontier to differentiate between outliers and inliers changes. In particular, Radial Basis Function (RBF) kernel will find hyperspheres (one or more) that enclose the inliers, leaving outliers outside.
The diverse density of outliers versus inliers highlights that there may be differences in the rules depending on which class they enclose. Mainly, since the decision function is a hypersphere, the intuition is that it will be easier to find rules that enclose all those points. Figure 9 illustrates this idea.

Algorithms for metrics
As mentioned before, the metrics considered in this paper are divided into four subsets: comprehensibility, representativeness, stability and diversity. Since, to the best of our knowledge, some of these metrics are not implemented within the main XAI framerworks, we propose within these paper a set of algorithms to compute them in a rule extraction scenario. We apply the metrics for the case of unsu-Algorithm 4 Additional functions 1: procedure GETVERTEX(X n , d f , m, n cl ) 2: n v ← hyperDimension(X n , d f )

5:
d bounds empty list 6: d points empty list 7: if m = kprototypes then 8: labels, centroids ← getKP (X n , l n , l c , n cl ) 9: else 10: labels, centroids ← getKM (X n , l n , n cl ) 11: end if 12: for c ∈ n cl do 13: X nc ← insideCluster(labels, X n ) 14: if len(X nc ) > n v then return d bounds , d points 24: end procedure pervised anomaly detection using OCSVM models, but they could be applied for any model that has a binary output and that is explained through rule-extraction techniques.
• Metrics for representativeness: Percentage of datapoints explained with P@1 rules (per_p1) and the median percentage coverage of datapoints by each rule (p1_coverage).
• Metrics for stability: How many artificial points (similar to a subset of prototypes from the dataset) are classified by the rules with the same predictions yielded by original blackbox model (precision_vs_model).
• Metrics for diversity: Degree of hyperspace overlapping between all the rules (score_intersect).
Comprehensibility: The metrics for "comprehensibility" are directly analyzed from the rules themselves; n_rules is computed counting the number of rules generated, and size_rules is computed checking the elements that define the rule (i.e. X > 3 AND X < 7 AND Y > 1 have a size_rules = 3 while X > 3 have a size_rules = 1). This proposal already appears in [13].
Representativeness: The metric per_p1 for "representativeness" simply checks the percentage of datapoints for the target class explained with P@1 rules. The other metric in this group is p1_coverage. It checks the median performance of the rules themselves: it computes the median percentage of coverage for the target class by each rule. This Figure 9: With an RBF Kernel the correct hypercube will be the one that encloses the points that are not anomalies, since the OCSVM algorithm will try to enclose most of the points inside the decision frontier and leave anomalies outside.
proposals are similar to [31], with the particularity of focusing on P@1 rules.

Stability:
The metric precision_vs_model computes the "stability" metric of the hypercubes. The first step is obtaining the prototypes from the dataset and generate random samples near them. Then, obtain the prediction of the original model for those artificial samples and checks if the predictions using the rules are the same. The steps for these metric are described below, and the detailed pseudocode appears in algorithm 5.
Model agreement: • Choose N prototypes that represent the original hyperspace of data • Generate M samples close to each of those N prototypes using Protodash algorithm [38]; the hypothesis is that close points should be generally predicted belonging to the same class.
• For each of those N*M datapoints (M datapoints per each N prototype) check whether the rules (all of them) predict them as inliner or outlier; the datapoints that come into the function are either outliers or inliers. If they are inliers, then the rules identify an artificial datapoint (of those M*N) as inlier if it is outside every rule. If the datapoints are outliers it's the same reversed: a datapoint is an inlier if no rule includes it.
• It then checks if the predictions using the rules for those artificial datapoints are the same as the one provided by the original model.
• With that, it computes % of predictions for the artificial datapoints aforementioned that are the same between the rules and the original OCSVM model. Algorithm 5 receives the dataset X of inliers/outliers (depending if the rules are computed for inliers or outliers), the rules X r and the OCSVM fitted and trained model clf . Then obtains the protoypes with P rotodashExplainer() function and generates the random samples X s near them with randomN ear(), where an upper and lower limits (th s , th l ) can be defined for how close are those points to the prototypes. Then, it checks which rules enclose that datapoint with checkInR(), and if at least one of them encloses the datapoint, it is considered that it can be classified using the rules. The metric precision_vs_model is specified in n_precision variable, that checks the percentage of agreement between the classifications using the rules and the ones with the model, through checkInM odel() function.
Algorithm 5 Stability 1: procedure GETAGREEMENT(X, X r , clf ) 2: X p ← P rotodashExplainer(X) 3: for p ∈ X p do 5: X s ← X s .append(randomN ear(p, th l , th s )) n_precision ← n_precision/len(X s ) 21: return n_precision 22: end procedure Diversity: The metric to measure "diversity" is score_intersect, and it analyses if the rules are different with few overlapping concepts. This is computed checking the area of the hypercubes of the rules that overlaps with another one. The way to check this is by seeing the 2D planes of each hypercube (by keeping two degrees of freedom for the features in the hyperplane coordinates; n-2 features are maintained and the other two are changed between their max/min values in order to obtain the vertices of that 2D plane). Then, it obtains the area of the 2D planes for the rules that overlaps, and each of those 2D areas is turned into a score between 0 and 1 by using the Jaccard similarity index and dividing the area of intersection of the 2D planes by their area of union.
The pseudocode for this metric appears in algorithm 6. Algorithm 6 receives the dataset X of inliers/outliers (depending if the rules are computed for inliers or outliers), the rules X r , the list of columns for numerical features l_n and the one for categorical l_c. The first step is obtaining all the two tuples combinations of numerical features, using combinations() function. After that, it obtains the combination of categorical values with function unique(). The algorithm then analyses separately the rules that belong to each categorical combination values. For each of those subset of rules X_r_i, if there are at least two rules, then it defines the tuples of possible rule combinations, combR.
Then, it iterates per each combination of two numerical features. These two features will correspond to the features that will be changed, leaving the rest of the l_f ix features fixed, in order to extract 2D planes from the hypercubes with get2D, and storing those planes in polys variable. Those planes are used for obtaining the Jaccard similarity index with scoreP olys() function. If there is an iteration where one of the two dimensions has the same value, it is skipped since the area will be 0. (checkEqual(pair f )).
Image 10 describes the process for an example in a 3D space. Since all the rules translate into a hypercube, we can choose two features at a time (leaving the rest fixed) and obtain the coordinates for those 2D planes (using their vertices values). Then, for two rules, we can see the area of overlapping between those 2D hyperplanes, as well as their area of union. With that areas, we obtain the Jaccard similarity index. Since the Jaccard similarity index (score_i) yields a value between 0 and 1 (0 when there is no overlapping, and 1 when the area of intersection is the same as the area of union in a total overlapping), we can turn it into a metric in order to express a score value by doing 1 − score_i, so a perfect score will be the one corresponding to no overlap between the rules. This is repeated for all 2D planes of the hypercubes, and we compute the mean of all the individual scores in order to have one final metric (f inal_score) that is still between 0 and 1, with 1 the perfect score and 0 the worst. Figure 10: The overlapping between rules (hypercubes) approximated using their 2D planes' area of intersection.
All the algorithm that we have proposed for computing XAI metrics are XAI-specific metrics: metrics that are specific for a particular type of XAI technique (in this case, rule extraction).

Pruning rules
Many of the rules obtained with all the methods described above are suboptimal, since they can be enclosed into another bigger rule. In order to reduce the number of rules, and remove redundancies, we apply a simple pruning technique prior to the computing and evaluation of metrics. We check every hypercube generated and see if their limits are inside any other rule. If they are, we eliminate that rule from the set of rules. We check this for every rule against every other rule in the dataset, and we keep checking it in a loop until no rules are eliminated. if len(X_r_i) > 2 then 9: combR ← combinations(X_r_i, 2) 10: for pair f ∈ l_f ree do 11: if checkEqual(pair f ) then f inal_score ← mean(score) 23: return f inal_score 24: end procedure 3.6. From local to global rules Anchors [16] is one way of extracting rules for XAI. However, Anchors, as mentioned before, is a local method that explains one datapoint with rules. To be able to use it in the evaluation carried out in this paper, we need to turn Anchors into a global method. A simple way to do that is extracting rules for each datapoint of the input dataset, and prune those rules in order to keep the most relevant ones (as done in Section 3.5). This way, the whole dataset is explained, and the results can be compared with the remaining algorithms.
From the computational cost side, since obtaining Anchors rules for each datapoint is costly, we propose using Protodash [38] for scenarios when the dataset is too big. In this case, Protodash will select the relevant prototypes from the dataset, and Anchors will obtain the rules only for those points.

Combining everything
There is a question that will arise at this point: Which rule would be better? One with better results in "comprehensibility", or one with better results at, for instance, "Diversity"? When there is a need to choose a trade-off, which criteria should be prioritized? The answer to this will heavily depend upon the domain needs. However, in general terms, all the metrics can be combined into a single one that offers a unique view over them. It can be done with a metric in terms of f inal_metric = f (C, R, S, D) with C representing the comprehensibility metrics, R the representativeness, S the stability and D the diversity. There is another aspect that can be considered while creating a function to encapsulate all metrics. In general, it is better to have a lower value for comprehensibility metrics (less rules, less rule size) since that contributes to an enhancement of comprehensibility. Regarding the rest of the metrics, higher values are better. Thus, a simple way to compute this is adding the results for representativeness, stability and diversity (adjusting their relative importance by a set of weights), and subtracting comprehensibility results. Since the values for the metric of comprehensibility are the only ones that are not in a range of 0 to 1, we scale them before computing this metric in order to have all values in the same range by dividing them with respect to the number of inliers or outliers (number of rules) or by a value based on the number of features (rule size).
With this, a higher final value will be better. This is expressed in Equation 4.
N is equal to 5 in this case since we are considering 5 metrics. The different α, β, γ and theta parameters could be adjusted in order to weight the different metrics in case one of them are more important than others. Our proposed methods to compute a general metric is a very naive way to approach it, and more sophisticated ways could be explored. However, its important to highlight the need to be able to analyse everything together for some use cases since there are many XAI aspects to measure and it may difficult the comparison between XAI techniques.

Evaluation
We use our algorithm over different datasets (both public and from Telefonica's real data), to evaluate the following hypotheses: • Hypothesis 1 (H1): The rule extraction method of [10] and our proposed variations applied over OCSVM for anomaly detection using a RBF kernel yield significantly less P@1 rules when applied for explaining inliers than over outliers or when using a linear kernel.
• Hypothesis 2 (H2): Our proposed variations over [10] yield similar results for P@1 rules that explains the inliers of an OCSVM anomaly detection model when compared to [10] in terms of explainability regardless of the kernel (considering Linear and RBF).
• Hypothesis 3 (H3): The rule extraction method of [10] and our proposed variations yield better results for P@1 rules that explains the inliers of an OCSVM anomaly detection model in terms of explainability than other rule extraction techniques and regardless of the kernel (considering Linear and RBF).
As mentioned in Section 3, explanations in terms of rule extraction for anomaly detection may help to see with a counterfactual view what would make an outlier turn into an inlier by explaining the inlier class (for local explanations). For explaining what feature values are normally associated with outlier datapoints (global explanations) these explanations will target the outlier class. This is why hypothesis 1 checks the contribution of RBF kernel for grouping datapoints inside its hypersphere in order to help explaining them with less rules.
For the hypothesis checks, we will consider the results yielded by the XAI rule extraction methods over different datasets (Section 4.2) together with the type of kernel used for the OCSVM, as well as the type of data points explained (outliers or inliers). Thus, we will have N datasets x 2 types of kernel x 2 types of datapoints. This serves for performing an hypothesis contrast based on the Wilcoxon signed-rank test [39], since it has been proved useful for comparing different ML model metrics results over several datasets for both classification [40] and regression tasks [41].

Datasets
The datasets used belong to different domains, have different sizes and different number of features (both categorical and numerical), as indicated in Table 1 We ran experiments with the following infrastructure: the implementations of the OCSVM algorithm, the K-Means++ clustering and the DT algorithms are based on Scikit-Learn [45]. The rest of the code described in Algorithms 1 and 2 were developed from scratch, and available in Github [46], together with the algorithms used for measuring the "stability" and the "diversity" of the rules.

Results
In this Section we check our hypotheses. We will refer to K-Means approach as KM, and K-Prototypes as KP. Thus, for instance, K-Means with the "split" method will be identified as KM_split. Figure 13 and Table 2 provide the results associated to hypothesis 1. Here, we want to check if there are significantly less P@1 rules for inliers using a RBF kernel, compared to using a linear kernel for inliers, or the same RBF kernel for outliers. For the Wilcoxon signed-rank tests we will compare only combinations of method-kernel-inliers/outliers for datasets that have at least 1 P@1 rule (since, as Figure 13 shows, not every method in every dataset is able to yield P@1 rules). Since the comparisons involve few datapoints in some cases, we check against a minimum p-value of 0.1. Considering this, only KM_split and KM_keep have signif-icant differences in the number of rules. In those cases, H1 is actually rejected: RBF for inliers yields more rules than either RBF for outliers, or linear for inliers. Regarding the other methods, there are no statistically strong results to conclude anything. This results are shown in Table 2. Figure 13 shows how at datasets D3 and/or D5 (the ones with more features) the number of rules abruptly increase for some of the methods ("split" ones and KM_keep_reset). With that, even though it is not assured for every method, these rule extraction methods when applied to inliers and when using an RBF kernel tend to generate more rules than in the other cases.
After comparing those rule extraction methods in terms of the number of rules in order to see significant differences depending on the type of datapoints (inliers/outliers) and the type of kernel (rbf/linear), we proceed to check H2. Here, we compare the methods considering all the XAI metrics proposed previously. This is done by checking every metric over every combination of dataset, kernel and type of data (inliers/outliers), and performing a Wilcoxon signedrank test in order to see if there are no significant differences between the methods for each of the metrics. Since the datapoints in this case are superior than those present at H1, we check against a minimum p-value of 0.05. For H2 there is no need to check the size of the rules since they will be the same for all the methods using K-Means and for all the methods using K-Prototypes. We only compare between datasets-kernel-type of data that exists in both methods considered. Thus, the means for the KM methods may vary depending on whether they are compared between them or they are compared against KP ones (and vice versa).
At Table 3 we see the methods and metrics that have significant differences according to Wilcoxon signed-rank test. There are some cases where the metrics do differ significantly, as some methods yield better results. This is the case of KM_split. This method outperforms every other one regarding the percentage of datapoints covered by its P@1 rules (per_p1). It does so in exchange of yielding a greater number of rules than some of the other methods. Thus, it increases "representativeness" by losing in terms of "comprehensibility". In general, KM methods cover more datapoints with P@1 rules that their counterparts with KP. Considering the other metric from "representativeness", p1_coverage, there are no significant differences between KM_split and KM_keep_reset, but both methods yield better results than KM_keep. Thus, usually the P@1 rules that they yield are able to cover more datapoints. This is logical, since the algorithm that yields the rules in the case of KM_keep tends to generate smaller hypercubes. An example of this can be seen in Figure 11 for dataset D1. We can see how KM keep indeed yields rules that are smaller than the ones from the other methods.
Regarding "representativeness" (diversity_score), there are no significant differences between KM methods, but all of them outperform all the KP ones. In terms of "stability", we see no significant difference between any of the methods. Finally, the general metric (final_metric), shows that actually KM_keep outperform KM_split, and KM_keep_reset. Thus, even though KM_keep had worse results in terms of Figure 11: K-Means based rule extraction methods (for inliers) over D1 dataset with RBF kernel.
"representativeness" than the other KM methods, it is compensated by the other metrics. With this analysis, we see that KM methods appear to be better than KP ones for P@1 rules and for explaining anomalies over a OCSVM model. However, KM methods are more contested; they seem to have similar results in some metrics (KM_keep_reset and KM_discard are very similar between them), while being different in others (mainly compared to KM_keep in terms of "representativeness"). Thus, H2 is partially supported.
A visual analysis for the metrics is provided at Figure 14. It shows the XAI metrics aforementioned for the clusteringbased rule extraction methods and for every combination of dataset, kernel and type of data considered in this paper.
Finally, we check H3. Since the techniques compared for H2 yield similar results, we will focus only in KM_split and KM_keep, and benchmark them against the remaining rule extraction techniques covered in this paper. Figure 12 shows visualizations for some of these methods over D1 (when using a RBF kernel). Figure 12: Visualizations for the rules extracted over D1 with RBF kernel with DT and Anchors (for inliers) and SkopeRules and RuleFit (for outliers).
The results appear in Table 5. Here, we see how KM_split is generally better for every metric except for the ones related to "comprehensibility". In particular, KM_split is able to significantly cover more datapoints from the target class with P@1 rules (per_p1) than any of the other methods, and also yields rules that have better coverage (p1_coverage) than FRL and brlg. However, the mean coverage per rule compared to the other methods is not significantly different. Regarding "stability", KM_split outperforms brlg, but brlg has a better result in terms of "diversity". KM_split improves FRL and Anchors in "stability", and DT in "diversity". Finally, considering the general metric, KM_split has significantly better results than any of the other methods, with the exception of DT, where it does not show significant differences. Considering KM_keep, the results are similar, as shown in Table 5. One difference is that since KM_keep has a better general metric than KM_split, it is also able to significantly outperform DT in that aspect. Also, KM_keep is not outperformed in terms of "diversity" by brlg, as opposed to KM_split.
As a conclusion, we see that both the solution of [10] with K-Means++ (and with the modification for generating rules for categorical features), together with the variations considered in this paper (for also K-Means++ as clustering method) yield similar results in terms of most of the XAI metrics considered in this paper for explaining the results of a OCSVM anomaly detection model using P@1 rules. The results in terms of comprehensibility (number of rules) are influenced depending on the type of kernel used, and whether they are explaining inliers or outliers. Finally, comparing these techniques with other rule extraction methods, we saw a trade-off between comprehensibility and the remaining XAI metrics. The clustering-bsed rule extraction techniques used in this paper are able to explain better with P@1 rules the results of a OCSVM model (considering the datasets and kernels of this paper) in terms of "representativeness", "stability" and "diversity", but in exchange of "comprehensibility", which is penalized.

Conclusion
In this paper we have analysed the application of XAI techniques over unsupervised outlier detection models through the usage of rule extraction methods applied to OCSVM models. Among the rule extraction techniques, we used both algorithms from the literature, as well as new alternatives that we propose and evaluate together with them. Our first aim was analysing the quality of the rules extracted from a XAI perspective. We have done this by defining metrics for different aspects related to XAI: comprehensibility, representativeness, stability and diversity, as well as proposing a function to aggregate all those metrics together. We evaluated those metrics over different datasets, both from public sources as well as from Telefónica's, using communications and IoT generated data for that purpose. The results for the metrics show that clustering-based techniques yield results that are similar to each other (when usng K-Means++ clustering). When comparing these methods with other rule extraction techniques over different datasets and different kernels, we saw that when working with P@1 rules, the clustering-based methods yielded better results in terms of "representativeness", "stability" and "diversity", in exchange of "comprehensibility", which is penalized by using more rules with more size than other methods. Our evaluation considered model-agnostic techniques that can be applied over any black-box model. In order to check this empirically, we have used OCSVM models with different types of kernel configurations (linear and RBF). We saw how, indeed, all rule extraction techniques provide similar results regardless of the kernel used.

Limitations of our Approach
Regarding the XAI metrics and evaluations, the first limitation to consider is that the only model used is OCSVM (with two types of kernel). Even though the algorithms used and the metric definitions are model-agnostic and may be potentially applied over other outlier detection models, the results may differ if other unsupervised models or kernels are used. Together with this, our suggestion of a function that aggregates every metric is a simple baseline that can be further improved.
Also, the analyses carried out are focused in P@1 rules. Thus, the conclusions may be different if all the rules extracted are used, regardless of their precision value.
Finally, all the rules (for cluster-based methods) and all the checking to see if a data point is inside a hypercube (for all methods) are defined with inequalities (≤, ≥). Because of that, the results may be different if we allow values from the other class to be at the limit of the hypercube.

Future Work
There are several research lines that can be pursued following the work presented at this paper. The first one to consider is benchmarking these results against other rule extraction techniques that are not covered in this paper. An example is G-Rex algorithms [52]. Another research line that can be followed is analysing the metrics of the rule extraction techniques over other unsupervised anomaly detection models, such as IsolationForests [53] or LOF [54], as well as using other kernel configurations in OCSVM (such as a polynomial one). Also, while we have proposed a vanilla function to incorporate the metrics belonging to different XAI areas, there is much room of improvement over it in order to find an optimal function that weights appropriately every term. Finally, rule extraction should also be designed to consider all types of comparisons (≥, ≤, > and <), and this is something that could also be considered in the cluster-based methods developed.