Towards Explaining Anomalies: A Deep Taylor Decomposition of One-Class Models

A common machine learning task is to discriminate between normal and anomalous data points. In practice, it is not always sufficient to reach high accuracy at this task, one also would like to understand why a given data point has been predicted in a certain way. We present a new principled approach for one-class SVMs that decomposes outlier predictions in terms of input variables. The method first recomposes the one-class model as a neural network with distance functions and min-pooling, and then performs a deep Taylor decomposition (DTD) of the model output. The proposed One-Class DTD is applicable to a number of common distance-based SVM kernels and is able to reliably explain a wide set of data anomalies. Furthermore, it outperforms baselines such as sensitivity analysis, nearest neighbor, or simple edge detection.


Introduction
Novelty detection, or outlier detection, is a well-studied and well-formalized machine learning problem with numerous practical applications. One such application is intrusion detection in computer systems, where data points are typically digital messages transmitted over a network, and messages that are detected as outliers are considered likely to carry a threat [13,17]. Another application is obstacle detection in autonomous car driving [18]. The ability to detect outliers is also important in scientific applications, where points detected as such are intrinsically more interesting than inliers, and should therefore be given more attention [59,28]. A number of techniques can be used for outlier detection [12,21,36,41,51]. In practice, it is not only important to be able to detect outliers and inliers with high accuracy, one would also like to be able to explain why a machine learning model considers a sample as inlier or outlier. An interpretable explanatory feedback can indeed be used by a human operator for appropriate decision making. The data point could either be considered as benign and possibly incorporated to the dataset, or appropriate action might be taken. The problem of outlier explanation is shown schematically in Figure 1. A dataset, here one class from the MNIST data set of handwritten digits, is fitted by a one-class model from which outlier scores can be obtained. These scores must then be traced back to interpretable quantities such as the input variables of the model.
Interpretability of machine learning models has received growing attention, especially in scientific applications [42,26,48,19,52] and for systems that interact with humans [10,24,31,7]. A number of generally applicable techniques for interpreting machine learning models have been proposed [45,14,57,2,10]. Most of them have been developed in the context of supervised learning. Therefore, the present work addresses the present lack of interpretability of unsupervised machines learning models, and provides a practical solution in the context of oneclass SVMs for outlier detection. We will first argue that the problems of explaining inlier and outlier decisions are qualitatively different, and need to be treated in distinct ways. Inliers will be best explained in terms of contribution of support vectors, whereas outliers will be better explained in terms of contributions of input variables. We propose fairly general conditions for inlierness and outlierness that can be reconciled with many common models.
Exemplarily, this will be reflected by the identification of two distinct compositions of the one-class SVM. The first one will perform a sum-pooling over similarity scores. This architecture enables the interpretation of inlierness. The second one will perform a min-pooling over distances, which provides interpretation of outlierness. In particular, we will propose in this paper a deep Taylor decomposition decomposition/integrated gradients approach [34,49]. The proposed method can be applied to a num- Left: Data is generated from an unknown distribution, we are for example interested in potential outliers; Middle: Unsupervised machine learning techniques estimate the data generating distribution and assign an outlier score o(x) to unlikely data points; Right: Our explanation method assigns a relevance score to every input variable that reflects the contribution of input variable x i to the model decision. We apply dithering to all heatmaps for printing reliability.
ber of outlier detection models, namely those of RBF-type [25,51,20,5,22,56,9,50]. For that, the model does not have to be modified and neither re-training nor access to training data are required for the presented explanation method. Instead, only the detection model needs to be known and an appropriate measure of outlierness has to be constructed. The latter will be formally defined in Section 3. In Section 8, we will show empirically that the proposed technique provides meaningful explanations.

Related work
A number of studies have considered the problem of outlier explanations: Schwenk and Bach [43] applied structured one-class SVMs to explaining anomalies in Media-Cloud applications, and proposed a technique to decompose their predictions in terms of input variables for sumdecomposable kernels. We extend the previous work by proposing a Taylor-based decomposition framework applicable to various non-decomposable RBF-type kernels. Liu et al. [32] use the decision of a complex outlier detection model to train a set of simple detectors that separate outliers linearly from clusters of nearby training patterns. Subsequently, the linear weights are used for interpretation of the outlier. Micenková et al. [33] heuristically remove features from detected outliers and return a subset of features that maximizes separability of the outlier from the surrounding training patterns. These methods rely on (1) the existence of a hypothetical outlier class that is approximated by sampling in the vicinity of the supposed outliers and (2) access to the training data in the explanation stage. On the other hand, the methods are implementation invariant and model agnostic and can be applied to any outlier detection model.
We take on a different approach, where we look at the model as a mathematical function and identify marginal contributions of input variables on the produced detection score.

One-Class SVM
In one-class learning, we are trying to separate patterns that are generated by one common distribution from the rest of the input domain. Schölkopf et al. [41] proposed the one-class SVM as an algorithm that learns the tails of a high-dimensional distribution, which is sufficient for the separation task. For a set of training data x 1 , . . . , x n ∈ X and some feature map Φ : X → F, the primal one-class SVM problem takes the form minimize w∈F ,ρ∈R,ξ∈R n where ν ∈ [0, 1] controls the fraction of outliers that are extracted by the model [41]. Given an explicit map Φ with interpretable features (e.g. BoW or pixel intensities), the one-class SVM is readily interpretable in feature space F by means of the linear weight vector w ∈ F. For RBFkernels, the optimization is performed in the dual formulation, which does not provide an explicit representation of the weight vector, but a set of Lagrangian multipliers, taken as coefficients of radial basis functions, centered at support vectors. Let k : R → R be a RBF-kernel that acts on the distance of two points and produces large output for patterns that are similar and small output for distinct patterns. The one-class SVM extracts a small set of support vectors u 1 , . . . , u m with m ≤ n from the training set together with coefficients α 1 , . . . , α m such that is large for data points x ∈ X that are typical in terms of the training data and the chosen similarity measure k. For anomalous points, g(x) will be small.

Inlierness and Outlierness
Having introduced the one-class SVM, we now take a more abstract look at the problem. We will characterize inliers and outliers by answering: (1) what is an appropriate compositional structure for these two quantities, and (2) how should inlierness and outlierness be quantified. Our answers to these questions will provide a theoretical basis for the design of our explanation method described in Sections 5 and 6.

Modeling Inlierness and Outlierness
Any complex prediction task requires a set of function classes to choose from. These functions are preselected based on some prior knowledge about the problem, and incorporate properties such as linearity, smoothness, and more general types of equivariances or invariances. Practically, these function classes can be implemented by a model which can be, for example, a composition of multiple layers.
The compositional structure of the model differs substantially between the types of prediction tasks. For example, a model that detects "airplanes" in an image will typically consist of multiple detectors that test for the presence of an airplane at various locations in the image. The detection decision can be expressed as: "Decide 'airplane' if any airplane template is matched." An appropriate architecture for this problem would therefore be a collection of similarity functions in the first layer, followed by a maxpooling operation in the second layer. This structure of the prediction function is prototypical for state-of-the-art classification architectures such as the deep convolutional neural network, where detection layers are interleaved with max-pooling layers [8].
Max-pooling architectures are also particularly suitable for the problem of detecting inliers. A first layer will detect the similarity to every individual airplane in the data, and a second layer will retain the maximum similarity scores obtained in the previous layer. Here, each airplane detector measures the similarity to an airplane or a group of airplanes in the data. The inlier decision can in that case also be expressed as "Decide 'inlier' if any airplane template is matched.", i.e. in the same way as for the detection task. An appropriate composition of the inlierness function is therefore of type i(x) = max j k( x − u j ), where the first layer maps the input data to the similarity function scores, and the second layer applies some max-pooling operation or a soft variant of it. 1 This structure is visualized in Figure 2 (left).
On the other hand, if we were using the same maxpooling approach for outlier detection, one would need to build as many detectors as there are possible inputs without an airplane. There is an exponential number of them. Instead, the outlier detection problem is better expressed as follows: "Decide 'outlier' if all airplane templates are unmatched." In that case, the first layer models the level of distance functions, and the second layer becomes a minpooling operation. An appropriate composition of the outliereness function will be of type o(x) = min j x − u j , where the first layer maps the input data to the distances, and the second layer applies some min-pooling operation or a soft variant of it. This new structure is visualized in Figure 2 (right).

Quantifying Inlierness and Outlierness
In problems such as classification and regression, the output of the model can be readily interpreted, e.g. as the probability of membership to a given class, or as the expected value of the target variable respectively. When using, e.g. a one-class SVM, such interpretation is not obvious: The discriminant function g(x) does provide an ordering from the most to the least anomalous point (cf. Harmeling et al. [20]), however, it only answers which of two data points is most anomalous, and not the absolute level of anomaly of a given data point. We propose the following axiomatic definitions for inlierness and outlierness, and then briefly discuss how common machine learning outlier detection models fulfill or violate these definitions: Definition 1. A measure of inlierness i : X → R must fulfill the following two conditions for all x ∈ X : 1. It is bounded by zero and some positive number u: 0 ≤ i(x) ≤ u and 2. It converges asymptotically to zero: lim t→∞ i(tx) = 0.
For example, the Gaussian mixture model, which is sometimes used for inlier/outlier detection (e.g. [50]), associates to each input point a probability score representing

Similarity functions
Decide inlier if any template is matched

Distance functions
Decide outlier if all templates are unmatched the likelihood of that point being generated from the underlying distribution [6]. This probability score is bounded between 0 and 1 and converges to 0 when moving away from the data. Thus, these probability scores fullfill our definition of inlierness. Similarly, the discriminant function of the one-class SVM with RBF kernel is upper bounded by the kernel bound, and converges to zero as we move away from the data. These quantities are however not suitable as an outlierness model: they asymptote to 0 as x moves away from the data, which does not captures the fact that the degree of outlierness continues to increase. Outlierness is instead better defined by the following set of axioms: To reflect the Euclidean geometry of the input space, the norm in the denominator will be assumed to be a 2norm. Example of functions that satisfy Definition 2 are the distance to the mean, or the neg-log-likelihood under an isotropic probability distribution, e.g. N (µ, σ 2 I). These function are typically used in machine learning for measuring error.
As a counter example, the neg-log-likelihood of a general Gaussian distribution N (µ, Σ) learned from the data does not satisfy Definition 2: The latter is indeed not suitable for measuring outlierness, as the learned covariance Σ overrides the natural metric of input space on which the outlier decision should be based.
Having defined inlierness and outlierness, we now provide measures for the one-class SVM of interest in this paper. These measures are based on the discriminant g(x) defined above. In general, there may be more than one measure of inlierness or outlierness, and we shall here apply a principle of parsimony.
Exponential kernels. The first class of kernels we consider are exponential kernels which can be parametrized as Parameter σ is the bandwidth of the kernel. For q = 1, the kernel is called Laplacian, for q = 2 Gaussian kernel.
The simplest measures of inlierness and outlierness that satisfy Definitions 1 and 2 would be: A proof that the outlierness meets Definition 2 can be found in Appendix B.1.
t-Student kernels. The second class of kernels that we consider are t-Student kernels: The parameter a is positive and often set to 1. When the norm is also scaled by a bandwidth, the kernel is also referred to as Cauchy kernel. Inlierness and outlierness will be measured by the following functions: .
A proof for the agreement of o(x) with Definition 2 is in Appendix B.1.

Explaining Machine Learning Decisions
In this section, we review several techniques to explain the predictions of a machine learning classifier in terms of input variables. Let x ∈ R d be an input example and f (x) ∈ R be its prediction, where f is a function learned from the data. The goal of an explanation is to assign a relevance score R i to each feature x i , that reflect the importance of that feature for the prediction.
Sensitivity Analysis. The simplest technique for explanation is to attribute relevance to the input variables to which the prediction is locally most sensitive [63,16,3]. That is, for a given prediction, we define the importance score for each input variable i as: that is, the squared locally evaluated partial derivatives. A limitation of sensitivity analysis is that it is an explanation of the function variation rather than of the function value. Considering a single distance norm x − u j , we observe that the gradient does not grow with the distance, implying that sensitivity analysis does not capture the amount of outlierness that a pattern holds. Another observation is that the gradient vanishes between modes of the data, imposing zero importance to variables that occupy a local maximum of outlierness, when measured with sensitivity analysis. The aforementioned weaknesses in sensitivity have led to the development of more precise explanation techniques, which we will take up in the following.
Simple Taylor Decomposition. Taylor decomposition [4,2] seeks to determine the importance of input variables for a certain prediction f (x) by performing an expansion of the function f at a certain reference point x: It then identifies as importance for a given variable the various terms of the expansion that are bound to it. In the equation above, (1) is the function value at the reference point, (2) contains linear contributions, (3) contains all higher-order terms, including interdependence relations between input variables. Simple Taylor decomposition focuses on the term (2), where the summands are bound to a given input variable. Thus, we define the relevance scores (R i ) i for the prediction f (x) as: In our analysis, we will also choose functions and reference points such that the term (1) is zero, i.e. contains no information on the models prediction and (3) is small. In that case, we obtain the relevance conservation property , which guarantees that the explanation matches in magnitude the amount of predicted inlierness or outlierness. A limitation of simple Taylor decomposition is the need to find a root point x in the vicinity of x, which can be time-consuming. Further, the reference point might jump to a different mode as the input pattern moves from one mode to another mode of the distribution. This may cause two nearly equivalent data points with nearly equivalent predictions to receive a different explanation. Stated differently, the explanation as a function of x is discontinuous.
Integrated Gradients. Another approach for setting importance scores of inputs to a prediction has been proposed by Sundararajan et al. [49]. For some reference point x, a prediction is explained by summing over a finite number of small steps of first order simple Taylor decompositions between the input x and the reference point x. In the limit, the attribution can be written in terms of the integral Ri which typically has to be evaluated numerically, but may also have an analytical solution for simpler models. Like for simple Taylor decomposition, one needs to choose an appropriate reference point. The advantage of integrated gradients is the absence of second-and higher-order residual terms. In Section 6.3 we apply the method to convex functions of which the integral has an analytical solution.
Deep Taylor Decomposition. Deep Taylor decomposition (DTD) is a method for decomposing the prediction of a neural network on its input variables [34]. The decomposition is obtained by propagating the model output into the neural network graph by means of redistribution rules, until the input variables are reached. As such, it belongs to a broader class of propagation techniques [2,27,44,58]. A distinctive feature of DTD is that the propagation rules are derived from a Taylor decomposition performed at each neuron of the network. The decomposition process starts from the top neuron, whose activation is redistributed into relevance scores of neurons in the previous layer. The previous layer relevance scores are then expressed as a function of the activations of the layer before, which enables another step of redistribution. The Taylor decomposition process is iterated from the top layer down to the input layer where the decomposition in each layer has a closed form for known compositions. The procedure leads ultimately to a relevance score for each input variable. Like for the forward pass or standard gradient propagation, DTD can be quickly computed in O(#connections).
The original DTD method uses Taylor decomposition as a unit of explanation at each neuron. However, our adaptation of DTD in the context of one-class SVM leads to the observation that, for certain neuron types, e.g. mapping on the kernel basis function, this unit of explanation can be advantageously substituted by other analyses such as integrated gradients. Overall, the method we present in this paper generalizes DTD to a "deep decomposition" where we use standard Taylor decomposition or integrated gradients as unit of explanation at various layers.
Other methods. A number of other methods have been proposed for explanation: It includes methods based on locally sampling the decision function [37], local perturbations [57,62], other types of propagation techniques [57,47], as well as explanation methods supported by specific choices of achitectures [60,10].

Explaining Inlierness
In this section, we present the decomposition of the measure of inlierness i(x) defined in Section 3.2. As it was argued in Section 3, the inlierness is best modeled by a detection-max-pooling architecture. Such architecture is common in convolutional neural networks, where max-pools are composed of outputs from different detectors that were applied to the same lower level features. A two-layer neural network that implements the measures of inlierness is given by: where the first layer are the weighted similarities to the support vectors measured by the kernel, and the second layer performs a sum-pooling, which can be viewed as a soft variant of max-pooling. Deep Taylor decomposition applies as a first step the decomposition of the output i(x) on the first-layer activations (s j ) j that we call "effective similarities" due to the weighting term α j . The two-layer architecture and the process of relevance redistribution from the top layer to the intermediate layer is shown in Figure 3. The input (e.g. a handwritten digit) is first propagated into the neural network, to compute the inlier score. Then, this score is redistributed from the top layer to the hidden layer, which gives a decomposition of inlierness in terms of support vectors. Technically, we perform a Taylor expansion of the inlier as a function of the hidden layer activations i((s j ) j ). Relevance scores are then given by: Due to the linearity of the sum-pooling function, there is no second order term. In order to satisfy the conservation property i = j R j , we further need to have i(( s j ) j ) = 0, i.e. we need to perform the Taylor expansion at a root point of the function. Here, we choose the root point ( s j ) j = (0) j , because it is the only admissible root point in the space of activations. The deep Taylor decomposition method [34] does not require the root point to have a preimage in the lower-layer, however, in this particular case, one can still interpret the segment [(s j ) j , (0) j ] as moving in some direction orthogonal to the data manifold in the input domain.
Injecting the root point (0 j ) j in equation (3) gives the relevance score: That is, the relevance of support vector u j corresponds to its hidden neuron activation s j . This operation can be interpreted as a "max-takes-most" redistribution.
We now ask if it is sufficient for explanation to stop at this layer, or if relevance should be further propagated to the input variables. For this, consider the simplest inlier model i(x) = α · k( x − u ) composed of a single support vector u. Consider the most inlier point x = argmax x i(x) = u. At this location, it is easy to conclude that u has contributed to the inlierness of x , however, because the kernel is RBF and x lies at the maximum, it is impossible to assign a directional explanation for such inlinerness. Indeed, i(x) looks exactly the same along each direction. Based on this prototypical example, one concludes that explanations for inlierness are better given in terms of support vectors than input directions.

Explaining Outlierness
As it was discussed in Section 3, outlier detection is more naturally described as a min-pooling over local distances. Unlike explanation of inliers, the analysis here will depend on the choice of kernel. For each family of kernel, one needs to find a suitable model composition, and appropriate root points for the explanation. In this section, two classes of kernels are considered. These kernels are frequently encountered in practical applications.

t-Student Kernels
The first kernel we focus on is the generalized t-Student kernel given by k( x − x ) = (a + x − x q ) −1 . We compute the one-class SVM discriminant g(x) = j α j k( x − u j ), and apply the measure of outlierness o(x) = mg(x) −1 proposed in Section 3.2 for this kernel. The measure of outlierness o(x) can be implemented by the following twolayer neural network (see Appendix B.2 for a proof): The first layer can be interpreted as a mapping to the effective distances h j from each support vector. By effective distance, we mean the distance as perceived by the data point x, i.e. modulated by the support vector coefficients α j . The second layer computes the harmonic mean H which implements a soft min-pooling.
We would now like to redistribute the output o to the lower-layer. We let o depend on the hidden layer activations so that a Taylor decomposition can be performed on

Inlier Neural Network Decomposition
Input Figure 3: Neural network equivalent of the one-class SVM for inlier detection, and the relevance redistribution from the top layer to the intermediate layer.
the previous layer. Specifically, we choose the root point ( h j ) j = (0) j and perform a first order Taylor decomposition at that point. It can be shown that higher order terms sum to one in the Taylor expansion for that root (see Appendix B.3). Relevance scores are given by: where h j is the first-layer activation representing effective distances, and c j is a factor that only retains support vectors that are active in the min-pooling operation (i.e. those with the lowest effective distance). In the input domain, c j can be interpreted as a localization term. A large relevance score R j is therefore the result from a large effective distance h j , but low in comparison to other effective distances (h j ) j in the pool. In Appendix B.3, we show that the decomposition is conservative, i.e. j R j = o. In Section 6.3 we will show how to redistribute R j to the input layer.

Exponential Kernels
In this section we consider the family of kernels of type k( x − x ) = exp − ( x − x /σ) q /q . Unlike the kernels of Section 6.1, this family of kernel implements stronger locality. The Laplacian and Gaussian kernels are special cases for q = 1 and 2 respectively. Like in the previous section, we compute the SVM discriminant g(x) = j α j k( x − u j ), however, we apply a different measure of outlierness, o(x) = − log g(x), proposed in Section 3.2 for this kernel. The function o(x) can be mapped to the following two-layer neural network (proven in Appendix B.2): a set of radial basis distance functions followed by a flipped log-sum-exp computation which implements a soft of minpooling. We let the neural output depend on the hidden layer, and choose the root point ( we substract the output of the model to each dimension of the vector of activations. Relevance scores on the hidden layer are obtained by Taylor decomposition: One can also show that this decomposition is conservative (see Appendix B.3).

Redistribution on the Input Layer
In the inlier detection case, it was sufficient to perform redistribution on the domain of support vectors. We argue that explaining an outlier in terms of support vectors does not provide much interpretability. Take a prototypical outlier, which is very far from the data. From this distance, two distinct support vectors will look very similar, and the main information about outlierness is not contained in which point it is the closest from, but in the distance and direction between the outlier and the data. Thus, motivated by this prototypical-case argument, we now look at how to backpropagate the outlier explanation (R j ) j one layer below onto the input domain.
In Sections 6.1 and 6.2, support vector relevance was given by R j = h j ·c j and R j = (h j +ε j )·p j . Redistribution  on the input domain requires to express R j as a function of x. We will first show that c j , ε j and p j are approximately constant: When ∀ j =j h j /h j ≈ 0, i.e. when support vector u j dominates locally, then c j , p j ≈ 1, and ε ≈ 0. Furthermore, c j is constant under any rescaling of activations (h j ) j , and p j , ε j are constant under any increment of activations by a constant value. A proof for the invariances can be found in Appendix B.4. These transformations also describe the path where we look for the root point. Thus, considering these terms as effectively constant, R j can be modeled locally as an affine transformation of activations, which are themselves an affine transformation of distances. We write: where C j > 0 and D j ∈ R are constant. This quantity is redistributed on the input dimensions by means of integrated gradients [49]. A detailed derivation of integrated gradients of R j can be found in Appendix B.5. The attribution on the input variables is given in vector form by where like in the original paper [34] we have summed over relevance received from all higher-layer units. The expression D + j denotes max(0, D j ) and the integral is the vector of individual integrals of ∂R j /∂x i .
The whole process of layer-wise redistribution from the top layer down to the input layer is shown in Figure 4. The data point (e.g. a handwritten digit) is given as input to the neural network. The network implements the outlier function as a soft min-pooling over support vector distances. The outlier score obtained at the output of the network is redistributed using deep Taylor decomposition: it is first redistributed using Taylor decomposition on the support vectors, and then further propagated to the input domain using integrated gradients.

Extension for Sequential Data
When applied to sequential data such as images or time series, one-class models based on RBF kernels become affected by the curse of dimensionality. Thus, it is sometimes preferable to apply these models to small sequences or patches of the input [29,30,15]. The scores computed for all patches are then pooled to compute a global score for the sequence. Let (i t ) t and (o t ) t be the inlier and outlier scores associated to a collection of patches or segments taken from the input sequence. One measure of outlierness that satisfies Definition 2 is obtained by summing all outlier scores, thus forming a third layer of representation: This composition resembles the max-pooling layer 2 from Section 5. Choosing the root ( o t ) t = (0) t consists therefore of a max-take-most redistribution on the spatial locations, R t = o t , from where on we proceed as explained in Section 6, by first redistributing location relevance to support vectors, R t j by Equation (4) or Equation (5) and perform a final redistribution on input variables by Equation (6).

Experiments
We first test our deep Taylor decomposition (DTD)based method for outlier explanation on large images, where we use the sequential model of Section 7. Figure 5 shows heatmaps for images taken from various image datasets. These heatmaps are compared to a simple baseline edge detector.
All models are trained on 7 × 7 patches from the single image itself, thus heatmaps should highlight unusual statistics in the image. The function that the model implements depends solely on model parameters (1) fraction of outliers ν, here chosen as 0.1, (2) degree of Euclidean distance q, here set to 2, (3) kernel bandwidth σ chosen as 0.1 quantile of one-nearest-neighbor distances for the exponential kernel 2 and (4) the patch size. Having these Figure 5: A One-Class SVM is trained on small 7 × 7 patches of the very image itself. Parameter ν = 0.1 is set to allow at most 10% outliers. Images from a texture data set [11] (row one, two and four) and PatternNet [61]; top image is altered by us. For every image, we show Left: input image; Middle decomposition of one-class SVM; Right Sobel filter for reference. All images were resized to 256 pixels width.
parameters fixed, the one-class SVM has a unique solution and explanation. Examples in Figure 5 are generated with a Gaussian kernel. We rescale images to a common width of 256 pixels and apply anti-aliasing in the rescaling, because we observed that the method is sensitive to aliasing artifacts.
The One-Class DTD grounds anomalies to individual pixels. The first row in Figure 5 shows a modified image from the class "grid" from the Describable Textures Dataset [11]. We perturbed the clean grid by a circle in the grid that is invisible to the human eye. The outlier edges that are due to our modification are indeed discovered by our method, as we can see in the heatmap right next to the grid image. The Sobel edge detector is not able to detect these special edges. The second image has a small defect in the middle of the lower-right quadrant. It is not obvious that this defect can be detected in the presence of other distractions, like the lamps, that are detected as well. The first three images show that our One-Class DTD is able to discard recurring patterns, e.g. grid lines, wood lines or parking lines. In the fourth image, we see that the method is also robust to some amount of noisy patterns. While the Sobel filter detects edges reliably, One-Class DTD puts emphasis to edges that are outstanding on a small scale. The scale on which outlierness is detected is parameterized by the patchsize.

External Validation
The following experiment tests the ability of One-Class DTD to produce correct explanations on an artificial problem where we have ground truth information on the input features that cause outlierness.
We build a dataset composed of two horizontally concatenated images of size 28 × 28. Inliers are constituted by a MNIST digit of the particular digit class on the left, e.g. the class "0", and a blank image on the right. A simple one-class SVM with no extension for sequential inputs is trained with a Gaussian kernel and σ = 400 and ν = 0.01. After training, the following three cases are considered for explanation: (1) Inlier: A test image from the training class is presented. (2) Type I outlier: Structure of the inliers is present (i.e. a test example from the training class appears in the left panel) together with some distraction. As distraction, we replace the right panel with a random sample from another random class. (3) Type II outlier: Structure of the inliers is distracted on both sides; the left panel of type I outliers is replaced by a random sample from another random class. Figure 6 (left) shows some example data for the class of zeros in a 2D PCA embedding. The ground truth explanation for inliers contains no relevance at all, because the measure of outlierness should detect no evidence for outlierness in these images. For type I outliers, the ground truth only contains relevance in the right side of the image. Consider a growing amount of outlierness in the right side only: an explanation of the left side should not be affected by these distractions. For type II outliers, relevance should fall in both sides of the image: the left side contains relevance for deviation from the training digit, and the relevance on the right side explains deviation from the blank image. If we consider an input with growing amount of outlierness in the left panel, we see that relevance should also increase in the left panel only and vice versa.
As a baseline, we compare the relevance attribution with the maximum likelihood estimate of a multivariate normal density (MVN) of the training data with no offdiagonal covariances [35,5]. The maximum likelihood estimate for the variances is given by being the mean of training data and a regularization term λ ∈ R. The negative log-likelihood of the MVN, although not a measure of outlierness in the strict sense of Definition 2, provides however a natural decomposition on input features as where the first term is a non-decomposable zero-order term and terms of the sum determine the relevance of input features. Figure 6 collects the outcomes of the experiment. In the bottom plots, every sample is represented by one dot.
On the x-axis, we plot the amount of relevance that falls in the right side of the image, H right = i∈right H i . On the y-axis, the relevance of the left image, H left , is plotted. One-class DTD and MVN are both able to explain inliers and type I outliers reliably. They both attribute a small amount of outlierness to the inlier data points, though. The effect of growing outlierness on the right side leading to more relevance in that area can still be observed by looking at blue dots in Figure 6 (right). One-class SVM is better able to explain the outlierness of type II outliers, because it reacts equally strongly to permutations over the input dimensions. Instead, the MVN largely ignores outlying patterns in the left panel and thus produces a partial explanation. The incorrect behavior of MVN explanations stems from the fact that the MVN negative log-likelihood is not a true measure of outlierness in the sense of Definition 2 as it distorts the natural metric of the input space.

Internal Validation
In the following experiments, we consider the output of the one-class SVM as a ground-truth model for outlierness. This allows us to perform validation on datasets for which we do not have a priori knowledge of which features are causing outlierness.
The deep Taylor decomposition method will be compared to a number of other explanation techniques: Sensitivity analysis uses the same trained model but assigns relevance based on the locally evaluated gradient. Other  analyses assign relevance based on a simple decomposition of the distance to data, or on the output of some image filter.

Inlier
For evaluation of explanation quality, we consider the pixel-flipping approach described by Samek et al. [39] in the context of DNN classifiers. The approach consists of gradually destroying pixels from most to least relevant, and measure how quickly the prediction score decreases.
In the context of outlier detection, however, destroying a pixel does not reduce evidence for outlierness and might even create more of it. Thus, the original pixel-flipping method must be adapted to the specific outlier detection problem. Our approach will consist of performing the flipping procedure not in the pixel-space directly, but in some feature space containing all component-wise differences to support vectors. The one-class SVM can be rewritten in terms of elements of this feature space as g(Ψ) = j α j k( Ψ :,j ), and similarly the outlier function can be written as o(Ψ).
Our modified procedure reduces the dimensionality of the data one dimension at the time. Once dimensionality 0 is reached, the pattern is necessarily an inlier, because no deviance from the support vectors exists anymore. This method makes removal of outlierness computational feasible. The ordering of variables is inferred from the relevance scores assigned to each input dimension (cf. Appendix A.4 for pseudo-code). Also note that we seek to provide a global explanation of the outlierness of pattern x. Except for trivial cases (with only one support vector) no single pattern in X can represent a minimizer of all detectors that the model is composed of. We train a one-class SVM on the CIFAR-10 data set. The data set consists of 50000 images in the space R 32×32×3 with values ranging from 0 to 255. The images are divided into 625 patches of adjacent pixels each, where a patch is of dimensionality 7 × 7 × 3. This leads to more than 31 million training vectors x i ∈ N 147 . For training speedup, we randomly select 30,000 patches from the data set and train a one-class SVM on these patches. The outlier scores are summed over all patches of an image, as described in Section 7, to get a measure of outlierness for the whole image. Figure 7 shows an example image and heatmaps from all attribution methods.
We consider as baselines for explanation sensitivity analysis (SA) as defined in Section 4, the squared difference to the nearest neighbor (NN), the squared difference to the expected pixel value (EV) and the Sobel filter. The squared difference to the nearest neighbor support vector (NN), with u NN = argmin uj x − u j is similar to DTD but performs a min-take-all redistribution instead of min-takemost. This yields discontinuities in the explanation for perturbations of the inputs. This issue is reduced in the sequential model due to the overlap of patches, however. We also add the squared difference to the expected pixel value (EV) to the baselines. EV is inferred from the support vectors by R EV = (x −ū) 2 whereū = m j=1 α j u j . Finally, a random pixel ordering is considered as a completely uninformed baseline method. Figure 7 shows the results of the pixel flipping experiment for all methods and several kernels. Deep Taylor decomposition is indeed superior for all considered kernels. Sensitivity analysis can be interpreted as the explanation of local variation of the detection function in the vicinity of the pattern in question. We can see that the local gradient is not as well suited for explanation as DTD. In particular, sensitivity is unable to detect truly relevant pixels that cause the outlier score to be large. Instead, it assigns the most relevance to pixels to which the model is sensitive locally (cf. Samek et al. [39]). As mentioned before, nearest neighbor support vectors provide an explanation that is discontinuous to perturbations of the inputs. Explanations from the NN procedure are more complete compared to SA. As we see in the right plots of Figure 7, the pixel ordering is still diverging from the explanation that is produced by DTD. The EV baseline corresponds roughly to a squared difference to the data mean, and is even more global than DTD. We see that its performance in the pixel flipping experiment is still better than Sobel and Random flipping. Remaining baselines (EV, Sobel, Random) fail to produce a competitive explanation.
Results for more kernels can be found in Appendix C.

Intrusion detection
One-class SVM has been applied to network intrusion detection and malware detection [17,54,38,53,13]. Having interpretable model outputs can help to identify the intent or the method of an attack. We take up this idea in a simpler setting where no domain knowledge is necessary and where it is arguably possible to detect outlierness on a symbolic level, that can be compared to an attack. In particular, we train a one-class SVM on the personal attacks corpus from the Detox data set [55]. In this dataset, documents are labeled by up to ten annotators as either 0 (neutral) or 1 (personal attack).
A dictionary is constructed from stemmed terms that appear in at least five documents and binary features are extracted as a vectorial representation of documents. No stop words are removed and no document frequencies are used for feature extraction. The model is trained on samples with label mean 0 with a Gaussian kernel and ν = 0.3. Parameter σ is set to 10, which is a soft assumption of an expected difference in 10 terms for similar documents.
Interpretable outputs are produced in terms of term relevance scores [1,23]. Figure 8 shows the explanation for two example documents. As one would expect, common terms have no or low relevance in the document and terms that would not be expected in a neutral message receive more relevance. Due to the RBF property, relevance will also be assigned to terms that do not appear in a document. These terms can be interpreted as being benign and expected to appear in a typical example. This quantity can be of interest in text analysis and could not be derived from, e.g., a linear model. Note the ironic use of the word fantastic. The term receives most relevance, simply because it is not used frequently in neutral messages. The interpretation of the term being detected due to the ironic use can not be justified for such a symbolic model. The property to assign high relevance to rare events is still given. Rare events, here, is the presence of terms which appear rarely or missing terms that usually appear. Outlierness continues to grow if more rare events appear [40].

Conclusion
In this paper, we have addressed the problem of anomaly explanation. Technically, we have proposed a deep Taylor decomposition of the one-class SVM. It is applicable to a number of commonly used kernels, and produces explanations in terms of support vectors or input variables. Our empirical analysis has demonstrated that the proposed method is able to reliably explain a wide range of outliers, and that these explanations are more robust than those obtained by sensitivity analysis or nearest neighbor.
A crucial aspect of our explanation method is that it required us to elicit a natural neural network architecture for the problem at hand. Achieving this in the context of the one-class SVM model has highlighted the asymmetry between the problem of inlier and outlier detection, where the first one can be modeled as a max-pooling over similarities, and where the latter is better modeled as a minpooling over distances. The novel insight on the structure of the outlier detection problem might inspire the design of deeper and more structured outlier detection models.
Next, we show the convergence for exponential kernels. Condition 1 follows from g(x) being upper bounded by 1 for the exponential kernels. The proof of the second condition follows below. We show that the neural network from Section 6.1 implements the measure of outlierness for the t-Student kernel that is proposed in Section 3.2.
Let k be the t-Student kernel and h j = 1 αj (a + x − u j q ), then .
Next, we show the equivalence for exponential kernels. Let therefore k be the exponential kernel and h j = − log(α j )+ gives the decomposition Of course, the root z j = rz j only exists if D j and C j have opposite sign. When D j is positive, R j has no root and the global minimum at x j = u j serves as the reference for decomposition. This decomposition will not be conservative, because the term R j ( x j ) = R j (u j ) = D j will be strictly positive. Collecting all cases, the decomposition of R j on the input will be given by where D + j = max(0, D j ). Since R = j R j (cf. [2] for the details), the full input relevance can be written as Appendix C. Quantitative results for several more kernels In this section we collect the results of selectivity experiments from Section 8.2 for several more kernels, in particular for both, exponential and t-Student kernels, we show results for q = 1, 2, 4. Note that for q = 4 or larger, the kernel matrix tends to be singular. We observe a clear tendency of NN and DTD converging to the same performance for the exponential kernels. Overall, DTD performs most reliably over all cases that we consider here. Figure C.9 shows the results on the CIFAR-10 dataset as described in the main paper.