A Unifying Review of Deep and Shallow Anomaly Detection

Deep learning approaches to anomaly detection have recently improved the state of the art in detection performance on complex datasets such as large collections of images or text. These results have sparked a renewed interest in the anomaly detection problem and led to the introduction of a great variety of new methods. With the emergence of numerous such methods, including approaches based on generative models, one-class classification, and reconstruction, there is a growing need to bring methods of this field into a systematic and unified perspective. In this review we aim to identify the common underlying principles as well as the assumptions that are often made implicitly by various methods. In particular, we draw connections between classic 'shallow' and novel deep approaches and show how this relation might cross-fertilize or extend both directions. We further provide an empirical assessment of major existing methods that is enriched by the use of recent explainability techniques, and present specific worked-through examples together with practical advice. Finally, we outline critical open challenges and identify specific paths for future research in anomaly detection.


I. INTRODUCTION
An anomaly is an observation that deviates considerably from some concept of normality. Also known as outlier or novelty, such an observation may be termed unusual, irregular, atypical, inconsistent, unexpected, rare, erroneous, faulty, fraudulent, malicious, unnatural, or simply strangedepending on the situation. Anomaly detection (or outlier detection or novelty detection) is the research area that studies the detection of such anomalous observations through methods, models, and algorithms based on data. Classic approaches to anomaly detection include Principal Component Analysis (PCA) [1]- [5], the One-Class Support Vector Machine (OC-SVM) [6], Support Vector Data Description (SVDD) [7], nearest neighbor algorithms [8]- [10], and Kernel Density Estimation (KDE) [11], [12].
What the above methods have in common is that they are all unsupervised, which constitutes the predominant approach to anomaly detection. This is because labeled anomalous data is often non-existent. When available, it is usually insufficient to represent the diversity of all potential anomalies. This prohibits or renders a supervised approach ineffective. Instead, a central idea in anomaly detection is to learn a model of normality from normal data in an unsupervised manner, so that anomalies become detectable through deviations from such a model.
The study of anomaly detection has a long history and spans multiple disciplines including engineering, machine learning, data mining, and statistics. While the first formal definitions of so-called 'discordant observations' date back to the 19th century [13], the problem of anomaly detection has likely been studied informally even earlier, since anomalies are phenomena that naturally occur in diverse academic disciplines such as medicine and the natural sciences. Anomalous data may be useless, for example when caused by measurement errors, or may be extremely informative and hold the key to new insights, such as very long surviving cancer patients. Kuhn [14] claims that persistent anomalies drive scientific revolutions (cf., section VI 'Anomaly and the Emergence of Scientific Discoveries' in [14]).
Deep learning [81]- [83] follows the idea of learning effective representations from the data itself by training flexible, multi-layered ('deep') neural networks and has greatly improved the state of the art in many applications that involve complex data types. Deep neural networks provide arXiv:2009.11732v1 [cs.LG] 24 Sep 2020 the most successful solutions for many tasks in domains such as computer vision [84]- [93], speech recognition [94]- [103], or natural language processing [104]- [113], and have contributed to the sciences [114]- [123]. Methods based on deep neural networks are able to exploit the hierarchical or latent structure that is often inherent to data through their multi-layered, distributed feature representations. Advances in parallel computation, stochastic gradient descent optimization, and automated differentiation make it possible to apply deep learning at scale using large datasets.
Recently, there has been a rapidly growing interest in developing deep learning approaches for anomaly detection. This is motivated by a lack of effective methods for anomaly detection tasks which involve complex data, for instance cancer detection from multi-gigapixel whole-slide images in histopathology. As in other adoptions of deep learning, the ambition of deep anomaly detection is to mitigate the burden of manual feature engineering and to enable effective as well as scalable solutions. However, unlike supervised deep learning, it is less clear what characterizes an effective learning objective for anomaly detection and which signals should be used for learning a representation due to the mostly unsupervised nature of the problem.
Due to the long history and diversity of anomaly detection, there exists a wealth of review and survey literature [157]- [176] as well as books [177]- [179] on the topic. Some very recent surveys focus specifically on deep anomaly detection [180]- [182], but these works exclusively consider the deep learning approaches themselves. An integrated treatment of deep learning methods in the overall context of anomaly detection research -in particular its kernel-based learning part [6], [183], [184] -is still missing.
In this review paper, our aim is to exactly fill this gap by presenting a unifying view that connects traditional shallow and novel deep learning approaches. We will summarize recent exciting developments, present different classes of anomaly detection methods, provide theoretical insights, and highlight the current best practices when applying anomaly detection. Note finally, that we do not attempt an encyclopedic treatment of all available anomaly detection literature; rather, we present a slightly biased point of view illustrating the main ideas (and in doing so we often draw from the work of the authors) and providing ample reference to related work for further reading.

II. AN INTRODUCTION TO ANOMALY DETECTION
A. Why Should We Care About Anomaly Detection?
Anomaly detection is part of our daily lives. Operating mostly unnoticed, anomaly detection algorithms are continuously monitoring our credit card payments, our login behaviors, and companies' communication networks. If they detect an abnormally expensive purchase made on our credit card, several unsuccessful login attempts made from an alien device in a distant country, or unusual ftp requests made to our computer, they will issue an alarm. While warnings such as "someone is trying to login to your account" can be annoying when you are on a business trip abroad and just want to check your e-mails from the hotel computer, the ability to detect such anomalous patterns is vital for a large number of today's applications and services and even small improvements in anomaly detection can lead to immense monetary savings 1 .
In addition, the ability to detect anomalies is also considered an important ingredient in ensuring fail-safe and robust design of deep learning-based systems, e.g. in medical applications or autonomous driving. Various international standardization initiatives have been launched towards this goal (e.g., ITU/WHO FG-AI4H, ISO/IEC CD TR 24029-1, or IEEE P7009).
Despite its importance, discovering a reliable distinction between 'normal' and 'anomalous' events is a challenging task. First, the variability within the normal data can be very large, resulting in misclassifying normal samples as being anomalous (type I error) or not identifying the anomalous ones (type II error). Especially in biological or biomedical datasets, the variability between the normal data (e.g., personto-person variability) is often as large or even larger than the distance to anomalous samples (e.g., patients). Preprocessing, normalization, and feature selection are potential means to reduce this variability and improve detectability. Second, anomalous events are often very rare, which results in highly imbalanced training datasets. Even worse, in most cases the datasets are unlabeled, so that it remains unclear which data points are regarded anomalies and why. Hence, the anomaly detection problem reduces to an unsupervised learning task with the goal to learn a valid model of the majority of data points. Finally, anomalies themselves can be very diverse, so that it becomes difficult to learn a complete model for them. Also here the solution is to learn a model for the normal samples and treat deviations from it as anomalies. However, this approach can be problematic if the distribution of the (normal) data changes (non-stationarity), either intrinsically or due to environmental changes (e.g., lighting conditions, recording devices from different manufacturers, etc.).
As exemplified and discussed above, we note that anomaly detection has a broad practical relevance and impact. Moreover, (accidentally) detecting the unknown unknowns [185] has always been a strong driving force in the sciences. If applied to these disciplines, anomaly detection can help us to identify new, previously unknown patterns in data, which can lead to novel scientific insights and hypotheses.

B. A Formal Definition of the Problem
In the following, we formally introduce the anomaly detection problem. We first define in probabilistic terms what an anomaly is, explain what types of anomalies there are, and delineate the subtle differences between an anomaly, an outlier, and a novelty. Finally we present a fundamental principle in anomaly detection -the so-called concentration assumption -and give a theoretical problem formulation that corresponds to density level set estimation.
1) What is an Anomaly?: We opened this review with the following definition: An anomaly is an observation that deviates considerably from some concept of normality.
Let X ⊆ R D be the data space given by some task or application. We define a concept of normality as the distribution P + on X that captures the ground-truth law of normal behavior in a given task or application. An observation that deviates considerably from such a law of normality -an anomalyis then a data point x ∈ X (or set of points) that lies in a low probability region under P + . Assuming that P + has a corresponding probability density function (pdf) p + (x), we can define the set of anomalies as where τ ≥ 0 is some threshold such that the probability of A under P + is 'sufficiently small' which we will discuss in further detail below.
2) Types of Anomalies: Various types of anomalies have been identified in the literature [161], [179]. These include point anomalies, conditional or contextual anomalies [169], [171], [190]- [194], and group or collective anomalies [146], [192], [195]- [198]. We extend these three established types by further adding low-level, sensory anomalies and high-level, semantic anomalies [199], a distinction that is particularly relevant for choosing between deep and shallow feature maps.  A group anomaly can be a cluster of anomalies or some series of related points that is anomalous under the joint series distribution (contextual group anomaly). Note that both contextual anomalies have values that fall into the global (timeintegrated) range of normal values. A low-level, sensory anomaly deviates in the low-level features, here a cut in the fabric texture of a carpet [189]. A semantic anomaly deviates in high-level factors of variation or semantic concepts, here a dog among the normal class of cats. Note that the white cat is more similar to the dog than to the other cats in low-level pixel space.
A point anomaly is an individual anomalous data point x ∈ A, for example an illegal transaction in fraud detection or an image of a damaged product in manufacturing. This is arguably the most commonly studied type in anomaly detection research.
A conditional or contextual anomaly is a data instance that is anomalous in a specific context such as time, space, or the connections in a graph. A price of $1 per Apple Inc. stock might have been normal before 1997, but as of today (2020) would be an anomaly. A mean daily temperature below freezing point would be an anomaly in the Amazon rainforest, but not in the Antarctic desert. For this anomaly type, the normal law P + is more precisely a conditional distribution P + ≡ P + X|T with conditional pdf p + (x | t) that depends on some contextual variable T . Time series anomalies [169], [194], [200]- [203] are the most prominent example of contextual anomalies. Other examples include spatial [204], [205], spatio-temporal [191], or graph-based [171], [206], [207] anomalies.
A group or collective anomaly is a set of related or dependent points {x j ∈ X | j ∈ J} that is anomalous, where J ⊆ N is an index set that captures some relation or dependency. A cluster of anomalies such as similar or related network attacks in cybersecurity form a collective anomaly for instance [18], [207], [208]. Often, collective anomalies are also contextual such as anomalous time (sub-)series or biological (sub-)sequences, for example, some series or sequence {x t , . . . , x t+s−1 } of length s ∈ N. It is important to note that although each individual point x j in such a series or sequence might be normal under the time-integrated marginal p + (x) = p + (x, t) dt or under the sequence-integrated, time-conditional marginal p + (x | t) given by the full series or sequence {x t , . . . , x t+s−1 } can be anomalous under the joint conditional density p + (x t , . . . , x t+s−1 | t), which properly describes the distribution of the collective series or sequences.
In the wake of deep learning, the distinction between lowlevel, sensory anomalies and high-level, semantic anomalies [199] has become important. Low and high here refer to the level in the feature hierarchy of some hierarchical distribution, for instance, the hierarchy from pixel-level features such as edges and textures to high-level objects and scenes in images or the hierarchy from individual characters and words to semantic concepts and topics in texts. It is commonly assumed that data with such a hierarchical structure is generated from some semantic latent variables Z and Y that describe higherlevel factors of variation Z (e.g., the shape, size or orientation of an object) and concepts Y (e.g., the object class identity) [80], [209]. We can express this via a normal law with conditional pdf p + (x | z, y), where we usually assume Z to be continuous and Y to be discrete. Low-level anomalies can for example be texture defects or artifacts in images, or character typos in words. In comparison, semantic anomalies can be images of objects from non-normal classes [199], for instance, or misposted reviews and news articles [139]. Note that semantic anomalies can be very close to normal instances in the raw feature space X . For example a dog with a fur texture and color similar to that of some cat can be more similar in raw pixel space than various cat breeds among themselves (cf., Fig. 2). Similarly, low-level background statistics can also result in a high similarity in raw pixel space even when objects in the foreground are completely different [199]. Detecting semantic anomalies is thus innately tied to finding a semantic feature representation (e.g., extracting the semantic features of cats such as whiskers, slit pupils, triangular snout, etc.), which is an inherently difficult task in an unsupervised setting [209].
3) Anomaly, Outlier, or Novelty?: Some works make a more subtle distinction between what is an anomaly, an outlier, or a novelty. While all three refer to instances from low probability regions under P + (i.e., are elements of A), an anomaly is often characterized as being an instance from a distinct distribution other than P + (e.g., when anomaly points are generated by a different process than the normal points), an outlier as being a rare or low-probability instance from P + , and a novelty as being an instance from some new region or mode of an evolving, non-stationary P + . Under the distribution P + of cats, for instance, a dog would be an anomaly, a rare breed of cats such as the LaPerm would be an outlier, and a new breed of cats would be a novelty. Such a distinction between anomaly, outlier, and novelty may reflect slightly different objectives in an application: whereas anomalies are often the data points of interest (e.g., a longterm survivor of a disease), outliers are frequently regarded as 'noise' or 'measurement error' that should be removed in a data preprocessing step ('outlier removal'), and novelties are new observations that require models to be updated to the 'new normal'. The methods for detecting points from low probability regions, whether termed anomaly, outlier, or novelty, are usually the same however. For this reason, we do not make such a distinction here and refer to any instance x ∈ A as an anomaly.
4) The Concentration Assumption: In general, the data space X ⊆ R D can be unbounded. A fundamental assumption in anomaly detection however is that the region where the normal data lives can be bounded. That is, that there exists some threshold τ ≥ 0 such that is non-empty and small (typically in the Lebesgue-measure sense). This is known as the so-called concentration or cluster assumption [210]- [212]. Note that the concentration assumption does not imply that the full support supp(p + ) = {x ∈ X | p + (x) > 0} of the normal law P + must be bounded; only that some high-density subset of the support is bounded. A standard univariate Gaussian is supported on the full real axis, for example, but approximately 95% of the most likely region is covered by the bounded interval [−1. 96, 1.96]. In contrast, the set of anomalies A need not be concentrated and can be unbounded.

5) Density Level Set Estimation:
A law of normality P + is only known in a few application settings, such as for certain laws of physics. Sometimes a concept of normality might also be user-specified (as in juridical laws). In most cases, however, the ground-truth law of normality P + is unknown because the underlying process is too complex. For this reason, we must estimate P + from data.
Let P be the ground-truth data-generating distribution on data space X ⊆ R D with corresponding density p(x). For now, we assume that this data-generating distribution exactly matches the normal data distribution, i.e. P ≡ P + and p ≡ p + . This assumption is often invalid in practice, of course, as the data-generating process might be subject to noise or contamination as we will discuss in the next section.
Given data points x 1 , . . . , x n ∈ X generated by P (usually assumed to be drawn from i.i.d. random variables following P), the goal of anomaly detection is to learn a model that allows us to predict whether a new test instancex ∈ X (or set of test instances) is an anomaly or not, i.e. whetherx ∈ A. Thus the anomaly detection objective is to (explicitly or implicitly) estimate the low-density regions (or equivalently high-density regions) in data space X under the normal law P + . We can formally express this objective as the problem of density level set estimation [213]- [216] which is an instance of minimum volume set estimation [217]- [219] for the special case of density-based sets. The density level set of P for some threshold τ ≥ 0 is given by C = {x ∈ X | p(x) > τ }. For some fixed level α ∈ [0, 1], the α-density level set C α of distribution P is then defined as the smallest density level set C that has a probability of at least 1 − α under P, i.e.
where τ α ≥ 0 denotes the corresponding threshold and λ typically is the Lebesgue measure, which is the standard measure of volume in Euclidean space. The extreme cases of α = 0 and α → 1 result in the full support C 0 = {x ∈ X | p(x) > 0} = supp(p) and the most likely modes argmax x p(x) of P respectively. If the aforementioned concentration assumption holds, there always exists some level α such that a corresponding level set C α exists and can be bounded. Fig. 3 illustrates some density level sets for the case that P is the familiar standard Gaussian distribution. Given a level set C α , we can define a corresponding threshold anomaly detector c α : X → {±1} as 6) Density Estimation for Level Set Estimation: An obvious approach to density level set estimation is through density estimation. Given some estimated density modelp(x) = p(x; x 1 , . . . , x n ) ≈ p(x) and some target level α ∈ [0, 1], one can estimate a corresponding thresholdτ α via the empirical p-value function: where 1 A (·) denotes the indicator function for some set A. Usingτ α andp(x) in (3) yields the plug-in density level set estimatorĈ α which in turn can be used in (4) to obtain the plug-in threshold detectorĉ α (x). Note however that density estimation is generally the most costly approach to density level set estimation (in terms of samples required), since estimating the full density is equivalent to first estimating the entire family of level sets {C α : α ∈ [0, 1]} from which the desired level set for some fixed α ∈ [0, 1] is then selected [220], [221]. If there are insufficient samples, this density estimate can be biased. This has also motivated the development of one-class classification methods that aim to estimate subfamilies [221] or single level sets [6], [7], [183], [222] directly, which we will explain in section IV in more detail. 7) Threshold vs. Score: The previous approach to level set estimation through density estimation is more costly, yet generally results in a more informative model that can rank inliers and anomalies according to their estimated density. In comparison, a pure threshold detector as in (4) only yields a binary prediction. Menon and Williamson [223] propose a compromise by learning a density outside the level set boundary. Many anomaly detection methods also target some strictly increasing transformation T : [0, ∞) → R of the density for estimating a model (e.g., log-likelihood instead of likelihood). The resulting target T (p(x)) is often no longer a proper density but still preserves the density order [224], [225]. An anomaly score s : X → R can then be defined by using an additional order-reversing transformation, for example s(x) = −T (p(x)) (e.g., negative log-likelihood), so that high scores reflect low density values and vice versa. Having such a score that indicates the 'degree of anomalousness' is important in many anomaly detection applications. As for the density in (5), of course, we can always derive a threshold from the empirical distribution of anomaly scores if needed.
8) Selecting a Level α: As we will show, there are many degrees of freedom when attacking the anomaly detection problem outlined in this section which inevitably requires making various modeling assumptions and choices. Setting the level α is one of these choices and depends on the specific application. As the value of α increases, the anomaly detector focuses only on the most likely regions of P. Such a detector can be desirable in applications where missed anomalies are costly (e.g., in medical diagnosis or fraud detection). On the other hand, a large α will result in high false alarm rates, which can be undesirable in online settings where lots of data is generated (e.g., in monitoring tasks). We will provide practical guidelines for selecting α in Section VIII. Choosing α also involves further assumptions about the data-generating process P, which we have assumed here to match the normal data distribution P + . In the next section, we discuss the data settings that can occur in anomaly detection that may alter this assumption.

C. Dataset Settings and Data Properties
The dataset settings and data properties that occur in realworld anomaly detection problems can be diverse. We here characterize these settings which may range from the most common unsupervised to a semi-supervised as well as a supervised setting and list further data properties that are relevant for modeling an anomaly detection problem. But before we elaborate on these, we first observe that the assumptions made about the distribution of anomalies (often implicitly) are also crucial to the problem.
1) A Distribution of Anomalies?: Let Pdenote the groundtruth anomaly distribution also on X ⊆ R D . As mentioned above, the common concentration assumption implies that some high-density regions of the normal data distribution are concentrated whereas anomalies are assumed to be not concentrated [210], [211]. This assumption may be modeled by an anomaly distribution Pthat follows a uniform distribution over the (bounded 2 ) data space X [183]. Some well-known unsupervised methods such as KDE [12] or the OC-SVM [6], for example, implicitly make this assumption that Pfollows a uniform which can be interpreted as a default uninformative prior on the anomalous distribution [211]. This prior assumes that there are no anomalous modes and that anomalies are equally likely to occur over the valid data space X . Semisupervised or supervised anomaly detection approaches often depart from this uninformed prior and try to make a more informed a-priori assumption about the anomalous distribution P - [211]. If faithful to P -, such a model based on a more informed anomaly prior can achieve better detection performance. Modeling anomalous modes also can be beneficial in certain applications, for example, for typical failure modes in industrial machines or known disorders in medical diagnosis. We remark that these prior assumptions about the anomaly distribution Pare often expressed only implicitly in the literature, though such assumptions are critical to an anomaly detection model.
2) The Unsupervised Setting: The unsupervised anomaly detection setting is the case in which only unlabeled data is available for training a model. This setting is arguably the most common setting in anomaly detection [159], [161], [165], [168]. We will usually assume that the data points have been drawn in an i.i.d. fashion from the data-generating distribution P. For simplicity, we have so far assumed that the data-generating distribution is the same as the normal data distribution P ≡ P + . This is often summarized by the statement that the training data is 'clean'. In practice, however, the data-generating distribution P might be subject to noise and contamination [183]. Noise, in the classical sense, is some inherent source of randomness ε that is added to the actual signal in the datagenerating process, that is, samples from P have added noise x+ε where x ∼ P + . Noise might be present due to irreducible measurement uncertainties in an application, for example. The greater the noise, the harder it becomes to accurately estimate the ground-truth level sets of P + , since characteristic normal features get obfuscated [165]. This is because added noise generally expands the regions covered by the observed data in input space X . A standard assumption about noise is that it is symmetric and unbiased E[ε] = 0.
In addition to noise, the contamination or pollution of the unlabeled data with undetected anomalies is another critical source of disturbance. For instance, some unnoticed anomalous errors of a machine might have already occurred during the data collection process. In this case the data-generating distribution P is a mixture of the normal data and the anomaly distribution, i.e., P ≡ (1−η) P + + η Pfor some contamination or pollution rate η ∈ (0, 1). The greater the contamination, the more likely the normal data decision boundary will be damaged by including the anomalous points.
In summary, a more general and realistic assumption is that samples from the the data-generating distribution P have the form of x + ε where x ∼ (1 − η) P + + η Pand ε is random noise. Assumptions on both, the noise distribution ε and contamination rate η, are crucial for modeling a specific anomaly detection problem. Robust methods [5], [126], [226] specifically aim to account for these sources of disturbance. Note also that by increasing the level α in the density level set definition above, a corresponding model generally becomes more robust, since the target decision boundary becomes tighter and excludes the contamination.
3) The Semi-Supervised Setting: The semi-supervised anomaly detection setting is the case in which both unlabeled and labeled data are available for training a model with Y = {±1}, where we denoteỹ = +1 for normal andỹ = −1 for anomalous points respectively. Usually, we have m n in the semisupervised setting, that is, mainly unlabeled and only a few labeled instances are available, since labels are often costly to obtain in terms of resources (time, money, etc.). Labeling might for instance require domain experts such as medical professionals (e.g., pathologists) or technical experts (e.g., aerospace engineers). Anomalous instances in particular are also infrequent by nature (e.g., rare medical conditions) or very expensive (e.g., the failure of some industrial machine). The deliberate generation of anomalies is rarely an option. However, including known anomalous examples, if available, can significantly improve the detection performance of a model [143], [183], [227]- [230]. Labels are also sometimes available in the online setting where alarms raised by the anomaly detector have been investigated to determine whether they were correct. Some unsupervised anomaly detection methods can be incrementally updated when such labels become available [231]. Verifying unlabeled samples as indeed being normal can often be easier due to the more frequent nature of normal data. For this reason, among others, the special semi-supervised case of Learning from Positive and Unlabeled Examples (LPUE) [232]- [234], i.e., labeled normal and unlabeled examples, is also studied specifically in the anomaly detection literature [148], [161], [235]- [237].
Previous work [161] has also referred to the special case of learning exclusively from positive examples as the semisupervised anomaly detection setting, which is confusing terminology. Although meticulously curated normal data can sometimes be available (e.g., in open category detection [238]), such a setting in which entirely (and confidently) labeled normal examples are available is rather rare in practice. The analysis of this setting is rather again justified by the assumption that most of the given (unlabeled) training data is normal, but not the absolute certainty thereof. This makes this setting effectively equivalent to the unsupervised setting from a modeling perspective, apart from maybe weakened assumptions on the level of noise or contamination, which previous works also point out [161]. We therefore refer to the more general setting as presented in (7) as the semi-supervised anomaly detection setting, which incorporates both labeled normal as well as anomalous examples in addition to unlabeled instances, since this setting is relevant and occurs in practice. If some labeled anomalies are available, the modeling assumptions about the anomalous distribution P -, as mentioned in section II-C1, become critical for effectively incorporating anomalies into training. These include for instance whether modes or clusters are expected among the anomalies (e.g., group anomalies). 4) The Supervised Setting: The supervised anomaly detection setting is the case in which completely labeled data is available for training a model, where again Y = {±1} withỹ = +1 denoting normal instances andỹ = −1 denoting anomalies respectively. If both the normal and anomalous data points are assumed to be representative for the normal data distribution P + and anomaly distribution Prespectively, this learning problem is equivalent to supervised binary classification. Such a setting would thus not be an anomaly detection problem in the strict sense, but rather a classification task. Although anomalous modes or clusters might exist, i.e., some anomalies might be more likely to occur than others, anything not normal is by definition an anomaly. Labeled anomalies are therefore rarely representative of some 'anomaly class'. This distinction is also reflected in modeling: whereas in classification the objective is to learn a (well generalizing) decision boundary that best separates the data according to some (closed set of) class labels, the objective in anomaly detection remains the estimation of the normal density level set boundaries. Hence, we should interpret supervised anomaly detection problems as label-informed density level set estimation in which confident normal (in-distribution) and anomalous (out-of-distribution) training examples are available. Due to the costs that are usually involved with labeling, as mentioned before, the supervised anomaly detection setting is the most uncommon setting in practice. 5) Further Data Properties: Besides the settings described above, the intrinsic properties of the data itself are also crucial for modeling a specific anomaly detection problem. We give a list of relevant data properties in Table I and present a toy Ground-truth normal law P + Observed data from P = P + + ε Fig. 4. A two-dimensional Big Moon, Small Moon toy example with realvalued ground-truth normal law P + that is composed of two one-dimensional manifolds (bimodal, two-scale, non-convex). The unlabeled training data (n = 1,000, m = 0) is generated from P = P + +ε which is subject to Gaussian noise ε. This toy data is non-hierarchical, context-free, and stationary. Anomalies are off-manifold points that may occur uniformly over the displayed range.
dataset with a specific realization of these properties in Fig. 4 which will serve us as a running example. The assumptions about these properties should be reflected in the modeling choices such as adding context or deciding among suitable deep or shallow feature maps which can be challenging. We outline these and further challenges in anomaly detection next.

D. Challenges in Anomaly Detection
We conclude our introduction by briefly highlighting some notable challenges in anomaly detection, some of which directly arise from the definition and data characteristics detailed above. Certainly, the fundamental challenge in anomaly detection is the mostly unsupervised nature of the problem, which necessarily requires assumptions to be made about the specific task, the domain, and the given data. These include assumptions about the relevant types of anomalies (cf., II-B2), possible prior assumptions about the anomaly distribution (cf., II-C1) and, if available, the challenge of how to incorporate labeled data instances in a generalizing way (cf., II-C3 and II-C4). Further questions include if a specific task requires an anomaly score or a threshold (cf., II-B7)? What level α (cf., II-B8) strikes a balance between false alarms and missed anomalies that is reasonable for the task? Is the data-generating process subject to noise or contamination (cf., II-C2), i.e. is robustness a critical aspect? Moreover, identifying and including the data properties given in Table I into a method and model can pose challenges as well. The computational complexity in both the dataset size n+m and dimensionality D as well as the memory cost of a model at training time, but also at test time can be a limiting factor (e.g., for data streams or in real-time monitoring). Is the data-generating process assumed to be non-stationary [239]- [241] and are there distributional shifts expected at test time? For (truly) high-dimensional data, the curse of dimensionality and resulting concentration of distances can be a major issue [165]. Here, finding a representation that captures the features that are relevant for the task and meaningful for the data and domain becomes vital. Deep anomaly detection methods further entail new challenges such as an increased number of hyperparameters, for example the selection of a suitable network architecture or specification of optimization parameters (learning rate, batch sizes, etc.). In addition, the more complex the data or a model is, the greater the challenges of interpretability (e.g., [242]- [245]), transparency, and explaining anomalies become. We illustrate these various practical challenges and provide guidelines with worked-through examples in section VIII.
Given all these facets of the anomaly detection problem we covered in this introduction, it is not surprising that there is such a wealth of literature and approaches on the topic. We turn to these approaches in the following sections, where we first examine density estimation and probabilistic models (section III), followed by one-class classification methods (section IV), and finally reconstruction models (section V). In these sections, we will point out the connections between deep and shallow methods. Afterwards, we present our unifying view in section VI, which will enable us to systematically identify open challenges and paths for future research.

III. DENSITY ESTIMATION AND PROBABILISTIC MODELS
The first category of methods predict anomalies by taking the intermediate step of estimating the whole probability distribution. A wealth of existing probability models are therefore direct candidates for the task of anomaly detection. This includes classic density estimation methods [246] as well as deep statistical models. In the following, we describe the adaptation of these techniques to anomaly detection.

A. Classic Density Estimation
One of the most basic approaches to multivariate anomaly detection is to compute the Mahalanobis distance from a test point to the training data mean [247]. This is equivalent to fitting a multivariate Gaussian distribution to the training data and evaluating the log-likelihood of a test point according to that model [248]. Compared to modeling each dimension of the data independently, fitting a multivariate Gaussian can capture linear interactions between multiple dimensions. To model more complex distributions, nonparametric density estimators have been introduced, including kernel density estimators (KDE) [12], [246], histogram estimators, and Gaussian mixture models (GMMs) [249], [250]. The kernel density estimator is arguably the most widely used nonparametric density estimator due to theoretical advantages over histograms [251] and the practical issues with fitting and parameter selection for GMMs [252]. The standard kernel density estimator, along with a more recent adaptation that can deal with modest levels of outliers in the training data [253], [254], is therefore a popular approach to anomaly detection.
While classic nonparametric density estimators perform fairly well for low dimensional problems, they suffer notoriously from the curse of dimensionality: the sample size required to attain a fixed level of accuracy grows exponentially in the dimension of the feature space. One goal of deep statistical models is to overcome this challenge.

B. Energy-Based Models
Some of the earliest deep statistical models are energy based models (EBMs) [255]- [257]. An EBM is a model whose density is characterized by an energy function E θ (x) as where Z(θ) = exp (−E θ (x)) dx is the so-called partition function that ensures that p θ integrates to 1. These models are typically trained via gradient descent, and approximating the log-likelihood gradient ∇ θ log p θ (x) via Markov chain Monte Carlo (MCMC) [258] or Stochastic Gradient Langevin Dynamics (SGLD) [259], [260]. While one typically cannot evaluate the density p θ directly due to the intractability of the partition function Z(θ), the function E θ can be used as an anomaly score since it is monotonically decreasing as the density p θ increases.
Early deep EBMs such as Deep Belief Networks [261] and Deep Boltzmann Machines [262] are graphical models consisting of layers of latent states followed by an observed output layer that models the training data. Here, the energy function depends not only on the input x, but also on the latent state z so the energy function has the form E θ (x, z). While these approaches can richly model latent probabilistic dependencies in data distributions, they are not particularly amenable to anomaly detection since one must marginalize out the latent variables to recover some value related to the likelihood. Later works replaced the probabilistic latent layers with deterministic ones [263] allowing for the practical use of E θ (x) as an anomaly score. This sort of model has been successfully used for deep anomaly detection [145]. Recently, EBMs have also been suggested as a framework to reinterpret deep classifiers where the energy-based training has shown to improve robustness and out-of-distribution detection performance [260].

C. Neural Generative Models (VAEs and GANs)
Neural generative models aim to learn a neural network that maps vectors sampled from a simple predefined source distribution Q, usually a Gaussian or uniform distribution, to the actual input distribution P + . More formally, the objective is to train the network so that φ ω (Q) ≈ P + where φ ω (Q) is the distribution that results from pushing the source distribution Q forward through neural network φ ω . The two most established neural generative models are variational autoencoders (VAEs) [264]- [266] and generative adversarial networks (GANs) [267].
1) VAEs: A variational autoencoder learns a deep latentvariable model where the data points x are parameterized on latent samples z ∼ Q via some neural network so it learns a distribution p θ (x | z) such that p θ (x) ≈ p + (x). For example, a common instantiation of this is to let Q be an isotropic multivariate Gaussian distribution and let the neural network φ d,ω = (µ ω , σ ω ) (the decoder) with weights ω, parameterize the mean and variance of an isotropic Gaussian, Performing maximum likelihood estimation on θ is typically intractable. To remedy this an additional neural network φ e,ω (the encoder) is introduced to parameterize a variational distribution q θ (z | x), with θ encapsulated by the output of φ e,ω , to approximate the latent posterior p(z | x). The full model is then optimized via the evidence lower bound (ELBO) in a variational Bayes manner: Optimization proceeds using Stochastic Gradient Variational Bayes [264]. Given a trained VAE, one can estimate p θ (x) via a Monte Carlo sampling from the prior p(z) and computing E z∼p(z) [p θ (x | z)]. Using this score directly for anomaly detection has a nice theoretical interpretation, but experiments have shown that it tends to perform worse [268], [269] than alternatively using the reconstruction probability [270] which 2) GANs: GANs pose the problem of learning the target distribution as a zero-sum-game: a generative model is trained in competition with an adversary that challenges it to generate samples whose distribution is similar to the training distribution. A GAN consists of two neural networks, a generator network φ ω : Z → X and a discriminator network ψ ω : X → (0, 1) which are pitted against each other so that the discriminator is trained to discriminate between φ ω (z) and x ∼ P + where z ∼ Q. The generator is trained to fool the discriminator network thereby encouraging the generator to produce samples more similar to the target distribution. This is done using the following objective: Training is typically done via an alternating optimization scheme which is notoriously finicky [271]. There exist many GAN variants, including the Wasserstein GAN [272], [273], which is frequently used for anomaly detection methods using GANs, and StyleGAN, which has produced impressive highresolution photorealistic images [274]. Due to their construction, GAN models offer no way to assign a likelihood to points in the input space. Using the discriminator directly has been suggested as one approach to use GANs for anomaly detection [137]. Other approaches apply optimization to find a pointz in latent space Z such that x ≈ φ ω (z) for a test pointx. The authors of AnoGAN [51] recommend using an intermediate layer of the discriminator, f ω , and setting the anomaly score to be a convex combination of the reconstruction loss x − φ ω (z) and the discrimination loss f ω (x) − f ω (φ ω (z)) . In AD-GAN [147], the authors recommend initializing the search for latent points multiple times to find a collection of m latent pointsz 1 , . . . ,z m while simultaneously adapting the network parameters ω i individually for eachz i to improve the reconstruction and using the mean reconstruction loss as an anomaly score: Other adaptations include an encoder network which is trained to find the latent pointz and is used in a variety of ways, usually incorporating the reconstruction error [57], [148], [151], [152].

D. Normalizing Flows
Like neural generative models, normalizing flows [275]- [277] attempt to map data points from a source distribution z ∼ Q (usually called base distribution for normalizing flows) so that  Fig. 4). The parametric Gauss model is limited to an ellipsoidal (convex, unimodal) density. KDE with a RBF kernel is more flexible, yet tends to underfit the (multi-scale) distribution due a uniform kernel scale. RealNVP is the most flexible model, yet flow architectures induce biases as well, here a connected support caused by affine coupling layers in RealNVP.
distinguishing characteristic of normalizing flows is that the latent samples are D-dimensional, so they have the same dimensionality as the input space, and the network consists where each φ i,ωi is designed to be invertible for all ω i , thereby making the entire network invertible. The benefit of this formulation is that the probability density of x can be calculated exactly via a change of variables where otherwise. normalizing flow models are typically optimized to maximize the likelihood of the training data. Evaluating each layer's Jacobian and its determinant can be very expensive for general flow models. Consequently, the networks of flow models are usually designed so that the Jacobian is guaranteed to be upper (or lower) triangular, or have some other nice structure, such that one does not need to compute the full Jacobian and evaluating the determinant is efficient [275], [278], [279]; see [280] for an application in physics.
An advantage of these models over other methods is that one can calculate the likelihood of a point directly without any approximation while also being able to sample reasonably efficiently. Because the density p x (x) can be computed exactly, normalizing flow models can be applied directly for anomaly detection [281], [282].
A drawback of these models is that they do not perform any dimensionality reduction, which argues against applying them to images where the true (effective) dimensionality is much smaller than the image dimensionality. It has been observed that these models often assign high likelihood to anomalous instances [269]. Despite present limits, we have included them here because we believe that they may provide an elegant and promising direction for future anomaly detection methods. We will come back to this in our outlook in section IX.

E. Discussion
While we have focused on the case of density estimation on i.i.d. samples of low dimensional data and images, it is worth noting that there exist many deep statistical models for other settings. When performing conditional anomaly detection, for example, one can use GAN [283], VAE [284], and normalizing flow [285] variants which perform conditional density estimation. Likewise there exist many deep generative models for virtually all data types including time series data [284], [286], text [287], [288], and graphs [289]- [291], all of which may potentially be used for anomaly detection.
It has been argued that full density estimation is not needed for solving the anomaly detection problem, since one learns all density level sets simultaneously when one really only needs a single density level set. This violates Vapnik's Principle: "[W]hen limited amount of data is available, one should avoid solving a more general problem as an intermediate step to solve the original problem" [292]. The methods in the next section seek to compute only a single density level set, that is, they perform one-class classification.

IV. ONE-CLASS CLASSIFICATION
One-class classification [183], [222], [293]- [295], occasionally also called single-class classification [296], [297], adopts a discriminative approach to anomaly detection. Methods based on one-class classification try to avoid a full estimation of the density as an intermediate step. Instead, these methods aim to directly learn a decision boundary that corresponds to a desired density level set of the normal data distribution P + , or more generally, to produce a decision boundary that yields a low cost when applied to unseen data.

A. The One-Class Classification Objective
We can see one-class classification as a particularly tricky classification problem, namely as binary classification where we only have (or almost only have) access to data from one class -the normal class. Given this imbalanced setting, the one-class classification objective is to learn a one-class decision boundary that minimizes (i) falsely raised alarms for true normal instances (i.e., the false alarm rate or type I error), and (ii) undetected or missed true anomalies (i.e., the miss rate or type II error). Achieving a low (or zero) false alarm rate, is conceptually simple: given enough normal data points, one could just draw some boundary that encloses all the points, for example a sufficiently large ball that contains all data instances. The crux here is, of course, to simultaneously keep the miss rate low, that is, to not draw this boundary too loosely. For this reason, one usually a priori specifies some target false alarm rate α ∈ [0, 1] for which the miss rate is then sought to be minimized. Note that this precisely corresponds to the idea of estimating an α-density level set for some a priori fixed level α ∈ [0, 1]. The key question in one-class classification thus is how to minimize the miss rate for some given target false alarm rate with access to no (or only few) anomalies.
We can express the rationale above in terms of the binary classification risk [211], [221]. Let Y ∈ {±1} be the class random variable, where again Y = +1 denotes normal and Y = −1 denotes anomalous points, so we can then identify the normal data distribution as P + ≡ P X|Y =+1 and the anomaly distribution as P -≡ P X|Y =−1 respectively. Furthermore, let : R×{±1} → R be a binary classification loss and f : X → R be some real-valued score function. The classification risk of scorer f under loss is then given by: Minimizing the second term -the expected loss of classifying true anomalies as normal -corresponds to minimizing the (expected) miss rate. Given some unlabeled data x 1 , . . . , x n ∈ X , and potentially some additional labeled data (x 1 ,ỹ 1 ), . . . , (x m ,ỹ m ), we can apply the principle of empirical risk minimization to obtain This solidifies the empirical one-class classification objective. Note that the second term is an empty sum in the unsupervised setting. Without any additional constraints or regularization, the empirical objective (15) would then be ill-posed. We add R as an additional term to denote and capture regularization which may take various forms depending on the assumptions about f , but critically also about P -. Generally, the regularization R = R(f ) aims to minimize the miss rate (e.g., via volume minimization and assumptions about P -) and improve generalization (e.g., via smoothing of f ). Further note, that the pseudo-labeling of y = +1 in the first term incorporates the assumption that the n unlabeled training data points are normal. This assumption can be adjusted, however, through specific choices of the loss (e.g., hinge) and regularization. For example, requiring some fraction of the unlabeled data to get misclassified to include an assumption about the contamination rate η or achieve some target false alarm rate α as we will see below.

B. One-Class Classification in Input Space
As an illustrative example that conveys useful intuition, consider the previous simple idea of fitting a data-enclosing ball as a one-class model. Given x 1 , . . . , x n ∈ X , we can define the following objective: In words, we aim to find a hypersphere with radius R > 0 and center c ∈ X that encloses the data ( To control the miss rate, we minimize the volume of this hypersphere by minimizing R 2 to achieve a tight spherical boundary. Slack variables ξ i ≥ 0 allow some points to fall outside the sphere, thus making the boundary soft, where hyperparameter ν ∈ (0, 1] balances this trade-off. Objective (16) exactly corresponds to Support Vector Data Description (SVDD) applied in the input space X , motivated above as in [7], [183], [222]. Equivalently, we can derive (16) from the binary classification risk. Consider the (shifted, cost-weighted) hinge loss (s, y) defined by (s, +1) = 1 1+ν max(0, s) and (s, −1) = ν 1+ν max(0, −s) [221]. Then, for a hypersphere model f θ (x) = x − c 2 − R 2 with parameters θ = (R, c), the corresponding classification risk objective (14) is given by We can estimate the first term in (17) empirically from x 1 , . . . , x n , again assuming (most of) these points have been drawn from P + . If labeled anomalies are absent, we can still make an assumption about their distribution P -. Following the basic, uninformed prior assumption that anomalies may occur uniformly on X (i.e., P -≡ U(X )), we can examine the expected value in the second term analytically: where B R (c) denotes the ball centered at c with radius R and λ is again the standard (Lebesgue) measure of volume. 3 This shows that the minimum volume principle [217], [219] naturally arises in one-class classification through seeking to minimize the risk of missing anomalies, here illustrated for an assumption that the anomaly distribution Pfollows a uniform distribution. Overall, from (17) we thus can derive the empirical objective min R,c which corresponds to (16) with the constraints directly incorporated into the objective function. We remark that the costweighting hyperparameter ν ∈ (0, 1] is purposefully chosen here, since it is an upper bound on the ratio of points outside and a lower bound on the ratio of points inside or on the boundary of the sphere [6], [136]. We can therefore see ν as an approximation of the false alarm rate, that is ν ≈ α. A sphere in the input space X is of course a very limited model and only matches a limited class of distributions P + (e.g., an isotropic Gaussian). Minimum Volume Ellipsoids (MVE) [178], [298] and the Minimum Covariance Determinant (MCD) estimator [299] are a generalization to nonisotropic distributions with elliptical support. Nonparametric methods such as One-Class Neighbor Machines [300] provide additional freedom to model multi-modal distributions having non-convex support. Extending the objective and principles above to general feature spaces (e.g., [210], [292], [301]) further increases the flexibility of one-class models and enables decision boundaries for more complex distributions.

C. Kernel-based One-Class Classification
The kernel-based OC-SVM [6], [302] and SVDD [7], [183] are perhaps the most well-known one-class classification methods. Let k : X × X → R be some positive semi-definite (PSD) kernel with associated RKHS F k and corresponding feature map φ k : for all x,x ∈ X . The objective of (kernel) SVDD is again to find a data-enclosing hypersphere of minimum volume. The SVDD primal problem is the one given in (16), but with the hypersphere model In comparison, the OC-SVM objective is to find a hyperplane w ∈ F k that separates the data in feature space F k with maximum margin from the origin: So the OC-SVM uses a linear model The margin to the origin is given by ρ w which is maximized via maximizing ρ, where w acts as a normalizer.
The OC-SVM and SVDD both can be solved in their respective dual formulations which are quadratic programs that only involve dot products (the feature map φ k is implicit). For the standard Gaussian kernel (or any kernel with constant norm k(x, x) = c > 0), the OC-SVM and SVDD are equivalent [183]. In this case, the corresponding density level set estimator defined bŷ is in fact an asymptotically consistent ν-density level set estimator [303]. The solution paths of hyperparameter ν have been analyzed for both the OC-SVM [304] and SVDD [305]. Kernel-induced feature spaces considerably improve the expressive power of one-class methods and allow to learn wellperforming models in multi-modal, non-convex, and non-linear data settings. Many variants of kernel one-class classification have been proposed and studied over the years such as hierarchical formulations for nested density level set estimation [306], [307], Multi-Sphere SVDD [308], Multiple Kernel Learning for OC-SVM [309], [310], OC-SVM for group anomaly detection [196], boosting via L 1 -norm regularized OC-SVM [311], One-class Kernel Fisher Discriminants [312]- [314], Bayesian Data Description [315], or robust variants [316].
As a simpler variant compared to using a neural hypersphere model in (16), the One-Class Deep SVDD [136], [320] has been introduced which poses the following objective: MVE (AUC=74.7) SVDD (AUC=90.9) DSVDD (AUC=97.5) Fig. 6. One-class classification models on the Big Moon, Small Moon toy example (cf., Fig. 4). A Minimum Volume Ellipsoid (MVE) in input space is limited to enclose an ellipsoidal, convex region. By (implicitly) fitting a hypersphere in kernel feature space, SVDD enables non-convex support estimation. Deep SVDD learns an (explicit) neural feature map (here with smooth ELU activations) that extracts multiple data scales to fit a hypersphere model in feature space for support description.
Deep one-class classification methods generally offer a greater modeling flexibility and enable learning or transfer of task-relevant features for complex data. They usually require more data to be effective though, or must rely on some informative domain prior (e.g., some pre-trained network). The underlying principle of one-class classification methodstargeting a discriminative one-class boundary in learningremains unaltered, regardless of whether a deep or shallow feature map is used.

E. Negative Examples
One-class classifiers can usually incorporate labeled negative examples (y = −1) in a direct manner due to their close connection to binary classification as explained above. Such negative examples can facilitate an empirical estimation of the miss rate (cf., (14) and (15)). We here recognize three qualitative types of negative examples that have been studied in the literature, that we distinguish as artificial, auxiliary, and true negative examples which increase in their informativeness in this order.
The idea to approach unsupervised learning problems through generating artificial data points has been around for some time (cf., section 14.2.4 in [328]). If we assume that the anomaly distribution Phas some form that we can generate examples from, one idea would be to simply train a binary classifier to discern between the normal and the artificial negative examples. For the uniform prior P -≡ U(X ), this approach yields an asymptotically consistent density level set estimator [211]. Classification against uniformly drawn points from a hypercube, however, quickly becomes ineffective in higher dimensions. To improve over artificial uniform sampling, more informed sampling strategies have been proposed [329] such as resampling schemes [330], manifold sampling [331], and sampling based on local density estimation [332], [333] as well as active learning strategies [334]- [336]. Another recent idea is to treat the enormous quantities of data that are publicly available in some domains as auxiliary negative examples [325], for example images from photo sharing sites for computer vision tasks and the English Wikipedia for NLP tasks. Such auxiliary examples provide more informative domain knowledge, for instance about the distribution of natural images or the English language in general, as opposed to sampling random pixels or words. This approach, called Outlier Exposure [325], which trains on known anomalies can significantly improve deep anomaly detection performance in some domains [154], [325]. Finally, the most informative labeled negative examples are true anomalies, for example verified by some domain expert. Access to even a few labeled anomalies has been shown to improve detection performance significantly [143], [183], [228]. There also have been active learning algorithms proposed that include subjective user feedback (e.g., from an expert) to learn about the user-specific informativeness of particular anomalies in an application [337].

V. RECONSTRUCTION MODELS
Models that are trained on a reconstruction objective are among the earliest [338], [339] and most common [180], [182] neural network approaches to anomaly detection. Reconstruction-based methods learn a model that is optimized to well-reconstruct normal data instances, thereby aiming to detect anomalies by failing to accurately reconstruct them under the learned model. Most of these methods have a purely geometric motivation (e.g., PCA or deterministic autoencoders), yet some probabilistic variants reveal a connection to density (level set) estimation. In this section, we define the general reconstruction learning objective, highlight common underlying assumptions, as well as present standard reconstruction-based methods and discuss their variants.

A. The Reconstruction Objective
Let φ θ : X → X , x → φ θ (x) be a feature map from the data space X onto itself that is composed of an encoding function φ e : X → Z (the encoder) and a decoding function φ d : Z → X (the decoder), that is, φ θ ≡ (φ d • φ e ) θ where θ holds the parameters of both the encoder and decoder. We call Z the latent space and φ e (x) = z the latent representation (or embedding or code) of x. The reconstruction objective then is to learn φ θ such that φ θ (x) = φ d (φ e (x)) =x ≈ x, that is, to find some encoding and decoding transformation so that x is reconstructed with minimal error, usually measured in Euclidean distance. Given unlabeled data x 1 , . . . , x n ∈ X , the reconstruction objective is given by where R again denotes the different forms of regularization that various methods introduce, for example on the parameters θ, the structure of the encoding and decoding transformations, or the geometry of latent space Z. Without any restrictions, the reconstruction objective (23) would be optimally solved by the identity map φ θ ≡ id, but then of course nothing would be learned from the data. In order to learn something useful, structural assumptions about the data-generating process are therefore necessary. We here identify two principal assumptions: the manifold and the prototype assumptions.
1) The Manifold Assumption: The manifold assumption asserts that the data lives (approximately) on some lowerdimensional (possibly non-linear and non-convex) manifold M that is embedded within the data space X -that is M ⊂ X with dim(M) < dim(X ). In this case X is sometimes also called the ambient or observation space. For natural images observed in pixel space, for instance, the manifold captures the structure of scenes as well as variation due to rotation and translation, changes in color, shape, size, texture, and so on. For human voices observed in audio signal space, the manifold captures variation due to the words being spoken as well as person-to-person variation in the anatomy and physiology of the vocal folds. The (approximate) manifold assumption implies that there exists a lower-dimensional latent space Z and functions φ e : X → Z and φ d : Z → X such that for all x ∈ X , x ≈ φ d (φ e (x)). Consequently, the generating distribution P can be represented as the pushforward through φ d of a latent distribution P Z . Equivalently, the latent distribution P Z is the push-forward of P through φ e .
The goal of learning is therefore to learn the pair of functions φ e and φ d so that φ d (φ e (X )) ≈ M ⊂ X . Methods that incorporate the manifold assumption usually restrict the latent space Z ⊆ R d to have much lower dimensionality d than the data space X ⊆ R D (i.e., d D). The manifold assumption is also widespread in related unsupervised learning tasks such as manifold learning itself [340], [341], dimensionality reduction [3], [342]- [344], disentanglement [209], [345], and representation learning in general [80], [346].
2) The Prototype Assumption: The prototype assumption asserts that there exists a finite number of prototypical elements in the data space X that characterize the data well. We can model this assumption in terms of a data-generating distribution that depends on a discrete latent categorical variable Z ∈ Z = {1, . . . , K} that captures some K prototypes or modes of the data distribution. This prototype assumption is also common in clustering and classification when we assume a collection of prototypical instances represent clusters or classes well. With the reconstruction objective under the prototype assumption, we aim to learn an encoding function that for x ∈ X identifies a φ e (x) = k ∈ {1, . . . , K} and a decoding function k → φ d (k) = c k that maps to some k-th prototype (or some prototypical distribution or mixture of prototypes more generally) such that the reconstruction error x−c k becomes minimal. In contrast to the manifold assumption where we aim to describe the data by some continuous mapping, under the (most basic) prototype assumption we characterize the data by a discrete set of vectors {c 1 , . . . , c K } ⊆ X . The method of representing a data distribution by a set of prototype vectors is also known as Vector Quantization (VQ) [347], [348].
3) The Reconstruction Anomaly Score: A model that is trained on the reconstruction objective must extract salient features and characteristic patterns from the data in its encoding -subject to imposed model assumptions -so that its decoding from the compressed latent representation achieves low reconstruction error (e.g., feature correlations and dependencies, recurring patterns, cluster structure, statistical redundancy, etc.). Assuming that the training data x 1 , . . . , x n ∈ X includes mostly normal points, we therefore expect a reconstruction-based model to produce a low reconstruction error for normal instances and a high reconstruction error for anomalies. For this reason, the anomaly score is usually also directly defined by the reconstruction error: For models that have learned some truthful manifold structure or prototypical representation, a high reconstruction error would then detect off-manifold or non-prototypical instances. Most reconstruction methods do not follow any probabilistic motivation, and a point x gets flagged anomalous simply because it does not conform to its 'idealized' representation φ d (φ e (x)) =x under the encoding and decoding process. However, some reconstruction methods also have probabilistic interpretations, for instance PCA [349], or are even derived from probabilistic objectives such as Bayesian PCA [350] or VAEs [264]. Such methods are again related to density (level set) estimation (under specific assumptions about some latent structure), usually in the sense that a high reconstruction error indicates low density regions and vice versa.

B. Principal Component Analysis
A common way to formulate the Principal Component Analysis (PCA) objective is to seek an orthogonal basis W in data space X ⊆ R D that maximizes the empirical variance of the (centered) data x 1 , . . . , x n ∈ X : Solving this objective results in a well-known eigenvalue problem, since the optimal basis is given by the eigenvectors of the empirical covariance matrix where the respective eigenvalues correspond to the component-wise variances [351]. The d ≤ D components that explain most of the variance -the principal components -are then given by the d eigenvectors that have the largest eigenvalues. Several works have adapted PCA for anomaly detection [77], [352]- [357], which can be considered the default reconstruction baseline. From a reconstruction perspective, the objective to find an orthogonal projection W W to a ddimensional linear subspace (which is the case for W ∈ R d×D with W W = I) such that the mean squared reconstruction error is minimized, yields exactly the same PCA solution. So PCA optimally solves the reconstruction objective (23) for a linear encoder φ e (x) = W x = z and transposed linear decoder φ d (z) = W z with constraint W W = I. For linear PCA, we can also readily identify its probabilistic interpretation [349], namely that the data distribution follows from the linear transformation X = W Z + ε of a d-dimensional latent Gaussian Z ∼ N (0, I), possibly with added noise ε ∼ N (0, σ 2 I), so that P ≡ N (0, W W + σ 2 I). Maximizing the likelihood of this Gaussian over the encoding and decoding parameter W again yields PCA as the optimal solution [349]. Hence, PCA assumes the data lives on a d-dimensional ellipsoid embedded in data space X ⊆ R D . Standard PCA therefore provides an illustrative example for the connections between density estimation and reconstruction.
Of course linear PCA is limited to data encodings that can only exploit linear feature correlations. Kernel PCA [3] introduced a non-linear generalization of component analysis by extending the PCA objective to non-linear kernel feature maps and taking advantage of the 'kernel trick'. For a PSD kernel k(x,x) with feature map φ k : X → F k , kernel PCA solves the reconstruction objective (26) (27) which results in an eigenvalue problem of the kernel matrix [3]. For kernel PCA, the reconstruction error can again serve as an anomaly score. It can be computed implicitly via the dual [4]. This reconstruction from linear principal components in feature space F k corresponds to a reconstruction from some non-linear subspace or manifold in input space X [358]. Replacing the reconstruction W W φ k (x) in (27) with a prototype c ∈ F k yields a reconstruction model that considers the squared error to the kernel mean, since the prototype is optimally solved by c = 1 n n i=1 φ(x i ) for the L 2 -distance. For RBF kernels, this prototype model is (up to a multiplicative constant) equivalent to kernel density estimation [4], which provides a link between kernel reconstruction and nonparametric density estimation methods. Finally, Robust PCA variants have been introduced as well [359]- [362], which extend PCA to account for data contamination or noise (cf., II-C2).

C. Autoencoders
Autoencoders are reconstruction models that use neural networks for the encoding and decoding of data. They were originally introduced during the 80s [363]- [366] primarily as methods to perform non-linear dimensionality reduction [367], [368], yet they have also been studied early on for anomaly detection [338], [339]. Today, deep autoencoders are among the most widely adopted methods for deep anomaly detection in the literature [44], [52], [55], [124]- [134] likely owing PCA (AUC=66. 8) kPCA (AUC=94.0) AE (AUC=97.9) Fig. 7. Reconstruction models on the Big Moon, Small Moon toy example (cf., Fig. 4). PCA finds the linear subspace with the lowest reconstruction error under an orthogonal projection of the data. Kernel PCA solves (linear) component analysis in kernel feature space which enables an optimal reconstruction from (kernel-induced) non-linear components in input space. An autoencoder (AE) with one-dimensional latent code learns a one-dimensional, non-linear manifold in input space having minimal reconstruction error.
to their long history and easy-to-use standard variants. The standard autoencoder objective is given by where the optimization is carried out over the neural network weights ω of the encoder and decoder. A common way to regularize autoencoders is by mapping to a lower dimensional 'bottleneck' representation φ e (x) = z ∈ Z through the encoder network, which enforces data compression and effectively limits the dimensionality of the manifold or subspace to be learned. If linear networks are used, such an autoencoder in fact recovers the same optimal subspace as spanned by the PCA eigenvectors [369], [370]. Apart from a 'bottleneck', a number of different ways to regularize autoencoders have been introduced in the literature. Following ideas of sparse coding [371]- [374], sparse autoencoders [375], [376] regularize the (possibly higher-dimensional, over-complete) latent code towards sparsity, for example via L 1 Lasso penalization [377]. Denoising autoencoders (DAEs) [378], [379] explicitly feed noise-corrupted inputsx = x + ε into the network which is then trained to reconstruct the original inputs x. DAEs thus provide a way to specify a noise model for ε (cf., II-C2), which has been applied for noise-robust acoustic novelty detection [42], for instance. For situations in which the training data is already corrupted with noise or unknown anomalies, robust deep autoencoders [126], which split the data into well-represented and corrupted parts similar to robust PCA [361], have been proposed. Contractive autoencoders (CAEs) [380] propose to penalize the Frobenius norm of the Jacobian of the encoder activations with respect to the inputs to obtain a smoother and more robust latent representation. Such ways of regularization influence the geometry and shape of the subspace or manifold that is learned by an autoencoder, for example by imposing some degree of smoothness or introducing invariances towards certain types of input corruptions or transformations [130]. Hence, these regularization choices should again reflect the specific assumptions of a given anomaly detection task.
Besides the deterministic variants above, probabilistic autoencoders have also been proposed, which again establish a connection to density estimation. The most explored class of probabilistic autoencoders are Variational Autoencoders (VAEs) [264]- [266], as introduced in section III-C1 through the lens of neural generative models, which approximately maximize the data likelihood (or evidence) by maximizing the evidence lower bound (ELBO). From a reconstruction perspective, VAEs adopt a stochastic autoencoding process, which is realized by encoding and decoding the parameters of distributions (e.g., Gaussians) through the encoder and decoder networks, from which the latent code and reconstruction then can be sampled. For a standard Gaussian VAE, for example, where q(z|x) ∼ N (µ x , diag(σ 2 x )), p(z) ∼ N (0, I), and p(x|z) ∼ N (µ z , I) with encoder φ e,ω (x) = (µ x , σ x ) and decoder φ d,ω (z) = µ z , the empirical ELBO objective (10) becomes where z i1 , . . . , z iM are M Monte Carlo samples drawn from the encoding distribution z ∼ q(z|x i ) of x i . Hence, such a VAE is trained to minimize the mean reconstruction error over samples from an encoded latent Gaussian that is regularized to be close to a standard isotropic Gaussian. VAEs have been used in various forms for anomaly detection [268], [270], [381], for instance on multimodal sequential data with LSTM networks for anomaly detection in robot-assisted feeding [382] and for new physics mining at the Large Hadron Collider [74]. Another class of probabilistic autoencoders that has been applied to anomaly detection are Adversarial Autoencoders (AAEs) [44], [52], [383]. By employing an adversarial loss to regularize and match the latent encoding distribution, AAEs can employ any arbitrary prior p(z), as long as sampling is feasible. Finally, other autoencoder variants that have been applied to anomaly detection include RNN-based autoencoders [193], [230], [384], [385], convolutional autoencoders [55], autoencoder ensembles [125], [385] and variants that actively control the topology of the latent code [386]. Autoencoders also have been employed in two-step approaches that utilize autoencoders for dimensionality reduction and apply traditional methods on the learned embeddings [135], [387], [388].

D. Prototypical Clustering
Clustering methods that make the prototype assumption provide another approach to reconstruction-based anomaly detection. As mentioned above, the reconstruction error here is usually given by the distance of a point to its nearest prototype, which ideally has been learned to represent a distinct mode of the normal data distribution. Prototypical clustering methods [389] include the well-known Vector Quantization (VQ) algorithms k-means, k-medians, and k-medoids, which define a Voronoi partitioning [390], [391] over the metric space where they are applied -typically the input space X . Kernel variants of k-means have also been studied [392] and considered for anomaly detection [308]. More recently, deep learning approaches to clustering have also been introduced [393]- [396], some also based on k-means [397], and adopted for anomaly detection [128], [387], [398]. As in deep one-class classification (cf., section IV-D), a persistent question in deep clustering is how to effectively regularize against a feature map collapse [399]. Note that whereas for deep clustering methods the reconstruction error is measured in latent space Z, for deep autoencoders it is measured in the input space X after decoding. Thus, a latent feature collapse (i.e., a constant encoder φ e ≡ c ∈ Z) would result in a constant decoding (the data mean at optimum) for an autoencoder, which generally is a suboptimal solution of (28). For this reason, autoencoders seem less susceptible to a feature collapse, though they have also been observed to converge to bad local optima under SGD optimization, specifically if they employ bias terms [136].

VI. A UNIFYING VIEW OF ANOMALY DETECTION
In this section, we present a unifying view on the anomaly detection problem. We identify specific anomaly detection modeling components that allow us to organize and characterize the vast collection of discussed anomaly detection methods in a systematic way. Importantly, this view shows connections that enable the transfer of algorithmic ideas between existing anomaly detection methods. Thus it reveals promising directions for future research such as transferring concepts and ideas from kernel-based anomaly detection to deep methods and vice versa.

A. Modeling Dimensions of the Anomaly Detection Problem
We identify the following five components or modeling dimensions for anomaly detection: Dimension D1 Loss is the (scalar) loss function that is applied to the output of some model f θ (x). Semi-supervised or supervised methods apply loss functions that also incorporate labels, but for the many unsupervised anomaly detection methods we usually have (s, y) = (s). D2 Model defines the specific model f θ that maps an input x ∈ X to some scalar value that is evaluated by the loss. We have aligned our previous three sections along this major modeling dimension where we covered certain groups of methods that formulate models based on common principles, namely probabilistic modeling, one-class classification, and reconstruction. Due to the close link between anomaly detection and density estimation (cf., II-B5), many of the methods formulate a likelihood model f θ (x) = p θ (x | D n ) with negative log-loss (s) = − log(s), that is they have a negative log-likelihood objective, where D n = {x 1 , . . . , x n } denotes the training data. Dimension D3 captures the Feature Map x → φ(x) that is used in a model. This could be an (implicit) feature map φ k (x) defined by some given kernel k, for example, or an (explicit) neural network feature map φ ω (x) that is learned and parameterized with network weights ω. With dimension D4 Regularization, we  [145], [257] Normalizing Flows − log(s) base distribution pz(z); diffeomorphism architecture [276], [281] adversarial training [57], [327] Min. Vol. Sphere . . , TK } for self-labeling [153], [154] prior p(w, ω) fully [402], [403] PCA  [348] capture various forms of regularization R(f, φ, θ) of the model f θ , the feature map φ, and their parameters θ in a broader sense. Note that θ here may include both model parameters as well as feature map parameters, that is θ = (θ f , θ φ ) in general. θ f could be the distributional parameters of a parametric density model, for instance, and θ φ the weights of a neural network. Our last modeling dimension D5 describes the Inference Mode, specifically whether a method performs Bayesian inference. The identification of the above modeling dimensions enables us to formulate a general anomaly detection learning objective that applies to a broad range of anomaly detection methods: Denoting the minimum of ( * ) by θ * , the anomaly score of a test inputx is computed via the model f θ * (x). In the Bayesian case, when the objective in ( * ) is the negative log-likelihood of a posterior p(θ | D n ) induced by a prior distribution p(θ), we can predict in a fully Bayesian fashion via the expected model E θ∼p(θ | Dn) f θ (x). We describe many well-known anomaly detection methods within our unified view in Table II.

B. Distance-based Anomaly Detection
Our unifying view focuses on anomaly detection methods that formulate some learning objective. Apart from these methods, there also exists a rich literature on purely 'distancebased' anomaly detection methods and algorithms that have been studied extensively in the data mining community in particular. Many of these algorithms follow a lazy learning paradigm, in which there is no a priori training phase of learning a model, but instead new test points are evaluated with respect to the training instances only as they occur. We here group these methods as 'distance-based' without further granularity, but remark that various taxonomies for these types of methods have been proposed [161], [179]. Examples of such methods include nearest-neighbor-based methods [8], [9], [404]- [406] such as LOF [10] and partitioning tree-based methods [407] such as Isolation Forest [408], [409]. These methods usually also aim to capture the high-density regions of the data in some manner, for instance by scaling distances in relation to local neighborhoods [10], and thus are mostly consistent with the formal anomaly detection problem definition presented in section II. The majority of these algorithms have been studied and applied in the original input space X . Few of them have been considered in the context of deep learning, but some hybrid anomaly detection approaches apply distance-based algorithms on top of deep neural feature maps from pre-trained networks (e.g., [410]).

A. Building Anomaly Detection Benchmarks
Unlike standard supervised datasets, there is an intrinsic difficulty in building anomaly detection benchmarks: Anomalies are rare and some of them may have never been observed before they manifest themselves in practice. Existing anomaly benchmarks typically rely on one of the following strategies: 1) k-classes-out: Start from a binary or multi-class dataset and declare one or more classes to be normal and the rest to be anomalous. Due to the semantic homogeneity of the resulting 'anomalies,' such a benchmark may not be a good simulacrum of real anomalies. For example, simple low-level anomalies (e.g., additive noise) may not be tested for. 2) Synthetic: Start from an existing supervised or unsupervised dataset and generate synthetic anomalies (e.g., [411]- [413]). Having full control over anomalies is desirable from a statistical view point, to get robust error estimates. However, the characteristics of real anomalies may be unknown or difficult to generate. 3) Real-world: Consider a dataset that contains anomalies and have them labeled by a human expert. This is the ideal case. In addition to the anomaly label, the human can augment a sample with an annotation of which exact features are responsible for the anomaly (e.g., a segmentation mask in the context of image data).
We provide examples of anomaly detection benchmarks and datasets falling into these three categories in Table III. Although all three approaches are capable of producing anomalous data, we note that real anomalies may exhibit much wider and finer variations compared to those in the dataset. In adversarial cases, anomalies may be designed maliciously to avoid detection (e.g., in fraud and cybersecurity scenarios [203], [335], [414]- [417]).

B. Evaluating Anomaly Detectors
Most applications come with different costs for false alarms (type I error) and missed anomalies (type II error). Hence, it is common to consider the decision function where s denotes the anomaly score, and adjust the decision threshold τ in a way that (i) minimizes the costs associated to the type I and type II errors on the collected validation data, or (ii) accommodates the hard constraints of the environment in which the anomaly detection system will be deployed.
To illustrate this, consider an example in financial fraud detection: anomaly alarms are typically sent to a fraud analyst who must decide whether to open an investigation into the potentially fraudulent activity. There is typically a fixed number of analysts. Suppose they can only handle k alarms per day, that is, the k examples with the highest predicted anomaly score. In this scenario, the measure to optimize is the 'precision@k', since we want to maximize the number of anomalies contained in those k alarms.
In contrast, consider a credit card company that places an automatic hold on a credit card when an anomaly alarm is reported. False alarms result in angry customers and reduced revenue, so the goal is to maximize the number of true alarms subject to a constraint on the percentage of false alarms. The corresponding measure is to maximize 'recall@k' -where k is the number of false alarms.
However, it is often the case that application-related costs and constraints are not fully specified or vary over time. With such restrictions, it is desirable to have a measure that evaluates the performance of anomaly detection models under a broad range of possible application scenarios, or analogously, a broad range of decision thresholds τ . The Area Under the ROC Curve (AUROC or simply AUC) computes the fraction of detected anomalies, averaged over the full range of decision thresholds. AUC is the standard performance measure used in anomaly detection [429], [433]- [436]. Another commonly employed measure is the Area Under the Precision-Recall Curve (AUPRC) [199].

C. A Comparison on MNIST-C and MVTec-AD
In the following, we apply the AUC measure to compare a selection of anomaly detection methods from the three major approaches (probabilistic, one-class, reconstruction) and three types of feature representation (raw input, kernel, and neural network). We perform the comparison on the synthetic MNIST-C and real-world MVTec-AD datasets. MNIST-C is MNIST extended with a set of fifteen types of corruptions (e.g., blurring, added stripes, impulse noise, etc). MVTec-AD consists of fifteen image sets from industrial production, where anomalies correspond to manufacturing defects. These images sometimes take the form of textures (e.g., wood, grid) or objects (e.g., toothbrush, screw). For MNIST-C, models are trained on the standard MNIST training set and then tested on each corruption separately. We measure the AUC separating the corrupted from the uncorrupted test set. For MVTec-AD, we train distinct models on each of the fifteen image sets and measure the AUC on the corresponding test set. Results for each model are shown in Tables IV and V. We provide the training details of each model in Appendix B.  A first striking observation is the heterogeneity in performance of the various methods on the different corruptions and defect classes. For example, the AGAN performs generally well on MNIST-C but is systematically outperformed by the Deep One-Class Classification model (DOCC) on MVTec-AD. Also, the more powerful nonlinear models are not better on every class, and simple 'shallow' models occasionally outperform their deeper counterparts. For instance, the simple Gaussian model reaches top performance on MNIST-C:Spatter, linear PCA ranks highest on MVTec-AD:Toothbrush, and KDE ranks highest on MVTec-AD:Wood. The fact that some of the simplest models sometimes perform well highlights the strong differences in modeling structure of each anomaly detection model. However, what is still unclear is whether the measured model performance faithfully reflects the performance on a broader set of anomalies (i.e., the generalization performance) or whether some methods only benefit from the specific (possibly non-representative) types of anomalies that have been collected in the test set. In other words, assuming that all models achieve 100% test accuracy (e.g., MNIST-C:stripes), can we conclude that all models will perform well on a broader range of such anomalies? This problem was already highlighted in the context of supervised learning, and explanation methods can be applied to uncover potential hidden weaknesses of models, also known as 'Clever Hanses' [244].

D. Explaining Anomalies
To gain further insight into the detection strategies used by different anomaly models, and in turn to also address some of the limitations of classical validation procedures, many practitioners wish to augment anomaly predictions with an 'explanation.' Producing explanations of model predictions is already common in supervised learning, and this field is often referred to as Explainable AI (or XAI) [245]. Popular XAI methods include LIME [437], (Guided) Grad-CAM [438], integrated gradients [439], [440], and Layer-wise Relevance Propagation (LRP) [441]. Grad-CAM and LRP rely on the structure of the network to produce a robust explanation.
Explainable AI has recently also been brought to unsupervised learning, in particular, anomaly detection [38], [322], [326], [442]- [444]. Unlike supervised learning, which is largely dominated by neural networks [81], [84], [445], stateof-the-art methods for unsupervised learning are much more heterogeneous, including neural networks but also kernelbased, centroid-based, or probability-based models. In such a heterogeneous setting, it is difficult to build explanation methods that allow for a consistent comparison of detection strategies of the multiple anomaly detection models. Two directions to achieve such consistent explanations are particularly promising: 1) Model-agnostic explanation techniques (e.g., samplingbased) that apply transparently to any model, whether it is a neural network or something different (e.g., [442]). 2) A conversion of non-neural network models into functionally equivalent neural networks, or 'neuralization', so that existing approaches for explaining neural networks, e.g. LRP [441], can be applied [322], [444].
In the following, we demonstrate a neuralization approach. It has been shown that numerous anomaly detection models, in particular kernel-based models such as KDE or one-class SVMs, can be rewritten as strictly equivalent neural networks [322], [444]. Examples of neuralized models are shown in Fig.  8. They typically organize into a 3-layer architecture, from left to right: feature extraction, distance computation, and pooling.
where smin is a soft min-pooling of the type logsumexp.
Once the model has been converted into a neural network, we can apply explanation techniques such as LRP [441] to produce an explanation of the anomaly prediction. In this case, the LRP algorithm will take the score at the output of the model, propagate to 'winners' in the pool, then assign the score to directions in the input or feature space that contribute the most to the distance, and if necessary propagate the signal further down the feature hierarchy (cf., the Supplement of [322] for how this is done exactly). Fig. 9 shows from left to right an anomaly from the MNIST-C dataset, the ground-truth explanation (the squared difference between the digit before and after corruption) as well as LRP explanations for three anomaly detection models (KDE, DOCC, and AE).

Input
Ground Truth KDE DOCC AE Fig. 9. Explaining anomaly prediction: Highlighting the input features that are most relevant for the prediction helps to understand the model's decision strategy, here on MNIST-C:Stripes.
Although all models predict accurately on the stripe data, the strategies are very different: The kernel density estimator highlights the anomaly, but also some regions of the digit itself. The deep one-class classifier strongly emphasizes vertical edges. The autoencoder produces a result similar to KDE but with decision artifacts in the corners of the image and on the digit itself.
From these observations, it is clear that each model, although predicting with 100% accuracy on the current data, will have different generalization properties and vulnerabilities when encountering subsequent anomalies. (In section VIII-B we will work through an example showing how explanations can help to diagnose and improve a detection model.) To conclude, we emphasize that a standard quantitative evaluation can be imprecise or even misleading when the available data is not fully representative, and in that case, explanations can be produced to more comprehensively assess the quality of an anomaly detection model.

VIII. WORKED-THROUGH EXAMPLES
In this section, we work through two specific, real-world examples to exemplify the modeling and evaluation process and provide some best practices.

A. Example 1: Thyroid Disease Detection
In the first example our goal is to learn a model to detect thyroid gland dysfunctions such as hyperthyroidism. The Thyroid dataset 4 includes n = 3772 data instances and has 4 Available from the ODDS Library [430] at http://odds.cs.stonybrook.edu/ D = 6 real-valued features. It contains a total of 93 (∼2.5%) anomalies. For a quantitative evaluation, we consider a dataset split of 60:10:30 corresponding to the training, validation, and test sets respectively, while preserving the ratio of ∼2.5% anomalies in each of the sets.
We choose the OC-SVM [6] with standard RBF kernel k(x,x) = exp(−γ x −x 2 ) as a method for this task since the data is real-valued, low-dimensional, and the OC-SVM scales sufficiently well for this comparatively small dataset. In addition, the ν-parameter formulation (cf., Eq. (20)) enables us to use our prior knowledge and thus approximately control the false alarm rate α and with it implicitly also the miss rate, which leads to our first recommendation:

Assess the risks of false alarms and missed anomalies
Calibrating the false alarm rate and miss rate of a detection model can decide over life and death in a medical context such as disease detection. Though the consequences must not always be as dramatic as in a medical setting, it is important to carefully consider the risks and costs involved with type I and type II errors in advance. In our example, a false alarm would suggest a thyroid dysfunction although the patient is healthy. On the other hand, a missed alarm would occur if the model recognizes a patient with a dysfunction as healthy. Such asymmetric risks, with a greater expected loss for anomalies that go undetected, are very common in medical diagnosis [446]- [449]. Given only D = 6 measurements per data record, we therefore seek to learn a detector with a miss rate ideally close to zero, at the cost of an increased false alarm rate. Patients falsely ascribed with a dysfunction by such a detector could then undergo further, more elaborate clinical testing to verify the disease. Assuming our data is representative and ∼12% 5 of the population is at risk of thyroid dysfunction, we choose a slightly higher ν = 0.15 to further increase the robustness against potential data contamination (here the training set contains ∼2.5% contamination in the form of unlabeled anomalies). We then train the model and choose the kernel scale γ according to the best AUC we observe on the small, labeled validation set which includes 9 labeled anomalies. We select γ from γ ∈ {(2 i D) −1 | i = −5, . . . , 5}, that is from a log 2 span that accounts for the dimensionality D.
Following the above, we observe a rather poor best validation set AUC of 83.9% at γ = (2 −5 D) −1 , which is the largest value from the hyperparameter range. This is an indication that we forgot an important preprocessing step, namely: Apply feature scaling to normalize value ranges Any method, including kernel methods, that relies on computing distances requires the features to be scaled to similar ranges to prevent features with wider value ranges from dominating the computed distances. If this is not done, it can cause anomalies that deviate on smaller scale features to be undetected. Similar reasoning also holds for clustering and classification (see e.g. the discussion in [450]). Minmax normalization or standardization are common choices, but since we assume there may be some contamination, we apply a robust feature scaling via the median and interquartile range. Remember that scaling parameters should be computed using only information from the training data and then applied to all of the data. After we have scaled the features, we observe a much improved best validation set AUC of 98.6% at γ = (2 2 D) −1 . The so-trained and selected model finally achieves a test set AUC of 99.2%, a false alarm rate of 14.8% (i.e., close to our a priori specified ν = 0.15), and a miss rate of zero.

B. Example 2: MVTec Industrial Inspection
For our second example, we consider the task of detecting anomalies in wood images drawn from the MVTec-AD dataset. Unlike the first worked-through example, the MVTec data is high-dimensional and corresponds to arrays of pixel values. Hence, all input features are already on a similar scale (between −1 and +1) and therefore we do not need to apply feature rescaling.
Following the standard model training / validation procedure, we train a set of models on the training data, select their hyperparameters on hold out data (e.g., a few inliers and anomalies extracted from the test set), and then evaluate their performance on the remaining part of the test set. The AUC performance of the nine models in our benchmark is shown in Table VI. We observe that the best performing model is the kernel density estimation (KDE). This is particularly surprising, because this model does not compute the kinds of higher-level image features that deep models, such as DOCC, learn and apply. Examination of the data set shows that the anomalies involve properties such as small perforations and stains that do not require high-level semantic information to be detected. But is that the only reason why the KDE performance is so high? In order to get insight into the strategy used by KDE to arrive at its prediction, we employ the neuralization/LRP approach presented in section VII-D.
Apply XAI to analyze model predictions Fig. 10 shows an example of an image along with its ground-truth pixel-level anomaly as well as the computed pixel-wise explanation for KDE.
Ideally, we would like the model to make its decision based on the actual anomaly (here, the three drill holes), and therefore, we would expect the ground-truth annotation and the KDE explanation to coincide. However, it is clear from inspection of the explanation that KDE is not looking at the true cause of the anomaly and is looking instead at the vertical stripes present everywhere in the input image. This discrepancy between the explanation and the ground truth can be observed on other images of the 'wood' class. The high

Input
Ground Truth KDE Fig. 10. Input image, ground-truth source of anomaly (here, three drill holes), and explanation of the KDE anomaly prediction. The KDE model assigns high relevance to the wood strains instead of the drill holes. This discrepancy between ground truth and model explanation reveals a 'Clever Hans' strategy used by the KDE model.
AUC score of KDE thus must be due to a spurious correlation in the test set between the reaction of the model to these stripes and the presence of anomalies. We call this a 'Clever Hans' effect [244], because just like the horse Clever Hans, the model appears to work because of a spurious correlation. Obviously the KDE model is unlikely to generalize well when the anomalies and the stripes become decoupled (e.g., as we observe more data or under some adversarial manipulation). This illustrates the importance of generating explanations to identify these kinds of failures. Once we have identified the problem, how can we change our anomaly detection strategy so that it is more robust and generalizes better?

Improve the model based on explanations
In practice, there are various approaches to improve the model based on explanation feedback: 1) Data extension: We can extend the data with missing training cases, e.g., anomalous wood examples that lack stripes or normal wood examples that have stripes to break to spurious correlation between stripes and anomalies. When further data collection is not possible, synthetic data extension schemes such as blurring or sharpening can also be considered. 2) Model extension: If the first approach is not sufficient, or if the model is simply not capable of implementing the necessary prediction structure, the model itself can be changed (e.g., using a more flexible deep model). In other cases, the model may have enough representation power but is statistically inefficient (e.g., subject to the curse of dimensionality). In that case, adding structure (e.g., convolutions) or regularization can also help to learn a model with an appropriate prediction strategy. 3) Ensembles: If all considered models have their own strengths and weaknesses, ensemble approaches can be considered. Ensembles have a conceptual justification in the context of anomaly detection [322], and they have been shown to work well empirically [451], [452].
Once the model has been improved based on these strategies, explanations can be recomputed and examined to verify that the decision strategy has been corrected. If that is not the case, the process can be iterated until we reach a satisfactory model.

IX. CONCLUDING REMARKS, OPEN CHALLENGES, AND FUTURE RESEARCH PATHS
Anomaly detection is a blossoming field of broad theoretical and practical interest across the disciplines. In this work, we have given a review of the past and present state of anomaly detection research, established a systematic unifying view, and discussed many practical aspects. While we have included some of our own contributions, we hope that we have fulfilled our aim of providing a balanced and comprehensive snapshot of this exciting research field. Focus was given to a solid theoretical basis, which then allowed us put today's two main lines of development into perspective: the more classical kernel world and the more recent world of deep learning and representation learning for anomaly detection.
We will conclude our review by turning to what lies ahead. Below, we highlight some critical open challenges -of which there are many -and identify a number of potential avenues for future research that we hope will provide useful guidance.

A. Unexplored Combinations of Modeling Dimensions
As can be seen in Fig. 1 and Table II, there is a zoo of different anomaly detection algorithms that have historically been explored along various dimensions. This review has shown conceptual similarities between anomaly detection members from kernel methods and deep learning. Note, however, that the exploration of novel algorithms has been substantially different in both domains, which offers unique possibilities to explore new methodology: steps that have been pursued in kernel learning but not in deep anomaly detection could be transferred (or vice versa) and powerful new directions could emerge. In other words, ideas could be readily transferred from kernels to deep learning and back, and novel combinations in our unified view in Fig. 1 would emerge.
Let us now discuss some specific opportunities to clarify this point. Consider the problem of robustness to noise and contamination. For shallow methods, the problem is well studied, and we have many effective methods [5], [253], [316], [359], [361], [362]. In deep anomaly detection, very little work has addressed this problem. A second example is the application of Bayesian methods. Bayesian inference has been mostly considered for shallow methods [315], [350], owing to the prohibitive cost or intractability of exact Bayesian inference in deep neural networks. Recent progress in approximate Bayesian inference and Bayesian neural networks [403], [453]- [456] raise the possibility of developing methods that complement anomaly scores with uncertainty estimates or uncertainty estimates of their respective explanations [457]. In the area of semi-supervised anomaly detection, ideas have already been successfully transferred from kernel learning [183], [228] to deep methods [143] for one-class classification. But probabilistic and reconstruction methods that can make use of labeled anomalies are unexplored. For time series anomaly detection [169], [200]- [202], where forecasting (i.e., conditional density estimation) models are practical and widely deployed, semi-supervised extensions of such methods could lead to significant improvements in applications in which some labeled examples are available (e.g., learning from failure cases in monitoring tasks). Concepts from density ratio estimation [458] or noise contrastive estimation [459] could lead to novel semi-supervised methods in principled ways. Finally, active learning strategies for anomaly detection [334]- [337], which identify informative instances for labeling, have primarily only been explored for shallow detectors and could be extended to deep learning approaches. This is a partial list of opportunities that we have noticed. Further analysis of our framework will likely expose additional directions for innovation.

B. Bridging Related Lines of Research on Robustness
Other recent lines of research on robust deep learning are closely related to anomaly detection or may even be interpreted as special instances of the problem. These include out-ofdistribution detection, model calibration, uncertainty estimation, and adversarial examples or attacks. Bridging these lines of research by working out the nuances of the specific problem formulations can be insightful for connecting concepts and transferring ideas to jointly advance research.
A basic approach to creating robust classifiers is to endow them with the ability to reject input objects that are likely to be misclassified. This is known as the problem of classification with a reject option, and it has been studied extensively [460]- [466]. However, this work focuses on objects that fall near the decision boundary where the classifier is uncertain.
Recent work has begun to address other reasons for rejecting an input object. Out-of-distribution (OOD) detection considers cases where the object is drawn from a distribution different from the training distribution P + [470], [472], [474]- [477]. From a formal standpoint, it is impossible to determine whether an input x is drawn from one of two distributions P 1 and P 2 if both distributions have support at x. Consequently, the OOD problem reduces to determining whether x lies outside regions of high density in P + , which is exactly the anomaly detection problem we have described in this review.
A second reason to reject an input object is because it belongs to a class that was not part of the training data. This is the problem of open set recognition. Such objects can also be regarded as being generated by a distribution P − , so this problem also fits within our framework and can be addressed with the algorithms described here. Nonetheless, researchers have developed a separate set of methods for open set recognition [238], [478]- [481], and an important goal for future research is to evaluate these methods from the anomaly detection perspective and to evaluate anomaly detection algorithms from the open set perspective.
In rejection, out-of-distribution, and open set recognition problems, there is an additional source of information that is not available in standard anomaly detection problems: the class labels of the objects. Hence, the learning task combines classification with anomaly detection. Formally, the goal is to train a classifier on labeled data (x 1 , y 1 ), . . . , (x n , y n ) with class labels y ∈ {1, . . . , k} while also developing some measure to decide whether an unlabeled test pointx should be rejected (for any of the reasons listed above). The class label information tells us about the structure of P + and allows us to model it as a joint distribution P + ≡ P X,Y . Methods for rejection, out-of-distribution, and open set recognition all take advantage of this additional structure. Note that the labels y are different from the labels that mark normal or anomalous points in supervised or semi-supervised anomaly detection (cf., section II-C).
Research on the unresolved and fundamental issue of adversarial examples and attacks [482]- [491] is related to anomaly detection as well. We may interpret adversarial attacks as extremely hard-to-detect out-of-distribution samples [454], as they are specifically crafted to target the decision boundary and confidence of a learned classifier. Standard adversarial attacks find a small perturbation δ for an input x so that x = x + δ yields some class prediction desired by the attacker. For instance, a perturbed image of a dog may be indistinguishable from the original to the human's eye, yet the predicted label changes from 'dog' to 'cat'. Note that such an adversarial examplex still likely is (and probably should) be normal under the data marginal P X (an imperceptibly perturbed image of a dog shows a dog after all!) but the pair (x, 'cat') should be anomalous under the joint P X,Y [492]. Methods for OOD detection have been found to also increase adversarial robustness [154], [454], [477], [493], [494], some of which model the class conditional distributions for detection [476], [492], for the reason just described.
The above highlights the connection of these lines of research towards the general goal of robust deep models. Hence, we believe that connecting ideas and concepts in these lines (e.g., the use of spherical losses in both anomaly detection [136], [156] and OOD [493], [495]) may help them to advance together. Finally, the assessment of the robustness of neural networks and their fail-safe design and integration are topics of high practical relevance that have recently found their way in international standardization initiatives (e.g., ITU/WHO FG-AI4H, ISO/IEC CD TR 24029-1, or IEEE P7009). Beyond doubt, understanding the brittleness of deep networks (also in the context of their explanations [496]) will also be critical for their adoption in anomaly detection applications that involve malicious attackers such as fraudsters or network intruders.

C. Interpretability and Trustworthiness
Much of anomaly detection research has been devoted to developing new methods that improve detection accuracy. In most applications, however, accuracy alone is not sufficient [322], [497], and further criteria such as interpretability (e.g., [243], [498]) and trustworthiness [456], [499], [500] are equally critical as demonstrated in sections VII and VIII. For researchers and practitioners alike [501] it is vital to understand the underlying reasons for how a specific anomaly detection model reaches a particular prediction. Interpretable, explanatory feedback enhances model transparency, which is indispensable for accountable decision-making [502], for uncovering model failures such as Clever Hans behavior [244], [322], and for understanding model vulnerabilities that can be insightful for improving a model or system. This is especially relevant in safety-critical environments [503], [504]. Existing work on interpretable anomaly detection has considered finding subspaces of anomaly-discriminative features [442], [505]- [509], deducing sequential feature explanations [443], the use of feature-wise reconstruction errors [57], [189], utilizing fully convolutional architectures [326], integrated gradients [38], and explaining anomalies via LRP [322], [444]. In relation to the vast body of literature though, research on interpretability and trustworthiness in anomaly detection has seen comparatively little attention. The fact that anomalies may not share similar patterns (i.e., the heterogeneity of anomalies) poses a challenge for their explanation, which also distinguishes this setting from interpreting supervised classification models. Furthermore, anomalies might arise due to the presence of abnormal patterns, but conversely also due to a lack of normal patterns. While for the first case an explanation that highlights the abnormal features is satisfactory, how should an explanation for missing features be conceptualized? For example given the MNIST dataset of digits, what should an explanation of an anomalous all-black image be? The matters of interpretability and trustworthiness get more pressing as the task and data become more complex. Effective solutions of complex tasks will necessarily require more powerful methods, for which explanations become generally harder to interpret. We thus believe that future research in this direction will be imperative.

D. The Need for Challenging and Open Datasets
Challenging problems with clearly defined evaluation criteria on publicly available benchmark datasets are invaluable for measuring progress and moving a field forward. The significance of the ImageNet database [510], together with corresponding competitions and challenges [511], for progressing computer vision and supervised deep learning in the last decade give a prime example of this. Currently, the standard evaluation practices in deep anomaly detection [129], [134], [136], [140], [143], [148], [153]- [156], [325], [512] out-ofdistribution detection [269], [470], [474]- [477], [513], [514], and open set recognition [238], [478]- [481] still extensively repurpose classification datasets by deeming some dataset classes to be anomalous or considering in-distribution vs. outof-distribution dataset combinations (e.g., training a model on Fashion-MNIST clothing items and regarding MNIST digits to be anomalous). Although these synthetic protocols have some value, it has been questioned how well they reflect real progress on challenging anomaly detection tasks [199], [320]. Moreover, we think the tendency that only few methods seem to dominate most of the benchmark datasets in the work cited above is alarming, since it suggests a bias towards evaluating only the upsides of newly proposed methods, yet often critically leaving out an analysis of their downsides and limitations. This situation suggests a lack of diversity in the current evaluation practices and the benchmarks being used. In the spirit of all models are wrong [515], we stress that more research effort should go into studying when and how certain models are wrong and behave like Clever Hanses. We need to understand the trade-offs that different methods make. For example, some methods are likely making a trade-off between detecting low-level vs. high-level, semantic anomalies (cf., section II-B2 and [199]). The availability of more diverse and challenging datasets would be of great benefit in this regard. Recent datasets such as MVTec-AD [189] and competitions such as the Medical Out-of-Distribution Analysis Challenge [422] provide excellent examples, but the field needs many more challenging open datasets to foster progress.

E. Weak Supervision and Self-Supervised Learning
The bulk of anomaly detection research has been studying the problem in absence of any kind of supervision, that is, in an unsupervised setting (cf., section II-C2). Recent work suggests, however, that significant performance improvements on complex detection tasks seem achievable through various forms of weak supervision and self-supervised learning.
Weak supervision or weakly supervised learning describes learning from imperfectly or scarcely labeled data [516]- [518]. Labels might be inaccurate (e.g., due to labeling errors or uncertainty) or incomplete (e.g., covering only few normal modes or specific anomalies). Current work on semisupervised anomaly detection indicates that including even only few labeled anomalies can already yield remarkable performance improvements on complex data [61], [143], [320], [324], [326], [519]. A key challenge here is to formulate and optimize such methods so that they generalize well to novel anomalies. Combining these semi-supervised methods with active learning techniques helps identifying informative candidates for labeling [334]- [337]. It is an effective strategy for designing anomaly detection systems that continuously improve via expert feedback loops [443], [520]. This approach has not yet been explored for deep detectors, though. Outlier exposure [325], that is, using massive amounts of data that is publicly available in some domains (e.g., stock photos for computer vision or the English Wikipedia for NLP) as auxiliary negative samples (cf., section IV-E), can also be viewed as a form of weak supervision (imperfectly labeled anomalies). Though such negative samples may not coincide with ground-truth anomalies, we believe such contrasting can be beneficial for learning characteristic representations of normal concepts in many domains (e.g., using auxiliary log data to well characterize the normal logs of a specific computer system [521]). So far, this has been little explored in applications. Transfer learning approaches to anomaly detection also follow the idea of distilling more domain knowledge into a model, for example, through using and possibly fine-tuning pre-trained (supervised) models [138], [141], [322], [410], [522]. Overall, weak forms of supervision or domain priors may be essential for achieving effective solutions in semantic anomaly detection tasks that involve high-dimensional data, as has also been found in other unsupervised learning tasks such as disentanglement [209], [523], [524]. Hence, we think that developing effective methods for weakly supervised anomaly detection will contribute to advancing the state of the art.
Self-supervised learning describes the learning of representations through solving auxiliary tasks, for example, next sentence and masked words prediction [111], future frame prediction in videos [525], or the prediction of transformations applied to images [526] such as colorization [527], cropping [528], [529], or rotation [530]. These auxiliary prediction tasks do not require (ground-truth) labels for learning and can thus be applied to unlabeled data, which makes selfsupervised learning particularly appealing for anomaly detection. Self-supervised methods that have been introduced for visual anomaly detection train multi-class classification models based on pseudo labels that correspond to various geometric transformations (e.g., flips, translations, rotations, etc.) [153]- [155]. An anomaly score can then be derived from the softmax activation statistics of a so-trained classifier, assuming that a high prediction uncertainty (close to a uniform) indicates anomalies. These methods have shown significant performance improvements on the common kclasses-out image benchmarks (cf., Table III). Bergman and Hoshen [156] have recently proposed a generalization of this idea to non-image data, called GOAD, which is based on random affine transformations. We can identify GOAD and self-supervised methods based on geometric transformations (GT) as classification-based approaches within our unifying view (cf., Table II). In a broader context, the interesting question will be to what extent self-supervision can facilitate the learning of semantic representations. There is some evidence that self-supervised learning helps to improve the detection of semantic anomalies and thus exhibits inductive biases towards semantic representations [199]. On the other hand, there also exists evidence showing that self-supervision mainly improves learning of effective feature representations for low-level statistics [531]. Hence, this research question remains to be answered, but bears great potential for many domains where large amounts of unlabeled data are available.

F. Foundation and Theory
The recent progress in anomaly detection research has also raised more fundamental questions. These include open questions about the out-of-distribution generalization properties of various methods presented in this review, the definition of anomalies in high-dimensional spaces, and informationtheoretic interpretations of the problem.
Nalisnick et al. [269] have recently observed that deep generative models (DGMs) such as normalizing flows, VAEs, or autoregressive models (cf., section III) can often assign higher likelihood to anomalies than to in-distribution samples. For example, models trained on Fashion-MNIST clothing items can systematically assign higher likelihood to MNIST digits [269]. This counter-intuitive finding, which has been replicated in subsequent work [149], [260], [325], [513], [514], [532], revealed that there is a critical lack of theoretical understanding of these models. Solidifying evidence [513], [514], [533], [534] indicates that one reason seems to be that the likelihood in current DGMs is still largely biased towards low-level background statistics. Consequently, simpler data points attain higher likelihood (e.g., MNIST digits under models trained on Fashion-MNIST, but not vice versa). Another critical remark in this context is that for (truly) high-dimensional data, the region with highest likelihood must not necessarily coincide with the region of highest probability mass (called the 'typical set'), that is, the region where data points most likely occur [532]. For instance, while the highest density of a D-dimensional standard Gaussian is given at the origin, points sampled from the distribution concentrate around an annulus with radius √ D for large D [535]. Therefore, points close to the origin have high density, but are very unlikely to occur. This mismatch questions the standard theoretical density (level set) problem formulation (cf., section II-B) and use of likelihoodbased anomaly detectors for some settings. Hence, theoretical research aimed at understanding the above phenomenon and DGMs themselves presents an exciting research opportunity.
Similar observations suggest that reconstruction-based models can systematically well reconstruct simpler out-ofdistribution points that sit within the convex hull of the data. For example, an anomalous all-black image can be well reconstructed by an autoencoder trained on MNIST digits [536]. An even simpler example is the perfect reconstruction of points the lie within the linear subspace spanned by the principal components of a PCA model, even in regions far away from the normal training data (e.g., along the principal component in Fig. 7). While such out-of-distribution generalization properties might be desirable for representation learning in general [537], such behavior critically can be undesirable for anomaly detection. Therefore, we stress that more theoretical research on understanding such out-ofdistribution generalization properties or biases, especially for more complex models, will be necessary.
Finally, the push towards deep learning also presents new opportunities to interpret and analyze the anomaly detection problem from different theoretical angles. For example, from the perspective of information theory [538], autoencoders can be understood as adhering to the Infomax principle [539]- [541] by implicitly maximizing the mutual information between the input and the latent code -subject to structural constraints or regularization of the code (e.g., 'bottleneck', latent prior, sparsity, etc.) -via the reconstruction objective [378]. Similarly, information-theoretic perspectives of VAEs have been formulated showing that these models can be viewed as making a rate-distortion trade-off [542] when balancing the latent compression (negative rate) and reconstruction accuracy (distortion) [543], [544]. This view has recently been employed to draw a connection between VAEs and Deep SVDD, where the latter can be seen as a special case that only seeks to minimize the rate (maximize compression) [545]. Overall, anomaly detection has been studied comparatively less from an information-theoretic perspective [546], [547], yet we think this could be fertile ground for building a better theoretical understanding of representation learning for anomaly detection.
Concluding, we firmly believe that anomaly detection in all its exciting variants will also in the future remain an indispensable practical tool in the quest to obtain robust learning models that perform well on complex data.

APPENDIX A NOTATION AND ABBREVIATIONS
For reference, we provide the notation and abbreviations used in this work in Tables VII and VIII respectively. The real numbers D The input data dimensionality D ∈ N X The input data space X ⊆ R D Y The labels Y = {±1} (+1 : normal; −1 : anomaly) x A vector, e.g. a data point x ∈ X Dn An unlabeled dataset Dn = {x 1 , . . . , xn} of size n P, p The data-generating distribution and pdf P + , p + The normal data distribution and pdf P -, p - The anomaly distribution and pdf p An estimated pdf ε An error or noise distribution supp(p) The support of a data distribution P with density p, i.e. {x ∈ X | p(x) > 0} A The set of anomalies Cα An α-density level set Cα An α-density level set estimator τα The threshold τα ≥ 0 corresponding to Cα cα(x) The threshold anomaly detector corresponding to Cα s(x) An anomaly score function s : The indicator function for some set A (s, y) A loss function : R × {±1} → R f θ (x) A model f θ : X → R with parameters θ k(x,x) A kernel k : X × X → R F k The RKHS or feature space of kernel k φ k (x) The feature map φ k : X → F k of kernel k φω(x) A neural network x → φω(x) with weights ω

APPENDIX B DETAILS OF TRAINING
For PCA, we compute the reconstruction error whilst maintaining 90% of variance of the training data. We do the same for kPCA, and additionally choose the kernel width such that 50% neighbors capture 50% of total similarity scores. For MVE, we use the fast minimum covariance determinant estimator [299] with a default support fraction of 0.9 and a contamination rate parameter of 0.01. To facilitate MVE computation on MVTec-AD, we first reduce the dimensionality via PCA retaining 90% of variance. For KDE, we choose the bandwidth parameter to maximize the likelihood of a small hold-out set from the training data. For SVDD, we consider ν ∈ {0.01, 0.05, 0.1, 0.2} and select the kernel scale using a small labeled hold-out set. The deep one-class classifier applies a whitening transform on the representations after the first fully-connected layer of a pre-trained VGG16 model (on MVTec-AD) or a CNN classifier trained on the EMNIST letter subset (on MNIST-C). For the AE on MNIST-C, we use a LeNet-type encoder that has two convolutional layers with max-pooling followed by two fully connected layers that map to an encoding of 64 dimensions, and construct the decoder symmetrically. On MVTec-AD, we use an encoderdecoder architecture as presented in [130] which maps to a bottleneck of 512 dimensions. Both, the encoder and decoder here consist of four blocks having two 3×3 convolutional layers followed by max-pooling or upsampling respectively. We train the AE such that the reconstruction error of a small training hold-out set is minimized. For AGAN, we use the AE encoder and decoder architecture for the discriminator and generator networks respectively, where we train the GAN until convergence to a stable equilibrium.