Improving Novelty Detection using the Reconstructions of Nearest Neighbours

We show that using nearest neighbours in the latent space of autoencoders (AE) significantly improves performance of semi-supervised novelty detection in both single and multi-class contexts. Autoencoding methods detect novelty by learning to differentiate between the non-novel training class(es) and all other unseen classes. Our method harnesses a combination of the reconstructions of the nearest neighbours and the latent-neighbour distances of a given input's latent representation. We demonstrate that our nearest-latent-neighbours (NLN) algorithm is memory and time efficient, does not require significant data augmentation, nor is reliant on pre-trained networks. Furthermore, we show that the NLN-algorithm is easily applicable to multiple datasets without modification. Additionally, the proposed algorithm is agnostic to autoencoder architecture and reconstruction error method. We validate our method across several standard datasets for a variety of different autoencoding architectures such as vanilla, adversarial and variational autoencoders using either reconstruction, residual or feature consistent losses. The results show that the NLN algorithm grants up to a 17% increase in Area Under the Receiver Operating Characteristics (AUROC) curve performance for the multi-class case and 8% for single-class novelty detection.


Introduction
Novelty detection is an important field of research as identifying previously unknown behaviours in systems is critical for their maintenance and smooth operation. It is the procedure in which a model is able to identify new classes of data that it has not been exposed to before. Novelty detection is a farreaching topic having been applied extensively in fields such as manufacturing [1], cyber-security [2], biomedical analysis [3,4], astronomy [5] and many more [6].
Novelty, anomaly, outlier, abnormality and out-of-distribution (OOD) detection are closely related topics [7]. The distinction between them is vague across variety of literature studies [8,9,10,3,11]. For clarity purposes, we consider novelty detection to be the overarching paradigm, since it makes contextual sense to have novel abnormalities/anomalies/outliers but the converse does not apply.
Approaches for novelty and anomaly detection can be divided into a number of categories [12,13,6,14]. In this work we exclusively focus on autoencoder-based novelty detection. As it offers a data agnostic method that does not rely on significant data augmentation [15], finding negative samples [16,17] or pretraining on large labelled datasets [18] such as ImageNet [19].
Autoencoders (AEs) are widely used as novelty detectors [8,20,21,9,10,22,23,24,25]. The underlying mechanism that governs the AE's detection abilities is that they are firstly trained on data without abnormal, anomalous or outlying samples. Then, during inference, the AE is exposed to novel samples which result in higher errors thus enabling novelty detection. Methods such as mean-square-error (MSE) [22], residual error [3], structural-similarity (SSIM) [26] or feature consistency [27] are used to calculate the pixel-wise difference.
A common problem with using autoencoding methods for novelty detection is that AEs can generalise to unseen classes thereby performing poorly as novelty detectors [28]. In [9], this issue is addressed by placing a classifier in the training path of a multi-discriminator based autoencoder, which results in a fairly complicated and costly training procedure. On the contrary, we propose the Nearest-Latent-Neighbours (NLN) algorithm which uses the reconstructions of the nearest-neighbours in the latent space of autoencoders in-order to combat the aforementioned generalisation problem.
Unlike existing nearest neighbours methods [29], our NLN algorithm uses both the reconstruction error between a given sample and its neighbours in the latent space as well as the average latent-distance to its neighbours. Figure 1 illustrates how a vanilla autoencoder generalises to reconstruct unseen samples whereas the reconstructions of an input's nearest-latentneighbours more closely resemble the non-novel training set thereby offering improved novelty detection.
We evaluate the proposed method using the novelty detection framework described in [30] and prove it's effectiveness in a two-stage testing strategy. Firstly by comparing different architecture's performance with and without the use of our NLN algorithm. Secondly we compare our best performing model Figure 1: Comparison between the MSE and the Nearest-Latent-Neighbours (NLN) based error using a vanilla AE. The top row shows an AE trained on the non-anomalous screw images of the MVTec-AD dataset, with the anomaly in the input circled in red, and the bottom row illustrates an AE trained on all MNIST digits except for "2". The first column is the input to the AE, the second is the AE's output and the following two columns show the reconstructions of the input's NLNs. The final two columns show the difference between the MSE and the NLN-based error, where in this work maximising the error on novel classes effectively performs novelty detection. It is clear that the AE learns to reconstruct unseen classes whereas the reconstructions of the NLNs do not.
with the current state-of-the-art AEs. We show that NLN is competitive with the state-of-the-art methods across a number of datasets.
In summary, the main contributions of this paper are: (1) a novel nearest-neighbour based algorithm that harnesses the reconstruction error of a given sample's nearest-latent-neighbours and their latent-neighbour distances. (2) The formulation of the NLN algorithm applied to a variety of autoencoding architectures using several different error calculation methods. (3) Improved performance to the state-of-the-art autoencoders using NLN, a fairly simple, cheap and intuitive method, across a number of standard datasets.

Background and Related Work
In novelty, anomaly, outlier, abnormality and OOD detection one or more of the following steps are required for the detection of a novel, anomalous or outlying sample: (1) a model of the distribution of the (non-anomalous/non-novel) data. (2) A suitable measure of fitness describing whether a given sample lies within the modelled distribution. (3) A decision rule to determine whether the measure is above or below a threshold [13].
A critical distinction among all related work is whether supervised, semi-supervised, or unsupervised methods have been used [31]. Supervised methods typically relate more strongly to anomaly detection scenarios, where both the normal and anomalous data classes are known a-priori [13]. However, supervision is not applicable in many settings, as anomalous classes are either underrepresented or just not known [13]. In the unsupervised setup, we have no a-priori information if the available data contains normal or abnormal samples [31].
Semi-supervised methods are the most common in practice, as normal (non-anomalous, non-novel) data is most easily collected from most systems [6]. In this case, models are designed to represent the expected operating conditions of a system and any deviations from that are considered novel. These deviations may, in some cases, be considered anomalous, however this is dependent on the context of operation [7]. For the remainder of the paper we focus on semi-supervised methods, where we designate particular classes of a dataset as novel and all other as expected.

Reconstruction based novelty detection
Semi-supervised reconstruction based methods leverage the fact that a model trained on the normal data cannot suitably reconstruct novel samples. In effect, the difference between the input and reconstructed output can be used as a novelty detector. Autoencoders (AE) are commonly used for reconstruction based novelty and anomaly detection [8,20,21,9,10,22,23,24,25]. They operate by jointly learning latent representations and reconstructions of the training data. Once trained, a reconstruction error can be calculated between the input sample and the model's decoded output. These models achieve improved performance when regularising the latent space [21] using variational AEs (VAE) [32] or adversarial losses [33].
It has been demonstrated that reconstruction-error based methods alone are not particularly robust to noise, changing backgrounds and viewing angles [7]. In generative autoencoding models such as VAEs, reconstruction probability or attentionmechanisms are used to improve performance [21,34,25]. Furthermore Generative Adversarial Networks (GANs) [35] are used for reconstruction-error based anomaly detection [3,36,37]. Here the residual error is calculated as the difference between the training and generated images using their intermediate representations provided by the discriminator. More recently, self-supervised learning (SSL) has been applied to AEs and offers improved performance in novelty detection by using inpainting [23] or position prediction [38] pretext tasks.
In our work we use the reconstructions of a given sample's latent neighbours in conjunction with their latent-distances. We show that this offers performance increases across a variety of architectures and datasets. Additionally, we utilise several autoencoding models and show that the NLN algorithm offers performance improvements irrespective of architecture.

Statistical methods
Statistical methods typically focus on modelling a distribution of inliers through learning their distribution parameters [10]. In effect, all expected/normal sample should lie in high density regions of the distribution, while outliers should have a low probability under the learnt distribution. Works such as One-class SVM [39], KNN [31] and isolation forests [40] have been shown to be suitable anomaly detectors applied in this context. Furthermore, [30] demonstrates that using discriminative measures in the latent space of AEs improves accuracy over reconstruction error. Here, discriminative novelty measures such as One-class SVM and Local Outlier Factor (LOF) [41] are applied to the latent space of AEs.
In our work, we propose a hybrid approach, where we combine a nearest neighbours based approach, that is typically used in distance-based anomaly detection, with a reconstruction errorbased approach. This is done by considering the reconstruction error between neighbouring points in the latent space with some input query. We show that our work enables robustness and improvements to existing state-of-the-art research.

Single and Mutli-class novelty detection
In the context of deep learning, one-class (single-class) novelty detection [9,42,43,10,44,45] is the paradigm where a single class is considered normal and all other classes are novel. In practice, a model is trained on a dataset consisting of only a single class and during inference the novelty detector is exposed to all classes and should identify all unseen classes as novel.
For multi-class novelty detection, multiple classes are considered inliers and a single class is considered novel [8,20,27,36]. This is an inherently more challenging evaluation framework as the model should be able to generalise to multiple classes and still be capable of detecting novel samples. In this work we evaluate our NLN-enabled models in both a Multiple-Inlier-Single-Outlier (MISO) and Single-Inlier-Multiple-Outlier (SIMO) contexts as defined by [30].

NLN: Nearest Latent Neighbours
Here we present our novelty detection framework for autoencoders. We show that using a simple addition to existing autoencoding architectures we can significantly increase their novelty detection performance.

Motivation
In [9] and [28] the limitations of using AEs for one-class novelty detection are demonstrated. They show that when an AE is trained on the relatively complex 8-class from the MNIST dataset [46], the AE is able to implicitly learn the representations of digit classes such as the 1, 3, 6 and 7. In effect, reconstructionbased novelty detectors are prone to misidentify these implicitly learnt classes.
In order to solve this problem, [9] propose placing a classifier in the training path of a multi-discriminator-based AE to decrease the training signal for the reconstructions of implicitly learnt novel classes. Conversely, we show that if we consider both the distance to, and the reconstruction of, a given sample's nearest latent neighbours we can effectively mitigate this issue, as demonstrated in Figure 1.
Furthermore, we motivate our focus on AEs for novelty detection as they are applicable to a variety of datasets without significant augmentation [15,17], do not need pretraining on large labelled datasets [18,47] and require far fewer network parameters [48]. Additionally, their structure provides segmentation maps for free without the need of many small patches [38] that result in a significantly more expensive KNN search 1 or additional networks for segmentation [15].

Problem formulation and approach
Considering an autoencoding model with encoder, f , and decoder, g, then where x is the input, z is the input's latent representation and θ f are the parameters of the encoder. Additionally, R p is the p-dimensional image-space and R l is the l-dimensional latentspace. Now consider the decoder with an input, z, and a reconstructed outputx such that where θ g are the decoder's parameters, such that the decoder maps from the l-dimensional latent space to the p-dimensional image space. The encoder and decoder pair is trained in an endto-end manner using a loss function such as Mean-Square-Error (MSE) or Binary-Cross-Entropy(BCE). Once trained, the AE's novelty score (η) is computed for the i th sample using where n and m are the pixel-indexes for an image of size N × M. This score is typically thresholded in order to determine whether a sample is novel and the threshold is calculated using AUROC-based methods that are explained in more detail in Section 4.
In order to motivate our use of nearest-neighbours, we assume that the high-dimensional training data is concentrated on a low-dimensional data manifold in R l that we attempt to learn using autoencoder [49]. The learnt manifold is illustrated in Figure 2. Here we demonstrate that closely-connected regions on the learnt manifold contain points similar to non-anomalous inputs and dissimilar to those which are novel. We exploit this fact to improve the anomaly score robustness by including the nearest-latent neighbours into the reconstruction error. This is done by including the neighbours of the i th test sample in the latent space R l in the calculation of the novelty score (η nln ). Such that where k is the neighbour index such that z k i is z i 's nearest neighbours in the latent space. K is the maximum number of latent neighbours and α is the hyper-parameter (∈ [0, 1]) used to tune the contribution of latent-space and image-space based distances respectively. It must be noted that Equation 4 shows the critical difference between the [18,29] and our work. We propose using the reconstruction error in the image space, R p , whereas they only use the difference of extracted feature vectors in R l .

Discriminative Considerations
Discriminative autoencoding models use discriminators in the training of autoencoders. This is done to either improve the realism of the AE's outputs or to regularise the latent space to a prior distribution. In this work we focus on the former case. Given a discriminator d x , trained on inputs x andx = g(z; θ g ) then Where the discriminator on x maps between the image space and a value on the interval between 0 and 1. It returns 0 or 1 based on whether the sample x is taken from the training set or if it is generated by the decoder, g. The discriminator's training objective is stated as [35] In addition to improving the regularisation, discriminators can also be used for novelty detection. Novelty is calculated through the difference between the representations of a sample x i , and its respective decoded outputx i , from an intermediate layer, q, of d x . This is also referred to as the residual error [3] and we include the nearest-latent-neighbours by where h in an index of the output from an intermediate layer q with size H.

Feature Consistency
It has been shown in [8] that adding an additional encoder in the training path of the autoencoder improves performance. This paradigm is referred to as feature consistency [27] and can be integrated in our nearest-latent-neighbours method by Where f con is the additional encoder that takesx as an input, with parameters, θ f con . Furthermore, L is the latent space dimensionality, which is maintained between the first encoder, f , and the second encoder, f con and is indexed by l. The encoder is trained jointly with the rest of the discriminative autoencoder as described in [8].

The NLN algorithm
Our work concerns the integration of the NLN technique into existing autoencoding models. For this reason we explain three different modes of operation for three different novelty scores. In the first case, a vanilla autoencoding model is used with a standard reconstruction error, as shown in Equation 4. The second uses the autoencoding architecture in [8] and the feature consistency error in Equation 8. Finally, the third makes use of a discriminative autoencoding architecture and use of the residual error in Equation 7.
In all cases, an autoencoding model is first trained on a dataset with some novel class(es) removed. During testing, a sample is randomly chosen (which may be novel or not) and is input into the encoder. Then the nearest neighbours of the encoded sample are found in the latent space generated by the training data. This process is represented by the left-most half of Figure 3.
In the first mode of operation, the error is computed between the test sample and both the decodings and positions of its latent-neighbours in the non-novel latent space. Whereas when discriminative methods are used, the error is computed between the intermediate representation from the discriminator d x of the test sample and all its decoded latent neighbours in the training data. In the feature-consistent case, the error is computed between the encoding via f con of the given sample and all its nearest-latent-neighbours in the training data. In Figure 3 these three operations are represented by the operator.
When performing novelty detection, one of the three methods' errors are aggregated over all neighbours and normalised after which they are added to the aggregated and normalised latent-neighbour distance vector. Then they are thresholded to result in an anomaly score and a segmentation map. The threshold is determined by the AUROC method described in Section 4. This methodology is illustrated in the right half of Figure 3.

Experiments
We evaluate our method 2 experimentally in both multi-class and single-class novelty detection contexts as outlined in [30]. Furthermore, we compare our best performing NLN-enabled autoencoder using both pixel-level and image-level anomaly detection metrics on the MVTec-AD dataset with state-of-theart autoencoders.

Evaluation methodology
To measure the performance of the NLN-enabled models, they are trained multiple times on a specific dataset, each time removing a different class or classes from the training set, thereby testing the novelty detection performance on every class present in a given dataset. We do this according to [30], such that both the single-class or Single-Inlier-Multiple-Outlier (SIMO) and the multi-class or Multiple-Inliers-Single-Outlier (MISO) performance are evaluated.
We use the Area Under the Receiver Operating Characteristic (AUROC) score to evaluate and compare the performance of the NLN-algorithm. The AUROC metric measures the area under the ROC curve of true positive rates and false positive rates 2 Source code available at: https://github.com/mesarcik/NLN for different threshold values. Furthermore, we evaluate the per-pixel detection performance of our NLN-enabled models using Intersection over Union (IoU) score. The IoU metric is a measure of the overlap between the predicted regions and their corresponding ground-truth.
We limit our evaluation to only autoencoders as we find comparison with methods that rely on SSL [38,50,16], pretrained feature extractors [51,52,16,50] or computationally expensive inference [38] are not easily comparable on AUCROC alone across multiple datasets. It has been well documented that using pretrained feature extractors and SSL losses result in improved performance. However, they typically require orders of magnitude more parameters [48], and are not easily applicable across datasets or evaluation strategies. Furthermore, we regard the simplicity of AEs a crucial attribute. This is in contrast with the significant augmentation found in [15] and the challenge of applying patch-dependant methods [38] to different datasets of varying resolutions and anomaly types.

Datasets
We evaluate our work on four different datasets, namely MNIST [46], CIFAR-10 [53], Fashion-MNIST [54] and MVTec-AD [24]. MNIST is a dataset consisting of 28×28×1 handwritten digits between 0 and 9. The complexity of the dataset is low and therefore our method performs best on it. Similarly, Fashion-MNIST is composed of 28 × 28 × 1 images of different types of articles of clothing. This dataset is used as an intermediary difficulty, between MNIST and CIFAR-10. CIFAR-10 is an object recognition dataset consisting of 32 × 32 × 3 images of 10 different classes. It is the most challenging dataset for novelty detection as each of the semantic classes may appear at different scales, viewing angles and have changing backgrounds [7]. The MVTec-AD dataset is an industrial anomaly detection dataset consisting of 15 different classes in 2 categories -objects and textures. The 10 object classes contain regularly positioned objects photographed in high resolution from the same viewing angle and the 5 texture classes contain repetitive patterns. For training on the MVTec-AD dataset we follow the augmentation scheme proposed in [24], where random rotations and crops are applied to the dataset that is broken into 128 × 128 patches. For more details about the dataset's composition and the augmentation performed see [24].

Model and parameter selection
In order to evaluate our work across a number of different datasets we adapt our models accordingly. We adopt autoencoding the architecture specified in [24] for the evaluation of the NLN algorithm on the MVTec-AD dataset. For MNIST, CIFAR-10 and F-MNIST we modify a LeNet [55] based autoencoding architecture. The encoder consists of 3 convolutional layers and the decoder has 3 transposed-convolutional layers. A base number of filters of 32 is used for the AE and is increased or decreased on each subsequent layer by a factor of 2. We use ReLU activations for all models and they are trained for 50 epochs using ADAM [56] with a learning rate of 1 × 10 −4 . The image-based discriminators d x use the same architecture as the encoder, except the final layer, which is a dense layer with a sigmoid activation. The latent discriminator for the AAE consists of 3 dense layers with Leaky ReLU activations and a dropout rate of 0.3. The base layer size is 64 and is increased by a factor of 2 for each subsequent layer. Furthermore, we treat the maximum number of neighbours, K, the latent dimensionality, L, and the NLN contribution, α, as hyper-parameters of our algorithm.

Results
We evaluate the performance increase of the NLN algorithm for a variety of autoencoding models across a number of different datasets in both the MISO-context in Table 1 and SIMO-context in Table 1. Here the best performing reconstruction error-based AUROC is compared with the best performing NLN-enabled model for each architecture. The NLN-based AEs achieve a performance increase between 17% and 1% across the three MISO-datasets and 8% and 3% for the SIMO-case. We suspect the low performance gains in the SIMO-case of the NLN-enabled AEs are due there being fewer latent neighbours to select from, thereby reducing performance.
In effect, the MSE between non-novel images in the same class, can be greater than novel images thereby reducing the efficacy of MSE based novelty detectors on CIFAR-10.
We present the class-averaged AUROC scores for the SIMObased evaluation in Table 4. Here the optimal method for MNIST is a discriminative AE, with LD = 128, K = 3 and α = 1.0 and for CIFAR-10 we find the optimal method to be a vanilla AE with LD = 256, K = 1 and α = 0.75. Furthermore, we find the best performing method on F-MNIST to be a VAE with LD = 32, K = 3 and α = 0.9. For the MVTec-AD dataset we use a discriminative AE with LD = 128, K = 1 and α = 0.8. It is clear that the attention guided VAE (CAVAGA) [34] method performs best on MNIST whereas DKNN [18] on CIFAR-10. However, it is evident that the NLN-enabled autoencoding models offer increased performance over existing autoencoding and ResNet-based architectures for both the F-MNIST and MVTec-AD datasets in the SIMO context.
In Figure 5 we show the effect of varying L and K on AU-ROC scores for vanilla AE in the SIMO context when α = 0.8. For F-MNIST and MNIST a maximum AUROC score is found for L = 128 and K > 3, whereas for CIFAR-10 the optimal is found when L = 256 and K = 1. Finally it is shown that the vanilla AE offers best image-based AUROC performance when L = 256 and K = 3. We evaluate the pixel-level anomaly detection performance in Table 5, and illustrate the model outputs in Figure 4 of both texture and object classes. In all cases we use a vanilla AE with K = 1, L = 128 and α = 0.6. It is clear that the NLN-enabled AE demonstrates performance increases in the object classes of MVTec-AD. However, this is not the case for the texture classes. We suspect that this is due to our NLN-enabled AE not being able to distinguish between different texture-patches. This behaviour is similarly demonstrated in [24], and we believe that this is an inherent weakness of standard autoencoding architectures.
In Figure 6 we illustrate the effect on varying alpha for the NLN-enabled autoencoding models used for the MVTec-AD dataset. Here it is demonstrated, that the NLN-based model obtain optimal AUROC segmentation-performance when 0.25 < α < 0.8, whereas to optimal AUROC detection-performance   occurs when α > 0.6. Finally we illustrate that the optimal IoU value is obtained at α = 0.8, thus demonstrating the benefit of including the reconstructions of nearest-neighbours in the calculation of the anomaly score.

Time and memory efficiency
The NLN-algorithm requires a forward pass through an encoder, a KNN search of the latent-space generated by the training samples, and a forward pass of a given point's nearest neighbours through a decoder. We evaluate the models on a Nvidia T4, where a forward pass of a single image from the MVTec-AD dataset takes 7.41 ms for the encoder and 9.63 ms for the decoder. In comparison, a ResNet50 used in [18,57,58] requires 43.3ms for a forward pass of a single image. This means that our method is between 1.3× and 2.5× more efficient for a forward pass, depending on the architecture used.
For the KNN search we use a k-d tree implementation of the KNN search, which has a inference time complexity of O(KL log N). Where K is the number of neighbours, L is the latent dimensionality and N is the number of points in the training set. In the case of the NLN-enabled models presented in this work, we find a latent dimensionality of 128 sufficient, whereas the ResNet50 in [18] uses 2048 dimensional latent space. This means that our work offers a 16× reduction in KNN search inference time incomparison with [18].
Finally, our method has comparable storage requirements as other AE based models [48] in terms of number of trainable parameters. For comparison, the AE-con model used for MVTec-AD has 1.79 million parameters, whereas the ResNet-50 from [18] has 25.58 million parameters. The only storage-based overhead of the NLN-algorithm is the requirement of amortising the embeddings of the training set as suggested in [18]. In the case of the bottle-class of the MVTec-AD dataset, there is an additional storage requirement of 6.85 MB 4

Ablation study
The AUROC performance of the NLN-algorithm is demonstrated in Table 6 when the loss function varied. The term in the first column, L recon , represents the standard reconstruction error given by Equation 3 and L NLN shows the NLN-based reconstruction loss given in the first half of Equation 4. L con represents the feature consistent adaption given by the first half of Equation 8 and L total is equivalent to the score obtained from Equation 8. It can be seen that through the utilisation of all terms in NLN-loss formulation we obtain optimal performance.

Discussion and conclusions
Autoencoders learn to generalise to unseen classes which is a problem when they are used for novelty detection. In this work, we demonstrate that when the reconstructions of a model's nearest-latent-neighbours are harnessed we can more effectively and efficiently mitigate this problem in comparison with the state-of-the-art. This is achieved through a fairly simple algorithm that is agnostic to both the AE's architecture and its error method. We experimentally prove that the addition of the NLN algorithm consistently yields performance increases for various autoencoding architectures and various datasets and is competitive with the state-of-the-art autoencoding models. This is achieved without complex augmentation, using pretrained networks or computationally expensive inference. We note that the complexity of CIFAR-10 and the texture classes of MVTec-AD result in modest performance, but we expect this can be solved using more robust error functions or using SSL to obtain even better latent representations.