Adaptive Adversarial Latent Space for Novelty Detection

Novelty detection is a challenging task of identifying whether a new sample obeys to a known class. Note that the boundary between normal and novel is not clear enough in existing works, resulting from adequately reconstructing novel samples or crudely reconstructing normal samples. To tackle the above issues, we propose a general framework named Adaptive Adversarial Latent Space (AALS), which mainly consists of two components, Adaptive Latent Space Generator (ALSG) and Constrained-based Adversarial AutoEncoder (CAAE). ALSG is established to obtain the real latent space distribution by adaptive mapper and the manner of adversarial learning. Moreover, CAAE is presented to obtain a better boundary between normal and novel by Global and Local Channel-wise Attention (GLCA), which is proposed to reweight the global and local information in different channels. Given a sample, the latent space representation $z$ of the sample is firstly obtained by encoder with GLCA. Next, the output of ALSG is applied to constrain the latent space by latent adversarial loss. Further, $z$ is transformed into the decoder to reconstruct the given sample with GLCA. Experiments on three available datasets demonstrates the effectiveness of the proposed method, which achieves state-of-the-art performance compared with other methods.

Novelty detection is defined as how to distinguish the normal and novel samples, which has a number of applications, such as video surveillance and medical diagnosis. Different from other tasks in computer vision, the key problem of novelty detection is how to ensure intra-class consistency and inter-class distinction given normal class only. The crucial challenge of novelty detection is how to find an appropriate boundary between normal and novel when only given normal samples.
Convolutional autoencoder (AE) structure and its variants are widely applied in novelty detection and related fields [1]- [7]. Compared with traditional methods, AE allows to extracting better features and achieves better reconstruction results. In this case, the reconstruction error of novel input will be low since they are close to the normal class. For better performance, adversarial learning methods are introduced to detect novelties [8]- [12]. In general, they not only change the training mode of the network, but also take the output of the discriminator as one of the criterions of novelty score.
The associate editor coordinating the review of this manuscript and approving it for publication was John See. Nevertheless, the above-mentioned works are all based on an assumption that may not always hold: novel samples are incapable of reconstructing well. The source of the problem is to ''generalized'' representation capacity of the overall network. In order to constrain the expression ability of network, OCGAN [13] presents a network structure of constraining latent space. They adopt a discriminator to constrain the latent space into a known prior distribution, and another discriminator to boost the quality of generated images. To obtain a better boundary, a sampling technique based on gradient descent is introduced. Using the samples of prior distribution that have achieved good results, OCGAN intends to generate normal class examples as much as possible. However, OCGAN relies on more convolutions and sophisticated operations, leading to inelegant implementation of the model.
We note that it is not sometimes effective to assume that the latent space obeys to a known prior distribution. In this paper, we propose a general unsupervised novelty detection model AALS by making a trade-off between the expression ability of the whole network and image quality. Specifically, AALS includes an Adaptive Latent Space Generator (ALSG) and a Constrained based Adversarial AutoEncoder (CAAE). Firstly, a number of samples are randomly selected from the normal distribution to train ALSG and then its parameters are fixed. Notably, ALSG contains an adaptive mapper, a generator and a discriminator. Secondly, CAAE utilizes the output of adaptive mapper to generate adaptive latent space distribution of known class. To reconstruct a given input, the overall network is optimized by employing pixel adversarial loss and pixel reconstruction loss. Note that channel attention is often utilized in convolution layer to extract features better such as SENet. However, the channel attention usually takes the feature map as a whole, which is not necessarily a good decision. Therefore, GLCA is proposed to extract features better by integrating global and local information, and its effectiveness is also proved in subsequent experiments. In addition, we also introduce feature reconstruction loss intention to generate higher quality images. Finally, in the test phase, the sample is only fed into CAAE, and the novelty score is composed of pixel-feature reconstruction loss. Compare with other methods, our model requires no assumptions about data or latent space, which makes the model elegant and efficient.
The key contributions of this paper are four-fold: • An adaptive adversarial model named AALS is proposed for novelty detection. Different from other approaches, AALS obtains a better trade-off between the expression ability of the whole network and image quality.
• Global and Local Channel Attention is proposed to obtain better features in CAAE. GLCA integrates the global and local information of features, which is more convincing and reasonable than existing channel attention mechanism.
• An adaptive latent space generator is proposed, which constructs an appropriate latent space distribution according to each class. It makes our model requires no assumptions about the latent space.
• We achieve state-of-the-art performance in three popular benchmarks available datasets [14]- [16]. As a side contribution, we will release the codes.

II. RELATED WORK
Novelty detection is a field closely related to outlier detection, anomaly detection, etc. There are many traditionally methods for novelty detection and related fields, such as OCSVM [17], PCA [18]. With the advent of deep neural networks, most of the methods utilized in novelty detection can be summarized as follows: Auto Encoder (AE) based methods, Adversarial AutoEncoder (AAE) based methods and Latent Constraint (LC) based methods.

A. AE-BASED METHODS
The end-to-end models have recently achieved promising results, attracting more and more attention in novelty detection. Sakurada and Yairi [19] first utilize and prove the effectiveness of autoencoders in anomaly detection. In [20], stacking denoising autoencoders is combined with SVM. Chong and Tay [21] then propose a spatiotemporal autoencoder for abnormal event detection in the video domain.
By splitting the input of autoencoders, Zhou and fPaffenroth [22] present robot deep autoencoders to improve the detection efficiency. Thereafter, Zong et al. [23] integrate autoencoders and gaussian mixture model for anomaly detection. Li et al. [24] also propose a multi-scale method to extract spatiotemporal information with Gaussian mixture model. The latest methods based on AE is IAE [25] and MBC [26], the former introduces channel attention mechanism into novelty detection and proposes a method of minimizing cross entropy to limit the hidden space, while the latter describes the quality of reconstruction from the perspective of gradient. However, though AE adopts the reconstruction loss, it is not adept at reconstructing images, no matter whether the input is normal or not, leading to the unacceptable performance. In contrast, our method aims to generate high quality reconstruction results by utilizing pixel-feature reconstruction loss.

B. AAE-BASED METHODS
Increasingly adversarial based models [4], [27] are proposed in novelty detection to address the defect of AE structure. Schlegl et al. [8] first apply DCGAN [28] in anomaly detection, and propose a mapping from image to latent space by back propagation in inference phase. Unfortunately, there is no mapping between image and latent space in the training phase, casuing a lousy mapping in inference. Therefore, Zenati et al. [12] propose ALAD, a bi-directional GAN structure ensuring the mapping between the image and the latent space, so as to improve the detection results. However, the training process of their network structure is complicated, which requires a lot of adjustment to achieve training stability. Ravanbakhsh et al. [9] adopt Flow2Pixel and Pixel2Flow generators for training. During testing, two networks take each frame as input and the outputs jointly determine whether it is novel or not. Li et al. [29] propose a new anomaly score function and a spatio-temporal framework combined by U-net and adversarial learning. Similarly, Dong et al. [30] propose a new approach with a dual discriminator-based generative adversarial network and U-net structure. ALOCC [10] combines denoising AE and adversarial AE, which is based on the assumption that the novel samples may not get a high discriminator score. Consequently, ALOCC applies the output of the discriminator as the novelty score. More recently, Salehi et al. [31] makes the network structure more effective by randomly combining the input of the network into several blocks. Both AE and AAE-based methods are based on the following assumptions: normal samples achieve favorable reconstruction and novel samples are reversed. In many cases, this assumption may not be hold. To this end, the proposed approach alleviates these problems by providing a constraint for the latent space in AE or AAE structure.

C. LC-BASED METHODS
With the development of the science and technology, researchers turn their attention to latent space with AE structure. They aim to constrain the expression ability of the whole network by constraining the latent space. GPND [32] first VOLUME 8, 2020 FIGURE 1. The overall framework of AALS, which contains two components ALSG and CAAE. During training, the ALSG is pre-trained and fixed before CAAE. Note that when ALSG and CAAE are trained, the VGG Layer is to calculate the pixel-feature loss using the generated image and the ground truth. During testing, the sample is only transformed into CAAE to obtain pixel-feature reconstruction loss which is combined into a novelty score. In addition, the latent representation z is constrained by the output of Adaptive Mapper. All details of the network structure is established in Table 5.
presents latent constraint, which constrains the latent space by assuming that latent space obeys to a known prior distribution. Moreover, OCGAN [13] and GPND [32] are based on the same assumption. The difference is OCGAN designs a classifier to constrain the expression ability of the latent space in the known(normal) class as much as possible. However, they all suffer a common problem: the assumption that the latent space obeys a prior distribution does not necessarily hold. Meanwhile, LSA [33] constrains expression ability by using the autoregression of the latent space, which is very complicated and not necessarily effective. MemAE [34] firstly stores some typical samples by memory module during training stage. Next, MemAE allows the latent space to be mapped to the nearest sample in the memory module during testing stage. The memory module increases extra parameters and the capacity of the method depends on the quality of memory module. Different from the aforementioned works, our method requires no redundant training processes, complex network structure and no assumptions about the latent space resulting from the adaptive mapper.

D. CHANNEL ATTENTION
Since Hu et al. [35]. propose SENet for classification task. a large amount of channel attention has been proposed in various fields of computer vision. Woo et al. [36]. propose a general attention framework which integrates spatial attention and spatial attention. Huang et al. [37]. combine channel attention and UNet, which apply attention mechanism in skipconnection. Guo et al. [25]. introduce channel attention to the field of novelty detection. Lee and Cho [38]. propose locally adaptive channel for denoising images. However, the abovementioned works either consider global information or local information, which is unreasonable and incomplete. Therefore, we propose Global and Local Channel Attention which utilizes both global and local information simultaneously.

III. PROPOSED METHOD
The proposed AALS model consists of two major components: Adaptive Latent Space Generator (ALSG) and Constrained-based Adversarial Auto Encoder (CAAE). AALS aims to obtain a clear boundary in normal and novel classes by making a trade-off between the expression ability of the whole network and image quality. As depicted in Figure 1, ALSG is pre-trained and fixed using the data sampled from normal distribution. In CAAE, given an input, we firstly feed it into encoder to obtain the latent space z, and z is constrained by the output of Adaptive Mapper in ALSG. The reconstruction output is obtained through decoder. In the following sections, we will elaborate the design of ALSG and CAAE in detail.

A. PROBLEM DEFINITION
The key problem of novelty detection is how to distinguish the normal class from the novel class on the test set when training on the only normal class. Similar to OCGAN [13], we train a model to limit the expression ability of the network to known class to distinguish normal and novel. When a model is trained, any new sample will be reconstructed into a known class as much as possible. Therefore, the value of reconstruction loss for normal class is very small, while that of novel class is the opposite, so it is easy to distinguish between normal and novel samples. As shown in Figure 3, assume normal class is digital 8, AALS tends to generate a reconstruction image of known class ''8'' regardless of the digital of real samples in testing, so the value of reconstruction loss for novel class is large. In CAAE, the function of the encoder is to obtain the latent space representation of the image. Due to the limitation of ALSG, the encoder tends to generate the latent space of known(normal) class. In addition, since the input of the generator is a constrained latent space, the generator generates images that are close to the normal class.

B. ADAPTIVE LATENT SPACE GENERATOR
It is unreasonable to assume that the latent space obeys to a prior distribution, so we propose an adaptive mapper which uses the prior distribution to obtain a specific distribution. Similar to StyleGAN [39], a sample s, as the input is randomly calculated from the prior distribution N (0, 1), and the sample w of an adaptive specific distribution is obtained through the adaptive mapper. Thereafter, the sample w is fed into the generator to obtain an imagex. The overall process is as follows: where θ w and θ g denote the parameters of the Adaptive mapper f w (·) and Generator f g (·), respectively. After training, only f w (·) is utilized to generate adaptive latent space in CAAE. Although our structure is different from the general GAN, the parameters of the two networks are still able to train in an end-to-end manner. Specifically, we aim to get the true distribution of normal class images by employing the adversarial training of discriminator and generator. Pixel adversarial learning is defined as follows: where x is an image sampled from the dataset, s is a sample from the prior distribution N (0, 1), f w (·) and f g (·) are used to generate images close to the real distribution. D aims to distinguish the generated image from real image. Lacking direct constraints on the pixels and features of the image, the whole network is time-consuming to train and the quality of the images is unsatisfied. In addition, many current generation tasks constrain images in feature space to improve performance, such as style transfer [40] and image inpainting [6]. In order to improve the image quality and training speed, we introduce the pixel and feature reconstruction loss [6], [40], which is defined as follows: where x is the input image,x is the output of overall network and f i (·) is the i-th layer output of pre-trained VGG-19 [41] network. α 1 and α 2 are the weights of two reconstruction losses. Similar to other computer vision tasks [40], we define α 1 and α 1 as 1 and 0.1, respectively. Similar to the aforementioned work [6], the final loss function formula is balanced by the hyper-parameters λ a1 = 0.1 and λ a2 = 10 and is calculated as follows: C. CONSTRAINED-BASED ADVERSARIAL AUTO ENCODER CAAE is proposed to limit the expression ability of the network and generate a normal class image with higher quality.
Each sample x in the training set is fed into the encoder to get the corresponding latent space representation z. At the same time, a random sample z is sampled from N (0, 1), and w is obtained by the adaptive mapper. Encoder is optimized by the following latent space adversarial constraint loss: where x is an image from dataset, P data is the distribution of normal examples, P w is the distribution of normal latent examples and w is a example random sample from P w . The previous methods adopt known prior distribution to constrain the latent space, such as normal distribution or uniform distribution. However, it is unreasonable to constrain the latent space. Therefore, we feed s into the adaptive mapper with fixed parameters to get the adaptive distribution. In this way, we get a specific and appropriate distribution for each class, instead of a single prior distribution. Then, z is transformed into the decoder to get the reconstructed imagex. Following the previous works [4], we utilize pixel adversarial loss: where z denotes the output of encoder, other variables are defined the same as formula 3. Due to the poor quality of the normal class image generated by the above-mentioned works, we adopt pixel and feature reconstruction loss (as formula 4.) to constrain the whole network. Most methods point out that high quality images are unnecessary in novelty detection, leading to the ignorance of image quality. In CAAE, we introduce pixel-feature reconstruction loss to boost the quality of reconstruction image. Moreover, we combine it with latent and pixel adversarial losses to widen the difference between normal and novel samples, the whole loss is defined as follows:  where λ c1 , λ c2 and λ c3 are the hyper-parameters in training and testing. For good reconstructions, a larger weight is assigned to pixel-feature reconstruction loss which is similar to OCGAN [13]. Other coefficients are chosen empirically based on the quality of reconstruction. In practice λ c1 = 0.1, λ c2 = 1 and λ c3 = 10 achieve desirable results in all our experiments.

D. GLOBAL AND LOCAL CHANNEL ATTENTION 1) GLOBAL CHANNEL ATTENTION
The essence of channel attention is to re-weight each feature map. Therefore, global channel attention (GCA) is to treat a feature map as a whole, and each point in the feature map is given the same weight. GCA is expressed as follows: where α is ReLU function and σ is Sigmoid function, W 1 and W 2 are the parameters of two fully conneted layers and GP(f ) ∈ R 1×1×C which is obtained by global pooling feature maps. For simplicity, the combination of two FC layers and two activation functions is called calculating attention module. The global channel attention is obtained by the above formula, which is proposed in SENet [35].

2) LOCAL CHANNEL ATTENTION
Different from global channel attention(GCA), local channel attention(LCA) takes a small region in the feature map as a whole. Each small region has the same weight, and different regions have different weights. LCA is defined as follows: where α is ReLU function and σ is Sigmoid function, W 1 and W 2 are the parameters of two fully conneted layers. In order to give different weights to each small region, the whole feature region is divided into k × k small regions by adaptive pooling (AP) operation. We define AP(f ) as z for more precise expression and the combination of two FC layers and two activation functions is called calculating attention module. In order to make it suitable for fully connected layer, resize operation ρ is proposed to convert z ∈ R k×k×C to ρ(z) ∈ R 1×1×k 2 C . Finally, lca ∈ R 1×1×k 2 C is obtained by two convolution layers and two activation functions.

3) ATTENTION FUSION MODULE
In this section, we will show the integration of global attention and local attention. Assume that the original feature maps f ∈ R H ×W ×C . Firstly, global scaling operation extends gca ∈ R 1×1×C to R H ×W ×C by copying the values of each channel. Secondly, local scaling operation converts lca ∈ R 1×1×k 2 C to R H ×W ×C by corresponding scaling in each region. Finally, the product of gca and lca is activated by sigmoid function. The above process is defined as follows: where gca and lca represent two kinds of attention respectively and ⊗ is the multiplication of corresponding positions. Moreover, δ 1 is global scaling operation, δ 2 is local scaling operation and σ is sigmoid function. In addition, the abovementioned three attention processes is depicted in Figure 2.

IV. EXPERIMENT
In this section, the proposed method is evaluated on three popular benchmarks available datasets, including MNIST [14], Fashion-MNIST [15], CIFAR-10 [16] datasets. All the dataset published in novelty detection set a class from 0 to 9 as normal and the rest as novelty. All datasets are applied to novelty detection by setting a class from 0 to 9 as normal and the rest as novelty. Similar to the previous methods, we adopt two experimental protocols [13]. Finally, we present the ablation study of the proposed method.
The 10 classes take turns being regarded as normal class, while the other 9 classes are novel class. All the normal data in the dataset are randomly reshuffled, and the novel data are scrambled at the same time. Take 80% of normal data as training set and the rest as normal test set. In addition, the novel test set with the same size as the normal test set is randomly extracted from the novel data. A test set consists of a normal test set and an exception test set of equal size.

2) PROTOCOL 2
In this protocol, we use the training set or test set split of the original dataset. After processing, the training set contains only normal class of the original training set, in other words, only 1/10 of the training data is used for training. For the test set, we first relabel all the data into two classes: normal class and new class. The normal class consists of the class corresponding to the training set, while the novel class contains all the remaining classes. For example, if 8 is a normal class, then the remaining classes from 0 to 9 form a novel class.
In order to better compare with previous works, Fashion-MNIST only experiments on protocol 1 and CIFAR-10 only experiments on protocol 2. Furthermore, MNIST adopts two protocols for experiments in this work.

B. IMPLEMENTATION DETAILS
For convenience, we resize all images to 32 × 32 size, and then map all values to [0, 1] through standardization and normalization. The proposed method is optimizer by Adam [42] with a learning rate of 0.0002. The training progress stops when epochs are up to 100 or the loss of 10 epochs does not drop significantly and batch size is set to 64. All of the experiments are implemented on Intel Core i5-8600K CPU@3.6GHz, 64GB, NVIDIA GeForce GTX 1080Ti 11GB GPU with Pytorch 1.1.0(python 3.6.0).

C. EVALUATION METRICS
Due to the unreasonable distribution of the dataset, like most methods, we use receiver operating characteristic (ROC) and area under the curve (AUC) as the evaluation criteria. ROC curve is determined by true positive rate (TPR) and false positive rate (FPR): where TP means the real label is true and the the prediction is positive, FN means the real label is true and the the prediction is negative, FP means the real label is false and the prediction is positive, TN means the real label is false and the prediction is negative. AUC is defined as the area enclosed by the coordinate axis under the ROC curve. Its value is ranged from 0 to 1.
• VAE [21]: As a generation model comparable like GAN, VAE combines the advantages of Bayesian method and deep learning.
• Pixel CNN [5]: An image density generation model based on pixel CNN, which can generate images better. PixelCNN can be applied to AE or VAE as a powerful decoder.
• GAN [8]: Generator and Discriminator constitute a dynamic game process. This paper uses the same way to operate GAN as used in AnoGAN [8].
• AnoGAN [8]: AnoGAN is based on DCGAN structure, trying to find the mapping from image to latent space through gradient propagation during testing. The anomaly scores are weighted by reconstruction error and the output of discriminator.
• ALOCC [10]: ALOCC utilizes both reconstruction loss and adversarial loss to train the network. During inference, the output of the discriminator is adopted to represent the novelty scores.
• GPND [32]: The proposed method constrains latent space using a prior distribution. GPND computes the probability distribution as novelty scores in testing.
• LSA [33]: Similarly, LSA limits the expression ability of network through latent space autoregression. Novelty scores are carried out by summing the reconstruction loss and the log-likelihood loss.
• OCGAN [13]: OCGAN aims to improve the reconstruction loss of novelty samples by generating normal class images as much as possible. Therefore, they apply reconstruction loss as novelty scores.
• IAE [25]: IAE is based on Autoencoder structure, which is introduced channel attention and entropy minimization in novelty detection.
• MBC [26]: MBC is a model-based characterization of neural networks which characterizes novelty from the model perspective using gradients.
• PAE [31]: PAE utilizes a new AAE framework that is trained based on solving puzzles on randomly permuted image patches.

1) MNIST
MNIST contains 70000 digital samples from 0-9 with 28 × 28 resolution. Each pixel is represented by a gray value. We demonstrate the performance on MNIST using two protocols. For MNIST dataset, the proposed model AALS obtain state-of-the-art performance, which is compared with ALOCC, GPND, LSA, OCGAN, etc. AALS achieves the mean AUC of 0.984 and 0.9794 in two protocols. In protocol 2, the proposed method has achieved the state-of-theart performance in six classes: 1, 2, 3, 6, 8, 9. It is also close to the most advanced performance in other classes. The proposed method is similar to OCGAN, and the generated novel class data are also close to known(normal) class as shown in Figure 3.

2) FASHION-MNIST
Fashion-MNIST dataset is a 28 × 28 pixels clothing image data, which is more complex than the scene of MNIST dataset. For Fashion-MNIST dataset, Table 1 lists the corresponding mean AUC values, which demonstrate the optimal performance of the proposed method. Compared with GPND and OCGAN, the DCAE and ALOCC methods performed poorly on this dataset. Moreover, the average AUC score of the proposed method is 0.945, which improves the latest method OCGAN by 0.021 and proves the effectiveness of the proposed method.

3) CIFAR-10
CIFAR-10 is a color image dataset that is closer to the regular object. There are 10 classes of 32 × 32 pixels RGB color images. Compared with MNIST dataset, CIFAR-10 dataset contains real world objects with a lot of noises and different characteristics. As a result, novelty detection results are comparatively weaker for all methods on CIFAR-10 dataset. Figure 3 shows our visual inspection results and Table 3 lists the corresponding mean AUC scores, such as VAE, Pixel CNN, GAN, AnoGAN, LSA, OCGAN and our method. Specifically, our proposal model outperforms other methods in AUC scores of seven classes and average AUC scores. Moreover, the difference between the remaining classes and the optimal performance is only about 2% AUC. In Figure 3 we visualize the normal images have a better reconstruction results, and the novel images tend to be close to the normal images. In summary, our method shows superior performance in complex datasets, which improves mean AUC scores 7.18% compared with the latest method OCGAN.

E. ABLATION STUDY
In this section, in order to prove the effectiveness of each components, we design a set of ablation experiments on CIFAR-10 dataset as shown in Table 4. Our first experiment considers only AAE structure which is the backbone structure of our method. Secondly, we use AAE with Latent Constraint(LC).  Thirdly, we combine feature reconstruction loss(FL) in AAE. Thirdly, LC and FL is combined with AAE. To better test the role of GLCA, we next combined SENet [35] with our model. In the final scenario, we adopt full proposed model AALS. Table 4 shows the mean AUC of all classes of CIFAR-10 dataset. Compared with the baseline method, each part has corresponding performance improvement. More concretely,, the model without feature loss can't get high-quality images, while the model without implicit space constraints can't distinguish normal and novel well which is depicted in Figure 4. Meanwhile, it proves that our model make a trade-off the expression ability of the network and the image quality. Moreover, GLCA also has obvious effect improvement compared with SENet [35]. Finally, we conduct experiments on the efficiency of the model in CIFAR-10 dataset. AALS consumes 2.9ms and 0.31ms (per image) in the training and testing stage respectively, which proves that our model has good real-time performance.

V. CONCLUSION
In this paper, we propose a general adaptive adversarial latent space (AALS) network for novelty detection. Specifically, the propose method consists of two core components: Adaptive Latent Space Generator (ALSG) and Constrained-based Adversarial Auto Encoder (CAAE). The ALSG is designed to generate adaptive latent space from a prior distribution. The CAAE utilizes adaptive latent space to constrain the latent space in AAE and adopt GLCA to obtain better feature maps. From the above analysis and experiments, we prove the effectiveness of GLCA. Meanwhile, we adapt pixel-latent adversarial loss and pixel-feature reconstruction loss for novelty detection. Moreover, pixel-feature loss is combined as criteria for novelty scores in testing. Experimental results on three datasets demonstrate state-of-the-art performances in novelty detection. Further, the results show the effective and generalization capability of the proposed method without making any data-related and latent space assumption. In the future, we will investigate how to detect novelty in video data and how to obtain a better boundary between normal and novel.