An Approach to Detect Anomaly in Video Using Deep Generative Network

Anomaly detection in the video has recently gained attention due to its importance in the intelligent surveillance system. Even though the performance of the state-of-art methods has been competitive in the benchmark dataset, the trade-off between the computational resource and the accuracy of the anomaly detection should be considered. In this paper, we present a framework to detect anomalies in video. We proposed a “multi-scale U-Net” network architecture, the unsupervised learning for anomaly detection in video based on generative adversarial network (GAN) structure. Shortcut Inception Modules (SIMs) and residual skip connection are employed to the generator network to increase the ability of the training and testing of the neural network. An asymmetric convolution has been applied instead of traditional convolution layers to decrease the number of training parameters without performance penalty in terms of detection accuracy. In the training phase, the generator network was trained to generate the normal events and attempt to make the generated image and the ground truth to be similar. A multi-scale U-Net kept useful features of an image that were lost during training caused by the convolution operator. The generator network is trained by minimizing the reconstruction error on the normal data and then using the reconstruction error as an indicator of anomalies in the testing phase. Our proposed framework has been evaluated on three benchmark datasets, including UCSD pedestrian, CHUK Avenue, and ShanghaiTech. As a result, the proposed framework surpasses the state-of-the-art learning-based methods on all these datasets, which achieved 95.7%, 86.9%, and 73.0% in terms of AUC. Moreover, the numbers of training and testing parameters in our framework are reduced compared to the baseline network architecture, while the detection accuracy is still improved.


I. INTRODUCTION
Currently, videos from Closed-Circuit Television (CCTV) cameras are rapidly generated every minute in accordance with an increasing number of cameras in public places in order to increase the efficiency, safety, and security due to criminal and terrorist attacks. CCTV cameras are also attributed to monitor areas such as shopping malls, hotels, streets, banks, and government buildings. However, the monitoring proficiency of anomaly events in hundred surveillance cameras using human labor is ambitious. To overcome this problem, developing intelligent computer vision algorithms to automatically detect events in a video scene is a viable solution. Anomaly detection in the video has attracted ones in many research fields, especially in the computer vision The associate editor coordinating the review of this manuscript and approving it for publication was Manuel Rosa-Zurera. field. The term ''anomaly'' refers to an event that is different from normality. The challenging to detect an anomaly event is to distinguish the pattern of object movement, i.e. normal or anomaly, since the video scene captured by surveillance cameras may incur movement over the time. Real-world anomaly events are complicated and it is difficult to define every specific event. Although anomaly detection algorithms have reached the accuracy level under certain condition, the algorithm may still be affected by the external and internal variation such as the illumination, direction of movement object, motion velocity, occlusion and similar object motion.
Video data is high dimensional data containing noise, high variations, and interactions, making the analysis and defining the anomaly event in the scene more challenging. There are many successful researches in the field of action recognition [1]- [5] attempt to label the video where events and actions are clearly defined. However, human labeling of a real-world video scene in every event type is extremely costly or impossible. The training video could also not cover every anomaly event due to some anomaly events rarely or never occurred. Three issues impact the challenge of detecting an anomaly in the video: 1) anomalies are diverse and highly contextual. For example, people running at a park are considered normal events, where running in the bank would be considered an anomaly [6]- [8].
2) The computational complexity and cost of processing across a video frame due to the spatial and temporal dimensions of the video data. In practice, when an anomaly occurred, the algorithm should be able to alarm on time with an acceptable accuracy rate and false alarm rate. 3) environmental conditions are diverse such as illumination, crowd density, object shadow, object occlusions, complex background, etc.
Among related works, several anomaly detection approaches based on either a convolution autoencoder (Conv-AE) [9]- [11] or a U-Net [12] are performed in a different way to detect an anomaly. These approaches learn the normal patterns from training videos and then detecting abnormal events which would not correspond to the learned model [10], [13]- [15]. Deep learning has been proposed for anomaly detection [9], [10], [12]- [19] including supervised and unsupervised learning, the method based on deep learning has improved the accuracy as well as reducing the false alarm rate. Additionally, Generative Adversarial Network (GAN) [20] has been used to generate a realistic image based on a given input image sequence [12]. However, a generator network in the anomaly detection framework still losts some features during training which may affect the accuracy of the detection.
In this paper, we proposed a framework for video anomaly detection using GAN structure. Spatial and temporal features are extract through the multi-scale U-Net architecture. In training, PatchGAN [21] has been utilized to distinguish the ground truth image and the output image from the generator. We also used the optical flow in training to optimize the training parameters. In the testing phase, a detector refers to the regularity score of an image. A lower score indicates anomaly situation, where errors between a generated frame based on the normal training data and the ground truth image are large. This paper is an extended version of work published in [22]. To summarize, our contributions in this paper are: • We employ Shortcut Inception Module (SIM) and residual skip connection to the generator network called ''multi-scale U-Net'' to make the network learning higher-level features.
• We apply the idea of an asymmetric convolution layer and increase the width of the network architecture in order to attain both of small model size and high training efficiency.
• The proposed multi-scale U-Net reduces the parameter number of training and testing, while the anomaly detection accuracy still significantly improves.
• We evaluate our proposed framework with three benchmark datasets of different scene scenarios. Experiments on the benchmark datasets show the effectiveness of our proposed framework for video anomaly detection.
The rest of this paper is organized as follows. We provide an overview of related works of anomaly detection approaches in Section II. Then details of our proposed framework are described in Section III. The experimental results are provided in Section IV followed by our final conclusion in Section V.

II. RELATED WORKS
Anomaly detection approaches can be classified into two main categories including a hand-crafted feature and a learning-based method. Existing hand-crafted features based method [14] has been proposed to learn a set of sparse combinations to model normal events in a video scene. However, hand-crafted features based methods require prior knowledge to define specific parameters for any possible abnormal patterns, which is difficult to adapt to huge variations of different video scenes in real-time anomaly detection.
Learning-based methods currently achieve significant performance in a wide range of computer vision applications, which is improving the accuracy and reducing false alarm rate of the detection and recognition. Motion features are required to model the object movement in a video. A 2D convolution layer will output an image that loses temporal feature of video signal. Only 3D convolution can extract the temporal feature and output in volume. Xu, Dan, et al. [17] presented a novel Appearance and Motion DeepNet (AMDN). Ionescu et al. [15] presented abnormal event detection based on a two-stage outlier elimination algorithm. The algorithm eliminates the outlier using k-mean clustering and classifies by training a one-class SVM. Therefore, Sultani et al. [19] proposed a framework for training anomaly and normal videos using Multiple Instance Learning (MIL) by dividing the videos and video segments into instance and bags. The deep anomaly ranking model has been used to predict high anomaly scores, which reflects an anomaly event. In these approaches, learning features are extracted in two streams (spatial and temporal), which may consume considerable time for training and testing. Fan et al. [18] proposed a video anomaly detection method based on a supervised learning approach called Gaussian Mixture Variational Autoencoder.
The assumption is that the anomaly sample does not belong to any Gaussian component of a Gaussian Mixture Model (GMM). Recurrent neural network (RNN) and long short term memory are applied in anomaly detection to model temporal pattern. Another work [9] leverages a convolutional LSTM based on Auto-Encoder (ConvLSTM-AE) to model both appearance and motion information. Another work proposed in [16] iteratively updates sparse coefficients via a stacked RNN to detect anomalies in videos.
A generative network is one of learning-based methods proposed to generate more realistic datasets in anomaly detection. The generative network aims to infer the data distribution to generate new images that could belong to the S. Saypadith, T. Onoye: Approach to Detect Anomaly in Video Using Deep Generative Network same set as training data. Chong and Tay proposed end-toend architecture for learning video representation [11], which included two main components, one for the spatial component and the other for the temporal component. The structure of this network architecture is based on an autoencoder that aims to reconstruct the input image. Liu, et al. proposed a framework for anomaly detection based on GAN [12]. To generate a more realistic future frame, a U-Net was used as a primary prediction network (a.k.a. a generator network). A motion feature is used in training by enforcing the optical flow between predicted images and ground truth images to be consistent. A skip connection is applied in each layer of the U-Net architecture [23] to improve the quality of the reconstruction image. At the end of the training phase, the discriminator network was used to distinguish an image created by the generator from the ground truth image. However, some features are lost in these generator networks during training due to convolution operators of each layer. To resolve this issue, we presented a framework that consists of a multi-scale generator network and residual skip connection to make a network learn higher-level features of images.

III. PROPOSED APPROACH
In this section, details of our proposed framework are described, as illustrated in Fig. 1. Firstly, given an input image sequence, a multi-scale U-Net is utilized as a generator network G to extract spatial features. Fig. 2 shows the structure of the multi-scale U-Net. We employed SIM inside the multi-scale U-Net to make the network learning the feature in different scales. Instead of the traditional skip connection of the U-Net architecture, residual skip connections are applied in our architecture for the purpose of propagating spatial information which was lost during the convolutional operation from encoder to decoder. The use of these residual skip connections is beneficial for learning higher feature of an image. In the training phase, we enforce the optical flow of the generated image to be close to that of the ground truth image in order to optimize the network parameters. Further, we used PatchGAN [21] as a discriminator network D into our framework to distinguish between the generated image and the ground truth image. We describe details of each part in subsections below.

A. FEATURE EXTRACTION USING INCEPTION ARCHITECTURE
U-Net is considered as one of convolutional neural network (CNN) architectures that was introduced for biomedical image segmentation [23]. The U-Net architecture mainly consists of two parts, the encoder and the decoder. The encoder captures the context of the image by extracting the feature to a small vector size, called latent vector. On the other hand, the decoder aims to extract the feature and recover image details from the vector, where the upsampling layer is applied to increase the size of the feature. Typically, the encoder involves a sequence of two consecutive 3 × 3 convolution layers followed by a max-pooling operation. As explained in [24], the sequence of two 3×3 convolutional layers actually resembles a 5 × 5 convolution operation with the same input size and output size. To improve the learning efficiency of the U-Net architecture with feature learning in different scales, a viable way is to integrate 5 × 5 and 7 × 7 convolution operations in parallel to the 3 × 3 convolution operation. Another possible option for improving the detection performance is increasing the size of the network architecture in terms of the depth and the width [25], [26].
In this work, we used U-Net as our base network architecture for the generator network. Unlike [23], we modified and replaced original convolution layers by SIM and increased the width of the network to make it possible to learn the input image in higher-level features. Fig. 3 (c) illustrates a proposed SIM inspired by the idea of an inception module [24] (Fig. 3 (a)). As described above, we can replace the convolutional layers with the inception blocks. Although the performance gain can be expected in introducing the larger size of convolution operations such as 5 × 5 and 7 × 7, the parallel network structure consumes high computational capacity. In the same manner as [24], we factorize and stack the larger size of 5 × 5 and 7 × 7 convolution operators by a 3 × 3 convolution operation, the output of the last two 3 × 3 convolutions competent to the 5 × 5 and 7 × 7 convolutions approximately as shown in Fig. 3 (b). We also take an advantage of a feature concatenation to extract the feature from different scales [25]. We then add a shortcut connection with an additional 1×1 convolutional layer to add more non-linearity information to enhance the representation as well as reducing the network's dimension without a performance penalty [27]. In [22], the traditional convolution layers are utilized in the inception module. Unlike [22], our proposed block uses the ideas of the asymmetric convolution operation, which aims to factorize a standard two-dimensional convolution kernel into two one-dimensional convolution kernels. For example, a 3× 3 convolution is equivalent to a stacking of a 3×1 convolution followed by a 1×3 convolution, which reduces the size of the model and increasing the training efficiency [24], [28], [29].
Details of the proposed SIM are summarized in Table 1. We assign W j to control the number of filters used in our proposed module in each depth j. Inside the SIM, we assign  respectively, as this combination achieved the best results in our preliminary experiment. We compute W j as follows.
where N j is the number of filters in the corresponding depth j of the multi-scale U-Net, and α is a scalar coefficient. Typically, the filters should be gradually increased to prevent the memory usage of the earlier depth from rising the deeper network. Therefore, the number of filters N j of the network architecture in depth j is set to 2 5+j . We selected α = 1.5 as it keeps the number of parameters slightly below that of the original U-Net.

B. RESIDUAL SKIP CONNECTION
The U-Net architecture [23] also proposed the idea of using the skip connections between the encoder after the max-pooling operation and the decoder before the deconvolution layer. The aim of the skip connection is to propagate the spatial information that lost in every convolution operation from the encoder and the decoder, which is beneficial in recovering the clean image. As the design of the U-Net architecture, the features from the encoder are supposed to be low-level features, and the features from the decoder are supposed to be higher level as they are computed in a deep network. Thus, the fusion of these sets of features from encoder-decoder could cause feature learning which affects the reconstruction output. Following the deep residual network [30] that proposed the idea of using residual learning block, as shown in Fig. 4 (a), we introduce residual skip connection blocks to our proposed generator network. As illustrated in Fig. 4 (b), the proposed residual skip connection block consists of an asymmetric convolution layer of a 3 × 1 convolution layer followed by a 1 × 3 convolution layer and a shortcut connection of a 1 × 1 convolution layer, which allows the network to learn additional information from the input. In the generator network, instead of concatenating the feature maps from the encoder to the decoder, we pass encoder features as the input through a chain of the residual skip connection block, and the output of the block is concatenated with the decoder features. A significant amount of image detail could be lost or corrupted using more convolution layers [31]. Therefore, the residual skip connections make it possible to keep useful features lost by the convolution operation, and it is beneficial to train the deep network while still having fewer parameters. We denoted a residual skip connection block as RB i,j , where i is the number of blocks used in each depth j. Basically, the feature maps of each layer in the encoder are likely to decrease as we down-sampling the image in every step using the max-pooling layer. Therefore, we gradually decrease the number of RB i,j , which is i = 4, 3, 2, 1 in each depth j, respectively. The number of filters in the block is set to the same with N j in each depth j.

C. OBJECTIVE FUNCTIONS
At the training time, G learns to map the ground truth image I and the generated imageÎ to be consistent. The intensity loss L int and the gradient loss L gd of two images along two spatial dimensions are used to minimize the reconstruction error between I andÎ , which can be computed as follows.
The optical flow loss is applied to capture the motion information and to optimize training parameters, we used a FlowNet [32] to estimate the optical flow. Following [9], we apply the L1 distance loss to calculate the motion penalty: where F is the ground truth optical flow estimated from two consecutive frame I t and I t+1 andF is the output optical flow calculated by I t and the generated imageÎ t+1 . In addition to the loss functions described above, we used an adversarial loss based on Generative Adversarial Network (GAN) [20] to constrain the training process and improve model performance [12]. Given an input image sequence, the proposed multi-scale generator G is trained by the adversarial loss, which encourages the generator to generate a more realistic image. A discriminator network D is used to optimize model parameters to make generated imageÎ indistinguishable from the ground truth image I . We utilized PatchGAN [21] as a discriminator network. PatchGAN mapŝ I to small patches, where a discriminator takes each individual patch and predicts whether a patch come from I orÎ . A discriminator outputs a scalar which classify the patch from I as class 1 and the patch fromÎ as class 0. The goal of training G is to generate an image where D classifies it into class 1. A mean square error loss function L MSE and adversarial loss L adv were used as objective functions which can be calculated as follows.
where D gen (·) is the output of the discriminator network of I and m, n denote patch indexes. Finally, our proposed final objective function L can be computed as follows.
where λ int , λ gd , λ adv , and λ flow are weights of each loss.

D. ANOMALY DETECTION USING REGULARITY SCORE
In testing, we computed the anomaly score in every frame of the testing video by measuring the similarity of the ground truth image and the generated image. In the same manner with [12], Peak Signal to Noise Ratio (PSNR) is utilized as the detection score in our framework. The PSNR calculates the image quality, where a low value of PSNR means that the image is likely to be abnormal. PSNR is defined as where maxˆI represents the maximum intensity value in a generated imageÎ . p t (i) andp t (i) are pixel intensity of index i in I andÎ , respectively. N denotes the total number of pixels in the image. Then, we obtain a regularity score R(t) for each frame t in the video by normalizing the PSNR to the range of [0,1] as follows: where the terms min PSNR and max PSNR are the minimum and maximum values of the PSNR in every frame of each test video. Finally, we detect anomaly event in an image based on a threshold of regularity score R(t). VOLUME 9, 2021  [12], [16], [18] works on CUHK Avenue, UCSD Ped1, UCSD Ped2 and ShanghaiTech dataset.

A. EXPERIMENTAL SETUP
Our implementation is based on Tensorflow framework [33] using Python 3.7. The network architecture was trained and tested on NVIDIA Geforce RTX 2080 GPU. Training is based on Adam optimizer and the batch-size is fixed to 4. In our training and testing, input images of the network architecture are resized to 256 × 256 of 3 color channels and normalize it to the range of [−1, 1]. To be consistent with [12], we have set 4 consecutive images as the input image sequence.

B. DATASETS
We evaluated and compared our framework with stateof-the-art learning based methods using three benchmark datasets for video anomaly detection, including the CUHK Avenue [14], the UCSD Pedestrian [13], and ShanghaiTech datasets [12]. The CUHK Avenue dataset contains 16 training and 21 testing videos with abnormal events such as abnormal objects, throwing, and strange actions. The UCSD Pedestrian dataset provided two subsets, Ped1 and Ped2. The Ped1 subset contains 34 training and 36 testing videos, and the Ped2 contains 16 training and 12 testing videos. The anomaly events in the UCSD dataset includes cars, scooters, wheelchairs, and bicycles. The ShanghaiTech dataset covers challenging scenarios for video anomaly due to large variations in appearance and viewpoint, which consists of 13 scenes of 330 training and 107 testing videos.

C. EVALUATION METRIC
Following the framework described in [12], we use a receiver operating characteristic (ROC) curve, Equal Error Rate (EER), and corresponding an area under the curve (AUC) to evaluate the detection performance for qualitative comparison. ROC is used to visualise the performance of a binary classifier by plotting the trade-off between the true positive rate (TPR) and the false positive rate (FPR) with varying its discrimination threshold. AUC used in most previous works [9]- [12], [16], [18] that measures the entire of two dimensional area of ROC, which provides an aggregate measure of performance across all possible classification thresholds. In this study, higher AUC values and lower EER values indicated better performance of anomaly detection.

D. QUANTITATIVE RESULTS
We have compared our proposed with several anomaly detection methods based on deep learning, including autoencoders [10], [11], [18], LSTM [9], [16], and GAN [12]. Experimental results summarized in Table 2 demonstrated that our proposed framework achieved better AUC and EER in UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets. Specifically, compared to Liu et al. [12] that provides initial results for our proposed framework, our proposed achieves relative AUC and EER improvement over the original U-Net in all these datasets. The performance is slightly improved with respect to the ShanghaiTech dataset due to the fact that the ShanghaiTech is a larger-scale dataset that contains several types of anomalies and complicated movement. The performance in UCSD Ped1 is 85.3% whereas the best result of 94.9% is achieved by Fan et al. [18], who employed a two-stream network that combines the appearance and motion of anomalies which may incur considerable computational cost. The overall result illustrated that the design of our proposed framework is able to capture appearance and motion information to detect anomalies in real-world scene scenarios.

E. QUALITATIVE RESULTS
The qualitative results of our proposed framework on two testing videos in the CUHK Avenue and UCSD Ped2 dataset are illustrated in Fig. 5. We can see that the generated image tends to achieve a high regularity score while decreasing when the anomaly occurred (e.g., running, bicycle intrusion). We also show the output of the proposed generator network in Fig. 6. The result of the generated image and the image The top row is ground truth images, the middle shows generated images, and the bottom row is an image difference between ground truth and generated images.  difference compared to the ground truth image indicates that the image quality of the anomaly area is blurred and distorted due to the generator network could not reconstruct the unseen object from the learned model (ie., ''throwing'', ''car and bicycle approaching'', ''strange action''), which results in a lower regularity score in these video scenes. Table 3 summarizes the performance evaluation result of our proposed framework for confirming individual contribution. The experiments are performed on CUHK Avenue dataset using two parts based on the original U-Net architecture as a baseline: the inception module and the residual skip connection. Firstly, the traditional skip connections were replaced by our residual skip connection. Secondly, two consecutive 3 × 3 convolution layers were replaced by our inception module. The result shows that the residual skip connection slightly improves when included in the U-Net architecture, while employing the inception module is even more effective to the detection performance. However, combining the inception module and residual skip connection achieved the best detection performance compared to the original U-Net architecture.

G. RUNNING TIME ANALYSIS
A comparison of the parameter number of the original U-Net architecture and proposed ones are presented in Table 4, demonstrating that our multi-scale U-Net reduces the number of parameters for training and testing an anomaly detection while improving the accuracy. We also evaluated the computational cost of the proposed framework on the Shanghaitech dataset. The running times were measured on NVIDIA Geforce RTX 2080 Ti GPU with 24 GB of RAM. Our proposed framework takes 0.041 seconds per frame in averaged. Hence, it can run at 24 frames per second(fps) within the entire pipeline, which is on par or slightly better than using the baseline network architecture [23] that is able to run at about 22 fps.

V. CONCLUSION
In this work, we present a framework based on multi-scale U-Net architecture for anomaly detection in video. The inception modules are employed instead of using the traditional convolution layers that utilize in the original U-Net, making our multi-scale U-Net has ability to learn image features in different scales. The skip connections were replaced by our proposed residual skip connection including shortcut connections increase the ability to train a deeper network while still having fewer parameters. In the feature extraction part, an asymmetric convolution kernel is applied to reduce the number of network parameters while still keeping the detection accuracy. As a result in both qualitative and quantitative, our proposed framework based on multi-scale U-Net achieved better performance with a lightweight model and less memory usage compared to other learning-based anomaly detection approaches. However, the generator network is unable to distinguish an ambiguous anomalous object in a scene. For future work, we will explore on experimenting on applying two-stream inputs for feature extraction to our model to capture both the appearance and the motion of the object characteristic and to enhance the performance of the anomaly detection.