A New Unsupervised Video Anomaly Detection Using Multi-Scale Feature Memorization and Multipath Temporal Information Prediction

Anomaly detection in video is an advanced computer vision challenge that recognizes video segments containing out-of-the-ordinary motions or objects. Most recent techniques in video anomaly detection have focused on reconstruction and prediction methods; however, in practice, frame reconstruction methods deliver suboptimal results due to the outstanding generalization abilities of convolutional neural networks when reconstructing abnormal frames. Meanwhile, frame prediction methods have drawn much attention and are a powerful way of simulating the dynamics of natural scenes. This paper provides a new unsupervised frame prediction-based algorithm for anomaly detection that improves overall performance. Our suggested strategy follows a U-Net-like architecture that employs a Time-distributed 2D CNN-based encoder and 2D CNN-based decoder. A memory module is used in the design to retrieve and store the most relevant prototypical pattern of the normal scenario in the memory slots during training giving our model the capacity to produce poor predictions in the case of unusual input. For the memory module to fully retain normal semantic patterns on multiple scales, we propose an upstream multi-branch structure composed of dilated convolutions to extract contextual information. We also provide a multi-path structure that, as a great substitute for the optical flow loss function, directly includes temporal information into the network design. Experiments on the UCSD Ped1, UCSD Ped2, and CUHK Avenue benchmark datasets revealed that our design outperforms most competing models.


I. INTRODUCTION
With the rapid development of surveillance equipment and growing demands for social security, automated detection of abnormality in surveillance video has become more critical than ever. Nevertheless, most security cameras are not constantly monitored by human operators; even if it is the case, active live monitoring wastes a tremendous amount of human labor. These factors have prompted the development of reliable and intelligent video surveillance algorithms. However, contrary to most supervised video analysis tasks, detecting abnormal events in video content is challenging for three reasons [1]. First, abnormal occurrences in real-world scenarios The associate editor coordinating the review of this manuscript and approving it for publication was Chaker Larabi . infrequently happen, leading to a class imbalance in the training set. Furthermore, multiple potential anomalous behaviors might occur in the scene, making it very time-consuming to acquire every type of anomaly and address the issue using two-class classification-based approaches [2]. As a result, in earlier studies, researchers have attempted to explore unsupervised video anomaly detection approaches in which the model training is conducted using only regular data to learn normality. Any deviation from that learned normality is classified as an abnormality [3]. The third reason is that video abnormalities are scene-dependent [1], which means unusual behavior in one scenario could be seen as usual in another scene. Therefore, defining a consistent definition for irregular events would be challenging. For example, driving would be seen as normal on a highway, whereas it would be remarked VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as an anomaly on a sidewalk. The present paper focuses on video anomaly detection in a single scene since it covers extensive use cases and allows us to detect abnormal behaviors specific to the scene. Traditional video anomaly detection approaches usually describe regular motion or appearance patterns based on extracting task-specific features. However, the key restriction of these approaches is their limited representational power in complex surveillance scenarios and the need for prior knowledge to design descriptors. Among traditional approaches, sparse reconstruction-based methods [4] have reached state-of-the-art performance. However, their major weakness is the high computational cost of determining combination coefficients.
On the other hand, deep neural networks have proven their superiority in anomaly detection [5]. Some studies have used deep networks merely as feature extractors, attempting to find a decision boundary to distinguish normal features from abnormal ones using k-means or one-class SVM [6], [7], [8].
Methods in the first category attempt to train a probabilistic model that restricts the latent representation to accord with the Gaussian distribution, assuming that all normal samples follow the Gaussian distribution. Then, abnormal samples are distinguished with a lower probability throughout the inference time. Although these approaches provide a more adaptable anomaly detection framework, the uncertainty of domain knowledge and limited computational resources may lead to model failure in mimicking the complicated distributions of diverse regular events. Consequently, the model should be generalized to other unseen regular occurrences while decreasing sensitivity to strange, abnormal occurrences [18].
Many approaches follow the concept of reconstruction. They usually train a deep autoencoder to recover the original input frames from their compacted representation. However, the generalization ability of deep convolutional neural networks is high, and these strategies do not always ensure a significant reconstruction error for abnormal events [14]. Hence, determining the threshold for separating normal samples from abnormal ones could be difficult. Based on the same concept, future frame prediction methods compare the predicted frame to its ground truth and identify anomalies as events that do not match the expectation. Compared with frame reconstruction approaches, frame prediction considers the anomaly in appearance and location and could encourage the model to exploit time dependency. Moreover, the prediction-based methods could also somewhat solve the problem of the low reconstruction error [18].
The earlier prediction-based techniques usually employ U-Net architecture as a frame predictor network. In their model training, they impose a constraint on appearance by specifying intensity and gradient losses and another constraint on motion by defining optical flow loss [14], [19]. Meanwhile, generative adversarial networks were incorporated into the design to enhance the predictive power of U-Net for constructing a more accurate and realistic future frame [14]. However, despite realistic frame generation, adversarial networks make training inefficient [15]. Moreover, optical flow loss does not adequately introduce temporal domain information into the model, and calculating optical flow during training is time-consuming [15]. Therefore, considering the advantages of future frame prediction, in this paper, we introduce a novel Multi-scale Multi-path network architecture, i.e., MsMp-net, inspired by U-Net [20] in an unsupervised manner. The main idea is to memorize normal semantic information more efficiently while capturing temporal information and include them in the network design for video anomaly detection.
The MsMp-net structure comprises a Time-distributed 2D encoder, a 2D decoder, a multi-path predictor, and a multiscale memorizer module that is trainable in an end-to-end manner. The overall design of MsMp-net is illustrated in Fig. 1. On the one hand, high-level feature maps contain highly discriminative information, but their spatial resolution is relatively low. On the other hand, Low-level feature maps, in the early layers of the encoder, keep a higher spatial resolution, enabling us to model finer motion dependency of dynamic objects in single-scene surveillance videos [21]. Consequently, Different from the work of authors in [15], we propose to employ a multi-path design merely based on low-level spatial feature maps to explicitly memorize the temporal dependency of objects with different resolutions via multiple ConvGRU modules [21]. From another point of view, as the network depth increases, the spatial information in the feature maps decreases, and semantic information is essential at this stage. Therefore, we introduce the multi-scale memorizer module to store the prototypical high-level features more effectively, which consists of two steps. In the first stage, semantic features are extracted at multiple scales through a multi-branch structure of dilated convolution called a context module. These multi-scale regular patterns are stored in the second stage through a memory module. The context module and memory module, defined as the multi-scale-memorizer depicted in Fig. 1, can work well together to collect and memorize multi-scale semantic characteristics of the regular scene. Noticing the video anomaly detection task, it is worth noting that the regular scene may contain both small and large objects depending on their distance from the fixed camera. With the help of the proposed context module, the model could gain insight into varied-size objects in the scene. Meanwhile, we take advantage of the memory module [22], [23], [24] as the second part of the multi-scale-memorizer, which was previously used in the frame-reconstruction-based method to alleviate the over-generalization issue of autoencoders [22]. With the help of the memory module, the score gap between abnormal and normal frame could be enlarged; this can be done by selecting a few learned memory records while generating the subsequent frame.
During the training process, instead of concentrating solely on low-level structural differences via the MSE loss function, we considered the DSSIM loss function in terms of structural similarity to more effectively establish the mapping model from several consecutive frames in the past to the subsequent frame. At the decision stage, we propose a refined score based on the correlation of adjacent future frames for anomaly detection, which increases the detection performance to a certain extent. Altogether, the benefit of our suggested framework is improving the memorization of multi-scale features for both spatial and temporal characteristics while simplifying the training process by not resorting to additional complicated adversarial and optical flow losses. In summary, the followings are the main contributions of this paper: • A well-designed U-shape structure with Time-Distributed-2D encoder and 2D decoder is proposed that can learn and memorize the multi-scale context of high-level semantic features thanks to the cascade structure of context module and memory module, i.e., multiscale-memorizer. To the best of our cognizance, this is the first effort at anomaly detection that takes into account extracting multi-scale characteristics of the scene and after that retrieving them via the memory network.
• We included the content of the temporal domain in our designed model using low-level feature maps extracted in the early layers of the encoder to model finer motion variations. The proposed multi-path structure not only preserves the movements of dynamic objects in the normal scene but also simultaneously considers the multi-scale information in the temporal domain.
• During the training, we omitted the optical flow loss function and generative adversarial networks to reduce the cost of training and added a DSSIM loss to optimize the network in a different aspect.
• We used an assessment strategy that considers the continuous nature of video sequences in calculating anomaly scores.

II. RELATED WORK
The objective of anomaly detection, an unsupervised learning task, is to find aberrant patterns or motions in data that are, by nature, infrequent occurrences [5]. We can define the unsupervised anomaly detection task as a two-step process for the given training set {x_train i , y_train i }, i∈ [1, N], in which training samples are only available with negative labels and test samples {x_test i , y_test i }, which can contain both positive and negative labels. The first stage, determines the mapping model of the input data f θ (x_train), for instance, utilizing statistical models or a model based on convolutional deep neural networks, and the second stage finds the decision rule S(f θ (x_test))∈{0, 1} to detect anomalies. In the literature, video anomaly detection can be categorized under traditional and deep learning-based methods. In the following section, we looked into the subject with more details in two categories.

A. TRADITIONAL VIDEO ANOMALY DETECTION
Numerous studies have been conducted over the years to detect anomalies. The traditional methods have been largely explored using hand-crafted feature extraction and sparse reconstruction techniques. In the hand-crafted feature extraction-based methods, a descriptor is manually created based on object trajectory features [25], [26] or low-level appearance features taken from either 3D video cuboids or 2D frame patches. The methods based on object trajectory have the advantages of simple implementation and quick execution, but their effectiveness may significantly degrade by occlusion. In the area of low-level features, many techniques, such as 3D gradient [27], Histograms of Oriented Gradients (HOG) [28], and Histograms of Optical Flow (HOF) [29], have been proposed to cope with scene complexity and occlusions. Mehran et al. [30] considered the interaction forces between moving particles by proposing a Social Force Model (SFM) to identify abnormal behavior. Kim and Grauman [31] exploited local optical flow information and modeled them with a Mixture of Probabilistic PCA (MPPCA). Then the trained model and Markov Random Field (MRF) defined the degree of normality. Mahadevan et al. [32] used a Mixture of Dynamic Textures (MDT) to describe the regular dynamics and appearance features of crowds. The extended version of MDT was proposed by the authors in another work [33].
In the literature, multicore learning [34] was also used to train the classifier for abnormal event detection by simultaneously modeling motion dynamics and appearance. Several inference models have been presented by [35] to represent the activity of complex scenes online. On the latter side, the authors in [4] and [36] constructed a sparse dictionary of basis derived from regular crowded patterns, where a patch is categorized as normal or abnormal based on the Sparse Reconstruction Cost (SRC) over the normal dictionary that was previously learned.

B. DEEP LEARNING-BASED VIDEO ANOMALY DETECTION
Many deep learning techniques have been developed to alleviate the restriction of task-specific features. In [8], Xu et al. used stacked denoising autoencoders to automatically acquire motion and appearance feature representation (AMDN). After that, they generated anomaly scores by feeding these descriptors into multiple one-class SVM classifiers. An advanced approach over AMDN, called DeepOC, was also introduced in [37], optimizing feature learning and one-class classification in one stage. In another work [6], the authors used a collection of pre-trained Con-vNet to extract high-level features that can be subsequently used for training a set of classifiers for anomaly detection. Ionescu et al. [38] suggested using a multi-stage learning technique that combined one-versus-rest classification and clustering after detecting the object of interest. Reconstruction-based approaches for anomaly detection have also been outbreak through AutoEncoders (AE) [11], [12], [39], [40], [41] and Generative Adversarial Networks (GANs) [42], [43]. For example, Hasan et al. [40] developed a 2D Convolutional Autoencoder (2D-ConvAE) to capture the structure in regular frames. Later, conventional autoencoders were extended with 3D convolution [41] or convolutional LSTM modules [11], [39] to extract temporal cues of video occurrences in addition to spatial information. Nguyen and Meunier [12] proposed regenerating an input frame and transforming it into its optical flow frame simultaneously via a two-stream network by combining ConvAE for image reconstruction and a U-Net for image translation. A Deep Spatiotemporal Translation Network (DSTN) based on GAN and Edge wrapping was presented by Ganokratanaa et al. [44] for video anomaly detection and localization. They introduced a novel feature collection method to provide a more suitable appearance and motion representation for the model input. The anomaly score was then determined by calculating the reconstruction error between the generated and actual dense optical flow. Later, they introduced an accurate approach over DSTN [45] that determines abnormality based on the anomaly score of the detected objects in the frame.
The assumption of significant reconstruction error in the reconstruction-based models is not necessarily valid. To mitigate this drawback, authors of [22] proposed a memory AE architecture. They defined a specialized memory bank that retrieved the most related regular patterns in memory for reconstruction, leading to a significantly different output from the anomalous input. On the other hand, Ravanbakhsh et al. [43] put an autoencoder structure as a generator of GAN for reconstruction; after that, the produced output of both the video frame and its optical flow was used for anomaly detection. The authors of [42] designed two separate networks to accomplish various goals and were able to identify abnormalities using a generative adversarial network. Among probability-based approaches, variational autoencoders [10], [46] and adversarial autoencoder networks [9], [47] have been widely used. Abati et al. [10] suggested combining an autoencoder and an auto-regressive architecture into a unified framework for directly learning data distribution. Li et al. [9] suggested a two-stream twostage architecture by cascading 3D-Convolutional Autoencoder and 3D-Adversarial Autoencoder. Cascading Con-vAE and seq2seqLSTM for appearance and temporal features extraction was another strategy described by Pawar and Attar [48]. Subsequently, the decision boundary was derived by fitting the seq2seq LSTM encodings to a Gaussian distribution using the RBF kernel. A frame prediction approach for anomaly detection was initially recommended by Liu et al. [14], who used U-Net as a generator of GAN. Their method was further extended by [16], [17], and [19], where they took the advantage of the reconstruction module beside the prediction module. Recently, the authors of [13] slightly modified the U-Net architecture for future frame prediction to solve the challenges of the baseline method mentioned above [14]. They proposed Msm-net by combining spatio-temporal information in the bottleneck of U-Net architecture. Wang et al. [15] also suggested a robust multi-path technique that utilized ConvGRU modules and non-local blocks to integrate local and global spatio-temporal information.

III. PROPOSED METHOD A. FUTURE FRAME PREDICTION
Next-frame prediction attempts to predict future frame (s) by interpreting historical information. Mathematically, the current generated frameÎ t is modeled as a function of T consecutive previous frames in an unsupervised manner. i.e., I t = f(I t−T , . . ., I t − 2 , I t − 1 ). In the inference time, the generated output frameÎ t is compared to the ground truth frame I t to identify the existence of irregularity in the current frame through the abnormality score function, i.e., S t = A(Î t , , I t ) ∈ [0, 1]. Finally, each frame in the test set takes a binary label by applying a threshold and utilizing the sliding-window approach.

1) MULTISCALE MULTI-PATH U-NET
The overall architecture of MsMp-net is represented in Fig.1. MsMp-net is mainly comprised of four critical phases: the encoding phase, the multi-scale feature memorization phase, the multi-path prediction, and the decoding phase. The input pattern of our model is such that T consecutive frames are separately sent to the network. In contrast to the one suggested by [14], our method avoids the collapse of temporal information after the first 2D convolution layer of the encoder owing to the use of the Time-distributed-2Dencoder, which extracts the spatial information of each frame separately. As shown in Fig. 1, each Time-distributed 2D encoder block extracts T consecutive feature maps. These encoded features with different spatial resolutions are fed into three Predictor modules and, finally, into the multi-scalememorizer module. The derived low-level sequential feature map from the early layers of the encoder with resolutions of 256, 128, and 64 is spatiotemporally remembered in each prediction path using a ConvGRU layer. The consecutive high-level features from the last encoder block are fed into the context module after temporal concatenation to achieve multi-scale semantic feature extraction. After that, we complete remembering these features by passing this multi-scale encoding from the context module into the Memory module.
In other words, the context module supports the memory module in recovering more informative semantic patterns in the memory bank. To our knowledge, this is the first effort to combine the context module with the memory module to improve the effectiveness of the memory module in normal pattern memorization. Then, the addressed output features of the multiscale-memorizer, along with spatio-temporal output features of the prediction path, are delivered to the decoder to return to the initial resolution and generate the next frameÎ t of the input video sequence I t−T , . . ., I t − 2 , I t − 1 . During training, the model learns to predict the output based on regular input sequences by minimizing the differences between the generated output frame and the ground truth in different aspects. During the testing period, the learned memory components are fixed, and the output of the memory module is gathered from a number of chosen memory slots and then delivered to the decoder to generate a future frame. To assess the similarity of each generated output to its ground-truth frame, we utilize the Peak Signal-to-Noise Ratio (PSNR) metric. The final abnormality score is then determined according to the PSNR of the current frame and the weighted average of P future frames. This approach for calculating abnormality scores is successful since it accounts for the continuity characteristics of the abnormal events in the video sequence. If the input sequence contains regular patterns, the predicted frame tends to be very close to its ground truth; hence, the differences would likely be small and labeled as normal.
On the other hand, if the input sequence contains abnormality, the memory module will convert the abnormal encoding features into normal ones and then feed them to the decoder. Therefore, the output frame is always made of normal encoding, and as a result, it will usually tend to be close and similar to the normal frame. Consequently, the difference between the predicted frame and the ground truth is likely significant, making anomalies separable and yielding an abnormal label.
In the remainder of this section, we initially describe and clarify constituent elements of the model; the Encoder f enc (.), the decoder f dec (.), predictor f pred (.), and Multi-Scale-Memorizer f Ms−Mem (.) with minute details. Then, we define the loss function and explain how to compute the refined abnormality score.

2) ENCODER
The model receives a 5-dimensional hypercube of shape (B, T, C, H, W) as input of the encoder, where B and T denote the minibatch size in the training phase and the number of consecutive input frames, respectively, and C, H , and W denote the number of channels and dimensions of the frame. The encoder consists of four 2D convolutional blocks wrapped inside a time-distributed layer so that the same encoder can be applied to all input frames. Subsequently, the 5-dimensional input sequences get converted into 5-dimensional feature maps. Typically, each convolutional block involves two or three consecutive 3 × 3 convolutions accompanied by batch normalization and ReLU activation function. Table 1 shows the details of the Encoder structure, where C in and C out indicate the input and output channels of each ConvBlock, and N c shows the number of consecutive convolution layers specified for each ConvBlock. N k and N s stand for kernel size and stride, respectively.

3) PREDICTOR
In most normal events, the temporal coherence of spatial features is equally mattered and can be captured effectively by RNN blocks without using extra parts or complex designs [14]. Recent video anomaly detection methods that rely on recurrent structures use ConvLSTM [49] or ConvGRU [13], [50] to model temporal dependencies. Unlike Con-vLSTM, ConvGRU only employs two gate structures with fewer parameters, faster training speed, and similar modeling effects. As mentioned before, the input sequences with T consecutive frame get converted into T consecutive feature maps with different resolutions, which act as time-steps to be fed into each predictor. The ConvGRU module, defined as a predictor, allows these T consecutive feature maps to be modeled in the temporal domain from three separate paths and simultaneously preserve spatial and temporal information. Since low-level feature maps have a greater spatial resolution, we believe they are better suited to excavate temporal dependency of moving objects. As shown in Fig. 1, we adopted the feature maps with larger resolutions and sent them to the predictors. More specifically, the ConvGRU module takes a typical feature map e t at the current time step t with the hidden state h t−1 at the previous time step t-1 as the input. The hidden state h t is the output that records historical information. The previous hidden state h t−1 , the candidate hidden stateh t , the update gate z t , and the reset gate r t influence the output h t . The reset gate r t allows the cell to discard any irrelevant information in the hidden state to focus on what is more crucial. The candidate hidden states and previous hidden state information that can be contributed to the current hidden state are decided by the update gate z t [21]. The corresponding equations for ConvGRU in each prediction path L (where L=1,2,3) can be given by: where * , σ and ⊙ specify the 2D convolution operation, sigmoid activation function, and Hadamard product, respectively. W z , U z , W r , U r , W h, and U h are learnable weight matrices, b z , b r , and b h stands for learnable biases, and L demonstrates the paths equipped with the prediction module. As can be seen, Fig. 2 depicts the particular structure of each prediction path. Each path has a shallow one-layer ConvGRU cell with a 3×3 convolution kernel. The number of channels of the hidden state is equal to the number of channels of the corresponding feature maps. The state information is updated by the input and achieved time dimension modeling. Finally, we feed output T as the output of the predictor into the decoding process. It should be noted that passing lowlevel encoding information through the shallow ConvGRU module helps the model recover a constant background while simultaneously focusing on the motion feature.

4) MULTI-SCALE-MEMORIZER
The frame prediction network is not merely required to generate reliable frames in a normal scenario but also to expand the gap between abnormal and predicted frames by generating significantly different output from the anomalous input. To accomplish this, as a part of multi-scale-memorizer, we employ a memory module in our design. Furthermore, we propose a new context module on top of the memory module [13] to provide multi-scale semantic information for the memory module so that better contextual patterns can be recovered. We go into detail about each of them in the following two sections.

B. CONTEXT MODULE
Recently, addressing multi-scale information in images has shown promising recognition results for various computer vision tasks, especially in dense prediction problems [51], [52]. In surveillance videos, the same object can be captured at different spatial resolutions. Therefore, this realization has led us to use a structure to excavate multi-scale semantic characteristics automatically. Inspired by previous research [13], we suggest a new context module made of dilated convolution that gradually builds contextual knowledge of multiple receptive fields without sacrificing resolution by incorporating six distinct branches of 3 × 3 dilated convolution with 512 channels, as shown in Fig. 3. To fully preserve the high-level spatial information provided by the last encoder block, we concatenate this temporal slice to get a 5-dimensional hypercube into a 4-dimensional hypercube with the new shape (B, T×C, 32, 32). Then the concatenated temporal feature maps are sent into the context module for processing. The original feature representation is extracted in the first channel, while the following five channels extract multi-scale contextual information. Then the resulting outputs of six channels are added together after the ReLU activation function, and more detailed semantic information about regular scenes is collected.

C. MEMORY MODULE
The memory module directly retrieves and stores the detailed semantic normal information supplied by the up-stream context module into the memory slots during training. Instead of delivering the encoding directly to the decoder, the main idea is to convert anomalous encoding to the closest normal one  [4] indicate the multi-scale input features and the addressed output features, respectively. During training, the prototypical pattern is retrieved. In the testing phase, even if the input queries contain abnormalities, this mechanism generates output queries from normal patterns stored in the memory bank. and pass the aggregated combination of memory slots into the decoder. Specifically, attention-based memory addressing is used to implement the aforementioned process. The memory module consists of two main parts: an external memory slot, M, to record the prototypical semantic patterns and an attention-based addressing operator, w, for having access to the memory slots. As shown in Fig. 4, the memory is a matrix M ∈ R N×F where the real-valued N denotes the memory size, and F represents the size of the features stored in each memory slot m i , where i ∈ 1, 2, . . . , N . For simplicity, F is set to be at the same size as the dimension of the query q m . The memory as a content addressable module [23] computes the attention weight vector w according to the cosine similarity of each memory element m i and the query q m . Each weight w i is calculated through a SoftMax operation using the following formula: where w i is non-negative and implies the i-th entry of w. The sum of all elements of attention weight vector w ∈R 1×N is one. By using w, the output queryq m is calculated over the entire memory bank by the following formula: As shown in Fig. 4 and according to (5) and (6), during training, the memory items closely similar to the input query q m are retrieved using addressing weight vector w to generate the corresponding normal output representationq m . However, some abnormal queries may still be reconstructed well using linear combinations over memory slots. To tackle this concern, we limit access to the memory items while computinĝ q m . As depicted in Fig. 1, during training, we drop some entries of w by applying the sparsity constraint through hard shrinkage operations as: where λ denotes the shrinkage threshold. In practice, the interval values [1/N , 3/N ] are acceptable. Moreover, VOLUME 11, 2023 we override the discontinuous function (7) in the form of a continuous ReLU activation function to be differentiable for performing the backward operation. Finally, the re-normalization of w was carried out after the hard shrinkage operation. By applying the hard shrinkage threshold, the model encourages recording more important normal patterns in the limited number of memory slots during training. Backpropagation is used to train the memory module and the rest of the model parameters in an end-to-end manner.

1) DECODER
As the name suggests, the decoder combines extracted features at multiple resolutions. It consists of a series of UpsampleConvBlock i (where i=1,2,3) to generate the final output frame. The input of each UpsampleConvBlock i comprises two components that join together through channel concatenation operations. The first comes from the outputs of the corresponding Predictor block, and the second comes from the outputs of the former decoding block, i.e., UpsampleConvBlock i−1 . As shown in Fig. 1, the multi-scalememorizer module and corresponding prediction module output are concatenated and inputted to the first decoder block. In order to perform upsampling at the beginning of every decoder block, we use 3 × 3 transposed convolutions rather than interpolation operations. It can help decrease the checkerboard artifacts to some extent. Each convolutional block in the decoder involves a sequence of two or three consecutive 3 × 3 convolutions accompanied by batch normalization and ReLU activation function. The details of the 2D decoder are illustrated in Table 2, where C in and C out indicate the input and output channels of each UpsampleCon-vBlock. N c represents the number of successive convolution operations in each decoder block. N k and N s denote the kernel size and stride, respectively. Shape out is the output shape of a block, where B is the minibatch size in the training process. We only apply a 1×1 convolution for the last decoder block to produce the predicted frame. As the last activation function, tanh is used to scale the output value to [-1,1]. In a nutshell, our goal is to generate the next future frame at the time t according to the T preceding frames. We can define the encoder as a function f enc (·) that produces L=4 distinct feature maps at multiple resolutions as follows: The first three feature maps from the encoder blocks, which have a larger spatial resolution inputted into the prediction module to complete the temporal modeling as follows: t−1 (9) where L=1,2,3. Afterward, the high-level spatial information from the last path (L=4) is concatenated and fed into the multi-scale-memorizer module for multi-scale semantic feature memorization, i.e.

D. TRAINING PHASE
The design of the loss function directly impacts the final performance of the model. As an objective function for training, we consider the similarity of the ground truth frame and the predicted output frame from three different viewpoints. First, the difference per pixel between two frames is measured by intensity loss. However, MSE may overestimate low-level differences and fail to reveal the high-level structural differences between the two frames. Consequently, we use DSSIM as a complementary similarity metric to MSE, which compares the brightness, contrast, and structural similarity of two frames [50]. These loss functions are defined as follows: SSIM(I t ,I t ) = 2µ I t µ I t + C 1 + 2σ I t I t + C 2 µ 2 I t + µ I 2 t + C 1 σ 2 I t + σ 2 I t + C 2 (13) where (µ I t , σ I t ) and (µÎ t , σÎ t ) denote the mean and variance of I t andÎ t , respectively. Here, C 1 and C 2 are stabilizer. We use the structural dissimilarity (DSSIM) as a loss function, i.e.: Similar to previous studies [51], we compute the mean absolute differences of edges between the generated and ground truth frames as the gradient loss. As indicated in (12), the simplest possible image gradient is chosen to reduce training time, which only evaluates the intensity differences of neighboring pixels in the vertical and horizontal directions.
Finally, the overall objective function for parameter learning of our model is computed by a weighted combination of L int , L DSSIM and L gd as follows: where λ int , λ DSSIM and λ gd are hyper-parameters that determine the contribution of each loss term in the overall loss function.

E. ABNORMALITY ESTIMATION
During the test period, we intended to assess the detection performance of our proposed model. Similar to previous works [14], [54], we use PSNR as a metric to calculate the abnormality score of each video frame. First, we compute the MSE between the generated frameÎ t and its ground truth I t by computing the Euclidean distance in intensity space for all pixels. The PSNR is then computed using (17).
where max I t denotes the maximum possible intensity value of I t . H and W specify the dimensions of the frame. However, in the initial moments of abnormality occurrence, the abnormal events appear only in small parts of the frame. For example, in the first moments that a non-pedestrian object breaks into the scene, the abnormality score could be missed due to the average operation over all pixels in the error map. Therefore, as a solution, we can leverage the continuous nature of the video event and refine the abnormality score obtained by the PSNR metric. Inspired by the recent work [49], we propose another score estimation scheme in which the abnormality score is decided based on the PSNR of the current frame and the weighted average PSNR of the P future frame using the following general formula: S refined (t) = PSNR (I t , I t ) After calculating the refined score for each frame of the corresponding dataset (18), the obtained scores were normalized to the range [0, 1]. The final abnormality score is computed using (19). The more the abnormality score, the higher the possibility of abnormality in the frame at time t.
where, min t (S refined (t)) and min t (S refined (t)) determine the minimum and the maximum of the refined scores, respectively. It is noteworthy that calculating the min t (S refined (t)) and min t (S refined (t)) is impractical in real-time applications, implying that the two values should be determined experimentally using historical information for the real-time anomaly detection system.

IV. EXPERIMENTS
In this section, we perform experiments on the CUHK Avenue [52] and UCSD Pedestrian datasets [32]. We also compare the performance of our proposed MsMp-net with several existing state-of-the-art shallow and deep models. Then we provide quantitative and qualitative results and present further evaluation and analysis of the proposed model.

A. EXPERIMENTAL SETUP 1) DATASETS
UCSD Ped1 and Ped2 datasets [32] are the most popular benchmark datasets for video anomaly detection. Each contains footage from a different viewing angle of a static camera positioned above a pedestrian walkway, where population density can be high enough to cause significant occlusions. In UCSDPed1, each of the 34 training and 36 test videos contains 200 grayscale frames and a 158 × 238 pixel resolution. UCSD Ped2 includes 16 training and 12 testing videos with a slightly higher resolution of 240 × 360 pixels and 120 to 200 frames for each video. Any non-pedestrian objects, as well as odd pedestrian motion, are regarded as an abnormality. ''Biker,'' ''skater,'' ''vehicle,'' ''wheelchair,'' and ''walk over'' are examples of anomalies in these two sets. However, because of different camera angles, the spatial properties for the identical abnormal incident alter noticeably across the two sets, and UCSDPed2 has more clear abnormal appearance features in comparison with UCSDPed1.
CUHK Avenue [52] is a collection of brief video clips taken by a single camera mounted on the side of a building near some pedestrian walkways. People walking, entering, and exiting the building are the key activities in this dataset. The dataset includes 16 training and 21 testing videos with a resolution of 640 × 360 pixels. There are 47 weird incidents overall, which include ''throwing papers,'' ''throwing bag,'' ''kid skipping,'' ''wrong direction'', ''running,'' and ''bag on the grass.'' 2) EVALUATION CRITERIA Similar to earlier studies by [14] and [50], after acquiring frame-level anomaly scores, Receiver Operating Characteristic (ROC) curves are generated for video frames in each test set based on FPR versus TPR by changing the threshold applied to the anomaly scores. We calculated the area Under the ROC Curve (AUC) and the Equal Error Rate (EER) to describe the ROC curve and use it as our main evaluation metric. TPR and FPR are calculated as follows: Here, FP stands for false positive samples, FN for false negative samples, TN for true negative samples, and TP for VOLUME 11, 2023 true positive samples. Accordingly, TPR and FPR could be interpreted as the portion of the positive samples (abnormal frames) from the test dataset that have been correctly detected and the portion of the negative samples (normal frames) that have been incorrectly identified, respectively. The metric EER [48] is calculated as follows: TC demonstrates the total number of frames from the test dataset [48]. Therefore, a model with a higher AUC and a lower EER has a better discrimination capability, yielding better performance. Figure 5 shows the example of ROC curves and the calculus of the EER directly through the ROC curve.

3) IMPLEMENTATION DETAILS
We normalized the pixel values to the range of [−1, 1] and resized each input frame to be 256 × 256 pixels in resolution.
For model input, we utilized T = 8 consecutive frames. We chose the size N of memory M and the shrinkage threshold λ to be 1000 and 0.002, respectively. For all datasets, the hyperparameters in (16) were empirically set to λ int = 1, λ gd = 0.1, and λ DSSIM = 0.5. The minibatch size was selected to 8. Initial learning rates and weight decay were assigned to 0.0002 and 0.0001, respectively. To reduce the training time and prevent network overfitting, we adopted the pre-trained weights from VGG16 [53] trained on ImageNet for initializing the Encoder weights. The model was trained for 80 epochs for UCSD and 100 epochs for the Avenue dataset by Adam optimizer [54] with parameters β1 = 0.9 and β2 = 0.999. The models were trained end-to-end using PyTorch [55] framework with the GeForce GTX 1080ti GPU with 32 GB of RAM and Intel E5-2697 CPU.

1) COMPARISON WITH EXISTING APPROACHES
We compare our proposed method with several state-ofthe-art video anomaly detection techniques trained with unsupervised learning approaches. The results of the comparison are summarized in Table 3. Compared to reconstruction and prediction-based methods, as well as traditional shallow methods, our approach outperforms almost all recent methods in terms of AUC and EER. More specifically, the results for the UCSDPed1, UCSDPed2, and Avenue datasets sur- pass the recent frame prediction method, ROADMAP [15], by 0.4 %, 1.3 %, and 0.7%, respectively. In addition, the examination of three datasets revealed that MsMp-net performed better than an approach proposed by Liu et al. [14], who provided preliminary findings regarding frame prediction methods. The EER values for our method, when applied to UCSDPed2 and Avenue datasets, are 22.2% and 6.6% among the best. The AUC and EER for our approach applied to UCSDPed1 were 83.8% and 22.2%, respectively, which outperforms all other approaches except that of Ganokratanaa et al. [44], who reached outstanding AUC and EER values of 94.9% and 5.2%, respectively. Meanwhile, they produced a lower AUC and EER for both the UCSDPed2 and Avenue datasets. In particular, Ganokratanaa et al. used a novel feature collection using a background removal method and a dense inverse search optical flow algorithm to allow the model to extract the details of target appearance and motion features. Hence, they could identify the ambiguous abnormal object missed by our method in the UCS-DPed2 dataset. On the other side, the reliance of their model on the image patches could require more pre-processing time per frame. In section IV-B.10, we compare the time complexity of our approach with the method proposed by Ganokratanaa et al., and we show that our approach is slightly faster than theirs.

2) CASE STUDIES INVOLVING VISUALIZATION
This section illustrates the abnormality score curves from our suggested method, along with selected important frames with either normal or anomalous events. As shown in Fig. 6, the abnormal event is separated from the regular event based on the abnormality score calculated in (19). The score is low for regular occurrences and high for unusual ones. The first eight frames of each video are discarded because all predictionbased techniques require these frames to be employed as the initial input to the model, which is a limitation of all frame prediction-based approaches.

3) QUALITATIVE RESULT
After training the model, we can also use the differences between the pixels in the generated output and the ground-truth frame to create the prediction error map, allowing us to spread errors throughout all pixels and spot anomalous regions in the frame. For example, Fig. 8 illustrates abnormal regions found by our model on CUHK Avenue, UCSD Ped2, and UCSD Ped1. The left column represents input frames containing irregularity; the first two frames on the right show ground truth and generated frame; the second two represent the error maps and abnormal regions that were superimposed on the frame. As Fig. 8 visualizes, regular regions are predicted perfectly, whereas abnormal regions are blurry or distorted.

4) THE EFFECT OF MEMORY SIZE
We used the UCSD-Ped2 dataset to examine the performance of the memory module by varying the memory size, i.e., the number of memory slots N . After that, we reported the results in terms of AUC, as shown in Fig. 7. As the memory capacity increases, AUC improves continuously until the memory size reaches 1000 slots. When the memory size exceeds 1000, there is no significant performance reduction. Meanwhile, a bigger memory size would result in more calculations; therefore, N was set to 1000 to achieve the best overall performance.

5) THE EFFECT OF MSMP-NET ON ANOMALY DETECTION
In this section, we investigate the functionality of different MsMp-net components by including and excluding them and validate the effectiveness of their collaboration on all three datasets. Furthermore, we adopted the model from Liu et al. [14] with a 2D encoder and 2D decoder structure. We trained it with intensity, gradient, and DSSIM losses, without using advanced optical flow losses or adversarial learning components. We named it a simplified 2D U-Net and compared the performance in terms of AUC with our model. Table 4 illustrates the whole set of results. The significance of using the upstream context module to work together with the memory module is indicated by the AUC improvement. When the context module is included, the performance of the UCSDPed1, UCSDPed2, and Avenue datasets is increased by 0.9 %, 0.2 %, and 0.4 %, respectively. Therefore, we concluded that these two components were essential for memorizing multi-scale semantic features and required collaboration to improve detection results.  Additionally, we demonstrated how multi-path ConvGRUs modules at multiple resolutions affect MsMp-net performance. The AUC increased by 2.6%, 1.8%, and 2.9% after adding the predictor to the design for UCSDPed1, UCS-DPed2, and Avenue datasets. Our findings suggest that leveraging recurrent neural networks to model temporal variation from spatial characteristics with multiple scales would be highly successful. In summary, this modification encouraged simplified 2D U-Net to improve enormously in all these datasets.

6) QUANTITATIVE COMPARISON FOR ANOMALY DETECTION
To demonstrate the superiority of our proposed architecture, we also determined the difference between the average score of normal frames and abnormal frames for each test set. The score gap s on test set D is described this way:

of all normal frames in D
where S final (t) is the abnormality score of frame at time t. A larger score gap indicates that the model can better distinguish anomalous patterns, resulting in greater detection performance. According to Table 5, our method produces comparatively wide score gaps when compared to Liu et al. [14] and also the earlier frame prediction methods proposed by [13] and [15].We believe that introducing the multiscale-memorizer module and the refined abnormality score is the primary factor in improving the discriminative power of our model. VOLUME 11, 2023 FIGURE 6. Examples of anomaly scores of our model are plotted per frame number in three datasets. The blue curve represents the estimated anomaly score. The red and white sections indicate ground truth anomalous and normal frames, respectively. Bounding boxes highlight the abnormal events for visualization purposes.

7) EFFECT OF PROPOSED ABNORMALITY SCORE FUNCTION
Given that video frames are temporally continuous, we can improve the anomaly score of the current frame with those abnormality scores obtained by future temporally neighboring frames P. So as to better investigate the impact of the P-value on the detection result, we conducted the experiments on the three datasets with altering P-values. Fig. 9 demonstrates the AUC scores for various P-value configurations over the three datasets. The slight improvement in AUC value for Ped1 and Avenue is clear at P=2 and is somewhat constant at P-values greater than 2. For ped2, the optimal value for P is about 4; however, After assessing the trade-off between performance and inference time, to fix the constant value for neighboring future frames, we choose P to be 2 for all datasets.
Consequently, this approach will not considerably delay the inference process. We also compared the performance of both techniques, including the standard scoring function generated by normalizing the prediction error of current frames and our proposed score function, which also considers the weighted average of future frames. Score gaps were also used besides AUC and EER metrics to highlight the quantitative effects of the refined score function. Table 6 demonstrates  how the suggested score function raised AUC and EER by about 0.3% and 0.4%, with almost a slight improvement in the score gap.

8) THE EFFECT OF MULTIPLE INPUT FRAMES ON PERFORMANCE AND RUNNING TIME
We conducted experiments to evaluate the effects of multiple input frames on running speed and model performance. Recent studies have employed different counts of input frames to the model, such as four frames [14], [18] or 16 frames [10], [22]. However, grouping too many frames may lead to over-fitting in small datasets; in contrast, supplying relatively few frames might lead to poor motion feature extraction since the model cannot receive sufficient long-term temporal features. In Table 7, we provided the AUC score for initial configuration T = 8 and its variation for different stacking input values of 2,4,6,10,12,16 for the Avenue, UCSD datasets. In addition, the average runtime for different stacked input frames is reported, including the time required for both single frame prediction and anomaly detection. Our investigation revealed that increasing the number of input frames results in only minor improvements in three datasets.
On the other hand, when we set the number of input frames to 16, the model performed even worse on UCSDPed2 and  UCSDPed1. We believe this is due to the considerable dataset reduction in UCSD. Finally, T was set to 8 in our experiments to make an appropriate choice regarding running speed and AUC performance.

9) ANALYSIS OF LOSS FUNCTION
As aforementioned, we designed the model considering temporal dependency in architecture and removed the time-consuming optical flow loss used in [14]. Instead, we apply further constraints on various aspects with a new loss function, the DSSIM loss. As shown in Table 8, we practiced different loss functions in the training process of our model and then evaluated and compared the results based on AUC and score gap for the Avenue dataset. It is clear that including optical flow loss in our framework would not improve detection results significantly. The adjusted DSSIM constraint, on the other hand, produces the best performance, implying that the model with more constraints of different aspects can capture the information better. Furthermore, as shown in Table 9, the time required to calculate L int + L gd +L OF was, on average, 0.0927 seconds per batch during training, which is considerably longer than the time required to calculate L int + L gd +L DSSIM . The results indicated that the training process with optical flow loss took much longer, whereas training with DSSIM loss took less time and produced better results.

10) ANALYSIS ON TIME COMPLEXITY
The Average running time or frame per second (FPS) score is always employed to assess the time complexity in the inference stage. In this section, we analyze the average runtime of MsMp-net and compare it with state-of-the-art methodologies, especially the new methods based on frame prediction. The Average computational time included both frame prediction and anomaly detection in a single frame which is calculated by dividing the sum of the running times into the number of frames for each test set. Moreover, the computational time is identical in the three datasets because we used the same MsMp-net model and input configurations. As all these methods do not provide their original implementations, we originally copied the results and environment from these papers. As illustrated in Table 10, the average inference time of MsMp-net is 0.0891s (11 FPS), which exceeds the inference time of DSC [60], MDT [32], and DSTN [44] and is on par or slightly faster than ROADMAP [15] and sRNN-AE [56].
Nevertheless, our approach takes more average running time than [13], [14], [57], [58], and [59]. It could be due to using a multi-path ConvGRU architecture with a T=8 time step. However, in terms of AUC, we got better results than all these works, as indicated in Table 3. Based on the experiment in Table 7, if we set T to 4, we can achieve a real-time running speed of 24FPS by losing a small amount of AUC performance. For instance, by reducing T to 4, our model can outperform Msm-net [13] in terms of both AUC performance and running time. It can be concluded that the proposed MsMp-net has good overall performance for surveillance videos, given its remarkable accuracy and its favorable running time. Furthermore, we can think about operating GPUs in parallel to alleviate the comparatively long inference time of our technique for real-time applications.

V. DISCUSSION
Our method memorizes multi-scale semantic features by the proposed multiscale-memorizer module and uses the multiresolution feature to fully mine the consistency of motion variation; it is very advantageous for addressing varied-size objects in the scene; however, our method still has several weaknesses that could be improved in future work. As seen in Fig. 8, we found that when an anomalous target in the error map is relatively small, the presence of unwanted noise in the corresponding error map could damage the performance and prevent the model from detecting that ambiguous or distant abnormality. Despite the same nature of the UCSDPed1 and Ped2 datasets, noise in the error map is the primary cause of performance degradation in Ped1.
In the same direction, the refined anomaly scoring function was recommended to use inter-frame dependencies to help the model detect the small abnormal objects entering the scene. Even though this method was able to address this issue in part, due to the presence of noise, significant improvements were not achieved in the values obtained by the refined abnormality score. In the subsequent study, we suggest strengthening our MsMp network by adopting an attention-driven loss function, focusing on the objects in the scene during training, and adopting an attention-driven scoring function, suppressing the influence of the background noise in the inference time. In addition, further research is needed to determine the best adaptive techniques for adjusting the hyperparameters, as the optimal model setup may not be appropriate for all potential data sets.

VI. CONCLUSION
This study introduced MsMp-net, a novel unsupervised framework for video anomaly detection on the basis of frame prediction. The proposed model can be trained end-to-end without resorting to additional adversarial and optical flow losses. We built its multi-path predictor and added the multiscale low-level temporal characteristics to the design to model finer motion patterns, which is crucial for predicting the future frame. We also integrated the context module and the memory module to design a multiscale-memorizer module for extracting the multiscale context of high-level normal patterns and retrieving them simultaneously in the memory slots.
Our comprehensive study shows that the proposed method achieves the highest AUC in Avenue and UCSDPed2 and fairly decent results in UCSDPed1.