Anomaly Event Detection in Security Surveillance Using Two-Stream Based Model

Anomaly event detection has been extensively researched in computer vision in recent years. Most conventional anomaly event detection methods can only leverage the single-modal cues and not deal with the complementary information underlying other modalities in videos. To address this issue, in this work, we propose a novel two-stream convolutional networks model for anomaly detection in surveillance videos. Specifically, the proposed model consists of RGB and Flow two-stream networks, in which the final anomaly event detection score is the fusion of those of two networks. Furthermore, we consider two fusion situations, including the fusion of two streams with the same or different number of layers respectively. ,e design insight is to leverage the information underlying each stream and the complementary cues of RGB and Flow two-stream sufficiently. Two datasets (UCF-Crime and ShanghaiTech) are used to validate the effectiveness of proposed solution.


Introduction
Security surveillance is increasingly utilized at public places such as streets, hospitals, intersections, shopping malls, and banks, to guarantee public safety. However, the law enforcement agencies and monitoring abilities have not been matched. Consequently, the result is that there are obvious defects in the use of surveillance cameras. Anomaly event detection in surveillance videos is an important research topic in computer vision, which has been widely used in many security related scenarios, including traffic accidents investigation, crimes or illegal activities surveillance, forensics investigation, and violence alerting [1]. Because anomalous events rarely appear in real life, behavioral or appearance patterns deviating from normal patterns are often defined as anomalies [1][2][3].
Most existing researches in anomaly event detection mainly focus on the RGB modality when extracting video features in anomaly event detection. In this work, we propose a two-stream-based model to handle the anomaly event detection problem using the RGB and Flow two convolutional neural network (ConvNets) to extract video features. e RGB stream performs anomaly event detection from video frames, whilst the Flow stream is trained to detect anomalies from motion-based on dense optical flow. Moreover, the proposed framework is able to utilize all frames in the video, while almost no additional calculation is introduced in inference when compared to [13]. e main reason is that the final number of features utilized for training the anomaly model is the same. Specifically, one video is divided into several clips, features of all frames in one clip are averaged to obtain the video clip-level feature.
ere are noticeable advantages of our two-stream-based anomaly event detection. Instead of only considering RGB features for MIL models, in this work, we propose TAEDM that can leverage information of both RGB and Flow modalities. Specifically, the information from RGB modality is the static features underlying still images, such as the color, shape, and appearance of objects or people in the event. e information from the Flow modality is the motion features of the event. As a result, TAEDM can capture the complementary information on RGB stream from still images and motion between images in one video sufficiently. We evaluate the developed approach on two different-scale benchmark datasets, including UCF-Crime [13] and ShanghaiTech [14].
e extensive ablation experimental studies demonstrate that our model obtains the state-of-theart performance.
e main contributions of this work are summarized as follows: (i) A novel two-stream-based anomaly event detection model is proposed for anomaly detection in surveillance videos. Furthermore, a dense feature extraction method is proposed to obtain video-level feature. (ii) Proposed models are tested using benchmark datasets UCF-Crime [13] and ShanghaiTech [14], and results from both datasets show good performances than existing works. e rest of this paper is organized as follows: Section 2 reviews the state-of-the-art research in anomaly event detection. Section 3 proposes a two-stream-based anomaly event detection model. Experimental results are elaborated in Section 4 and further discussion is presented in Section 5. Section 6 concludes this paper.

Related Work
In this section, we will discuss the most recent research results in anomaly event detection, and details about anomaly detection, ranking, and two-stream action recognition will be discussed.

Anomaly Detection.
In computer vision, anomaly event detection is one of the most challenging problems and has attracted lots of research efforts in the past decades [15][16][17][18][19][20][21], where the commonly used detection methods can be roughly categorized into following three groups. e first category of anomaly detection methods focuses on the hypothesis that anomalies are rare, and behaviors different from normal patterns seriously are seen as anomalous. In these methods, the regular patterns are encoded through various statistic models, such as Gaussian process based models [22,23], the model of social force [24], Hidden Markov-based models [15,25], the spatial-temporal Markov random field based models [26,27], the combination of dynamic models [21], and treat anomalies as outliers.
e second category of anomaly detection approaches is sparse reconstruction [3,14,16,28], which is utilized for usual pattern learning. Specifically, a dictionary is constructed by employing sparse representation for normal behavior, and the ones with high error are detected as anomalies. Recently, with the promising breakthrough of deep learning, some researchers construct deep neural networks for anomaly detection, including video prediction learning [29], and abstraction feature learning [6,30,31]. e third group is the hybrid methods of normal and anomaly behavior for modelling [13,32,33], in which, under weakly supervised setting, multi instance learning (MIL) is utilized to model motion patterns [13,33], e.g., Sultani et al. developed an MIL-based classifier [13], which is employed to detect anomalies. Meanwhile, a deep ranking model is utilized to predict anomaly scores.
Aimed at leveraging the superiority of Sultani's work that considering both normal and anomalous videos, in this work, we rebuilt the model using a weak labelled supervised learning.

Ranking.
Learning to rank is a popular research problem in machine learning and many research efforts have been conducted, including [7,11,[34][35][36][37][38]. ese approaches aimed at boosting relative scores of the pieces rather than individual scores. Rank-SVM [7] was proposed to enhance the retrieval performance of search engines. e detection algorithm proposed in [34] can solve multiple-instance ranking problems through gradual linear programming. is method has been utilized in computational chemistry to solve hydrogen abstraction problem. More recently, researchers have proposed deep ranking networks for computer vision-related applications and achieved promising success, such as highlight detection [35], person reidentification [11], feature learning [36], Graphics Interchange Format (GIF) generation [37], face detection and verification [38], and metric learning and image retrieval [39]. All the above deep ranking approaches need extensive annotations of both positive and negative samples. Unlikely, in this work, a ranking model is proposed by reformulating anomaly detection problem as a regression problem under the ranking framework based on both normal and anomalous samples. e proposed model utilizes MIL depending on weakly supervised data to train the anomaly model and located anomaly with video segment level during testing. Unlike the conventional multiple instance learning (MIL) setting, the proposed ranking component forces ranking only includes two segments with the highest anomaly score in the negative and positive bags.

Two-Stream Action Recognition.
Video-based action recognition has been extensively researched and achieved comparable attention recently. Among them, the twostream-based action recognition is superior [40][41][42]. Inspired by neuroscience, one kind of action recognition methods introduced two-stream neural network architecture [40][41][42], to perform RGB and Flow feature extraction in parallel.
e final score of action classification can be achieved by fusing the results of two paths. In order to further enhance the action recognition performance, Wang et al. developed a novel Temporal Segment Network (TSN) [41], which focuses on modelling the long-range temporal structure in videos. Further, various extensions of twostream model [40] that explore convolutional fusion [42] and residual connections [43,44] were developed. e model in [43] established the residual connections between RGB and Flow streams. e STDDCN [44] integrated the multiscale information into residual connections via denseconnectivity interaction and contained a new knowledge distillation module.
Two-stream-based methods have been widely employed on some other task of video, such as action recognition [40-43, 45, 46]. However, two-stream-based methods are rarely applied to anomaly event detection. Inspired by the two-stream-based action recognition architectures leveraging the complementary information of RGB and Flow modalities underlying actions, we first design a novel twostream anomaly event detection model. Compared with action recognition, the anomaly event detection can identify the kind of behavior (normal or abnormal) and locate the time range of an exception. at leads this problem more difficult to solve than the others.

Two-Stream Anomaly Event Detection Network
is section will detail the proposed two-stream-based anomaly event detection model as shown in Figure 1. We first will introduce the abnormal video and the normal video, and then divide them into multiple time video clips for extracting the two-stream features (RGB stream and Flow stream) of the video clips. A fully connected neural network will be trained using a ranking loss function, which calculates the highest-scoring instance (shown in blue) and the fusion operation then will be performed.
Video clip can be naturally split into synchronous spatial and temporal parts. e spatial component underlying the individual frame image consists of scenes and object information in the video. e temporal component hidden in the motion across the images carries the movement between the objects and the observer. We designed our anomaly detection model accordingly and decomposed it into two streams, as is illustrated in Figure 1. Each stream is realized via a deep convolutional network (ConvNet), anomaly detection scores of which are fused in the late.
In the proposed model, video segments that obtained high anomaly scores will be marked as anomaly event. Each video will be split into equal number of nonoverlapping segments. e video containing anomaly segment is labelled as positive and a video without any anomaly segment is labelled as negative. A positive/negative video is treated as a positive/ negative bag and the segments as instances in the multiple instance learning. rough ranking method, anomaly scores for each video segment can be obtained and the video segments obtained high anomaly scores is seen as anomaly event.
First, given the abnormal video and the normal video, we divided them into multiple time video clips. Secondly, we extracted the two-stream features (RGB stream and Flow stream) of the video clips and then trained a fully connected neural network using a ranking loss function, which calculates the highest-scoring instance (shown in blue) and performed the fusion operation in the last step.

Problem Formulation.
In the past decade, a number of pattern learning methods have been developed [10,15,19,25], most of them assuming that any pattern that violates this common pattern should be abnormal. In fact, it is impossible to propose a method to define a full set of normal patterns, because the normal pattern may contain too many different events and behaviors. To define anomaly events is another challenge, since anomaly events may also contain many similar events and behaviors.
To handle the above issues, the proposed method formulates each anomaly detection task (RGB branch and Flow branch) as a regression problem, which is realized under the ranking framework by leveraging both normal and anomalous data. To achieve more precise segment-level labels, a weakly supervised deep multiple instance learning (MIL) ranking is employed. Specifically, weakly supervised rank indicates that the model only knows whether there is an abnormal event in a video rather than the category of the anomaly event and the corresponding occurrence time during training. e differences of the proposed pattern learning method from those in [10,15,19,25] is that our model utilizes both normal and anomalous data rather than normal data in previous studies (e.g., [10,15,19,25]). Furthermore, our model is formulated as a regression problem, which means that we consider a certain segment as an abnormal event based on regression prediction score rather than the probability less than a certain threshold.

Data Formulation.
To align the data for deep MIL setting in anomaly detection, the source video is first split into equal number of nonoverlapping segments during training. All segments in the same video are denoted as a bag, and each segment is acted as an instance. All videos formed two different bags, positive bags and negative bags, respectively. e segments of anonymous video are treated as positive bag and those of normal video negative bag. Moreover, as our insight is based on leveraging the complementary information of RGB and Flow streams, video clips in each bag are all decomposed into RGB and Flow components. Each kind of component is fed into the corresponding branch networks separately.

Network Architecture.
e deep MIL framework includes two main branch deep MIL ranking networks: RGB and Flow, as shown in Figure 1. Each branch contains feature extraction and instance scoring parts. Concerning the feature extraction, ResNet [47] is chosen as backbone because of its superiority in both efficiency and effectiveness.

Spatial Branch ConvNet. Spatial Branch
ConvNet focuses on single video frame, effectively conducting anomaly event detection from still images. e static RGB stream by itself contains useful information, since some anomaly events are closely associated with specific objects. Actually, as will be reported in the section of experiments, Security and Communication Networks anomaly event detection from still images (the RGB anomaly event detection stream) is quite competitive by itself.

Temporal Branch ConvNet.
Unlike the conventional ConvNet models, the input of proposed temporal anomaly event detection stream is the stacked optical flow displacement fields among several adjacent frames. is input explicitly models the motion between video clip images, which makes the anomaly event detection easier, as no implicit motion estimation is required. e dense optical flow is formed using a group of displacement vector fields v t between adjacent t and t + 1 frames. Further, v t (m, n) indicates the displacement vector at the corresponding point (m, n) in frame t, which represents the movement of point (m, n) from frame t to frame t + 1. Moreover, the displacement vector v t contains two components, including horizontal and vertical ones, which dubbed as v x t and v y t , respectively. v x t and v y t are seen as image channels (as shown in Figure 2) and can be fed into the temporal anomaly event detection stream network.

Loss Function.
To pursue better performance, we employ the following loss function (referred from [13]) to train each branch network: where S p and S n denote positive and negative bags, respectively. l(S p , S n ) indicates the loss over these two kind of bags. ‖W‖ F denotes the F − norm regularization on weights of the model, for boosting its generalization. Among them, l(S p , S n ) is defined as in which l rank denotes the ranking loss, l smooth denotes the temporal smooth restrict and l sparsity represents the sparsity constraint. λ a and λ b are two hyper-parameters which balance the strengths of corresponding terms. Among them, l rank is formulated as where S p and S n share the same meanings with those of equation (1). C n and C a indicate normal and anomalous video instances. f(C i a ) and f(C i n ) denote the predicted scores for the corresponding video instances. e l rank forces rank only on two segments with the highest anomaly score in the negative and positive bags separately, rather than every segment of the bag. us, the max operation is performed over all instances in each bag. e reason for this different setting is the absence of video segment-level annotations in anomaly event detection task.
e l rank loss here is superior for anomaly detection task due to several appealing reasons. First, it can enforce the anomalous video segments to achieve higher anomaly scores compared to normal ones. Furthermore, it can separate the positive instances and negative instances based on anomaly score.
On the other hand, l smooth and l sparsity are defined as where n is the number of instances in the specific bag. l smooth is utilized to guarantee the temporal smoothness via minimizing the difference of anomaly scores between neighboring video instances in a bag. l sparsity is employed to enforce the sparsity of scores in the anomalous bag. e   reason for introducing l sparsity loss function is that few segments may involve the anomaly event.

Experiments
In this section, we will illustrate our experiments in detail from aspects including datasets, implementation details, evaluation metric, and sufficient quantitative and qualitative experiments, respectively. [13] and ShanghaiTech [14] are two popular benchmark datasets commonly used in anomaly detection task. In this work, we use both datasets to validate the superiority of our proposed anomaly event detection model. With following steps, the proposed model can also support other datasets: (1) extract the RGB image and optical flow images of each video in the dataset; (2) extract their corresponding features; and (3) feed both the RGB and Flow stream features into corresponding branch subnetworks in the proposed model for training and obtaining expected test results. Before introducing the details, we first briefly introduce the two benchmark datasets as follows.  [14] is a medium-scale dataset with a total of 437 videos, which contains 130 abnormal events of 13 scenes. is dataset cannot be utilized directly to perform anomaly event detection because the training set has no abnormal video. To tackle this problem, Zhong et al. [48] rebuilt the dataset via randomly choosing abnormal test videos and putting them into the training data and vice versa. Simultaneously, both training and test dataset contain 13 scenes. is new organization of dataset made it suitable for anomaly event detection task. us, we perform the same operation as that in [48], before executing the experiments.

Implementation Details and Evaluation Metrics.
To implement the proposed model, we first extract features of RGB and Flow images from the last fully connected (FC) layer of the ResNet network [47]. Concerning the RGB stream, ResNet features for every frame are computed. e video segment-level feature can be obtained by averaging all frame features in the corresponding video segment. Similarly, for the Flow stream, features can be extracted using the same way of RGB stream. e only difference between these two streams is that each frame in Flow stream contains two directional flow images, namely, vertical (v x t ) and and horizontal (v y t ) images as stated above, which makes the ResNet infeasible to extract their features. Specifically, v x t and v y t are all grayscale images with only one channel (the concatenation of them only has two channels), while the input sample of feature extraction network (ResNet) needs three ones. To handle this problem, we concatenate the two directional flow images and their average variant to form the input flow sample with three channels for the feature extraction network.
After obtaining segment-level RGB and Flow ResNet features, we feed them (2048D) into a three-layer FC neural network as that of [13]. Further, the Adagrad optimizer is utilized, which initial learning rate is 0.001. To perform a fair comparison, the smoothness constraint, the sparsity restriction, and the segment number of each video are the Security and Communication Networks same with those of [13]. We stop our training at 20, 000 iterations. e following commonly used evaluation metrics are adopted to validate the performance our model. ey are receiver operating characteristic (ROC) curve and the area under the curve (AUC), respectively. e reason we utilize ROC and AUC is that they are two popular metrics for anomaly event detection tasks [13,21,48]. For fair comparison with other works and to verify the effectiveness of our model, ROC and AUC are employed.

Evaluation of the Proposed Model.
To validate the performance of the proposed method, we compare the results with those of state-of-the-art models, based on UCF-Crime [13] and ShanghaiTech [14]. Comparison ROC curves are shown in Figure 3. In Figure 3, RGB, Flow and Two denote the anomaly event detection results of different models based on RGB stream network, Flow stream network and the fusion of them separately. Figure 3 illustrates that RGB, Flow, and Two obtain better results than the other models, validating the dense feature extraction is effective. Further, Two yields better results than those of RGB and Flow, which verifies the superiority of the proposed model. e AUC results from different models on UCF-Crime [13] and ShanghaiTech [14] are displayed in Tables 1 and 2, respectively. It can be seen that the results are the same with those of Figure 3, which further validates the effectiveness of our model.

Ablation Studies.
In this section, several ablation studies are designed to demonstrate the effectiveness of the proposed model.

(1) Evaluation of the Generalization Capacity of the Model.
To validate the generalization of the proposed method, we present the results of proposed method based on models with different depths and architectures, including ResNet50, ResNet100, ResNet150, and VGG16, respectively, as shown in Tables 3 and 4. e results in Tables 3 and 4 illustrate that model Two achieves better results than those of the corresponding RGB and Flow models in all cases, which verifies the generalization capacity of the proposed model in terms of model depth and architecture.
Additionally, the ROC curves of ResNet50, ResNet100, ResNet150, and VGG16 are exhibited in Figures 4 and 5, respectively. Among them, Figures 4(a)-4(c) and 5(a)-5(c) report the ROC curves of RGB, Flow, and Two networks from ResNet50, ResNet100, ResNet150, and VGG16 models, respectively. Figures 4 and 5 further validate the generalization capacity of our method on model depth and architecture.
(2) Evaluation of the Fusion of Two Streams. As stated above, different backbone feature extraction models are employed to assess the proposed method.
is naturally raises the following evaluations, including the fusion of two streams with the same number of layers and the fusion of two streams with different number of layers separately: (1) Fusion of two streams with the same number of layers: To validate the effectiveness of the fusion of two streams with the same number of layers, we utilize the identical network (including ResNet50, ResNet100, ResNet150, and VGG16, respectively) to perform both RGB and Flow stream feature extractions. Comparison results are presented in Tables 1 and 2. Tables 1 and 2 show model Two obtains uniformly better results than those of the corresponding RGB and Flow models, which illustrates the effectiveness of the proposed method under the same layer fusion setting (dubbes as Fusion same setting). Tables 5 and 6 illustrate that the performance of model Two is consistently superior to those of corresponding RGB and Flow models, which validates the superiority of the proposed method under the different layers fusion setting (dubbed as Fusion dif setting). Further, an appealing conclusion can be drawn that Fusion dif surpasses Fusion same in most cases. In addition, the case that RGB stream with ResNet50 and Flow stream with VGG16 yields our best anomaly event detection results, which again verifies the effectiveness of fusion under different model architectures and depths.
(3) Evaluation of Fusion Proportion. is paper obtains the final anomaly detection scores via fusing two streams via the following equation: Score � β * Score Flow + (1 − β) * Score RGB . To validate the effects of fusion proportion β between two streams, we perform anomaly event detection with various fusion proportion ranges from 0.1 to 0.9 with step size 0.1.
(1) Results of fusion proportion with same number of layers: Figure 6 reports the results of Fusion same with different fusion proportions. From Figure 6, we can see that each ResNet backbone (including RenNet50, RenNet100, and RenNet150) obtains similar fusion anomaly detection results respectively under different fusion proportions, with differences range from 0. In other words, which stream has better performance, the fusion result will be better when its proportion is higher. e general proportion value for that stream with better results is 0.8 in most cases.  Figure 8 shows that as fusion proportion value β increases, the trends of the almost all AUC curves of four subfigures are monotonically increasing, and also three curves are almost monotonically decreasing. Reason is that flow streams of these curves have higher anomaly scores than those of their corresponding fused RGB streams, and anomaly scores of flow streams are worse than or comparable to those of their corresponding fused RGB streams. In this dataset, we also obtain the similar conclusion that for the stream with a better result, the fusion result will be better when its proportion is higher. e general proportion value for that stream with better results is 0.9 in most cases.

Qualitative Results.
To provide a more intuitive perception of the proposed model, we introduce the scores of anomalies per segment in a video obtained via our  Figure 3: ROC curves of different models on UCF-Crime [13] and ShanghaiTech [14]. (a) Results of UCF-Crime [13] and (b) that of ShanghaiTech [14].    method. Meanwhile, corresponding events with the highest or lower abnormal event scores in the video are also presented, with results presented in Figure 9 for UCF-Crime and Figure 10 for ShanghaiTech. Specifically, three example events are displayed in Figures 9 and 10, respectively. e first row of Figures 9 and 10 show the visualization results obtained by our best model variant, and the green blocks in the gray rectangle in Figure 9 or purple rectangles in

Discussion
It is noted that the RGB stream focuses on the appearance information and Flow stream concentrates on motion clues underlying a certain video. e fusion of these two streams with the same number of layers boosts the anomaly event detection performance effectively, as Table 1 and Table 2 show. Reason is that this fusion can      leverage the complementary spatiotemporal information on the same scale underlying videos. In addition, the fusion of two streams with different number of layers achieves better results than those of the same layer fusion. Reason is this different layer fusion not only utilizes the complementary information between two streams, but also leverages the multiscale information at different layers, as Tables 5 and 6 show. us, fusion of RGB and Flow two streams is optimal in anomaly event detection task. e benefits of our proposed solution are that it can further improve the performance of anomaly event detection significantly by leveraging the complementary information of RGB.    Figure 9: e visualization results of our method on testing videos on UCF-Crime. e first row shows the visualization results obtained by our best model variant, and the green block in the gray rectangle represent the ground truth time period in which the anomaly event occurred. e second row presents the visualization results of different variants of our model, including results of ResNet50, ResNet50, ResNet150, and the best model variants, respectively.

Conclusion
is paper proposes a novel two-stream-based model for anomaly event detection. Specifically, this model consists of RGB and Flow two branch networks, and the final anomaly detection score is the fusion of two networks. Meanwhile, we consider two fusion strategies, including the fusion of two streams with the same of different number of layers, respectively. e proposed model can utilize the complementary information of the two streams hidden in the video, which can improve the performance of anomaly event detection. Ablative studies based on two benchmark datasets UCF-Crime and ShanghaiTech have validated the effectiveness of the proposed model. Future work should focus more on effective feature extraction methods for improved anomaly event detection using new inputs [49] in edge computing environment [50][51][52].

Conflicts of Interest
e authors declare no conflicts of interest.