Histogram of Maximal Optical Flow Projection for Abnormal Events Detection in Crowded Scenes

Abnormal events detection plays an important role in the video surveillance, which is a challenging subject in the intelligent detection. In this paper, based on a novel motion feature descriptor, that is, the histogram of maximal optical flow projection (HMOFP), we propose an algorithm to detect abnormal events in crowded scenes. Following the extraction of the HMOFP of the training frames, the one-class support vector machine (SVM) classification method is utilized to detect the abnormality of the testing frames. Compared with other methods based on the optical flow, experiments on several benchmark datasets show that our algorithm is effective with satisfying results.


Introduction
Nowadays, more and more surveillance cameras have been used in public places. Behavior analysis in crowded scenes [1][2][3][4][5] becomes more and more popular and important for public safety. In order to eliminate the world representation layer which can be a significant source of errors for algorithm modeling, an approach based on modeling directly at the pixel level was described in [6]. In [7,8], social force model was used in abnormal crowd behavior detection. In [9,10], a model named social attribute-aware force model was proposed. In this model, in order to improve the algorithm performance for the interaction behavior of the crowd, social characteristics of crowd behavior were taken into account.
In [11], SIFT features were extracted for the Bag of Words (Bow) model with Spatial Pyramid Matching Kernel (SPM). Then a SVM classifier was used for cross-scene abnormal events detection. In [12], based on the fact that the occurrence of abnormal events is rare while the frequently occurring events are normal in general human perception, proximity clustering for abnormal events detection in video sequence was proposed. In [13], when labeled information about normal events was limited and information about abnormal events was not available, projection subspace associated with detectors was discovered by using both labeled and unlabeled segments. In wireless sensor networks, a fact has been observed that instead of being transient, most abnormal events persist over a considerable period time. Thus, a technique for handling data in a segment-based manner was introduced in [14]. Without using any tracking and motion features, a feature extraction and events detection method were presented in [15], where features were extracted from foreground blobs and then confined in SVM based models for real-time events detection.
Unlike most existing approaches used for abnormal events detection, sparse representation based approaches attracted many researchers in the recent years. In [16], a method to detect abnormal events by a sparse subspace clustering was proposed. In [17,18], a model based on the optical flow was described, which utilized the sparse reconstruction cost (SRC) over the normal dictionary to measure the normalness of the tested samples. As we know, optical flow is the approximated motion vector at each pixel location, which can reflect the relative distances of moving objects. Therefore, it is important and useful in video surveillance and abnormal events detection. Other methods based on the  histogram of optical flow were described in [19][20][21]. Also, it was improved and used in this paper. Although the above approaches could successfully realize abnormal events detection, they were limited in some aspects. Some models were established complicatedly and others cost a long time in the detection process. Based on these, we propose a novel detection model in crowded scenes, which is relatively simple and time-saving in calculation. Similar to the approach introduced in [21], our algorithm is mainly based on a proper processing method in the optical flow field.
The rest of the paper is organized as follows. In Section 2, we present how to acquire the motion features. In Section 3, the theory of one-class SVM is reviewed. In Section 4, the algorithm of abnormal events detection is introduced in detail. Section 5 presents our experiment results. Finally, some conclusions are presented in Section 6.

Motion Feature Extraction
Optical flow field is the movement on the surface of grayscale images, which reflects the movement information of two consecutive frames. Optical flow provides the information of direction and amplitude of the moving object in a scene, which can describe the behavior of people very well. Optical flow is derived from the following basic equation: where , , and are the partial derivatives of the image grayscale value along the , , and dimension, respectively; and V are the horizontal ( dimension) and vertical ( dimension) components of the optical flow. Equation (1) is an ill-posed problem. In [22], Horn and Schunck proposed an algorithm. It is known as the HS algorithm to compute the optical flow by introducing a global constraint of smoothness, which is equal to the additional condition where ∇ 2 and ∇ 2 V are Laplace operators of and V, respectively. The problem to get optical flow can be concluded as follows: where is the parameter that represents the weights of the regularization term. Then the Euler-Lagrange equations can be acquired, which are solved by utilizing the Gauss-Seidel method. It can get an iterative result to compute the optical flow: where and V are weighted average value of and V, respectively, which are calculated in a neighborhood around the pixel location. denotes the algorithm iteration number.
In this paper, we propose a novel motion feature descriptor, called histogram of maximal optical flow projection (HMOFP). Figure 1 briefly shows the process for computing the HMOFP.
As shown in Figure 2, the optical flow field of frame is divided into image patches with overlap areas. Each block contains × pixels. Then we deal with the optical flow in each patch as follows: 0 ∘ -360 ∘ are segmented into bins. For an image patch, the optical flow vector of each pixel must belong to a bin according to its direction. Thus, each bin may contain several optical flow vectors. We project all optical flow vectors in the same bin onto the angle bisector of this bin. Then the maximal projection vector is selected as the feature descriptor. For example, in Figure 3(a), there are two vectors → 1 and → 2 falling into the first bin. It is easy to know that the projection of → 2 is longer than the projection of → 1 . Thus, the length of the projection vector → 2 is selected as the feature descriptor of the first bin. After computing patches, we obtain the feature descriptor vector of each image patch, For the th patch, ℎ , 1 ≤ ≤ , 1 ≤ ≤ denotes the maximal amplitude among all projection vectors in the th bin. As shown in Figure 3(b), we take the concatenation of the feature descriptor vectors, which is named , as the global HMOFP feature of the frame .
In order to describe a crowd scene well, sufficient crowd movement information is required. On the other hand, for distinguishing two different scenes, detailed comparisons of them are needed and useless information in these two scenes should be eliminated. In the classification process, overlapping block-division can increase the number of significant motion features in two different frames such that these two frames can be more distinguishable. Thus, it is adopted in our algorithm since the optical information can be utilized sufficiently. Moreover, to describe the motion of a crowd,    we need two factors: explicit directions and the moving distance along each direction. The operation of segmenting the 2 space into bins provides us ample information to describe the directions of moving people. To let the direction in each bin be unique, we select the angle bisectors as the direction standard. Since there may be far more than one optical flow vector in each bin, in order to enhance the distinction between the normal scene and the abnormal scene, we select the maximal vector projection rather than the sum of all the vector projections on the bisector as the motion feature descriptor. If we ignore the background area, the amplitudes of motion vectors that belong to the normal area are very small in a normal frame and the motion vectors corresponding to the abnormal area are large in an abnormal frame. Usually, the number of normal motion vectors is much more than that of the abnormal area. If we use the sum of all projection vectors on the angel bisector as the feature descriptor of each bin, the accumulation of the massive small motion vectors in the normal frame may confuse the small number of large motion vectors in the abnormal frame; that is, the sum of all projection vectors on the angel bisector in each bin of the normal frame is likely to be close to that of the abnormal frame. Thus, in order to improve the distinguishability between the abnormal and normal frames, we select the maximal projection vector as the feature descriptor of each bin, as it was demonstrated in Figure 3.

One-Class SVM
SVM was initiated by Vapnik and Lerner [23]. Since the kernel methods were introduced, SVM has been applied extensively in nonliner classification problems [24][25][26]. In one-class classification problem, the substance is that the boundary, that is, an appropriate region, needs to be determined in the data space X, which contains most of the samples coming from an unknown probability distribution . This goal can be realized by searching for an optimal decision hyperplane in the feature space, which is known as the Hilbert space H. This hyperplane can maximize the distance between itself and the original point, while only a small part of data falls between them [27]. The relationship between X and H is shown in Figure 4. One-class SVM problem can be presented as an optimization model: where x ∈ X, ∈ [1 ⋅ ⋅ ⋅ ] are training samples in the input data space X and : X → H can map a vector x into the feature space H. w T (x ) − = 0 is the decision hyperplane. is the slack variable for penalizing the outliers. ] ∈ (0, 1] is the hyperparameter, which is the weight for controlling slack variable and tunes the number of acceptable outliers. is a mapping function, which provides us a way to solve the nonlinear classification problem in the space X by a linear solution in the space H. By calculating dot product in H, the kernel function is defined as (x , x ) = T (x ) (x ). The decision function in the space X with a Lagrangian multiplier is defined as In [28], it was introduced that if appropriate parameters were selected, polynomial and sigmoid kernels will result in similar results with Gaussian. We choose Gaussian kernel in our algorithm. This kernel is defined as where x and x belong to the space X and is the scale factor at which the data should be clustered. In our method, one-class SVM is utilized as follows. Firstly, the training set is used to establish a model. Then an appropriate boundary in the data space can be determined. The new incoming frames will be clustered by the following rule: if the HMOFP feature of the testing frame falls inside the boundary, it will be clustered as a normal frame. Otherwise, it is abnormal.

Abnormal Events Detection
In this section, an algorithm for abnormal events detection in surveillance video is described in detail. Suppose that for a given scene, there is a set of training frames [ 1 , . . . , ], which describe the normal behavior of crowded people. The general procedures for the abnormal events detection based on the histogram of maximal optical flow projection (HMOFP) are presented as follows.
Step 1. Calculate the optical flow, that is, [OP 1 , . . . , OP −1 ], by the HS method at each pixel of the first − 1 frames: where × is the size of the frame image and is the number of the frames in the training set. Our method to compute optical flow is based on the two consecutive frames, which is only effective to the first frame, so in the right side of (8), the maximal subscript is − 1.
Step 2. Extract the motion features of the first − 1 training frames. Then the HMOFP feature vectors of them can be obtained, which is denoted as the set Step 3. Based on HMOFP, one-class SVM is utilized to calculate the optimal boundary of the set [ 1 , . . . , −1 ] T , which corresponds to the set of support vectors or the optimal hyperplane in the feature space.
Step 4. Detect HMOFP of the testing frames based on the model trained by the motion feature of the first − 1 training frames.
The whole procedure is illustrated in Figure 5.

Experimental Results
In this section, based on the UMN dataset [29] and PETS2009 dataset [30], we evaluate our method for abnormal event   detection. Image patch size is set as 64 × 64 and 128 × 128, respectively, in the UMN dataset and PETS2009 dataset. 0 ∘ -360 ∘ are divided into 18 bins, that is, = 18. The overlapping proportion of two neighboring blocks is 50%. In the UMN dataset, the length of the HMOFP feature of each frame is 972 with a 320 × 240 resolution. In the PETS2009 dataset, the resolution of each frame is 768 × 576, and the length of the HMOFP feature is 1584.

Experiments on the UMN Dataset.
There are three different crowded scenes in the UMN dataset, which are named lawn, indoor, and plaza, respectively. In our experiments, we select a part of the normal frames of each scene as the training set and take the rest of the video sequence as the testing set.

Detection in the Lawn Scene.
The video sequence of the lawn scene contains 1453 frames in total. The first 480 frames are taken as the training set. As shown in Figure 6, in the lawn scene, the normal event is that individuals walk in different directions. The abnormal event is that individuals suddenly run away. The detection results of the lawn scene are shown in Figure 7. The accuracy of the detection results is 95.5141%.

Detection in the Indoor Scene.
The video sequence of the indoor scene contains 4144 frames in total. The first 319 frames are taken as the training set. As shown in Figure 8, in the indoor scene, the normal event is that some people are talking and standing in a relatively fixed location while some others are walking along the road in the hall. The abnormal event is that people run out of the doors suddenly. The detection results of the indoor scene are shown in Figure 9. The accuracy of the detection results is 91.2857%.  the plaza scene, the normal event is that people walk around the center of the square. The abnormal event is that people suddenly run away from the square. The detection results of the plaza scene are shown in Figure 11. The accuracy of the detection results is 94.3352%.

Experiments on the PETS2009 Dataset.
In the following experiments, we can choose some specific scenes we are interested in as the targets in the detection progress. In the PETS2009 dataset, we firstly select the training set and the normal testing set, respectively, in the same scene. Then another video clip in a different scene is taken as the corresponding abnormal testing set. Our experiments and the detection results are shown as follows.  Figure 12. The accuracy of the detection results is 97.5%, as shown in Figure 13.     Figure 18. The accuracy of the detection results is 96.1538%, as shown in Figure 19.

Comparison.
We compared our algorithm with the histogram of optical flow orientation (HOFO) method proposed in [21], as shown in Table 1. Most results of our algorithm are better than those of HOFO.

Conclusion
In this paper, we proposed an algorithm for abnormal events detection in crowded scenes with global-frame scale. Our method contains two main procedures: first is computing the histogram of maximal optical flow projection (HMOFP) descriptor of the input video sequence. Second, one-class SVM classifier is utilized for nonlinear classification of the  Frame label Figure 19: The detection results of the sequence Time 14-31. "1" means normal and "−1" means abnormal. testing sets. The proposed method has been tested on several surveillance video datasets with good detection accuracy.