A New Semantic and Statistical Distance-Based Anomaly Detection in Crowd Video Surveillance

Recently, attention toward autonomous surveillance has been intensified and anomaly detection in crowded scenes is one of those significant surveillance tasks. Traditional approaches include the extraction of handcrafted features that need the subsequent task of model learning. They are mostly used to extract low-level spatiotemporal features of videos, neglecting the effect of semantic information. Recently, deep learning (DL) methods have been emerged in various domains, especially CNN for visual problems, with the ability to extract high-level information at higher layers of their architectures. On the other side, topic modeling-based approaches like NMF can extract more semantic representations. Here, we investigate a new hybrid visual embedding method based on deep features and a topic model for anomaly detection. Features per frame are computed hierarchically through a pretrained deep model, and in parallel, topic distributions are learned through multilayer nonnegative matrix factorization entangling information from extracted deep features. Training is accomplished through normal samples. Thereafter, K-means is applied to find typical normal clusters. At test time, after achieving feature representation through deep model and topic distribution for test frames, a statistical earth mover distance (EMD) metric is evaluated to measure the difference between normal cluster centroids and test topic distributions. High difference versus a threshold is detected as an anomaly. Experimental results on the benchmark Ped1 and Ped2 UCSD datasets demonstrate the effectiveness of our proposed method in anomaly detection.


Introduction
Automatic video surveillance has recently attracted the attention of researchers since a large number of cameras, installed in surrounding places, may not let human based-surveillance be error free. Thus, computer vision and machine learning come to help analyze the output videos for various tasks of automatic recognition and anomaly detection. Originally, raw signals are used to extract information through machine learning techniques [1]. However, the high dimensionality of video signals captured by high-resolution video cameras makes traditional methods computationally complex. Thereby, to combat the issue of curse of dimensionality, dimensionality reduction techniques have received more attention. Linear and nonlinear dimensionality reduction approaches can be applied as task-dependent techniques. PCA, MDS, LLE, and autoencoder are some to name a few.
Generally speaking, all computer vision-based feature extraction methods like handcrafted features (SIFT, HOG, etc...) can also be considered a kind of dimensionality reduction.
New emerging embedding methods, basically introduced in natural language modeling/processing (NLP), map the original high-dimensional signals to embed spaces and consecutively capture high-level information, which besides the compression, the semantic relations of signals are also preserved [2,3]. Embedding techniques in NLP are based on representing each word as a vector in a vector space model. Preliminary one hot encoding suffers from lack of preservation of semantic relations, since orthogonally between words neglects the probable coherence between them. Topic-based representations such as LSA, probabilistic LSA, LDA, and NMF try to capture semantics [3].
Embedding can also be applied to vision tasks to bridge the semantic gap in image or video analysis. Recently, deep learning architectures (CNN, RNN, AE, RBM, etc.) have been well studied for anomaly detection [4]. Diving into the high-level features, they have shown considerable results in comparison to handcrafted features. Supervised CNNs consist of both convolution and fully connected (FC) layer for feature extraction and classification/recognition, respectively. Ultraparameters in CNN are caused by those terminative FC layers, which may cause overfitting in limited dataset regimes when training from scratch. Therefore, attention is trended toward using only pretrained convolutional layers for feature extraction and powerful image representations, putting aside FC layers.
In most researches, anomaly detection is investigated based on defining a model(s) on normal samples and detecting anomalies as deviation from this normality. This deviation can be measured either by likelihood or similarity. In [5], an anomaly was defined based on interaction forces between pedestrians using the social force model (SFM), and LDA was used to compute likelihood for test set to evaluate deviation from a normal model in a probabilistic framework, whereas in [6,7], normal training samples were used to create a dictionary model and deviation was calculated as high sparse reconstruction cost between an original test sample and its reconstruction through a linear combination of normal bases in the Euclidean space.
In this paper, we investigate a combination of the deep model, topic model, and statistical distance for anomaly detection. In contrast to previous methods which were based on either handcrafted or deep features, neglecting semantic and interpretable information, we analyze the combination of a deep model with a topic model hierarchically to produce semantic representation. We apply a pretrained deep model for hierarchical feature extraction from different layer levels, for each training image. Thereafter, we take the advantages of nonnegative matrix factorization (NMF) as a topic modeling approach in capturing semantic features. Specially, we applied a multilayer NMF, for hierarchical topic representation injecting information extracted from hierarchical layers of a deep model in hierarchical decompositions. After learning topic distribution per frame in the training stage, we apply K-means clustering to compute cluster centroids as typical normal topic-based representations. At test time, in a similar pipeline for feature extraction at the train stage, semantic representation for test frames is calculated and compared to typical normal topic distributions through a statistical distance metric. Here, the earth mover distance (EMD) metric is chosen as a distance metric since it has shown efficient performance in comparing distributions.
Our main contributions are as follows: (1) We take the advantages of both the deep model (pretrained VGG-Net) and the topic model (multilayer NMF), hierarchically and in combination to reach high-level and semantic frame representation (2) Since topic distributions are extracted at the final level as the frame representations, after K-means clustering, some normal representative topic distributions for normality are achieved, and then, EMD statistical distance metric is applied in clusteringbased anomaly detection framework The organization of the rest of this paper is as follows: literature review in three domains of anomaly detection, topic modeling, and statistical learning methods are provided in Section 2. Section 3 introduces our proposed pipeline for crowd anomaly detection. Experimental results are reported in Section 4. Finally, Section 5 concludes this paper.

Literature Review
In this section, we review researches in anomaly detection, topic modeling, and statistical distance separately.
2.1. Anomaly Detection. Video surveillance studies for anomaly detection was started by using traditional handcrafted feature extraction and model learning and improved over the years by applying end-to-end deep architectures. Formerly, low-level features like color, texture, and its variants, like mixture of dynamic texture (MDT), SIFT, SURF, optical flow, and trajectories, were extracted either from appearance, motion, or both, depending on the anomaly definition. At model learning stages, binary classifiers like SVM, decision tree, and NN have been applied for supervised scenarios [1]. However, in semisupervised and unsupervised scenarios, given only normal videos at the training stage, a model for normal behavior is created and an anomaly is detected as a deviation from this model. This has been done for instance by one-class SVM (OCSVM) or fitting a Gaussian model on normal samples. Some researchers took the idea of the inherent sparsity of vision. A dictionary was learned from normal samples, and at the test time, a large reconstruction error was interpreted as an anomaly. Reconstruction was done as a linear combination of dictionary bases which are representative of all normal samples. Dictionary can be learned offline through codebook generation or online through updating along with observing new normal samples [8].
Recently, deep learning methods have commenced entering to the practical realm like vision, lexical, and speech. The intermediate image representations learned through CNN, especially when trained on large-scale datasets like ImageNet, have been proven to be powerful image descriptors.
In [9], anomalous behaviors were captured through a novel concept of aggregation of ensembles (AOE), based on fine-tuning different pretrained ConvNets and a pool of classifiers. They assumed that different CNN architectures learn different levels of representation from crowd videos, and thus, an ensemble of CNNs will enable enriched feature sets to be extracted. Autoencoder-based architectures were also studied where a large reconstruction error was considered a sign of anomaly score. The autoencoder can reduce dimensionality and is vastly used in unsupervised learning problems or as the preliminary stage of supervised task [10]. In particular, after training an AE or sparse AE on normal samples, the bottleneck layer can be considered feature extraction layers for any test samples. Some researchers tried to incorporate both handcrafted and deep features in a unified configuration. In [11], a trajectory-pooled deep convolutional descriptor was introduced combining dense trajectories and convolutional feature maps which results in high discriminative features. Convolutional networks outperform both traditional low-level features and their compositional forms like BoW, Fisher Kernel, and VLAD, [12] although sometimes are used cooperatively. In [12], features extracted from within layers of a convolutional network were used in VLAD to compress the data and subsequently feed to SVM for classification. Wimmer et al. [13] applied Fisher vector encoding to the output feature maps of CNN to find fixedlength representation for image classification.
Sabokrou et al. investigated video anomaly detection through different deep architectures [14][15][16][17][18][19][20][21]. Autoencoderbased anomaly detection and localization using sparsity was introduced in [14,15]. An architecture based on deep 3D autoencoder, deeper 3D convolutional neural network (CNN), and cascade of two cascaded classifiers was proposed in [16] for anomaly detection. High speed and accurate detection and localization of anomalies were achieved in [18] using fully convolutional neural networks (FCNs) and cascaded outlier detection. Some researches applied generative adversarial networks and its variants for image anomaly detection [17,19,22]. Semisupervised anomaly detection was analyzed in [23] based on information theory. A novel selfsupervised representation learning based on integration of a neighbourhood-relational encoding (NRE) among the training data and an encoder-decoder structure was proposed in [20]. In [21], they propose an adversarial training approach to detect out-of-distribution samples in an end-to-end model through jointly training two deep neural networks which collaborate at test time to detect novelties.

Topic Modeling.
Topic modeling is an unsupervised method, originally introduced for text analysis, but has been also noticed in vision. It is based on the idea that documents containing similar contents will likely use a similar set of words that are indicated by topics. Topic modeling discovers patterns as low-dimensional latent representation given unlabeled collection of documents constituted of words. pLSA, LDA, and NMF are among the most common probabilistic topic modeling approaches [24][25][26]. Topic models take as input a set of documents J, a set of words V, and in a cooccurrence matrix of words and documents F = kn wj k wϵV:jϵ J (or BoVW representation, and produce a set of topic T, or more especially Pðw | kÞ and pðk | jÞ, for w ∈ V:j ∈ J:k ∈ T, as word distribution per topic and topic distribution per document, respectively. Consider n wj as the number of times the word w appears in document j, then documents can be represented as mixtures of topics. F can be decomposed into two matrices F = ΦΘ, where Φ = fϕ wk g wϵV:kϵK is a word-topic matrix with ϕ wk = pðw | kÞ and ϕ k = fϕ wk g wϵV , and Θ = fθ kj g kϵK:jϵ J is a topicdocument matrix with θ kj = pðk | jÞ and θ j = fθ kj g kϵK . The decomposition can be solved through the various topic model algorithms with a different assumption. For instance, LDA uses a predefined number of topics, whereas hierarchi-cal Dirichlet process (HDP) [27] estimates the best number of topics based on the training dataset.
In [28], Niebles et al. studied the application of latent topic models, namely, pLSA and LDA, for action categorization. Especially, they extract spatiotemporal interest points along the input volumes followed by codebook generation. In an unsupervised fashion, they succeeded in detecting and localizing actions, which were considered latent topics. New learning algorithms based on EM and variational Bayes inference were proposed in [29] for activity analysis in videos where the description of activities and behaviors was made by the dynamic topic model. The activities and behaviors were described by a dynamic topic model. They also evaluated anomaly localization procedures in the topic modeling framework. In [30], scene classification was made by discovering objects per image in an unsupervised fashion using pLSA. They subsequently used object distribution in each image for scene classification using supervised kNN. Topic modeling-based abnormal behavior recognition has been previously investigated in [5,31]. In almost all cases, low likelihood corresponds to abnormal test samples. An unsupervised topic model (pLSA) anomaly detection and localization were studied in [32] based on extra information of location and size beside quantized spatiotemporal gradient descriptors to create a more informative vocabulary over visual clips. Each document (frame) is fully described by a corresponding distribution over topics.

Statistical Distance.
Statistical distances try to find the distance between two statistical objects, and when accompanied with a symmetric property, they are known as a metric. In the anomaly detection area, distance measures such as Jensen Shannon divergence or Z score value were applied for comparing query observation to those extracted patterns from normal samples [33]. According to the evaluation of this distance concerning the threshold, the anomaly can be detected. As a powerful statistical distance, earth mover distance (EMD), also known as the Wasserstein metric, was applied in the image domain [34,35] to compare two probability distributions, mainly based on low-level features like color or texture. It is based on computing statistical distance between two signatures. The typical signature consists of a list of pairs: where each x i is a certain feature, and m n is its mass (how many times that feature occurs in the record). Considering two signatures P and Q which contain m and n clusters, respectively, and p i ðq i Þ is the cluster representative and w pi ðw qi Þ is the weight of cluster i. Also, consider D = ½d i:j as the ground distance between clusters p i and q j . It can be chosen or learned 3 Wireless Communications and Mobile Computing according to the problem at hand. The aim is to find flow matrix F = ½ f i:j , where f i:j is the flow between p i and q j , such that the below overall cost is minimized with its related constraints.
This optimization can be solved via linear programming. It is based on solving a kind of transportation problem. Once the flow F is calculated, then the EMD is defined as the work normalized by the total flow: EMD suffers from high computational complexity OðN 3 log NÞ. Wavelet EMD was proposed in [36] to reach a linear time algorithm for approximating the EMD for lowdimensional histograms using the sum of absolute values of the weighted wavelet coefficients of the difference histogram.
Rare studies have gained from EMD in anomaly detection. To the best of our knowledge, only in [7], wavelet EMD was applied in conjunction with sparse representation for anomaly detection instead of the Euclidean distance, for its robustness. In this paper, we investigate wavelet EMD on our proposed clustering-based anomaly detection.

Proposed Method
In this paper, we analyze anomaly detection at frame level in crowded scenes. Our proposed architecture is shown in Figure 1. The pipeline consists of two stages: (1) feature extraction and (2) anomaly detection. The feature extraction stage itself consists of two parts entangled with each other: (1) hierarchical feature extraction through pretrained VGG-Net [37] and (2) hierarchical latent representation from multilayer NMF. Both architectures start from low-level features and increase in depth to high-level information resulting in ultimate representation.
In the second stage, we applied clustering-based anomaly detection. Precisely, K-means is applied to all processed training samples' ultimate representations, to create typical normal clusters. Since the training dataset consists of only normal samples, thus, cluster centroids are normal frame representatives. At test time, test frames are processed to be represented in learned topic space from the training stage and compared to each cluster centroids. A large statistical distance from all centroids is detected as an anomaly. In the following, we explain each part in more detail.

Preprocessing and Feature Extraction.
The dataset is separated into two subsets as train and test set. Let X train = ½x 1 :x 2 ⋯ x n Train T ∈ R n Train ×B 0 , where n Train is the number of frames in the train dataset, B 0 = m × n × c and m, n, and c are the width, height, and number of channel, respectively, for the original captured image.
3.1.1. Deep Representation. Pretrained model is applied for feature extraction in problems encountering scarcity of training datasets, since training from scratch may result in overfitting. As higher layer feature maps are task specific, we extract more general features from lower layers. We resized each frame to be in a compatible size as the input for VGG-Net model (m 0 × n 0 × c 0 ) and extract features hierarchically from different depths of the architecture. Let a 0 = xð∈R m 0 ×n 0 ×c 0 Þ be a typical train image in compatible size with VGG input layer. Then, is the output feature map from layer l. w l−1 and b l−1 are VGG weights and biases pretrained, respectively, for layer l. m l × n l is the spatial size of the feature map, and c l is the feature map's depth at layer l. We extract feature maps from L different depths ðl = 1 ⋯ :LÞ; then, feature maps at each layer l ð l = 1:2 ⋯ :L Þ are separately feed to the global average pooling (GAP) layer to get representations in vector format. GAP layers take input volumes of size m l × n l × c l and create 1 × c l dimensional vector by spatial averaging. Therefore, for each frame x, now, we have L vector representations, f Dl ∈ R c l ð l = 1:2 ⋯ :L Þ. Considering all training samples, now we have L different size matrices, M l ∈ R n Train ×f Dl .

Topic-Based Representation.
In parallel, we try to capture semantic information based on the topic model. Specially, we applied multilayer NMF since multilayer has been shown to improve performance by capturing more semantic features [38]. We adopt a similar approach to [39] by considering a frame as a document and trying to extract topic distribution per document. However, we apply multilayer NMF for hierarchical topic modeling. Single-layer NMF decomposes a nonnegative matrix V into two low-rank nonnegative basis and coefficient matrices W and H.
where H is the new low-dimensional representation for V . The decomposition is solved as an optimization problem through a multiplicative update approach. In multilayer NMF, computed latent representation in preceding layers is decomposed hierarchically in subsequent layers. Consider X train−pca = PCAðX train−vec Þ and X train−pca = ½x 1 :x 2 ⋯ x n Train T ∈ R n Train ×D 0 , where PCA applied to each vectorized frame to decrease dimensionality from m 0 × n 0 to D 0 < m 0 × n 0 per 4 Wireless Communications and Mobile Computing frame and standardized to stay in range ½0-1 . Let H 0 = X train−pca as input to the first stage of multilayer NMF. Then, it can be decomposed as H 0 = W 1 H 1 . Instead of directly applying the second NMF to H 1 , as the new lowdimensional representation, H 1 is processed to V 1 before being introduced to the next layer. V l is computed as V l = f ðH l :M l Þ:l = 1 ⋯ L where f ð:Þ is the nonlinear function, like softmax, and M l is feature representation from pretrained VGG-Net at layer l .
Here, we use softmax as a nonlinear function to have a distribution-like representation. Since the ReLu activation function has been applied in deep architecture, nonnegativity is preserved. Bringing in M l s in multilayer NMF decomposition results in both high-level and semantic information, which can improve the performance of the subsequent tasks.
Normal data  By decomposing V l in the next layer, we force the architecture to learn how to combine information from the previous layer; therefore, D l < D l−1 . Training separately each NMF layer, to learn W l and H l , ultimate data representation V L is acquired. Finally, V L integrates features throughout the deep model and topic model.

Anomaly Detection.
Upon training completion, V L ∈ R n Train ×D L is acquired from normal frames in the training set. We apply K-means algorithm to V L to find K cluster centroids as normality representatives. Therefore, now, we have K cluster centroids s i :i = 1 ⋯ :K which are used in clusterbased anomaly detection. Each test frame x test is fed to our learned feature extraction block from the training phase, and ultimate representation V L:test is acquired. V L:test can be considered as the final topic distribution for x test . V L:test is compared to each s i and exceedance of statistical wavelet EMD distance from threshold th is detected as an anomaly.
is an abnomal frame.

Results and Discussion
We conducted experimental analysis on UCSD dataset as one of the benchmark datasets in crowd anomaly detection introduced in [40], recorded with a static camera at 10 fps. This dataset contains two scenes as Ped1 and Ped2, each of which is split into train and test sequences. The nonpedestrian objects, like bikers, skaters, and small carts, are considered anomalies. More details about this dataset are provided in Table 1. Typical normal and abnormal sample frames for Ped1 and Ped2 datasets are also shown in Figure 2. When originally introduced, VGG [37] was trained on the ImageNet dataset which only consists of object classes; however, recently, pretrained VGG on both the ImageNet and Places dataset is provided which consider scene classes, as well. 1000 classes from the ImageNet and the 365 classes from the Places365Standard [41] were merged to train a VGG16-based model (Hybrid1365-VGG [42]). We use VGG model pretrained both on the ImageNet and Places datasets to improve the capability of our deep feature extraction block in capturing both objects and scenes features. For this paper, our algorithms have been implemented in Python and run on a PC with 2.9 GHz Core i5 GPU, with GTX1080 GPU, and 16G RAM. Original frames are resized to be compatible with VGG, as VGG accepts input of size 224 × 224 × 3 . Feature maps from different depths, namely, block2 − pool, block3 − pool, and block4 − pool of VGG architecture, were extracted and resulted in ð56 × 56 × 128Þ, ð28 × 28 × 256Þ, and ð14 × 14 × 512Þ feature maps, respectively. Then, we applied global average pooling to each feature map separately which results in f D1 : 128D, f D2 : 256D, and f D3 : 512D representation vectors in hierarchical order. On the other hand, we applied multilayer NMF with L = 3 on our train set with reduced dimensionality by PCA (2000D vector each frame). W 0 , W 1 , and W 2 are learned separately with a multiplicative updates. D 1 , D 2 , and D 3 are chosen as 512, 256, and 128, respectively. K-means clustering with K = 50 is applied to the final representation V L ∈ R n Train ×D L to generate typical representative centroids. In the UCSD dataset, there are n Train = 6800 for Ped1 and n Train = 2550 for Ped2 datasets.
In our experiment, there are some parameters that we investigate their values and fixed after evaluation. These parameters are shown in Table 2.
VGG16 consists of several layers (C11-C12-P1-C21-C22-P2-C31-C32-C33-P3-C41-C42-C43-P4-C51-C52-C53-P5-FC1-FC2-FC3). Convolutions and fully connected layers have trainable parameters. Three last fully connected layers provide task specific features. So, we focus on first 5 convolution layers. We chose L = 3 to achieve a trade-off between accuracy and complexity. The number of clusters in K -means clustering was also evaluated for K = 30,40,50,60 and chosen as K = 50 based on accuracy evaluation. We decided on the value of threshold for WEMD comparison based on average distance from training samples representations, since the training dataset consists only of normal samples.
For Ped 1, we compare our proposed approach both to traditional methods (SRC [6], MPPCA [43], and MDT [40]) and high-level deep learning-based methods (AVID [19], Sabokrou [8], and deep cascade [16]). As introduced and calculated in [26], evaluation metrics such as equal error rate (EER) and area under curve (AUC) are computed at frame level and compared to the state-of-the-art methods. EER indicates the point where false positive rate equals to false negative rate. The lower the EER is, the higher accuracy can be achieved. A comparison of EER of our proposed  For Ped 2, The Ped1 dataset suffers from the perspective problem. For this reason, most researches have been conducted on Ped2. We compare our proposed approach both to traditional methods (SF [5], MPPCA [43], and MDT [40]) and high-level deep learning-based methods (Conv-AE [44], AVID [19], deep anomaly [18], deep cascade [16], ALOCC [17], and ST-AE [45]). A comparison of EER of our proposed approach to the previous method is shown in Table 4 for Ped2. Results show the comparable performance for our proposed method. Besides, AUC is computed and compared to the state-of-the-art. Results show the outperformance of our proposed approach in AUC.
Moreover, we evaluated accuracy as accuracy = TP + TN TP + FP + FN + TN : The results, shown in Table 5 for the Ped1 and Ped2 datasets, indicate the high performance of our proposed method.

Conclusions
In this paper, we discussed a new semantic and statistical distance-based crowd anomaly detection at the frame level. In particular, inspired by the earth mover distance metric applied previously on low-level vision features, we applied this statistical distance to hierarchically learned features, through pretrained deep convolutional neural network and topic model, for anomaly detection. Features from VGG-Net, pretrained on hybrid dataset (Places dataset and Ima-geNet dataset) and multilayered NMF as semantic interpretable features, were computed in combination as hierarchical representation and used in clustering-based anomaly detection using wavelet EMD statistical distance. Experimental results show the outperformance of our proposed approach. In the future, we will investigate anomaly localization by patch analysis through the kernel convolutional network (CKN) [46] and EMD in a similar framework to localize anomalies.

Data Availability
The readers can access the UCSD Ped1 and Ped2 datasets in http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm  Table 4: Comparison of EER performance for the UCSD Ped2 dataset at the frame level.