Elsevier

Neurocomputing

Volume 371, 2 January 2020, Pages 188-198
Neurocomputing

Crowd anomaly detection using Aggregation of Ensembles of fine-tuned ConvNets

https://doi.org/10.1016/j.neucom.2019.08.059Get rights and content

Abstract

Anomaly detection in crowded scenes plays a crucial role in automatic video surveillance to avert any casualty in the areas witnessing the high amount of footfalls. The key challenge for automatically classifying the anomalies in crowd image is the usage of feature set and techniques which can be replicated in every crowded scenario. In this paper, we propose a novel concept of Aggregation of Ensembles (AOE) for detecting an anomaly in video data showing crowded scenes, which leverage the existing capability of pre-trained ConvNets and a pool of classifiers. The proposed approach uses an ensemble of different fine-tuned Convolutional Neural Networks (CNN) based on the hypothesis that different CNN architectures learn different levels of semantic representation from crowd videos and thus an ensemble of CNNs will enable enriched feature sets to be extracted. The proposed AOE concept utilizes the fine-tuned ConvNets as fixed feature extractors to train variants of SVM classifier and then the posterior probabilities are fused to predict the anomaly in the crowd frame sequences. The experimental results show that the proposed Aggregation of Ensembles fine-tuned CNNs of various architectures achieve a higher accuracy in comparison with other established methods on benchmark datasets.

Introduction

Anomaly detection has been one of the interesting fields which have provided us with various pattern classification of usual and unusual behaviour in respective domains. In fact, in simple terms, we can say that any deviation from an expected and usual behaviours is cited as an anomaly in the system. Anomalies have been known in different domains as outliers, discordant observations, unusual observations, exceptions, aberrations, surprises, peculiarities or contaminations in different domain applications. Anomaly detection has a reach in various diverse fields such as fraud detection for credit cards, insurance or health care, intrusion detection for cyber-security, fault detection in safety-critical systems, and military surveillance for enemy activities. In medical imaging, anomaly detection plays a crucial role since specialist doctors those working on the diagnosis of complex diseases needs assistance to detect the anomalies in the routine behaviour of the cited problems to come up to a logical conclusion. For example, an anomalous MRI image may indicate the presence of malignant tumours [1], anomalies in credit card transaction data could indicate credit card or identity theft [2] or suspect in the experimental data may be termed as outlier [3]. Any deviation from a conformed and usual behaviour of domain-specific subjects is an abnormal categorization. We need an obtuse approach to define a region representing normal behaviour and proclaim any data which deviates from this defined region as anomalous. The amount of deviations has been categorized differently in different domains. What may be abnormal in medical regions may be normal in a field like financial sector. Thus, every domain will have a new abnormality measuring scale which differs from other domains.

Crowd dynamics recognition has become an important field in the computer vision applications. Crowd analysis becomes more interesting due to the underlying applications in various aspects of daily life while ensuring the security of people at crowded places like railway stations, stadiums, and religious monuments [4]. The computer vision and machine learning algorithms are being utilized simultaneously to automate the crowd surveillance system which can render state-of-the-art performance in real-time. The unique thing about human is that their behaviour has been studied and their threat perception has been decoded in the past. The problem becomes more complex when the individual human comes into groups, large groups to form a crowd. Crowds have various aspects such as crowd density estimation, crowd motion detection, tracking of crowds, crowd behaviours, and crowd counting. All the above steps include common steps of deciphering images and videos for true clues to the crowd behaviour analysis. Crowd behaviour analysis can be a valuable measure in the implementation of intelligent transportation system to cater for safety and security as well as for real-time supply/ demand management of public transportation.

The key objective of crowd anomaly detection is to identify non-typical behaviour of the crowd such as a motor vehicle on a foot walk path, an unusual pattern of people running spontaneously due to some mishappening, non-habitual overcrowding hotspots, a person on a motor transport path (jaywalking) etc. In Fig. 1 the appearance of the truck on the pedestrian highlighted in red, represents an anomaly in context with the surrounding environment.

However, anomaly detection in the crowded scene is a very tedious task due to: (a) the uneven lighting condition at the crowded scenes, (b) availability of anomalous event samples are rare and often subtle, (c) fast movement of objects weakens the performance of event modelling, (d) modelling of various normal and anomalous events is difficult task, and (e) the definition of normal and anomalous events is vague and highly dependent on changing visual contexts.

Convolutional Neural Networks (CNNs) are the most sought after deep learning technique due to their excellent recognition performance in different computer vision applications [5], [6]. ConvNets are multi-layered neural networks which are capable of extracting a set of discriminating features at multiple levels of abstraction. Training a ConvNet from scratch is a computationally intensive task, and it needs a large pool of labelled training dataset. This issue is very prominent in crowd anomaly detection where the availability of anomalous event samples is sparse. In this paper, the authors have tried to address this issue by eliminating the need for training the ConvNets from scratch and experimented with transfer learning of pre-trained CNNs. We have proposed a novel concept of Aggregation of Ensembles (AOE) which leverage the existing capability of pre-trained ConvNets and a pool of classifiers. The proposed term Aggregation of Ensembles is combination of two terms “Aggregation” & “Ensemble” where former refers to the anomaly classification scheme and later refers to quality feature extraction. The proposed approach uses an ensemble of different fine-tuned Convolutional Neural Network architectures based on the hypothesis that different CNN architectures learn different levels of semantic image representation. The various CNNs in our ensemble allow us to extract distinctive crowd features for the characterisation of the varying distinct and subtle differences among normal and anomalous events in the crowded scenes. The proposed methodology helps the ConvNets to adapt the generic features learned from natural images to crowd domain specific features. Novel Contributions of this paper are as follows:

  • We introduce a novel concept of aggregation of the ensemble for high accuracy anomaly detection in crowd scenes.

  • The proposal of Aggregation of Ensembles methodology is first of its kind to the best of our knowledge for anomaly detection in crowd videos. The AOE through various fine-tuned CNN ensembles adapts the generic features learned from natural images to be more specific for crowd anomaly domain and thereby enabling to extract higher quality features at different semantic levels.

  • We present a paradigm to eliminate the need for training the ConvNets from scratch for the crowd anomaly detection where the anomalous training samples are sparse.

  • We analyse the effect of different optimization methodologies on fine-tuning, particularly in crowd behaviour analysis.

Rest of this paper is organized as follows: Section 2 discusses the related work. Section 3 introduces the proposed methodology. The experimental results and discussion are presented in Section 4. The conclusion of the paper is drawn in Section 5.

Section snippets

Related work

Anomaly detection in the crowded scene has a surge of interest of the computer vision researchers in last decade [7]. The anomalous event classification is a cumbersome task, and the major factors are lack of training examples for different types of normal and anomalous events. The initial methods involved object trajectories [8], [9] to detect an anomaly. The trajectories can serve as features, and the abnormal trajectories are the ones occurring much more rarely compared with normal ones.

Proposed methodology

We introduce a novel approach named as AOE (Aggregation of Ensembles) where an ensemble of pre-trained CNNs (AlexNet [5], GoogLeNet [29] and VGGNet [30]) are utilized and over that different classifiers are aggregated to deduce the best wholesome classification decision. The proposed method utilizes CNNs which are pre-trained on natural images and to bring out pronounced results we applied two-way transfer learning on the existing CNNs i.e., (i) fine tuning the ConvNets on crowd datasets and

Experimental results and discussion

In this section, a comprehensive analysis of the qualitative and quantitative performance comparisons of the proposed algorithm with state-of-the-art algorithms is presented. The proposed anomaly detection algorithm is implemented using Caffe bindings with Python on Ubuntu operating system, and experiments were conducted on a standard machine (Intel Core i5,8 GB RAM) equipped with NVIDIA TITAN X GPU.

Conclusion

Aggregation of Ensemble, the concept which is introduced in this paper, is a way to leverage the existing capability of pre-trained ConvNets and extract high-level features which can be used to train a pool of classifiers. We present a novel crowd anomaly detection method which is efficient in discriminating the anomalous and normal events and can benefit from surveillance of crowded places. The ensemble of fine-tuned ConvNets cooperates for detecting anomalies. Altogether, our approach has

Declaration of Competing Interest

There is no conflict of interest.

Kuldeep Singh (M’16) received his B.E. degree in Electronics & Comm. from Govt. Engg. College, India in 2004, and M.Tech. degree in Signal Processing from Delhi University in 2006 and Ph.D. degree in Computer Vision from Delhi Technological University, India, in 2016. He is currently an Assistant Professor with the Dept. of Electronics & Comm. Engg. Malaviya National Institute of Technology, Jaipur, India. Previously, he was a Senior Scientist with the Central Research Lab, Bharat Electronics

References (40)

  • K. Simonyan et al.

    Two-stream convolutional networks for action recognition in videos

  • G. Tripathi et al.

    Convolutional neural networks for crowd behaviour analysis: a survey

    Vis. Comput.

    (2018)
  • C. Piciarelli et al.

    Trajectory-based anomalous event detection

    IEEE Trans. Circuits Syst. Video Technol.

    (2008)
  • M. Sabokrou et al.

    Deep-cascade: cascading 3D deep neural networks for fast anomaly detection and localization in crowded scenes

    IEEE Trans. Image Process.

    (2017)
  • V. Mahadevan et al.

    Anomaly detection in crowded scenes

  • W. Li et al.

    Anomaly detection and localization in crowded scenes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • R. Mehran et al.

    Abnormal crowd behavior detection using social force model

  • J. Kim et al.

    Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates

  • F. Zhijun et al.

    Abnormal event detection in crowded scenes based on deep learning

    Multimed. Tools Appl.

    (2016)
  • D. Xu et al.

    Learning deep representations of appearance and motion for anomalous event detection

  • Cited by (86)

    View all citing articles on Scopus

    Kuldeep Singh (M’16) received his B.E. degree in Electronics & Comm. from Govt. Engg. College, India in 2004, and M.Tech. degree in Signal Processing from Delhi University in 2006 and Ph.D. degree in Computer Vision from Delhi Technological University, India, in 2016. He is currently an Assistant Professor with the Dept. of Electronics & Comm. Engg. Malaviya National Institute of Technology, Jaipur, India. Previously, he was a Senior Scientist with the Central Research Lab, Bharat Electronics Ltd., India. His research interest includes Computer Vision, Deep learning applications, Crowd Behaviour Analysis, Medical Imaging, Sparse representation, human action/activity recognition and Visual Tracking.

    Shantanu Rajora is currently pursuing his Bachelors in Technology in Information Technology from Delhi Technological University (formerly Delhi College of Engineering). His areas of research include Deep Learning, Computer Vision and Digital Image Processing.

    Dinesh Kumar Vishwakarma (M’16, SM’19) received the B.Tech. degree from Dr. Ram Manohar Lohia Avadh University, Faizabad, India, in 2002, the M.Tech. degree from the Motilal Nehru National Institute of Technology, Allahabad, India, in 2005, and the Ph.D. degree from Delhi Technological University, New Delhi, India, in 2016. He is currently an Associate Professor with the Department of Information Technology, Delhi Technological University, New Delhi. His current research interests include Computer Vision, Machine Learning, Deep Learning, Sentiment Analysis, Fake News and Rumour Analysis, Crowd Behaviour Analysis, Person Re-Identification, Human Action and Activity Recognition. He is a reviewer of various Journals/Transactions of IEEE, Elsevier, and Springer. He has been awarded with “Premium Research Award” by Delhi Technological University, Delhi, India in 2018.

    Gaurav Tripathi received his B.Tech. degree in Computer Science from VBS Purvanchal University, India in 2003 and M. Tech. degree in Information Technology Indian Institute of Information Technology, Allahabad in 2007. He is currently pursuing Ph.D. from Delhi Technological University, India and working as Scientist at Bharat Electronics Ltd. India. His research interests include Internet of Things, Deep Learning based Computer Vision, Fog computing.

    Sandeep Kumar received his B. Tech. in electronics and communication from Kurukshetra University, India in 2004 and Master of Engineering in Electronics and Communication from Thapar University, Patiala, India in 2007. He is pursuing Ph.D. from Delhi Technological University, Delhi, India. He is currently working as Member (Senior Research Staff) at Central Research Laboratory, Bharat Electronics Limited Ghaziabad, India. His research interests include the study of wireless channels, performance modeling of fading channels and cognitive radio networks. He is also serving as a reviewer for IEEE, Elsevier and Springer journals.

    Gurjit Singh Walia (M'16) received B.E. degree in Electronics & Comm. from Guru Nanak Dev. Engineering College, Ludhiana in 1998, M.E. degree in Electronics Engg. from Punjab Engineering College, Chandigarh in 2000 and Ph.D. degree in Computer Vision from Delhi Technological University, Delhi, in 2016. He works as Senior Scientist in Defence Research and Development Organization, Ministry of Defence, India. His research interests include machine learning, pattern recognition and visual tracking.

    View full text