A dataset for automatic violence detection in videos

The automatic detection of violence and crimes in videos is gaining attention, specifically as a tool to unburden security officers and authorities from the need to watch hours of footages to identify event lasting few seconds. So far, most of the available datasets was composed of few clips, in low resolution, often built on too specific cases (e.g. hockey fight). While high resolution datasets are emerging, there is still the need of datasets to test the robustness of violence detection techniques to false positives, due to behaviours which might resemble violent actions. To this end, we propose a dataset composed of 350 clips (MP4 video files, 1920 × 1080 pixels, 30 fps), labelled as non-violent (120 clips) when representing non-violent behaviours, and violent (230 clips) when representing violent behaviours. In particular, the non-violent clips include behaviours (hugs, claps, exulting, etc.) that can cause false positives in the violence detection task, due to fast movements and the similarity with violent behaviours. The clips were performed by non-professional actors, varying from 2 to 4 per clip.


a b s t r a c t
The automatic detection of violence and crimes in videos is gaining attention, specifically as a tool to unburden security officers and authorities from the need to watch hours of footages to identify event lasting few seconds.So far, most of the available datasets was composed of few clips, in low resolution, often built on too specific cases (e.g.hockey fight).While high resolution datasets are emerging, there is still the need of datasets to test the robustness of violence detection techniques to false positives, due to behaviours which might resemble violent actions.To this end, we propose a dataset composed of 350 clips (MP4 video files, 1920 × 1080 pixels, 30 fps), labelled as non-violent (120 clips) when representing non-violent behaviours, and violent (230 clips) when representing violent behaviours.In particular, the non-violent clips include behaviours (hugs, claps, exulting, etc.) that can cause false positives in the violence detection task, due to fast movements and the similarity with violent behaviours.

Value of the Data
• As the interest towards automatic detection of violence and crimes in video is increasing, the clips in the presented dataset are intended to train and benchmark techniques for automatic violence detection in videos.• In the short and mid-term, researchers can use the Full HD clips as an additional open dataset to train and test their algorithms.In the long-term, law enforcement authorities and the entire community might benefit from fine-tuned algorithms, capable of reducing the decision time in violence and crime detection.
• A specific goal of the dataset is to verify the robustness to false positives of the violence detection techniques.Thus, experiments involving the assessment of the classification accuracy of algorithms can consider this specific feature in the evaluation phase.

Data Description
The pervasiveness of video surveillance cameras and the need of watching footages and making decisions in a very short time [1] boosted the interest of researchers towards techniques for the automatic detection of violence and crimes in videos.In facts, both techniques based on handcrafted features [ 2 , 3 ] and deep learning [ 4 , 5 ] demonstrated their accuracy for automatic violence detection on open datasets such as the Hockey Fight Dataset [6] , the Movie Fight Dataset [6] , and the Crowd Violence Dataset [7] .However, such datasets include few low-res videos, sometimes in too specific environments (e.g.hockey arenas).These issues have been faced by the RWF-20 0 0 [8] , a dataset including 20 0 0 clips from real video surveillance cameras.Nevertheless, in terms of accuracy, especially for the prevention of false positives, there is still the need to understand the effectiveness of the violence detection techniques in clips showing rapid moves (hugs, claps, high-fives, etc.) which are not violent.To this end, we present a dataset for violence detection specifically designed to include, as non-violent clips, scenes which can cause false positives.
The dataset is composed of 350 clips which are MP4 video files (H.264 codec) of an average length of 5.63 s, with the shortest video lasting 2 s and the longest 14 s.For all the clips, the resolution is 1920 × 1080 pixels and the frame rate 30 fps.The dataset is organized into directories as shown in Fig. 1 .
The dataset is split into two main directories, "non-violent" and "violent", labelling the included clips as showing non-violent behaviours and violent behaviours respectively.The directories are split into two subdirectories, "cam1" and "cam2": • "non-violent/cam1" includes 60 clips representing non-violent behaviours; • "non-violent/cam2" includes 60 clips with the same non-violent behaviours in "nonviolent/cam1" but recorded with a different camera and from a different point of view; • "violent/cam1" includes 115 clips representing violent behaviours; • "violent/cam2" includes 115 clips with the same violent behaviours in "violent/cam1" but recorded with a different camera and from a different point of view.
The clips were performed by a group of non-professional actors, varying from 2 to 4 per clip.For the violent clips ( Fig. 2 ), the actors were asked to simulate actions frequent in brawls, such as kicks, punches, slapping, clubbing (beating with a cane), stabbing, and gun shots.For the nonviolent clips ( Fig. 3 ), the actors were asked to simulate actions which can result in false positives by violence detection techniques due to the speed of movements or the similarity with violent actions.Specifically, the non-violent clips include actions such as hugging, giving high fives and clapping, exulting, and gesticulating.An additional labelling is provided in three csv files available in the main data repository directory: -"action-class-occurrences.csv" lists all the actions recorded in the clips, with the number of times each action occurs in the dataset and a label to explain if the action is violent (y) or not (n).All the actions recorded in the clips are listed in Table 1 ; -"non-violent-action-class.csv" lists the actions included in each non-violent clip; -"violent-action-class.csv" lists the actions included in each violent clip.

Experimental Design, Materials and Methods
As highlighted in a previous study [5] , violence detection techniques can fail due to actions and behaviours which are wrongly interpreted as violent, due to fast movements and similarity with violent behaviours.To this end, the non-violent clips were recorded to specifically challenge techniques and prevent false positives, even with datasets unbalanced towards the violent clips, as the one proposed in this paper.For the clips representing violent behaviours, in addition to kicks, punches and slapping, a plastic toy gun, a plastic toy knife, and a wood cane rolled into bubble wrap sheets were used to simulate actions involving weapons such as gun shots, stubbing, and beating.
The clips were recorded with two cameras placed in two different spots, building a dataset with videos from two different points of view.The cameras are: • The front camera of the Asus Zenfone Selfie ZD551KL (13 MP, Auto Focus, f/2.2).
All the clips were recorded in the same room, with natural lighting conditions.The Asus Zenfone was placed in the top left corner in front of the door, while the Action Cam was placed in the top right corner on the door side.All the performed actions and behaviours were recorded with both cameras.Therefore, all the clips with the same label and name, but in different final directories (for example "non-violent/cam1/1.mp4" and "non-violent/cam2/1.mp4") represent the same action, recorded from two different perspectives (the "cam1" directory identifies the Asus Zenfone, while the "cam2" directory identifies the Action Cam).
In addition to the main classification of the clips into violent and non-violent, we manually annotated the actions performed in each clip.This annotation can be used for further classification experiments with violence detection techniques, to train and test algorithms capable of performing action recognition.

Fig. 1 .
Fig. 1.The structure of the data repository with the 350 clips of the dataset, split in non-violent (120 clips) and violent (230 clips).
The clips were performed by non-professional actors, varying from 2 to 4 per clip.© 2020 The Authors.Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) • "non-violent/cam2" includes 60 clips with the same non-violent behaviours in "non-violent/cam1" but recorded with a different camera and from a different point of view.The clips were performed by a group of non-professional actors (varying from 2 to 4 per clip) simulating violent actions and non-violent actions.Data source location Dipartimento di Ingegneria dell'Informazione, Università Politecnica delle Marche, Ancona, Italy.Data accessibility Public repository: GitHub ( https://github.com ) Repository name: A Dataset for Automatic Violence Detection in Videos Direct URL to data: https://github.com/airtlab/A-Dataset-for-Automatic-Violence-Detection-in-Videos

Table 1
The list of the recorded actions, with the number of occurrences in the dataset.