1 Introduction

Surveillance cameras are ubiquitous in modern society, and they play a crucial role in maintaining public safety. However, the sheer volume of footage captured by these cameras can be overwhelming for human analysts to review. CCTV operators face multiple challenges in their work such as a high camera to operator ratio, distractions in the workplace, and long working hours [1, 2], making it difficult to concentrate on all cameras at all times. As a result, there is a pressing need for automated tools that can quickly and accurately detect but also localize instances of violence in surveillance footage. Such a system would benefit from providing meaningful results, explicitly including individuals and the groups they are a part of.

Existing methods for Violence Detection (VD) often depend on training data that is staged or contains very specific situations such as hockey fights or kickboxing [3, 4], with larger surveillance datasets emerging in recent years [5, 6]. Additionally, the focus lies either on detection but not localization, or localization is not explicitly based on persons or groups, leaving a gap between frameworks and interpretable real-world applications.

We therefore propose to incorporate subgroups into the task of VD, serving two purposes: it will increase interpretability of the outcome, by indicating where violence was detected, as well as increasing appropriateness for safety systems by enabling tracking and analysis of subgroups for VD. Furthermore, since it is an add-on module, it can be combined with any model or system. Acknowledging the importance of both interpersonal [7] and contextual [8] information, we combine full video analysis with subgroup analysis, automatically extracting and tracking subgroups. To the best of our knowledge, this is the first work that integrates subgroups in the task of violence detection by tracking (multiple) of them across frames and analyzing them in parallel with full video analysis. Furthermore, focus lies on the applicability of the system, creating a highly interpretable output in real-time that can be used in various safety systems.

Our contributions are as follows:

  • We propose the integration of a subgroup analysis module into real-time violence detection methods, addressing the need for increased interpretability and applicability of safety systems.

  • This subgroup module detects and localizes violent events by automatically extracting and tracking groups across frames, while enabling a slight improvement or maintained performance on the overall task, as supported by ablation studies.

  • The system proves to generalize well to unseen surveillance data, furthermore underlining its utility for safety systems working with real-world data and its potential to reduce the workload of human analysts.

The remainder of this paper is structured as follows. A brief summary of related work is provided in Section 2, followed by an overview of the methods employed in Section 3. Experiments and their results are discussed in Section 4, after which the paper is concluded in Section 5.

2 Related work

2.1 Video violence detection

The VD research field is rapidly expanding, with recent developments including the creation of specialized datasets for violence recognition from surveillance footage. In 2019, Aktı et al. [5] proposed the SCFD dataset, together with a framework using a CNN for feature extraction and a bidirectional LSTM with an attention layer for classification. Cheng et al. [6] introduced the RWF-2000 dataset and a pipeline combining RGB and optical flow data, emphasizing movement in RGB areas. Optical flow is also utilized in Ullah et al. [9] for CNN-based anomaly detection in IoT environments, whereas Islam et al. [10] work with cost-effective alternatives of optical flow. Another approach for VD on small devices is described in Vijeikis et al. [11], combining a spatial feature extractor with an LSTM temporal feature extractor. Kang et al. [12] apply 2D CNNs, merging three consecutive frames into one by averaging the RGB channels per frame, with a temporal attention module. The work of Tan and Liu [13] proposes to employ anomaly detection to find training data for supervised action recognition, thereby using both tasks iteratively. Su et al. [14] highlight the importance of hyperparameter tuning, with an efficient general action recognition CNN outperforming techniques created specifically for VD. Kwan-Loo et al. [4] track individuals across frames by finding the largest Intersection over Union of bounding boxes, combining pose information from past and current frames for classification.

2.2 Violence localization

While most studies primarily address VD, recent research has shown a growing interest in violence localization. In Roman and Chávez [15], a masking model generates motion saliency masks from dynamic images, merging salient regions near detected individuals to identify the main violent area. Mohammadi and Nazerfard [16] employ a reinforcement learning model to assess the significance of RGB and optical flow frame regions, recursively cropping the highest-scoring area for final classification. This is done recursively, classifying the final cropped region. Asad et al. [17] use spatiotemporal attention modules and bidirectional convolutional LSTMs to learn masks for each video, creating a heatmap overlay indicating important classification regions. Something similar is done in the previously discussed Su et al. [14] and Kang et al. [12], where Grad-CAM [18] is employed to visualize the regions on which the classification model focuses.

2.3 Subgroup analysis for VD

Some research has explored subgroups within VD. Chang et al. [19] track individuals using multiple cameras for non-violent group actions, such as the formation of groups, grouping them per frame. VD is a separate component, for which motion features are extracted from the foreground of frames and fed to an SVM for classification. Freire-Obregón et al. [8] investigate the influence of context by tracking individuals, and applying a threshold to determine the percentage of overlap the bounding boxes of two people should have to not be removed as background, effectively grouping people together. Finally, in the work of Rota et al. [7], violence is both detected and localized by only considering movements happening in the space between two people. When criteria for this interpersonal movement are met, visual features are classified with an SVM.

2.4 Comparing previous work to the proposed framework

Most studies on surveillance VD do not incorporate localization, making the output less interpretable [5, 6, 9,10,11,12,13,14]. Those who do either provide heatmaps that are not guaranteed to have social meaning [12, 14, 17], or they are unable to point out more than one violent area [16, 17]. While such heatmaps serve interpretability, they are not guaranteed to include people or to localize multiple distinct regions. Research involving explicit group inclusion either groups individuals per frame using multi-camera tracking [19] or focuses solely on groups or the entire video, without combining these aspects [7, 8]. Bridging these gaps, our proposed system detects and localizes violence from single-camera surveillance footage, by incorporating the full video as well as cropped subgroups from that video. This enables generation of a socially-aware output, where multiple subgroups are tracked throughout the video and classified as depicting violence or not.

3 Methodology

An overview of the model is presented in Fig. 1. Each video serves as input for two streams: one for full-video violence recognition, and one for subgroup violence recognition. For the latter, location and optical flow information are extracted as features for each detected person. These features, and thus the individuals they are retrieved from, are then clustered per frame. People are tracked across frames, to go from frame-level subgroups to video-level subgroups. Each subgroup is then fed to a VD network, the output being fused with the violence prediction of the entire video. This section will discuss each of these steps in more detail, along with general design choices and the datasets employed.

Fig. 1
figure 1

Overview of the proposed method

3.1 General design choices

Incorporating information from both the entire scene and cropped subgroups within a video is motivated by their individual significance in VD [7, 8], and the interpretability gained through subgroup-based analysis. Analyzing groups, instead of individuals, is chosen due to previous work showing that visual information from the interpersonal space of multiple people greatly contributes to behavior classification [7]. Furthermore, it is worth noting that analyzing solitary individuals adds little value to the analysis of individuals close together [8], since violence tends to involve multiple people. We focus on detection over prediction to avoid prediction bias, a serious issue where present inequalities are projected onto the future [12, 20]. Our subgroup detection approach is similar to that of Veltmeijer et al. [21], with the main difference being our use of motion features from videos instead of face features from images, reducing privacy concerns and potential data biases [12]. Furthermore, the present work classifies violence rather than emotion, serving practical applicability as well as reducing subjectivity.

3.2 Data

3.2.1 Datasets

The model was trained and evaluated on two datasets, the Surveillance Camera Fight Dataset (SCFD) [5] and the Real-World Fighting (RWF-2000) dataset [6], two of the main violence datasets that consist of general surveillance footage only [22]. Both contain real-world fights, indoors as well as outdoors, as recorded by surveillance cameras. They are collected from YouTubeFootnote 1 (SCFD, RWF-2000) and to a lesser extent from existing datasets (SCFD). SCFD contains 300 surveillance video clips, 150 of which contain violence and 150 not. Each clip is 2 seconds long. RWF-2000 contains 2000 surveillance video clips, 1000 violent clips and 1000 non-violent clips, each clip lasting 5 seconds. An example of both datasets is given in Figs. 2 and 3. It is worth noting that many of the violent clips in these datasets contain non-violent counterparts, meaning that the period of time before and after a fight is part of the non-violent class. This reduces possible biases in the dataset, since the same environment is present in both classes. Since both datasets mainly contain physical fights, the words ‘violence’ and ‘fight’ will be used interchangeably in the remainder of this work.

Fig. 2
figure 2

Example frames from video in SCFD dataset [5]

3.2.2 Data annotation

We extend the existing dataset annotations, by assigning labels to subgroups from videos in the fight classes. Since the non-fight videos are already labeled as such, we assume that none of the videos nor subgroups will contain fights. For the fight videos, on the other hand, individual subgroups are not guaranteed to contain fights. A label of fight or non-fight is assigned to each individual subgroup detected within that video. For annotating, violence is defined as an intentional action by individuals or groups to cause harm to the environment or others through physical contact. The SCFD subgroups are annotated by two different annotators. From this an inter-rater reliability (IRR) is calculated, giving a Cohen’s kappa value of \(\kappa \)=0.90 [23]. This indicates an ‘almost perfect’ [24] or ‘strong’ [25] agreement, allowing for the remainder of data being assigned to one of the annotators. This results in 94 (SCFD) and 481 (RWF-2000) subgroups being labeled as violent, versus 205 (SCFD) and 594 (RWF-2000) labeled as non-violent.

3.2.3 Data preprocessing

Since the classification network requires a square input, data is resized in two different ways: center cropping and zero padding, both combined with resizing the resulting square frame to 160\(\times \)160. For the full videos, all training data is augmented in both ways, resulting in two square videos per original video. SCFD results are reported on test data that is cropped, RWF-2000 results on test data that is padded, generalization results on both combined. Subgroup videos from both datasets are padded only, so as not to further crop the already cropped subgroup. The only other data augmentation applied is flipping, which is done for the subgroups videos (SCFD) and full videos and subgroup videos (RWF-2000) of the training sets.

3.3 Frame-level subgroup formation

3.3.1 Feature extraction

For each video, we extract frames and dense optical flow information (pixel-wise angle and magnitude) using OpenCV [26] and sample eight frames per video. Pose estimation is performed on the remaining frames using AlphaPose [27], a recognized framework for precise and robust keypoint detection [28, 29]. Frames are retained if they contain detected individuals with assigned poses.

We utilize the pose estimation’s inferred center coordinates of bounding boxes as location features, following the methodology of Veltmeijer et al. [21]. We aim to extract meaningful motion features, where individuals moving in a similar direction have more similar motion features. To achieve this, we calculate the average gradient over all pixels within each individual’s bounding box in the optical flow output as the motion feature.

3.3.2 Clustering

Following the procedure and implementation of Veltmeijer et al. [21], we employ hierarchical clustering for forming meaningful subgroups from the individuals in each frame. The coordinates of each person together with the movement information form the elements of the feature vector to be used for clustering. The feature vector of each individual consists of three elements: [xygradient]. The center coordinates x and y are normalized by dividing them by the image size in their respective dimension, and the gradient value is converted to range from 0 to 360 degrees. The optimal number of clusters for each frame is determined from the resulting dendrogram in an automated fashion.

For each cluster, we save a bounding box encompassing all individuals within that cluster, based on their original pose estimation bounding boxes. To ensure accuracy for training, a bounding box should only contain individuals assigned to that specific subgroup. However, given the real-world application in which the resulting clusters are observed visually, it may be possible for the bounding box of a cluster to include individuals not part of that cluster as well. Therefore, if one subgroup is interrupted by another, we split the interrupted subgroup to eliminate the interruption, focusing on interruptions along the horizontal axis (x-values). Specifically, cluster A is split up when a cluster B exists with:

$$\begin{aligned} B_x \subset A_x \end{aligned}$$
(1)

In this case cluster \(A_x\) is split up into clusters recursively until none of the from \(A_x\) derived clusters are a superset of any other cluster. After the initial merging of individuals and clusters through hierarchical clustering, followed by this cluster splitting procedure, we are left with meaningful subgroups that maintain their integrity when observed through their bounding boxes.

Fig. 3
figure 3

Example frames from video in RWF-2000 dataset [6]

3.4 Video-level subgroup formation

So far, we discussed person detection and subgroup formation at frame-level. The next step is to track individuals throughout the video and find video-level subgroups. Establishing video-level subgroups based on the clustered subgroup per frame will smooth frame-level predictions. While Alphapose [27] offers tracking options, we find them to be unstable for our purpose due to temporal consistency issues. Tracking methods often involve calculating the smallest distance in a person’s location between two frames [4, 19]. However, in scenes with clutter or fast movements, often encountered in fight scenes, this can result in frequent mismatches. Consequently, tracking and analyzing movement in violent scenarios remain challenging, with existing algorithms frequently falling short [7, 22].

Instead, in this work individuals are tracked by predicting their future location and finding the closest match, if possible, to that predicted location. For location, we use the center coordinates of an individual’s bounding box, rather than the (size of the) bounding box itself. This is done to increase stability, as center coordinates are found to be more robust to changes than bounding boxes. An example of this is when a person is spreading its arms, possibly to punch someone, causing a relatively big change in its bounding box but a smaller change in its center coordinates. For all individuals with center location \(C_n=(x_n,y_n)\) in frame n, optical flow gradient and magnitude are used to calculate the predicted center location \(P_{n+1}=(x_{n+1},y_{n+1})\) of that individual in frame \(n+1\). A distance matrix \(M_{\text {dist}}\) contains the Euclidean distance between the predicted center location \(P_{n+1}=(x_{n+1},y_{n+1})\) of individuals detected in frame n and the actual center location \(C_{n+1}=(x_{n+1},y_{n+1})\) of individuals detected in frame \(n+1\). The distance matrix is then used for matching individuals between frame n and frame \(n+1\), by solving it as a linear sum assignment problem, implemented with SciPy [30, 31]. It is possible that an individual in frame n is not detected or present in frame \(n+1\), or vice versa. To account for this, the distance matrix is padded with a threshold value, to ensure that individuals that have no plausible match will be matched to a padded value. This threshold value is set to 40, which is empirically found to generally distinguish between the same person moving and two people close together moving. When no match is found for an individual in frame \(n+1\), a match is sought with a non-matched individual in frame \(n-1\). In this case, the threshold is set to 50, meaning that a match can only be made if the Euclidean distance is below 50. If an individual is not matched for two consecutive frames, the tracking of that individual stops. In the end, only tracked individuals that have been identified in at least four of the eight frames are further considered.

After tracking and grouping individuals per frame, we use pairwise majority voting to infer video-level subgroups, in a way similar to Veltmeijer et al. [21]. Instead of finding consensus across multiple subgroup annotations for the same image, we propose to reach subgroup consensus over multiple frames. Following their approach, we consider a pair of individuals to be in the same video-level subgroup if they are part of the same frame-level subgroups for a majority of frames This is visualized in Fig. 4.

Fig. 4
figure 4

Approach for merging subgroups across frames. Person 1 and 2 are in the same subgroup for a majority of frames, and are therefore put together in the same video-level subgroup. Person 3 and 4 are on their own for a majority of frames, the latter not being detected in frame 3

3.5 Network training

For the detection of violence in both the full videos and the subgroup videos, we employ the X3D network [32], an efficient network for video recognition, as adjusted and provided by Su et al. [14], Su [33] and pretrained on the kinetics dataset [34]. The main motivation for choosing X3D is its lightweight architecture. When compared to high performing models for action action recognition, such as transformers [35], X3D shows training latencies much lower than its transformer-based counterparts [36]. Since the goal of this study is to construct an efficient framework working in real-time rather than reaching top accuracies, we follow Su et al. [14] in their choice of network. For the same reason, rather than performing an extensive hyperparameter selection, we keep hyperparameter settings as they are to the extent possible. For training, we change clip_len to 8 and set the sampling stride to 1 to correspond to the number of frames extracted from each video (8) and the resulting frame rate (1). The number of training epochs is changed from 1 to 10, saving the model after the epoch in which it performs best on the test set, to allow for optimal training. The learning rate is kept at \(1 \times 10^{-4}\). The network is trained separately for detecting violence in the full videos and for detecting violence in subgroup videos, keeping everything but the input the same. Preprocessing of the input differs slightly for the full videos and the subgroup videos, as discussed in Section 3.2.

For SCFD, we perform 5-fold cross-validation, using the same fold division as Su et al. [14] to ensure a fair comparison. For this division, videos were randomly divided into the folds while ensuring a fair balance between fight and non-fight clips, and assigning all videos originating from the same original (longer) video to the same fold to avoid data leakage. For RWF-2000, we use the training and test sets as originally provided by Cheng et al. [6], which also prevent data leakage by assigning all clips from the same original video to the same fold.

3.6 Prediction fusion

The prediction resulting from the model trained on the full videos serves as a baseline, and as the foundation on which the trained subgroup model should build. Since not all people are recognized or tracked successfully, the absence of violent subgroups is no reliable indicator of the original video to not depict any violence. The opposite, however, holds: once one of the subgroups from a video is classified as showing violence, we hypothesize that this is a strong indicator for the full video to depict violence [16]. A non-fight prediction for a video, \(y_{\text {video}}=0\), can therefore change to a fight prediction, \(y_{\text {video}}=1\), if the probability \(p_{\text {subgr}}\) of at least one of the subgroups from that video showing violence is 0.8 or higher. This means that

$$\begin{aligned} y_{\text {video}} = {\left\{ \begin{array}{ll} 1, &{} \text {if } p_{\text {video}} \ge 0.5\ \text { or }\ \text {maxProb}(p_{\text {subgr}}) \ge 0.8 \\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(2)

4 Experiments

4.1 SCFD

This section will describe and discuss the performed experiments and their results on SCFD, the main dataset under analysis on which most experiments are performed. For the baseline, violence detection is performed based on the full video analysis only, results of which are given in the first row of Table 1. These values show that the network is already performing quite well on the task of VD in the SCFD dataset, which is similar to the results and findings presented in Su et al. [14].

Next, full video predictions are combined with subgroup predictions in the way described in Section 3.6, as shown in the bottom row of Table 1. These results show that adding subgroups to the full video analysis improves performance for three of the folds, and performs on par with the full video only for two of the folds. When considering the violent class only, performance increases more, but this is tempered by an increase of false positives decreasing the performance of non-violence. Output of the system is visualized in Fig. 5, depicting violent subgroups with red bounding boxes.

Furthermore, two ablation experiments are performed. For the first of these, the full video component is removed, analyzing only subgroup videos. Results are presented in the second row of Table 1. Note that these performances cannot be directly compared to those of the full model, since a subgroup originating from a ‘fight’ video might be part of the ‘non-fight’ subgroup set when none of the individuals in that subgroup are displaying violent behaviour.

To validate that any performance gain obtained from combining the full video with subgroup prediction would be a result of the subgroups themselves, rather than simply resulting from analysing separate crops of the video, we perform a second ablation experiment. For this experiment we replace each subgroup video with a random crop video of the original video, with the same size as the original subgroup video. The best performing trained subgroup model is then employed to evaluate these random rectangle videos, and the results are combined with the full video predictions as described in 3.6. The fold-wise results of this are included in Table 1, showing that the outcome is the same as the baseline model. This makes clear that even though the overall increase in accuracy for the proposed model is quite small, it is consistent throughout different folds and can be attributed to the nature of the subgroups, therefore being a meaningful addition.

Table 1 Weighted accuracy per fold and average weighted accuracy and F1-score of the baseline model (full video), ablation models (random rectangle model, subgroups only model), and proposed model (full video and subgroups)

4.2 RWF-2000

This section will describe and discuss the experiments on RWF-2000. To validate the proposed system, it is also trained and tested on the RWF-2000 surveillance dataset. Results are presented in Table 2, indicating a minuscule difference in performance between the baseline and the proposed model, the addition of subgroups leading to a performance loss of 0.3%. The subgroup component is also tested by itself as an ablation experiment, showing a performance loss of 12.1% as compared to the proposed model. Since the original dataset comes with a fixed training and testing split, no cross-validation is performed to enable valid comparisons to other frameworks. It should be noted that the proposed architecture was built for SCFD, as the main dataset used in this work. This could explain that the performance gain found for SCFD does not extrapolate when training and testing on RWF-2000. Moreover, it should be noted that the RWF-2000 dataset contains longer videos than SCFD (5 seconds vs. 2 seconds long) while the same number of frames is extracted, effectively reducing the frame rate. The RWF-2000 is generally considered to be more challenging, not only because of the highly variable resolution of the surveillance footage, which can also be said for SCFD, but also due to some videos being modified versions rather than raw surveillance footage [11]. Furthermore, while annotating the subgroups, the researchers noticed that the video-level annotations are not fully correct: all clips originating from the same video have been given the same video-level annotation of violence, while clips from the start or end of the original video do not show violence themselves. No video-level annotations were changed for the sake of comparison, but future work could analyse this to further improve this important and relevant dataset.

Fig. 5
figure 5

Output of the proposed method, violent subgroups are indicated with red bounding boxes

Table 2 Weighted accuracies and F1-scores of the baseline model, ablation model, and proposed model on the RWF-2000 dataset

4.3 Benchmark results

While the main goal of this study is to qualitatively improve the task of VD by increasing its interpretability and enabling easy integration into existing frameworks, it is still important to consider the model’s quantitative performance as well. Performance of our proposed combined model is therefore compared to the baseline and to other benchmark results in Table 3. It can be seen the baseline model, similar to the implementation of Su et al. [14], already scores relatively high. For both datasets, multiple clips in the dataset originate from the same original video, meaning that a random split is likely to contain clips from the same video in both the training and test set. This is called data leakage, and causes the model to overfit, thereby giving a distorted view of the actual performance of the model. It should be noted that for SCFD, Su et al. [14] are the first and (to the best of our knowledge) only ones to have explicitly mentioned this, splitting the dataset into five balanced splits actively preventing data leakage. We have used their division into folds. This means that for none of the other works in Table 3 reporting on SCFD, we can state with certainty that there was no data leakage, therefore our work can only fairly be compared to that of Su et al. [14]. This is indicated in Table 3 by placing comparable works under the horizontal line. We encourage researchers to use the same fold division and share more information on the data splits used, for fair comparison as well as data leakage prevention. RWF-2000 was published with a predefined training and test set. Those papers that mention using this split, as was done in this study, are placed below the row "Transfer model proposed model (ours)" in Table 3.

Table 3 Accuracy of previously published models on violence detection, as reported in their respective papers, compared to our baseline and proposed subgroup model

4.4 Generalizability to unseen datasets

For a VD framework to be applicable in real-life scenarios, it is important to be robust to highly variable inputs. Specifically, this means that it should be able to generalize to unseen surveillance data. To test the generalizability of the proposed framework, we perform two experiments: training on SCFD data and testing on RWF-2000 data, and vice versa. Results are presented in the two rows just above the line in Table 3, showing that the model generalizes exceptionally well. Trained on (all of) RWF-2000 and tested on (all of) SCFD, we see that accuracy even increases with 2% when adding the subgroup module, to a decent score of 88.3%. The model that is trained on (all of) SCFD and tested on (all of) RWF-2000 shows a pattern similar to that of the model both trained and tested on RWF-2000, in that addition of the subgroup model leads to a slight drop in performance. However, it is noteworthy that even when trained on the relatively small SCFD, performance on RWF-2000 still reaches an accuracy of around 83%. It should be noted that our transfer models do not fully meet the aforementioned criteria of data leakage prevention: they are trained on one entire dataset and tested on the other entire dataset. They are therefore placed above the horizontal line.

4.5 Speed analysis

To underline the practical utility of the proposed model, the processing time for each component of the system is measured. Tests are run on an Nvidia GeForce RTX 4070 Ti GPU paired with an Intel Core i9-13900F processor featuring 32 cores. In the current setup not more than 8 of these are used concurrently, analyzing 8 frames per video simultaneously for non-inference tasks. This leaves room for additional optimization possibilities or increasing the frame rate, allowing for customization of the framework to meet the user’s requirements when embedded into existing infrastructures.

Results are shown in Table 4. Assuming a simultaneous inference of the full video and the subgroup videos, as indicated by the two streams in Fig. 1, total processing time is 0.561 seconds. Performing both analyses after one another, adding their individual processing times together, results in a total processing time of 0.738 seconds. Given the original length of the videos, 2 seconds (SCFD) or 5 seconds (RWF-2000), this means that our proposed framework allows for real-time processing. This furthermore underscores our system’s suitability for easy integration into established safety systems.

Table 4 Processing time per component in seconds

5 Conclusion

In this paper, an innovative approach for incorporating subgroup analysis into VD was proposed. While there is a vast body of research on VD, little work has been done on socially meaningful and interpretable VD in surveillance footage. Our adaptable add-on module automatically extracts and tracks multiple subgroups across frames for violence detection and localization in safety systems. As such, it can help bridge the gap between theory and practice, by alerting when violence is detected and indicating what group of people is or are involved. This is considered to be the main contribution of this work: performing real-time VD in an interpretable and generalizable way, with a system that can be easily integrated into existing frameworks or systems.

We trained an efficient network, X3D, and performed experiments on two of the main violence surveillance datasets: SCFD and RWF-2000. Results indicate that our subgroup module consistently increases or stabilizes the performance on SCFD with on average +1.3%, while leading to a small decrease in performance on RWF-2000 (-0.3%), needing as little as eight frames per video. This is supported by ablation experiments, showing the part both components take in the final performance. Furthermore, results indicate that the proposed model generalizes well to unseen datasets. All this is done in real-time, as illustrated by a speed analysis of individual components.

In the proposed system, subgroup analysis can only change a non-violent label to a violent one, increasing true positives and false positives for the violence class. In practical use cases, we recommend prioritizing false positives over false negatives since having to rewatch a fragment will have fewer consequences than missing a violent event. Models trained on either dataset demonstrate strong generalizability to the other, promising broader applicability. Figure 5 illustrates model outputs, showing meaningful and actionable information for CCTV operators. These results suggest the framework’s potential to enhance public safety and reduce human analyst workload.

A main limitation is that subgroup formation and therefore system performance rely heavily on the initial pose estimation. Individuals without assigned keypoints will not be part of any subgroup and thus not influence the final prediction. This poses challenges in VD, as fighting individuals are often blurred, occluded, or in less detectable positions. We see that missed keypoints are typically associated with individuals engaged in fights, which are crucial for violence detection. This led to the decision of only allowing the module to change a non-fight label to a fight-label, as the absence of detected fighting subgroups may indicate missed individuals rather than their absence in the entire video. Improving person detection could pave the way for a more refined subgroup analysis influencing the overall prediction.

While the current work focused on building the subgroup module and how it influenced the baseline, reaching state of the art performance was not a main motive. Future work should experiment with adding this module to other frameworks, including those outperforming the models described here. Another direction for future work is to include temporal attention for detecting which frames of a video are most important for the final classification. While our system could readily be extended to this, by feeding it multiple blocks of eight consecutive frames from the same video, integrating it into the framework would further enhance its practical utility.