Threshold Active Learning Approach for Physical Violence Detection on Images Obtained from Video (Frame-Level) Using Pre-Trained Deep Learning Neural Network Models

: Public authorities and private companies have used video cameras as part of surveillance systems, and one of their objectives is the rapid detection of physically violent actions. This task is usually performed by human visual inspection, which is labor-intensive. For this reason, different deep learning models have been implemented to remove the human eye from this task, yielding positive results. One of the main problems in detecting physical violence in videos is the variety of scenarios that can exist, which leads to different models being trained on datasets, leading them to detect physical violence in only one or a few types of videos. In this work, we present an approach for physical violence detection on images obtained from video based on threshold active learning, that increases the classifier’s robustness in environments where it was not trained. The proposed approach consists of two stages: In the first stage, pre-trained neural network models are trained on initial datasets, and we use a threshold ( µ ) to identify those images that the classifier considers ambiguous or hard to classify. Then, they are included in the training dataset, and the model is retrained to improve its classification performance. In the second stage, we test the model with video images from other environments, and we again employ ( µ ) to detect ambiguous images that a human expert analyzes to determine the real class or delete the ambiguity on them. After that, the ambiguous images are added to the original training set and the classifier is retrained; this process is repeated while ambiguous images exist. The model is a hybrid neural network that uses transfer learning and a threshold µ to detect physical violence on images obtained from video files successfully. In this active learning process, the classifier can detect physical violence in different environments, where the main contribution is the method used to obtain a threshold µ (which is based on the neural network output) that allows human experts to contribute to the classification process to obtain more robust neural networks and high-quality datasets. The experimental results show the proposed approach’s effectiveness in detecting physical violence, where it is trained using an initial dataset, and new images are added to improve its robustness in diverse environments.


Introduction
In recent years, the automatic detection of physical violence has been a challenge in human activity recognition because it involves a continuous analysis of human behavior [1].Analysis at the frame level of a video has been widely studied for detecting physical violence.It is a process by which each individual frame of a video is analyzed and examined to identify signs of violent or aggressive behavior.This analysis involves looking for specific visual indicators, such as sudden movements, punching gestures, physical contact between people, or other patterns that may suggest a violent action.In performing this on a frame-by-frame basis, the system attempts to detect and categorize instances of violence as they occur over time, allowing for a detailed and granular understanding of the activity in the video without extracting temporal features [2].
Since the early 2000s, the analysis of video at frame level has demonstrated that it is possible to detect violence in images using only spatial feature extraction, using techniques such as LPB [3], HOG [4], FAST [5], SURF [6], SIFT [7], BRISK [8], and ORB [9], because of their ability to capture robust and distinctive spatial features that are invariant to common transformations such as scale, rotation, and illumination.These techniques identify key points in images that reflect crucial local details, such as edges, corners, and textures, essential for recognizing complex visual patterns associated with violence.
The bag-of-words (BoW) technique, which works at the frame level of a video, has been used to compactly and robustly represent the extracted local features, allowing capture of the distribution of local visual patterns that characterize physical violence without the need for exact spatial arrangement.Wang [10] used BoW to identify violence in a dataset containing 500 images with violent scenes and 1500 images without violence.When applying BoW to SIFT features, they achieved an accuracy of 85.7% ± 1.4%.When using HOG features, the accuracy was 84.3% ± 1.6%.In contrast, with LBP features, they achieved an accuracy of 90.1% ± 1.5%, demonstrating that combining robust spatial features with effective BoW representation enables accurate detection of physical violence in images.
However, despite the success achieved by analysis at the frame level of video, detecting physical violence in videos is still a significant challenge.Recently, it has been addressed using deep learning (DL) techniques, mainly artificial neural networks (ANNs).This approach aims to enhance the efficiency of classifying physical violence scenes extracted from various video sources, such as closed-circuit television systems (CCTV), smartphones, and other digital cameras used for surveillance tasks.Consequently, developing such methodologies has expanded the available tools, reducing the human effort required to identify physical violence situations accurately.Although the current literature and state-ofthe-art research boast various works focused on achieving high accuracy rates in violence detection within videos, typically ranging between 90% and 99% (see Ref. [11]), it is worth noting that many of these models are tailored to specific datasets, which work only in those particular scenarios.
Strides in violence detection can be attributed to using pre-trained neural networks and the emergence of hybrid models.One prevalent approach involves employing these pretrained models for spatial feature extraction, followed by a secondary network responsible for classification.Such methodologies have demonstrated precision percentages reaching up to 96% [16].Another noteworthy implementation integrates ConvLSTM for temporal feature extraction and 3D ResNet50, 3D ResNet101, and 3D ResNet152 architectures for spatial feature extraction in conjunction with the UCF-Crime dataset, achieving a precision rate of 96% [17].Recent advancements include a convolutional attention module (CBAM) to discern the dynamics among individuals requiring detection [18].Other methods explored for violence detection involve feature extraction through optical flow, as demonstrated in the work presented in Ref. [19].Rendón-Segador et al. [20] employ optical flow as input, preceding the utilization of a spatial-temporal codifier with DenseNet121 and a bidirectional layer of 2D convolutional LSTM (BiConvLSTM2D), resulting in an accuracy of 99%.Similarly, Wang [21] experimented on four public violence detection datasets: Movies, Hockey, Crowd, and RWF-2000.The authors adopt a comparable approach, extracting two optical flows, concatenating them, and feeding them into the classifier, achieving precision rates of 0.86 and 1.0 for the Movies and RWF-2000 datasets, respectively.
Furthermore, several studies have focused on addressing the violence detection challenge through a multi-class approach, utilizing the UCF-Crime dataset to more precisely distinguish between various types of violence [22].Similarly, Vosta [23] examines four variants (Binary, UCF-Crime original dataset containing 14 classes, 13 abnormal events grouped as one anomalous class, 4MajorCat, and NREF) of the same dataset (UCF-Crime).Likewise, Yousaf [24] experimented with cartoon animated clips to classify violence into three classes: safe, fantasy violence, and sexual nudity.
The utilization of data sourced from a variety of channels, including online social networks, public datasets, and video surveillance cameras (with or without sound), has emerged as a trend aimed at cultivating more robust models across diverse environments [2,16,25].Nevertheless, as new videos in different scenarios are generated daily, relying solely on robust yet static models trained with diverse datasets still needs to be improved.
The unprecedented volume of available data presents a significant opportunity for developing deep learning models; however, it also poses challenges such as the scalability of traditional models, usability, and adaptability [26].Moreover, a critical challenge associated with this abundance of data is converting raw data into high-quality labeled data, which is essential for enhancing predictive models' accuracy.It is crucial to recognize that constructing a training set entails a potentially computationally intensive process in terms of both time and resources [27].
Active learning has emerged as a collaborative strategy that involves human participation alongside machine learning algorithms to iteratively construct predictive models and training sets.This iterative approach mitigates the costs of acquiring labeled data while concurrently enhancing the prediction accuracy of machine learning models [28].Moreover, active learning facilitates the incorporation of new information into datasets utilized by automatic violence detection models.Active learning has been studied for several years; however, there have been few studies, although it is a very important topic in machine learning, especially now that many data exist but there are few high-quality datasets [29].Today, it remains a focus of study in the academic research community (for example, see Refs.[30][31][32]) because of its utility in building robust machine learning models and obtaining high-quality labeled datasets.
In this paradigm of human-machine interaction, where human involvement spans various stages such as training, optimization, or evaluation of machine learning models, it is referred to as "the human in the loop" or "in the circuit".The aim is to harness human cognitive abilities and expertise to augment the performance of machine learning models.Consequently, this hybrid approach makes human participation in data labeling feasible, wherein an algorithm selects the class-unlabeled data deemed most pertinent for human expert labeling [33].Subsequently, these labeled data are incorporated into the training dataset, enabling the training of models with diverse data to foster robustness across different environments.
This paper introduces a threshold active learning approach to construct robust models for physical violence detection on images obtained from video across diverse environments.Our objective is to effectively identify instances of physical violence within video images depicting diverse environments sourced from heterogeneous datasets.
The proposed method offers two key advantages: firstly, it facilitates the expansion of the training dataset by incorporating labeled data deemed relevant by human experts, thereby enhancing model performance.Secondly, it makes the process of building new datasets from varied environments more efficient, thereby reducing the associated effort.Notably, our approach is characterized by its simplicity and versatility, making it adaptable to a wide range of sophisticated deep learning models.
The main contribution of the proposed approach is the process used to obtain a threshold µ (process based on the neural network outputs) that allows human experts to contribute to the classification process to obtain more robust neural networks and highquality datasets.

Related Works
The challenge inherent in any image identification system is attempting for 100% efficiency, a goal often elusive due to intrinsic complexities.This is particularly true in endeavors such as identifying physical violence, where false negatives are regrettably common.To address this challenge, our research adopts an active learning approach, mixing the collaborative efforts of human experts and machine learning algorithms.Active learning serves as a strategic framework aimed at refining classifiers to generalize well on instances not encountered within the problem domain [33].
Active learning finds its application in various domains, including image classification and object detection.Li and Guo [34] propose a method that employs a measure of uncertainty, coupled with information density and an adaptive information framework, to select instances from unlabeled sets based on their informativeness.By combining measures of uncertainty with adaptive information density, this approach effectively guides the selection of instances for labeling, thereby enriching the labeled dataset.
Beluch et al. [35] investigate and compare different active learning methods in the context of image classification using high-dimensional data and CNNs.The paper evaluates ensemble-based methods against Monte Carlo dropout and geometric approaches, finding that ensembles perform better and provide more calibrated predictive uncertainties, which are crucial for active learning algorithms.The study includes experiments on various datasets, including MNIST, CIFAR-10, and ImageNet, as well as a large, class-imbalanced diabetic retinopathy dataset.The results demonstrate that ensemble-based active learning is particularly effective in managing class imbalance during the acquisition phase.
In [36], Sener and Savarese introduce a core-set selection approach for active learning in CNNs, providing a theoretical framework to evaluate subset performance and demonstrating significant improvements in image classification tasks over existing methods.This method involves choosing a subset of data points such that a model trained on this subset performs competitively on the remaining data.
Pan et al. [37] produce significantly different results for a given sample based on the principle that if two deep neural networks (DNNs) of the same structure are trained on the same dataset, that sample should be selected for additional training.The dual active sampling on batch-incremental active learning approach has proven useful.It simplifies implementation and reduces computational requirements compared to other state-of-theart methods, leading to better results on the CIFAR-10 dataset with reduced computational time compared to the core-set method.
Carbonneau et al. [38] introduce new methods for bag-level aggregation in multiple instances of active learning (MIAL), a technique used in instance classification problems.The study focuses on reducing labeling costs by identifying the most informative instances and querying the expert for their labels.The proposed method outperforms existing methods in various application domains, including aggregated informativeness and clusterbased aggregative sampling.This research is significant as it addresses the limitations of single-instance active learning methods in multiple-instance learning problems.
Recent efforts, exemplified by Chen et al. [30], emphasize using large datasets without full labeling.Chen et al. propose a method for object detection that amalgamates active and supervised learning techniques.Initially, labeled data are utilized to train a detection model in a semi-supervised manner.Subsequently, the stability of the unlabeled data is assessed, with low-stability instances manually labeled, while those with high uncertainty are pseudo-labeled using predictions from the detector.Integrating manual and pseudolabeling contributes to refining the detection model, yielding promising results with a reported mean average precision (mAP) of 79.2%.
Another notable approach, as presented by Li et al. [31], uses active learning based on competence to classify unlabeled images by their complexity.By utilizing a competence function in conjunction with prediction probabilities, this method effectively identifies images with varying degrees of difficulty, iteratively augmenting the dataset.Training the model with stochastic gradient descent (SGD) and a learning rate of 0.1, Li et al. evaluated three active learning methods, selection based on similarity, probability-based similarity selection, and competence-based active learning, achieving a final accuracy of 92%.
Additionally, Mohammadi et al. [32] introduce a framework utilizing pre-trained vision transformers trained on the RWF dataset as "experts" alongside a routing system and a reinforcement learning-based classifier.This approach dynamically activates experts concerning each video clip, determining its category (fight or no-fight) through various selections and categorizations.The router module orchestrates the classification process, deciding whether to select a clip's category immediately or to gather further information by activating vision transformers, thus concluding the classification process by selecting the appropriate category.

Theoretical Foundations
An artificial neural network (ANN) comprises numerous interconnected layers, each composed of multiple neurons.There are two primary types of ANNs: feed-forward and recurrent.The distinction lies in their connectivity patterns.In recurrent neural networks (RNNs), neurons may be connected to other neurons within the same layer or to neurons in layers that are neither previous nor subsequent to it.Conversely, in feed-forward neural networks (FNNs), neurons in a given layer are only connected to neurons in the subsequent layer via synaptic weights (w) [39].
In the feed-forward process, the input to a neuron i in the l-th layer is determined by the product of the results of the activation function φ(•) (Equation (1)), derived from the preceding layer (l − 1), and the weight vector w of layer l, which includes the bias weight b l .The output of the neuron corresponds to a spatial transformation of s l i by the activation function (Equation ( 2)) [40].
The output j of the last layer in the ANN is z j = f (x, w) (Equation ( 3)); it depends on the parameters w of all hidden layers and it estimates the transformation { f : R N → R C }, which partitions the input space R N into C classification regions (R C ) [41] by using a linear combination φ h (x).
with x being the input vector (x ∈ R N ) and, when l = 1, s i = ∑ h w hi x h (i.e., s i is the i-th neuron of the first hidden layer).w l i is the weight of the i-th input that is connected from the h neuron of the l-th hidden layer to the (l − 1)-th hidden layer.φ l−1 h is the h-th output node on the (l − 1)-th hidden layer, where L is the total number of layers in the ANN.

Convolutional Neural Network
A convolutional neural network (CNN) is a type of ANN where the hidden layers consist of a set of convolutions that typically are a combination of linear and nonlinear operations (i.e., convolution operation and activation function) [42].In addition, the CNN usually includes pooling and fully connected layers.
The convolutional layer constructs a feature map to extract key features from input data.Here, convolutional kernels function as local filters, operating on sequential data to produce non-invariant local features.Conversely, the pooling layer plays a fundamental role in reducing the dimensionality of the feature vector, preserving only the most pertinent features.By extracting essential features within fixed-length windows, the pooling layer aids in making the subsequent processing steps more efficient [43,44].Moreover, fully connected layers serve as linear classifiers, facilitating the assignment of predicted classes to specific inputs x p .CNNs are formidable tools for feature extraction, particularly adept in tasks involving image data modeling, summarization, and classification [45,46].
In recent years, the utilization of pre-trained models for specific tasks has gained prominence, a practice known as transfer learning.This approach is invaluable when resources for training a neural network are limited or inadequate, allowing for adapting a model to varying classification problems.Notable pre-trained CNN models such as DenseNet121, EfficientNetB0, InceptionV3, MobileNetV2, ResNet50, and VGG16 offer the advantage of requiring fewer filters and parameters while yielding favorable results.Consequently, these models have found application in diverse domains, from processing biomedical images to violence detection in videos [20,47,48].Below, we provide a summary overview of the three pre-trained models we used in our experimentation.

•
EfficientNetB0.It is a convolutional neural network model that utilizes the ImageNet database for classification tasks across 1000 object categories.One of its key innovations is the adoption of "compound scaling", a technique designed to enhance model performance by increasing the resolution of input images.This increase in resolution necessitates adjustments in both the width (number of channels) and depth (number of layers) of the network to effectively capture the additional details provided by higher-resolution images.This approach involves applying a fixed set of coefficients to uniformly scale the network architecture's width, depth, and resolution.It was specifically developed to improve the scalability of CNNs to achieve better model performance and efficiency.EfficientNet-B0 serves as the baseline model, characterized by 5.3 million parameters, and accepts input images of size 224 × 224.Subsequent variants, denoted by increasing numbers (e.g., B1, B2, B3, etc.), further incorporate width, depth, and resolution adjustments to optimize performance across different computational constraints and task requirements [50].

•
MobileNetV2.The MobileNetV2 architecture consists of two main types of building blocks: residual blocks with stride and blocks with a stride of 2 to reduce the size.Each block comprises three layers: the first layer is a 1 × 1 convolutional layer followed by a rectified linear unit 6 (ReLU6) activation function, the second layer is a depthwise convolutional layer, and the third layer is another 1 × 1 convolutional layer used to linearly combine the output channels of the depthwise convolution with a ReLU6 activation function [51].

Physical Violence
Physical violence occurs when a person violates the bodily space of another without their consent, for example, by hitting, pulling, or pushing them.It also may include causing physical injuries with objects (weapons or other objects) or with the aggressor's body (punching, pulling, kicking, pushing).Physical violence has emotional and health consequences.The General Law on Women's Access to a Life Free of Violence [52], in its Article 6, section II, defines physical violence as "A type of violence referring to any act that inflicts non-accidental harm, using physical force or some type of weapon or object that may or may not cause injuries whether internal, external or both".
This work is focused on improving the automatic identification of physical violence (PV) or non-physical violence (NPV) in images obtained from video in diverse environments using a threshold active learning approach.

Proposed Approach
In this paper, we introduce an active learning approach based on a threshold parameter (µ) to enhance classifier performance, construct robust models capable of operating effectively across diverse environments for detecting physical violence in videos by analyzing each photogram, and build high-quality datasets with the intervention of a human expert.This approach comprises two stages, presented in Sections 4.1 and 4.2, and outlined in Algorithms 1-4.
Algorithm 1 initiates the active learning process by first constructing the initial classifier, followed by the execution of the active learning approach.Algorithm 2 delineates the key steps of the active learning approach based on a threshold.This includes classification by a neural network model and human expert intervention, facilitating iterative improvement of the classifier through selective labeling of data.Algorithm 3 focuses on training and testing the models using both the original and updated datasets (DB), incorporating ambiguous samples that have been labeled and added to DB.Finally, Algorithm 4 details the process used to obtain the threshold µ.

Algorithm 1 Proposed active learning approach
Input: DB, M j , T[]; /* DB is the training dataset, M 0 is the initial model, and T is a set of unlabeled datasets.*/ 1: STAGE 1: 2: j ← 0; 3: µ = 0; /* Default value for µ is zero.*/4: M j =OPTIMIZE(M j , DB, µ); /* Algorithm 3 is performed to built the initial model M 0 using DB, which is a labeled dataset.*/ 5: µ = THRESHOLD(M j , DB); /*Algorithm 4 is employed to obtain the best value to µ threshold.6: STAGE 2: /* T j is an unlabeled and unknown dataset */ 8: while new unlabeled datasets T j exist do z ← M j (x q ); /* z is the neural network output to sample x q */ 3: AM k ← AM k ∪ x q ; /* x q is considered ambiguous, so it is included in AM k .*/ 5: else 6: x q ← MODEL_LABEL(x q ); /* Set the label generated by the classifier to x q .*/ for q = 0 to ∥DB∥ do 8: z ← M j (x q ); /* z is the neural network output for sample x q */ 9: AM k ← AM k ∪ x q ; /* x q is considered ambiguous, so it is included in subset AM k */ 1: µ ini = 0; 2: for q = 0 to ∥DB∥ do 3: z q ← M j (x q ); /* z is the neural network output to sample x q */ 4: d q ← LABEL(x q , DB); /* d is the correct label to sample x q */ 5: if (z q <> d q ) then 6: end if 8: end for 9: µ = µ ini ∥DB∥ + △; 10: return µ;

Stage 1
In the first stage, the initial model is built using a labeled dataset DB (Algorithm 1, line 4).Immediately, the threshold parameter µ (Algorithm 4) is calculated (Algorithm 1, line 5).Observe that in the first stage the initial value of µ is zero (Algorithm 1, line 3); i.e., in the first stage, the training of the model is not biased by the value of µ.This stage comprises lines 2-5 of Algorithm 1.

Stage 2
The second stage is enclosed in lines 7-12 of Algorithm 1, in which the model is tested using unseen and unknown datasets (T j ).T j differs from the training dataset (DB ̸ = T j ), ensuring an evaluation of the model's performance in novel environments.The threshold parameter µ (obtained in the first stage) is utilized to identify images (x q ) that the classifier deems ambiguous or challenging to classify (Algorithm 2, line 3).These ambiguous images represent scenarios where discerning physical violence from non-physical violence scenes may be particularly difficult, such as instances involving strong hugs or warm greetings, among others.
The selected P ambiguous images (x p ) are stored in a subset AM k , where AM k = AM k ∪ P p=1 (x p ) (Algorithm 2, line 4).When all samples in T j are processed by the model (i.e., when q ≥ ∥T j ∥), AM k is then forwarded to a human expert to be labeled with its correct class: physical violence or non-physical violence scene (Algorithm 2, line 10).Subsequently, the dataset DB is updated with the labeled samples, i.e., DB = DB ∪ AM k (Algorithm 2, line 11), and the model is retrained using DB (Algorithm 2, line 12).This process continues until no unseen datasets are available (Algorithm 1, lines 5-9).
The OPTIMIZE procedure (Algorithm 3) is used to build the optimized artificial neural network model M j .It employs the µ parameter to improve the classifier performance, duplicating the samples the model finds difficult to learn.In this procedure, the model is retrained k = 0, 1, 2, 3, ..., K times; i.e., the process is repeated for a total of K epochs or while AUC increases.The human expert does not intervene in this process because the datasets employed are labeled.
The two above stages (Sections 4.1 and 4.2) simulate an online system where the classifier is initially constructed using a labeled dataset and subsequently deployed in a real-world environment.When the model encounters ambiguity in classifying an image, it is referred to a human expert to reduce classification errors.This approach takes advantage of ambiguous images to enhance the model's performance in unseen scenarios.

Threshold µ
An artificial neural network serves as a mathematical model mapping input x to output z, typically represented by a function f (W), where f : R N → R C .Here, x ∈ R N and z ∈ R C denote the input and output vectors, respectively [39].
Vector z contains the neural network outputs, and its dimensionality corresponds to the number of classes C. In this work, the dimensionality of z is two because we classify only two classes: physical violence (PV) or non-physical violence (NPV).Each position c in z q is assigned to a particular class; in our case, z q [0] corresponds to class PV and z q [1] to the NPV class.Thus, z q [c] is the likelihood of a sample x q ∈ c, where c = 0, 1, 2, 3, ..., C. The classifier assigns the class c to sample x q if z q [c] contains the maximum value of z q [40].
In binary classification tasks ), it is clear the class that the classifier should assign to x q , but when z q [0] ≈ z q [1], the classifier decision could be wrong.When the outputs z q [0] and z q [1] are significantly skewed towards one class, the classifier's decision is straightforward.However, when these outputs are close, the classifier may struggle to make a definitive decision.This uncertainty motivates the use of a threshold to identify ambiguous samples, where the difference between the outputs is negligible, potentially leading to misclassifications.
Prior studies have explored the effectiveness of employing thresholds to improve classification accuracy in scenarios where ambiguous samples are prevalent [53,54].Determining an appropriate threshold value µ involves an iterative process, typically conducted using the initial dataset DB.Through trial and error, a threshold value that effectively distinguishes between unambiguous and ambiguous samples can be set, aiding in the refinement of classification models.This iterative process is described as follows: 1.
The model M j is built using DB.

2.
M j is tested with DB.

3.
We choose the outputs from the samples of DB classified incorrectly by M j , and we use Equation ( 4) to find the initial threshold µ ini .
Equation ( 5) obtains the final threshold µ, in which the initial µ ini is increased by a constant △ (obtained in a trial-and-error process performed by a human expert), to give a certain level of tolerance to threshold µ.The process for setting the △ value is performed only one time, where △ is a small value (0 <△ and △<< 1).Algorithm 4 summarizes the process employed to obtain the best value for the µ threshold.
Using a threshold µ takes advantage of the neural network output meaning to identify ambiguous images, i.e., those images where the classifier could misclassify.Each neural network output can be seen as the likelihood that a specific image or sample belongs to a determined class [39,40].Thus, if the likelihoods of each output are very different between them, then the likelihood of error in the classifier may be reduced, as has been studied in Refs.[53,54].

Computational Complexity Analysis of the Threshold Active Learning Approach
The time-complexity (big-O notation) of Algorithm 1 is expressed in Equation ( 6), where O(OPT(DB, T j ) is the complexity of the OPTIMIZE procedure (Algorithm 3), J is the number of unlabeled datasets, and O(AL(M j−1 , DB, T j )) is the complexity of the ACTIVE-LEARNING procedure (Algorithm 2).

O(OPT(DB,
The complexity O(OPT(DB, T j ) of the OPTIMIZE sub-algorithm (Algorithm 3) can be approximated by considering the main operations within the while loop and the nested for loop, as follows: • The complexity of training the model LEARNING(M j , DB) depends on the learning algorithm used based on the pre-trained model.Suppose it is O(L(DB)).

•
The complexity of testing the model TEST(M j , T j ) depends on the size of the test set.Suppose it is O(T(T j )).

•
The for loop has a complexity of O(∥DB∥) to iterate over the dataset and compute the neural network output.
Therefore, the complexity of the OPTIMIZE sub-algorithm for k iterations, where K is the maximum number of epochs, is approximately Similarly, to calculate the complexity O(AL(M j−1 , DB, T j )) of the ACTIVE-LEARNING sub-algorithm, we can consider the main operations within the for loop and the invoked functions: • For loop: -The complexity of iterating over the test dataset T j is O(∥T j ∥).-Within the loop, the operation M j (x q ) has a complexity dependent on the selected pre-trained model, which can be considered O(M).
Then, to determine the general big-O order of the complexity of the MAIN algorithm (Algorithm 1), we need to observe the dominant terms in the total complexity expression, O(MAIN()), and simplify them: Complexity of training and testing (OPTIMIZE):

2.
Complexity of ACTIVE-LEARNING: We assume that: Generally, training (and thus the OPTIMIZE procedure) will be the most expensive operation.Assuming L(DB), T(T j ), and operations on the data (∥DB∥ and ∥T j ∥) are of the same order of magnitude, we can consider the dominant terms: 1.
Complexity of training (in the OPTIMIZE procedure): Complexity of the while loop (in the ACTIVE-LEARNING procedure): Considering that J is the number of iterations of the while loop and K is the number of epochs, the overall complexity can be summarized and simplified in terms of the most dominant components: If we assume that operations on the data (∥DB∥ and ∥T j ∥) are of lower order compared to the model iterations: The most dominant term, considering that K and J can be large, is This implies that the complexity order is dominated by the number of iterations of the while loop J, the number of epochs K, and the complexity of the training process L(DB), i.e., the LEARNING procedure in Algorithm 3.
As can be seen, the total complexity of the proposed approach is essentially dependent on the selected pre-trained model.That is, in the worst case, each of the evaluated models will dominate the total complexity.Although they all have unique characteristics, it can essentially be said that their complexity O(L(DB) = O(L(DB) 1 ) + O(L(DB) 2 ) depends on the convolutional (Equation ( 12)) and fully connected (Equation ( 13)) layers [55,56].
where d is the depth of the convolutional layer, l n is the length of the output feature map, f n is the number of filters in the n-th later, S n is the length of the filter, c n−1 is the number of input channels in the l-th layer, r is the learning rate, and b is the batch size.

O(L(DB
where l is the depth of the fully connected layer, D is the dimension of the input/output channel, W is the width of the input, H is the height of the input, and N is the number of outputs.

Summary of the Proposed Approach
Finally, in order to better understand the proposed method, we present a visual summary in Figure 1, where its main components are presented.The first neural network model is built using a labeled dataset DB after the model works in an unknown environment.It is observed that unlabeled datasets (T j ) arrive at the classifier; thus, images or data samples are forwarded in the model, which must assign a label, but if it has uncertainty about the correct label, the model considers this sample as ambiguous and it is added to AM k .
Otherwise, the label obtained by the model is assigned to the image.Thus, when enough ambiguous images exist in AM k , it is sent to a human expert for labeling.Labeled AM k is added to DB, and the model is retrained with updated DB, which now contains all ambiguous images correctly classified by human experts.

Data Sources
Utilizing three distinct publicly available datasets enriches the diversity and realism of the training and evaluation process in our work.By incorporating diverse datasets, our study aims to train and evaluate the physical violence detection model across various scenarios, enhancing its adaptability and robustness in real-world applications.The datasets are: 1.
AIRTLab dataset: This dataset presents scenes depicting everyday situations that may be misinterpreted by the algorithms as violent actions, such as hugs, clapping, and greetings, performed by nonprofessional actors.Including such scenarios challenges the classifier to discern subtle differences between benign and aggressive behaviors [57].

2.
Real-Life Violence Situations (RVLS) dataset: This dataset comprises images extracted from videos depicting real-life situations and provides a comprehensive collection of diverse scenarios encountered in various environments [58].By focusing specifically on instances of violence, this dataset facilitates the classifier's training in identifying and classifying violent behaviors accurately.The images in this dataset have been carefully selected and captured from multiple perspectives, incorporating a diversity of angles and different contexts or backgrounds to ensure wide variability and improve the effectiveness of recognition and analysis algorithms.

3.
Pexels [59] is an online platform renowned for its vast collection of free stock photos and videos.The Pexels dataset was obtained by carefully selecting images from this platform in the context of this study.These images were specifically chosen to represent real-life scenes that do not contain physical violence, thus complementing the RVLS dataset, which focuses on violence images.Including Pexels images in the study is essential to evaluate the deep learning model's ability to distinguish between violent and non-violent scenes.By training and testing the model with images from both categories, researchers can assess its accuracy and robustness in classifying physical violence in various contexts.
The diversity of images available on Pexels allows researchers to select a representative sample of non-violent scenes that reflect the complexity of the real world.This ensures that the model learns to identify physical violence and differentiate it from everyday situations that could be misinterpreted as violent.4.
Smart-City CCTV Violence Detection (SCVD) dataset: The SCVD dataset offers a unique compilation of video footage captured through closed-circuit television (CCTV) systems deployed in urban settings [60].This dataset accounts for variations in viewing angles, image quality, and recording contexts inherent to CCTV footage, which can significantly impact the effectiveness of violence detection algorithms.Notably, the SCVD dataset introduces a specific category dedicated to the identification of weapons, broadening the scope of violence detection to include any objects that could be employed harmfully.This expanded perspective enhances the classifier's ability to detect threats in urban environments and provides a different perspective on potential risks.Additionally, licensing the SCVD dataset under the "CC BY-NC-SA 4.0" license promotes collaboration and knowledge sharing within the research community while ensuring proper attribution to the original authors.
The heterogeneity in the collection of images is intended to simulate the variability of situations that artificial intelligence systems could face in real-world applications, from security systems to developments in artificial intelligence aimed at understanding complex social contexts.
It is important to note that all images included in the datasets used in this work are in the public domain.This means that they have been released from any existing copyright, allowing their use, redistribution, modification, and exploitation without restrictions or the need to request permission from the original authors.This feature facilitates their incorporation into research projects, software development, and other academic or commercial initiatives, providing a solid foundation free of legal restrictions for the advancement of studies in violence detection and behavioral analysis through video technologies of computer vision.
Table 1 shows details regarding the original public datasets used in this work, including "size" or number of the images or frames in the dataset, "resolution", and the number of samples in each class of interest for this work: physical violence and non-physical violence.
In order to give more information about the datasets used, Figure 2 shows some examples of the frames or data contained in AIRTLab, RVLS/Pexels, and SCVD datasets, where the left column of Figure 2 indicates examples of images of physical violence and the right column presents examples of non-physical violence.From the original video dataset AIRTLab, we select a subset of Q images to form the initial dataset.This dataset is then partitioned into two distinct subsets: the training set, denoted as DB; and the test set, denoted as T. The training set is utilized to construct and optimize the model, while the test set is reserved for evaluating the performance of the proposed approach.Both subsets contain 70% (training) and 30% (testing) of the total data, ensuring a balanced distribution of samples for training and evaluation purposes.It is important to note that the intersection between the training set (DB) and the test set (T) is null, ensuring that no data overlap occurs between these two subsets (DB∩ T = ∅).During the initial stage (refer to Section 4.1), the training set (DB) derived from the AIRTLab dataset is exclusively employed.Subsequently, in stage 2 (detailed in Section 4.2), datasets originating from RVLS, Pexels, and SCVD are utilized as the test set (T).Unlike the training set, these datasets are not split further and are used in their entirety to evaluate the performance of the developed model.This approach ensures that the model is assessed rigorously across diverse environments, facilitating a robust assessment of its effectiveness and generalization capabilities.

Neural Network Parameters
The experiments were conducted on a workstation equipped with an Intel Xeon E-2186G (12) @ 4.700 GHz processor and an NVIDIA GeForce RTX 3060 graphics card featuring 6 GB of video memory.The system had 64 GB of RAM memory.The NVIDIA-SMI 535.86.05 driver version and CUDA Version 12.2 were utilized for GPU acceleration.For software, TensorFlow 2.14.0, scikit-learn 1.3.0,Python 3.10.11,pandas 2.1.0,numpy 1.23.5, jupyterlab 3.6.3,notebook 6.5.4,ipykernel 6.25.2, and ipython 8.15.0 were employed.
In this work, we choose DenseNet121, EfficientNetB0, and MobileNetV2 pre-trained models because they are mature, effective, and robust, but mainly because they are wellknown models for image classification (see Section 3.1).Also, their computational efficiency has been tested in terms of floating point operations (FLOPS); for example, see Ref. [61].Thus, this allows us to focus only on evaluating the performance of our active learning approach (see Section 4) to build robust models for physical violence detection in images obtained from video in diverse environments.In addition, these models are still the focus of research about image classification tasks.
The setup details of the pre-trained models are summarized in Table 2. "Pooling" represents the pooling method applied, while "dense" indicates the configuration of the dense layer, with the first number representing the number of nodes and the second indicating the activation function used."Dropout" refers to the probability value for dropping out nodes in the hidden layers."BN" denotes the batch normalization method."η" signifies the initial learning rate, "epoch" represents the number of iterations executed by the optimizer, and "freezing" denotes the number of layers frozen to preserve and reuse fundamental layers, focusing learning on more specific and new tasks such as the detection of physical violence in images from videos."µ" represents the threshold computed, as presented in Section 4.3.The parameters presented in Table 2 were determined through a trial-and-error process.Initially, 70% of the DB was used to train different configurations of parameters, while the remaining 30% was reserved for testing.Different values of the learning rate, batch size, dropout, epochs, and values of threshold µ were explored to identify the most appropriate settings.The performance of each configuration was evaluated based on the area under the receiver operating characteristic (ROC) curve, as described in Section 5.3.
Once the parameters yielding the best results were identified, the model was retrained using the full DB with these optimal settings.It is important to note that the test set T was not used during the trial-and-error process but was reserved solely for evaluation, as outlined in Section 4.1.In the output layer, a softmax function [40] with two neurons was employed for classification, and the Adam optimization method [62] was chosen for training the model.

Classifier Performance
The performance of the studied models is evaluated using the area under the curve ROC, commonly referred to as AUC.This metric is widely used in binary classification problems as it measures a model's ability to discriminate between positive and negative classes, regardless of the threshold chosen for classification.Unlike accuracy, AUC considers the trade-off between sensitivity (true positive rate) and specificity (true negative rate), making it less susceptible to imbalanced datasets.
Recall or sensitivity is defined as the proportion of physical violence data correctly identified (true positives) or how effectively a classifier identifies positive labels; and specificity is the proportion of non-physical violence data correctly labeled (true negatives) or the effectiveness of a classifier to identify negative labels.In the context of binary classification, recall (or sensitivity) and specificity are important performance metrics that provide insights into how well a classifier distinguishes between positive and negative classes.These metrics are typically derived from a confusion matrix.The formal definitions of AUC, sensitivity, and specificity can be found in Ref. [63].

Results and Discussion
The experimental results are presented in this section.Firstly, we present the evaluation of the first stage of the proposed method, followed by the results of the second stage, and finally, an overall analysis of the method's performance.
Table 3 presents the results after the first stage (see Section 4.1), while Tables 4 and 5 present the second stage of the proposed method.The rows in Tables 3-5 correspond to the pre-trained model employed: DenseNet121, EfficientNetB0, or MobileNetV2 (see Section 3.1).The columns are "epoch", the number of epochs performed in the proposed approach (see Algorithm 3), and the metrics used to evaluate the built models "recall", "specificity", and "AUC" (see Section 5.3).Additionally, the tables include the number of physical violence ("PV") and non-physical violence ("NPV") scenes considered by the classifier as ambiguous or hard to classify in a respective epoch (see Section 4.3).Finally, the "size" column indicates the number of images in the training dataset (∥DB∥), including those added to DB, because they were considered ambiguous.
Figures 3-5 show representative scenes where the proposed approach has uncertainty about the correct label in two contexts: uncertainty when the model labeled the image as an image with physical violence (Figures 3a, 4a and 5a), and uncertainty when the model labeled the image as an image with non-physical violence (Figures 3b, 4b and 5b).These figures are intended to understand what scenes are confusing or ambiguous for the model and could be misclassified.
Table 3 presents the results of the initial model built with the AIRTLab dataset (dataset base, DB) in epoch 0, followed by subsequent epochs (epochs 1, 2, ..., 5) where the model is retrained using ambiguous images to enhance its robustness.Retraining the model with ambiguous scenes generally improves classifier performance compared to the initial model or epoch 0 (Figure 3, shows some samples of images identified as ambiguous by the neural network model).Additionally, it is observed that the number of ambiguous samples decreases as the number of epochs increases.Moreover, it is noteworthy that, overall, specificity values are lower than the recall values.The latter indicates that the classifier correctly identifies more images as physical violence (PV) than non-physical violence (NPV).However, it is notable that a more significant number of ambiguous samples are selected by the model in the class PV than in the class NPV.
Also, note in Table 3 in the column "size" that increasing the number of epochs implies an increase in the size of the dataset base (DB).This is because the ambiguous samples are added to the training dataset in each iteration to robust the model performance, as explained in Section 4. In Figure 3a,b, examples of images (from the AIRTLab dataset) selected by the classifier as ambiguous or hard to classify are shown to clarify the ambiguous concept.The left and right columns display physical violence and non-physical violence scenes, respectively; it is very unclear whether physical violence exists in these images, which means the proposed threshold works properly.
The results of the second stage are presented in Tables 4 and 5.In this stage, the last model obtained in epoch 5 of the first stage is tested with an unknown dataset, simulating an online system akin to a camera surveillance system.It is worth noting that while the primary objective of this work is to demonstrate the effectiveness of the proposed approach (as outlined in Section 4), it does not aim to implement a camera surveillance system for real-time operation.Instead, the focus is on evaluating the performance of the proposed method under simulated real-world conditions.
Table 4 presents the experimental results obtained after testing the model with the RVLS/Pexels dataset, which contains images markedly different from those used in the first stage (see Figure 2) or those employed to build the model.The objective is to evaluate the model's performance in scenarios with diverse characteristics.The RVLS/Pexels dataset, as described in Section 5.1, comprises confusing images in class PV and clearly non-physical violence images in class NPV (see Figure 2c,d).Despite being a small dataset, the model trained in epoch 5 of the first stage demonstrates robustness in classifying this dataset.The initial performance of the model in epoch 0 varies across different architectures.For instance, DenseNet121 achieves an AUC value of 0.892, while EfficientNetB0's performance is notably lower, with an AUC of 0.489.However, the classifier's performance is significantly improved upon retraining the model with ambiguous images (epochs 1 through 5).For instance, the performance of the EfficientNetB0 model in epoch 5 reaches an AUC of 0.982, suggesting the effectiveness of the proposed approach in enhancing classification accuracy.Consider that the model was retrained using approximately less than 21% of the RVLS/Pexels dataset, and the performance was notably increased.DenseNet121 and MobileNetV2 are more robust models in dealing with unknown environments, and EfficientNetB0's performance was also improved.In terms of recall and specificity, Table 4 shows a similar trend to the first stage, where the models perform better in classifying scenes as physical violence than non-physical violence, but it is not clearly defined.Regarding ambiguous images (see Figure 4), it is possible to notice that the maximum number of images considered as ambiguous was 76 in this dataset (with MobileNetV2), i.e., about 21% of samples from the RVLS/Pexels dataset (see Table 1).In other words, the proposed approach uses fewer samples than traditional methods that mix different datasets to obtain the best performance, for example, the studies in Refs.[2,16,25].In contrast, Table 4 does not exhibit a trend in terms of what class is more confusing for the classifier, as was observed in Table 3. Images selected by the classifier using the threshold µ could be clearly defined as physical violence or non-physical violence by a human expert (see Figures 2c and 4b), but for the model this is not true.However, adding these ambiguous images increases the classifier's performance; in other words, using the threshold µ helps the classifier to improve its effectiveness.Table 5 exhibits the results after testing the last model retrained with the AIRTLab dataset and ambiguous samples from the RVLS/Pexels dataset on a new and unknown dataset (in this case, the SCVD dataset), which corresponds to the models developed in epoch 5 of the preceding phase.The classifier performance is relatively low, with the DenseNet121 model achieving the highest AUC value of 0.444.However, the SCVD dataset differs significantly from the AIRTLab and RVLS/Pexels datasets, as it contains more diverse scenarios with more people in the frame and less defined scenes (see Figure 2).Despite these challenges, the process of retraining the classifier leads to an improvement in effectiveness, with AUC values ranging from 0.811 to 0.914.
The number of ambiguous scenes selected by the model is higher compared to the other datasets (see Tables 3 and 4), indicating that these images are more difficult for the classifier to learn compared to those used previously (see Figure 2).Furthermore, the images selected as ambiguous in this phase are confusing for both the classifier and the human expert.Some examples of such ambiguous scenarios are depicted in Figure 5a,b.There is a tendency in recall and specificity to classify non-physical violence images better than physical violence ones, even though the distinction is unclear.The performance of the neural network models studied in this work could still be improved despite including a threshold active learning approach.Tables 3-5 show that there still exists a margin for improvement because all AUC values are lower than 1.In other words, the model misclassifies some images, in which it has no doubts about their actual classes (it does not consider them as ambiguous), but it is wrong.Figure 6 presents some images misclassified by the neural network model, which notices the challenge of identifying physical violence in images obtained from video across several environments.In addition, there is not a clear trend in terms of accuracy or recall in Tables 3-5 because sometimes these values decrease or increase after a new training or in the next iteration.This behavior is related to the new samples included in each new iteration in the training dataset.Thus, the learning process has different conditions in each iteration, which is reflected in the experimental results of specificity and recall.
Finally, Table 6 compares the performance of the initial and final models.The initial model for the AIRTLab column was built solely using the AIRTLab dataset and ambiguous samples from AIRTLab; the initial model for RVLS/Pexels was constructed with the AIRTLab dataset and ambiguous samples from the RVLS/Pexels dataset; and the initial model for SCVD was built with the AIRTLab dataset and ambiguous samples from the RVLS/Pexels and SCVD datasets.The final model is trained with AIRTLab along with ambiguous images from RVLS/Pexels and SCVD, representing the last model of the second stage.Observe that the initial and final models are the same in the SCVD dataset (last column in Table 6).
Each model (initial and final) was tested with the original datasets: AIRTLab, RVLS/Pexels, and SCVD.Table 6 shows that the final model is highly robust for the AIRTLab test dataset; also, for most models, the AUC values increase or only a minimal decrease is observed.However, when the models are tested with the RVLS/Pexels dataset, the performance is affected, with AUC values reduced to 0.836 (for the MobileNetV2 model), while the other models are less affected.This reduction in performance indicates that the model's ability to classify new images from different environments is diminished, but it still retains acceptable performance on previously learned tasks.The experimental results on the RVLS/Pexels dataset significantly differ from the other datasets because it represents a sudden environmental change.The reasonable response of the model to that dataset indicates the proposed approach's ability to adapt to changing scenarios.Still, it is limited when the scenarios are very different (see Figure 2).However, the advantage of the proposed model is that it continuously learns from the environment, and the classifier will adapt in future iterations (see Algorithm 1) to improve its performance.
The performance on the SCVD dataset remains consistent with that obtained by the final model, as in this phase the initial and final models are equivalent.Another advantage of our proposed approach is that it is only retrained with samples that are hard to learn for the classifier, and those samples are confirmed by a human expert.
The experimental results presented in this section show that the proposed approach's performance is appropriate for working in changing environments.The key is the classifier's ability to retain the knowledge learned while it learns new knowledge verified by a human expert.In other words, the main idea is to maintain an acceptable classification performance on the previous images learned while new images are learned from unknown environments.
The additional effort regarding human expert hours associated with the proposed threshold active learning (see Section 4) is justified.The model cannot remain static because the nature of the environment is changeable: new videos in different scenarios are generated daily.Consequently, the classifier should be rebuilt to adapt to new scenarios (which is a challenge for static classifiers).For this, new high-quality datasets are needed; thus, the contribution of human experts in this process is of great importance because, unlike static classifiers, the proposed approach only uses those samples where the classifier has uncertainty about their correct label.

State of the Art: The Best-Performing Techniques
The state-of-the-art best-performing techniques for violence detection in surveillance videos include two main approaches: sequence and frame levels.Oriented Violent Flows (OViF), SVM classifiers, and spatiotemporal autocorrelation of gradients (STACOG) for feature extraction are some relevant representatives of the first approach.Deep learning methods such as CNN, LSTM, and 3D ConvNets have shown promising results in violence detection at a frame-level analysis [64].
In Table 7, we present the results of similar works that employed various models at the frame level to detect violence in videos.These works consistently report good performance.However, our research emphasizes the use of active learning, which, while not yielding substantially superior results compared to other authors, does enhance the performance compared to those without active learning.Specifically, our approach with active learning improves the pre-trained models' performance in nearly all cases (see Table 6).Furthermore, it is important to highlight that no modifications or adaptations to the model architecture were necessary.This indicates the potential of active learning to optimize model efficiency and performance, even within established frameworks, and distinguishes our approach since we have not identified references where active learning is part of the method.The purpose of conducting research using different techniques and models for violence detection in videos is to improve surveillance systems' accuracy, efficiency, and reliability in identifying and preventing physical violence incidents.By exploring various approaches, researchers aim to develop robust algorithms that can effectively detect physical violence behavior in different scenarios, such as crowd violence, school violence, and aggressive behaviors.Additionally, comparing and evaluating various methods helps identify the strengths and limitations of each approach, leading to the advancement of technology for enhancing security and safety through automated violence detection in surveillance videos.The goal is to create more effective and reliable systems for detecting and responding to physical violence activities, thereby contributing to the overall security and well-being of individuals and communities.Some challenges in accurately detecting violence in surveillance videos include the complexity of identifying features that describe behavior accurately, the impact of scene changes and appearance variations on object behavior analysis, the lack of publicly available datasets for violence detection, data imbalance between positive and negative samples, the computational and time-consuming costs of feature representation during video violence detection, among others [64,66].This work analyzes active learning as a tool that could be applied to different models by combining human expert input and an automatic threshold parameter, which also is able to manage diverse scene variations due to the re-training process with human-labeled images during the active learning stage; this was explored by using five well-known and well performing models.Additionally, to our knowledge, no other works have been identified in the field of violence detection in videos that include active learning as a strategy to improve the training process.
We discussed the main experimental results obtained by the proposed approach using pre-trained models.Still, clarifying that the presented method could be applied in other neural network models is necessary.We are interested in testing its performance in other advanced deep learning models like transformers in the future.The main goal of this work is only to highlight the effectiveness of the threshold active learning approach to build robust models for violence detection in images obtained from video in diverse environments, regardless of which pre-trained model is used or which is the one that attains the best classification performance.

Conclusions
This work analyzes the effectiveness of using a threshold active learning approach to develop robust neural network models capable of classifying physical violence or nonphysical violence scenes in video frames from unknown environments.The main contribution of this approach is the process used to obtain the threshold µ (process based on the neural network output) that allows human experts to contribute to the classification process to obtain more robust neural networks and high-quality datasets.
The proposed approach consists of two stages: In the first stage, the neural network model is trained with an initial dataset and tested with unseen images from the same dataset (training and testing subsets).The selected pre-trained models usually perform well, as presented in the experimental results section.In the second stage, the model is exposed to unknown environments (images from other datasets); in this stage, the model reduces its classification performance but still retains an acceptable performance on previously learned tasks and the new unlearned tasks.However, the classifier performance improves if the model is retrained with images selected as ambiguous (using the threshold µ).This implies that the proposed approach continuously learns from the environment, i.e., it constantly adapts to new scenarios and improves its classification performance in unknown environments.
Moving forward, there are several opportunity areas for further exploration and improvement.One area of interest is the identification of the onset of physical violence, which would enable the model to issue more precise alerts.The model could provide timely interventions or alerts by pinpointing the exact moment when physical violence begins within a video sequence.This would increase its practical utility in real-world scenarios.

---
The operations of comparison and updating the ambiguous set have a constant complexity O(1).• Human labeling and updating DB: The function HUMAN_LABEL(AM k ) will depend on the time taken by human experts but can be considered to be O(∥AM k ∥) in the worst case.-The operation of updating the dataset DB ← DB ∪ AM k has a complexity of O(∥AM k ∥).• Re-optimization of the model: The complexity of the function OPTIMIZE(M j , DB, DB) is the same as that of the OPTIMIZE sub-algorithm, denoted as O(OPT(DB)), shown previously.Therefore, the complexity of the ACTIVE-LEARNING sub-algorithm is approximately

Figure 1 .
Figure 1.Summary of the main components of the proposed active learning approach.

Figure 2 .
Figure 2. Some examples of images extracted from the datasets used, showing physical violence (left column) and non-physical violence (right column) scenes.

Figure 3 .
Figure 3.Some examples of images selected by the classifier as ambiguous or hard to classify in AIRTLab dataset.(a) and (b) display physical violence and non-physical violence scenes, respectively.
(a) Physical violence scene in RVLS (b) Non-physical violence scene in Pexels

Figure 4 .
Figure 4. Some examples of images selected by the classifier as ambiguous or hard to classify.(a) and (b) display physical violence and non-physical violence scenes, respectively.

Figure 5 .
Figure 5. Two examples of images selected by the classifier as ambiguous or hard to classify from the SCVD dataset.(a) and (b) display physical violence and non-physical violence scenes, respectively.

Figure 6 .
Figure 6.Examples of images misclassified by the model; it does not identify them as ambiguous or hard to classify but assigns them a wrong label.The first row displays non-physical violence and the second row physical violence scenes.
DBTrain ← DBTrain ∪ AM k ; /* Dataset is updated with ambiguous samples*/ Obtaining the threshold µ /* M j is the recent model, and DB the training dataset */ Input: M j ,DB; is the complexity of obtaining the model output, which may depend on the number of layers and parameters of the neural network model.• O(∥DB∥) and O(∥T j ∥) are proportional to the sizes of the training and test datasets.
is the complexity of training (the LEARNING procedure in Algorithm 3), which can be very high if the model is complex.•O(T(Tj )) is the complexity of testing, generally lower than that of training.•O(M)

Table 1 .
Video datasets used to build the models presented in this work.

Table 2 .
Setup summary of the pre-training models studied in this work.

Table 3 .
Experimental result values on recall, specificity, and AUC metrics corresponding to the AIRTLab dataset using our proposed active learning approach.These results were obtained in the first stage (Section 4.1).

Table 4 .
Experimental result values on recall, specificity, and AUC metrics corresponding to the RVLS/Pexels dataset using our proposed active learning approach.These results were obtained in the second stage (Section 4.2).

Table 5 .
Experimental results values on recall, specificity, and AUC metrics corresponding to the SCVD dataset using our proposed active learning approach.These results were obtained in the second stage (Section 4.2).

Table 6 .
Comparison of the initial and final models' performance (AUC metric) with respect to the original dataset test of AIRTLab, RVLS/Pexels, and SCVD.

Table 7 .
Comparison to related research; all Refs.worked at the frame level of the video.