Improving video surveillance systems in banks using deep learning techniques

In the contemporary world, security and safety are significant concerns for any country that wants to succeed in tourism, attracting investors, and economics. Manually, guards monitoring 24/7 for robberies or crimes becomes an exhaustive task, and real-time response is essential and helpful for preventing armed robberies at banks, casinos, houses, and ATMs. This paper presents a study based on real-time object detection systems for weapons auto-detection in video surveillance systems. We propose an early weapon detection framework using state-of-the-art, real-time object detection systems such as YOLO and SSD (Single Shot Multi-Box Detector). In addition, we considered closely reducing the number of false alarms in order to employ the model in real-life applications. The model is suitable for indoor surveillance cameras in banks, supermarkets, malls, gas stations, and so forth. The model can be employed as a precautionary system to prevent robberies by implying the model in outdoor surveillance cameras.

www.nature.com/scientificreports/ detection models in video surveillance cameras 12 . In 2007, L. Ward et al. presented for the first time the notion of identifying firearms using real-time object detection techniques 13 . Furthermore, it should also be highlighted that there are other promising methods for identifying people with dangerous weapons. According to research by Yong et al., microwave swept-frequency radar can be used to detect metal objects, such as weapons and knives 14 . Also, authors used a YOLOv3 model combined with a human pose to detect handguns 15 . An approach that combines color-based segmentation and an interest point detector has been utilized to automatically identify weapons. The purpose is to detect objects and analyze their features in order to compare them to descriptions of guns 16 .
Moreover, Olmos et al. has suggested techniques to reduce the number of false positives and false negatives, such as employing a symmetric dual camera system to enhance the selection of potential features, thereby boosting the model's accuracy 17 . The study 18 presents different transfer learning models, namely AlexNet, VGG16, and VGG19, used to train.gun and knife detection models.
A study on the detection of firearms and knives presented models that inform security guards when handguns or knives are identified in video surveillance systems 17 . Another study demonstrates a hybrid weapon detection model designed to identify guns and knives; it uses fuzzy logic and other parameters to enhance the result and decrease false alarms 18 .

Methodology
It is essential to introduce a conceptual framework. This entails understanding the problem, the trade-off between time response and model performance, and pointing out deficiencies in the proposed model including data quality. Figure 1 presents conceptual framework approach for the proposed project, showing the location of each neural network models.
In this research, we are looking to train two models using neural networks, the first model to allow people with faces uncovered to pass the front gate, and the second model to detect any dangerous items that indicate criminals or robberies inside the bank. The process of detecting balaclava and weapons can be broken down into two main parts, locating the bounding box of the target object (e.g. weapon) and classifying it. Our approach for weapon detection is applying transfer learning using STATE-OF-THE-ART object detection models such as YOLO and SSD. Figure 2 shows a flowchart of training and developing proposed neural network models.
However, we build five weapon detection models by using transfer learning and fine-tuning from pre-trained models. The models are SSD Inception V2, SSD MobileNet V2, SSD ResNet50 V1, YOLOv4, and YOLOv5. All models were trained on a Colab notebook using TensorFlow and GPU accelerated. Starting with SSD Inception V2 models, it uses VGG-16 as a feature extractor, the batch size is 32, and the input layer receives images of resolution 300 × 300. One of the solutions to decrease time response that trains a model on grayscale data, therefore, it yields a lightweight model that is appropriate for CCTV cameras.

Results
In surveillance applications, it is expected that the input images are of relatively high quality 19 . Processing time and high precision are essential factors for evaluating real-time weapon detectors. Accurate, relevant, and highquality images are crucial in building and developing any computer vision model. Research has been conducted on armed robbery and states that the most frequent weapon used was firearms 20 .

Datasets.
Our study only focuses on four classes: pistol, knife, rifle, and robber mask. We include revolvers and handguns in pistol class. Also, we include shotguns in rifle class. However, many objects are most likely to be www.nature.com/scientificreports/ confused with a pistol, such as wallets, cell phones, selfie sticks, money, and so forth. So, it would be a great idea to label them as a non-weapon class, hence reducing the number of false positives and false negatives, therefore increasing the overall accuracy. We have made two datasets, which are explained below. Dataset 1 consists of four classes: pistol, knife, rifle, and robber mask. We have 1160 images in total. The majority of images belong to the pistol class because nearly 95% of weapons used in robberies are either pistols or revolvers. Datasets were separated into train and test with split size shown in Table 1. Dataset 2 is used to build a weapon detection binary classifier, so two classes were made, knife and pistol. There are 5000 images, with 4250 images in the train set and 750 in the test set. Figure 3 shows the distribution of the four classes in dataset 1, which is a severely imbalanced dataset due to the high number of pistol images. Meanwhile, a distribution of two classes in dataset 2 is shown in Fig. 4.
Training results. In this research, we trained weapon detection models by using transfer learning and finetuning from pre-trained models. The models are SSD Inception V2, SSD MobileNet V2, SSD ResNet50 V1, YOLOv4, and YOLOv5. All models were trained on a Colab notebook using TensorFlow and GPU accelerated. Starting with SSD Inception V2 models, it uses VGG-16 as a feature extractor, the batch size is 32, and the input  www.nature.com/scientificreports/ layer receives images of resolution 300 × 300. The num_step parameter is 10,000 and defines how many training steps it will run. The model uses RMSprop as an optimizer designed for deep neural networks. The training was performed with different training steps for each model; the SSD Inception, SSD MobilNet, and SSD ResNet models were trained for 10,000 training steps, with batch sizes ranging from 8 to 32, depending on the dimensions of the input images. Meanwhile, YOLOv4 has trained for 6000-8000 training steps only to prevent the overfitting problem. YOLOv5 was trained on dataset 1 for only 270 epochs and dataset 2 for 832 epochs since no improvement was observed in the last 100 epochs. Training results are shown below in Figs. 5 and 6.
The behavior of total loss drops in models for dataset 1 and dataset 2 is quite similar. As the loss starts to decrease through training, the model's accuracy increases until it stagnates. After 10,000 training steps, the value of the total loss of the SSD MobileNet model on dataset 1 is around 0.16, and the learning rate has vanished to zero. However, the model with the highest total loss is SSD Inception in datasets 1 and 2.
The total loss is the sum of classification loss, localization loss, and regularization loss in SSD models. However, the loss function for YOLO models composes of classification loss, localization loss, and confidence loss (objectness loss). Figures 7 and 8 show the loss function and mAP chart for the YOLOv4 model throughout the training phase on datasets 1 and 2, respectively.
The blue curve represents the training loss, which decreases as the number of iterations increases. The red line is the mean average precision when the intersection-over-union threshold is 0.5. The most critical hyperparameter while configuring a neural network is the learning rate, which determines how much to change the model based on the estimated error every time the model weights are updated.  www.nature.com/scientificreports/  www.nature.com/scientificreports/ Figure 9 shows the learning rate behavior for the SSD MobileNet model on dataset 1 and the SSD ResNet50 model on dataset 2.
The images used in building weapon detectors should be cleaned, preprocessed, and appropriately annotated to achieve high precision. Unfortunately, the process of collecting images and labeling them is delicate and tough.
Evaluation results. Test data was created to assess the accuracy of the newly weapon detection models, where images represent various weapons in different scenarios. This section presents the evaluation results for all experiments in tables and graphical representations.
All object detection models were evaluated on mAP@0.5 (Pascal VOC), mAP@0.5:0.95 (COCO), and mAP@0.75 (strict metric). Tables 2 and 3 show evaluation results for dataset 1 and dataset 2, respectively. Our  www.nature.com/scientificreports/ primary metric used to compare the models is AP@IoU = 0.5:0.95, which is the average precision for IoU thresholds from 0.5 to 0.95 with a step size of 0.05. A few more evaluation metrics have been used in this paper, like AP for evaluating detection for different weapon sizes inside the image and determining whether the model is suitable for only large weapons, small weapons, or both. SSD models are evaluated on three different sizes, small, medium, and large, where small objects have area an (h*w) less than 32 2 pixel scale, medium objects have an area more than 32 2 and less than 96 2 pixels, and large objects have an area more than 96 2 pixels. However, the average precision for IoU threshold of 0.5 (AP@IoU = 0.5) is used to evaluate weapon detection models based on YOLO algorithms. The average recall metric is also used in the evaluation, given a fixed number of detections per image.
According to Tables 2 and 3, the average precision (AP) of the SSD MobileNet model surpasses the other two SSD models in detecting small weapons for datasets 1 and 2. For the average recall (AR) metric, the SSD ResNet50 model performs better than the SSD MobileNet and SSD Inception models in detecting small weapons on dataset 1. In contrast, the SSD MobileNet model performs better on dataset 2. However, for detecting medium and large objects, SSD MobileNet achieves better performance than other SSD models in both datasets 1 and 2. The primary metric used in comparing weapon detection models is AP@IoU = 0.5; by comparing the models, it is obvious that YOLOv4 surpasses other models with AP 0.795 on dataset 1, while the least efficient model is SSD Inception with AP around 0.375. In contrast with dataset 2, YOLOv5s is the best model for weapon detection with an average precision of 0.82, and the lowest-performing model is SSD Inception with an AP of 0.279. Figures 10 and 11 show detection model comparisons on datasets 1 and 2, respectively.
As mentioned previously, there are different model sizes for YOLOv5. In this paper, we trained both datasets on YOLOv5s, where "s" stands for a small-size model. Meanwhile, the x-large YOLOv5 model (YOLOv5x) achieves higher performance than YOLOv5s, with a mAP around 0.762 on dataset 1.
Confusion matrix. Dataset 1 contains 240 test images with four classes: knife, mask burglar, pistol, and rifle.
Moreover, dataset 2 contains 750 test images with two classes, pistol and knife. The confusion matrix in Fig. 12 shows that the mask burglar class had the highest rate of successful weapon detection (0.84), while the knife class had the lowest rate (0.4). Similarly, the most successful weapon detection rates as shown in confusion matrix in Fig. 13 were obtained for the pistol class at 0.84, and the least successful rates were for the knife class at 0.47.  www.nature.com/scientificreports/ It can be seen that around half of the knife weapons in both datasets are confused with the background. Also, around a fifth of the pistol images in both datasets was misclassified as background. In addition, 0.12 of rifle images were misclassified as pistol class in dataset 1. In comparison, 0.08 of rifle images were misclassified as knife images, and 0.04 of knife images were misclassified as pistol images.
In the confusion matrix, background FP indicates the model may think detection is a weapon when it is not, and background FN indicates the model misses detecting real weapons. By looking at previous confusion matrices, background FP refers to the model incorrectly predicting the weapons, and background FN refers to the model incorrectly predicting the negative class. Therefore, we can summarize our weapon detection model in the bank using a 2 × 2 confusion matrix in Fig. 14 that depicts all four possible outcomes.
Recall × precision. The precision-recall curve depicts the trade-off between precision and recall for different classes. A broad area under the curve indicates high recall and precision. A high recall is linked to a low false negative rate, but high accuracy is linked to a low false positive rate. The precision-recall graph for the YOLOv5s model on datasets 1 and 2 is shown in Figs. 15 and 16, respectively.
According to Fig. 15, it can be seen that the model is returning accurate results for detecting burglar mask items, with a mAP of 0.914. The model also performs quite well-detecting pistols and rifles, with a mAP of around 0.82. Meanwhile, the model poorly detects knife items with mAP 0.341 in real-time. Moving on to Fig. 16, it is evident that the performance model of detecting pistol items is higher than detecting knife class due to the quality of knife images in the dataset. F1 × confidence. The confidence score shows how probable it is that the bounding box contains the target object and how confident the weapon detection model is about it. The confidence score will be zero when no weapon items are detected in the bounding box. As a reminder, the IoU value can be obtained by dividing the shared area between the predicted bounding box and the ground-truth bounding box by the area of their union.  A bounding box, class, and confidence score represent the outcome of an object detector. Figures 17 and  18 show that precision increases when the confidence score increases while recall decreases. The detection is considered valid (positive) when its confidence is higher than a confidence threshold. Otherwise, it is a negative detection. The number of TPs and FPs decreases when the confidence threshold increases. Conversely, the recall measure decreased as the confidence threshold increased due to an increase in false negatives. For example, according to the F1-confidence curve shown in Fig. 21, the confidence value that optimizes the precision and recall is 0.345 for the YOLOv5 model on dataset 1, and the best F1 score for all classes is 0.74. Similarly, to Fig. 22, the best confidence value that optimizes precision and recall is 0.045 for the same model on dataset 2, and the best F1 score is 0.77.

Discussion
Weapon detection models were trained using several configurations based on model architecture. Many factors have an impact on model performance and results, such as different image resolutions, batch sizes, training steps, rescaling, and data augmentation options. The ideal parameters for each model should be investigated using a hyperparameter optimization pipeline in order to balance out or improve the results of weapon detectors.
The evaluation part is next; weapon detectors are evaluated based on the model's primary purpose. For example, if we want to use the proposed model as a weapon recognizer without considering box overlap, In that case, we can merely minimize the IoU threshold to close to zero while keeping the number of false negatives low and false positives high. In other words, the recall score will be high and the precision low. However, in the case of building a weapon detector instead of a weapon recognizer, we consider the location of a predicted box. Therefore, it would be recommended to evaluate the model using the COCO evaluation metric (AP@IoU = 0.5:0.95). This metric describes how the model works in the area of overlapping that we are most interested in.
Different models and classes all need a certain confidence threshold to boost their performance and thus maximize it. Figures 17, 18, 19 and 20 offer a tradeoff between confidence, precision, and recall to find the best confidence threshold for each class and model. It is hard to select a model based on a single metric. Thus, many metrics were used in the evaluation process. The proposed models fail to detect weapon items in images, as shown in Fig. 23. Figure 24 shows weapon detection results of footages from monitoring cameras in real life scenarios. www.nature.com/scientificreports/ Generally, low-quality images are one of the core problems of deep learning models. For example, almost all the CCTV footage in our dataset is grainy, blurry, monochromatic, and of poor quality. The main reason CCTV footage is of low quality is that the amount of data required to record at higher quality would be enormous; hence, more data storage is needed. In addition, outdated cameras, a weak internet connection, and poor lens conditions are factors responsible for low-quality images.
Besides poor-quality images, many challenges appear regarding the quality of the dataset. One of these challenges is background clutter, where the target object may blend into the environment, making it difficult to identify. Also, occlusion issue when a small portion of the target object is visible. Furthermore, viewpoint variation challenge where a target object can be rotated in any direction with respect to the observer or camera. Lastly, scale variation occurs when an object appears in images of different sizes.

Conclusion
To wrap up the results and come to a conclusion, the most suitable weapon detector for the task would be YOLOv4 for dataset 1 and YOLOv5s for dataset 2. This is because YOLO models perform better than SSD models in terms of detection speed and average precision. In addition, YOLO models are better options when the weapon size is small. This work studied several weapon detection models (SSD and YOLO) applied to the  www.nature.com/scientificreports/ video surveillance system. The two main objectives of the study were to compare the performance of five different models (SSD Inception, SSD MobileNet, SSD ResNet50, YOLOv4, and YOLOv5) and to examine the improvement of the detection quality by training on two datasets with different classes and number of images (Supplementary files 1, 2). The evaluation of the results in the five experiments was carried out on 240 test images in dataset 1 and 750 test images in dataset 2. According to the results in Chapter 4, we find that those weapon detectors using YOLOv4 outperform YOLOv5 and SSD models on dataset 1 with a mean average precision of 0.795. Meanwhile, YOLOv5 is the best option for dataset 2, with a mean average precision of around 0.82. Finally, unlike pistols in both datasets, the model's performance is poor in detecting knife items.       www.nature.com/scientificreports/