Intelligent Ammunition Detection and Classification System Using Convolutional Neural Network

: Security is a significant issue for everyone due to new and creative ways to commit cybercrime. The Closed-Circuit Television (CCTV) systems are being installed in offices, houses, shopping malls, and on streets to protect lives. Operators monitor CCTV; however, it is difficult for a single person to monitor the actions of multiple people at one time. Consequently, there is a dire need for an automated monitoring system that detects a person with ammunition or any other harmful material Based on our research and findings of this study, we have designed a new Intelligent Ammunition Detection and Classification (IADC) system using Convolutional Neural Network (CNN). The proposed system is designed to identify persons carrying weapons and ammunition using CCTV cameras. When weapons are identified, the cameras sound an alarm. In the proposed IADC system, CNN was used to detect firearms and ammunition. The CNN model which is a Deep Learning technique consists of neural networks, most commonly applied to analyzing visual imagery has gained popularity for unstructured (images, videos) data classification. Additionally, this system generates an early warning through detection of ammunition before conditions become critical. Hence the faster and earlier the prediction, the lower the response time, loses and potential victims. The proposed IADC system provides better results than earlier published models like VGGNet, OverFeat-1, OverFeat-2, and OverFeat-3.


Introduction
In the current era, security has become a vital issue. Under current prevailing conditions of poor security system, which has caused fear among people, everyone wants to secure his resources; premises; his organization's employees and clients. However, CCTV systems are becoming more popular for providing additional security. They are typically installed in many public places such as roads, highways, offices, housing complexes, and shopping malls. Generally, CCTV systems can see weapons or other harmful items in a person's hand. However, the potential threat must be identified by a remote operator, who then may trigger a police response [1]. Moreover, the images recorded by CCTV might not be examined until after a criminal incident. For this reason, the system proposed in this study is designed to enable CCTV cameras to detect weapons and potential threats in real-time.
The primary aim of this research was to minimize or eliminate the threat to security posed by weapons (e.g., pistols, automatic weapons, and knives) or explosives frequently used in criminal activities. In this research, we proposed an automated system to detect any dangerous weapons in the hand of an attacker or a terrorist. This system sounds an alarm that alerts the CCTV operator, who immediately informs the police or other agencies. It can help citizens before conditions worsen or the crime is carried out.
The rate of crimes involving weapons is of increasing global concern, specifically in those countries where gun ownership is legal. Initial detection is needed to detect weapons earlier and allow law enforcement agencies to take immediate action. One of the advanced solutions to this problem is to supplement surveillance or control CCTV cameras using an automatic system to detect a pistol, gun, or revolver. It sounds an alarm once it detects any harmful objects in an attacker's hand. CNN is also used to detect a firearm using a video recorder [2]. The ammunition detection system includes a firearm and a device for communication. The firearm contains a transceiver circuit, which detects the release of an ammunition round. The transceiver circuit produces an electromagnetic signal synchronized with the ammunition discharge. The communication device, which may be mobile or any handheld device like a radio set, detects the electromagnetic signal. The modes of communication can be single, half, or full way. The communication device is paired with a geographic location sensor, for example, a GPS receiver [3]. The communication device contains software that generates an informational message with geographic information transmitted to a remote device. Later, a converted image is obtained. Another visible light image, obtained using a transforming system, is included in this transformed image [4]. The object of interest has, therefore, been identified in the field [5]. On identification, an indicator shows the object's location in the area of interest.
The techniques for monitoring a firearm and generating a warning when the firearm is outside the designated permit area defined in various scenarios. A location device, such as a Global Positioning System (GPS) receiver coupled to a transmitter, is connected to or associated with a firearm. The identified location of the weapon is transmitted to a location service module via the location device. The position service module compares at least one designated authorized location with the current firearm location and confirms that the location of the firearm is not within the authorized permit area. In response, an alert is generated. It is a multi-modal protection technique that will detect any concealed metallic (weapon or shrapnel), non-metallic (explosives or Improvised Explosive Device (IED)), or radioactive nuclear threats. Furthermore, long-range facial recognition of potential terrorists can be carried out by the security checkpoint to detect them. The security checkpoint integrates many technologies for detecting threats into a single checkpoint designed to be stable and see a wide range of threats, including concealed weapons, explosives, bombs, and other threats [6].
The importance of security can never be neglected, especially when technological advances have broadened criminals' opportunities to commit crimes. After evaluating their security issues, many corporations have started investing in systems to secure their facilities. New visual systems that include handheld and web platforms to reduce response times are required to enhance security. It emphasizes perceiving the user's perceived load and working memory load efficiency. Visualizations are optimized when the perceived load is reduced, and working memory increases [7]. In Cloud computing, security-as-a-service is the most significant change in the field of information and corporate security. The accuracy of live streaming data depends on several factors.
Consequently, many high-technology security systems are available. The Security Information and Event Management (SIEM) system was developed to collect, analyze, aggregate, normalize, stock, and purify event logs. They also correlate data from traditional security systems such as intrusion detection/avoidance, firewalls, anti-malware, and others installed in both the host and network domains [8].

Previous Studies
Almotaeryi conducted research on automated CCTV surveillance [9] in which he compared different solutions to data augmentation in image classification using deep learning. Perez et al. developed a method in which a neural network could learn augmentations to improve classification process, called neural augmentation. There is an extensive discussion of this technique's pros and cons in different datasets [10]. Deperlıoglu analyzed an effective and successful method to diagnose diabetic retinopathy from retinal fundus images using image processing and deep learning techniques. CNN is used to classify images [11]. In the information communication system, the denoising of the image is a modern and practical approach due to the image-filtering algorithm. However, that algorithm is not always efficient for the nature of the noise spectrum. Sheremet et al. presented the possibilities of denoising using the convolutional neural network while transferring the graphical content in the information communication system proposed in this study. They concluded that using a denoising convolutional neural network creates the correct signal but sends noisy images [12].
Zhang et al. [13] presented the feedforward De-noising Convolutional Neural Network (DnCNN) as the primary source of progress in deep architecture, learning algorithms, and regularization techniques. Training processes accelerate because of residual learning and batch normalization strategies. Zhang et al. have presented a DnCNN model with a residual learning strategy that accommodates many images of denoising processes. Cha et al. [14] discussed a visionbased technique for recognizing concrete cracks without calculating imperfect features with the help of CNN architecture. However, CNN can learn image features automatically. This technique works for removing features without the help of Image Processing Techniques (IPT) and analyzes the efficiency of CNN using Traditional Sobel and Canny edge recognition techniques, which are significantly better.
Yang et al. [15] presented a new technique for super-resolution called multi super-resolution convolutional neural network. The development of GoogLeNet architecture inspired it. This method uses parallel convolution filters of various sizes and achieves low-resolution license plate imagery, with a concatenation layer that blends the features. Finally, this method rebuilt the high-resolution image using nonlinear mapping.
Tsoutsa et al. [16] are working on artistic styles using a neural algorithm that can separate and recombine the image material and natural image style. Recently, Leon A. Gates and Alexander S. Ecke described Image Style Transfer Using Convolutional Neural Networks and feature extraction. Handa et al. [17], Simonyan et al. [18] worked on convolutional network depth and its impact on the precision of large-scale image settings. They used tiny (3 × 3) convolutional filters, which improved earlier art configurations by pushing depth to 16-19 weight layers.
Deep learning techniques play a vital role as an essential alternative, overcoming the difficulties of feature-based approaches. Araújo et al. [19] clarified a method that classified the hematoxylin and eosin-stained breast biopsy images using CNN. A faster R-CNN model has been trained on the bigger datasets [20]. In the last two years, deep learning methods have been improved rapidly for general object detection. Various methods of facial recognition are still based on R-CNN, resulting in limited accuracy and processing speed. Jiang et al. [21] analyzed the Faster CNN implementation, which produced impressive results in various object detection benchmarks [17,22].

Proposed Intelligent Ammunition Detection and Classification (IADC) System
The proposed Intelligent Ammunition Detection and Classification (IADC) system uses a Convolutional Neural Network. Fig. 1 illustrates the acquired image stream from various CCTV cameras at different locations. These CCTV cameras transmit captured images to the object layer through the Cloud. Due to moving objects, the captured images may be blurred or noisy. To convert the captured images into high-quality images, a preprocessing layer is required for image enhancement. The object layer further sends the images to the preprocessing layer. The preprocessing layer sends these image streams to the Convolution Neural Network model. The CNN model classifies the object as either with ammunition or without ammunition. If ammunition is detected, the model notifies the observer of the object and sounds an alarm. If no weapon or ammunition is detected, there is no alarm.

Sensing Layer
There are N-sensing layers located in different positions, and each layer contains multiple cameras. The N-sensing layers transmit captured image streams to the object layer through the Cloud.

Object Layer
The captured image stream needs to be stored in a specific location. Image streams coming from different sensing layers are stored in object layers. The object layers combine all stream data inputs at a single point.

Pre-Processing Layer
Input image stream may contain noise and blurriness as a result of low quality. It consists of raw shape data that cannot produce good results in image classification. The preprocessing layer transforms the raw images into high-quality images by removing noise and blurriness. There are different filters used to remove this noise and blurriness, which are the inputs of CNN. Fig. 2 shows the preprocessing process of the input raw data stream.

Image Noise Model
Image streaming may be blurred or noisy. The additive and multiplicative Noisy Image Model has been clarified in Eqs. (1), (2).
where, (x) is the original image form, (x) is the noise, and (x) is the noisy image.

Gaussian Noise
The Gaussian noise model is very popular because of its simple application. When other noise models fail, the Gaussian noise model can be applied. Eq. (3) is the mathematical representation of the Gaussian noise model.
where x is the gray value, σ is the standard deviation, and μ is the mean.

Impulse Valued Noise
The black and white dots on the image are called salt and pepper or impulse valued noise. In Fig. 3, the centered value 200 is replaced by the value 0. Progressively, dark pixel values are replaced by white pixel values and vice versa.

Convolutional Neural Network Model
Today, authorities attempt to resolve most issues by seeking help from computer professionals using Artificial Intelligence (AI) methods. AI is a broad spectrum used in every aspect of life. For this purpose, machine learning, a subset of AI, is used. Machine learning is used to solve a different problem by applying various algorithms like k-nearest neighbors, linear regression, decision trees, logistic regression, Support Vector Machine, random forests, and neural networks. Deep learning is using for image classification, a subset of machine learning. In deep learning, CNN has been used. It is a powerful model for object classification. It is a network of different sequentially-connected layers.
These are convolution layers in which the convolution process occurs. It can typically have multiple convolution and pooling. Normalization layers do not necessarily follow the order. Fig. 4 shows the complete model used in the IADC system with a CNN.
To extract the feature or object for the next layer in the convolutional layer, a kernel/filter matrix is used. There are different methods to obtain the features of the images using kernel. The feature map values can be computed by the sum of the product of element-wise of input matrix and kernel. Often, a dot product is used instead of the element-wise multiplication, but this can be modified for better (or worse) results.  where net(i, j) represents the output image, ξ is the input image, ϕ is the kernel or filter matrix, and * is the convolution. The core building block of CNN is a Convolutional layer, which has been used for feature detection.
Let's assume the image size is a 7 × 7 matrix with RBG channels. A kernel or feature detector or window of size 5 × 5 with three (R, G, B) channels and stride 1 is being used to scan the kernel over the image.
In Fig. 6, if the kernel 3 × 3 moves over the image 7 × 7 matrix having stride one. Then, the dimension of the output feature map can be calculated by Eq. (5).
where, W is the image size, F is the kernel size, P is padding, and S is stride.
Kernel / Feature Detector Figure 6: Producing feature mapping results with 3 × 3 kernel size Therefore, the dimension of the output feature map is 5×5. The Rectified Linear Unit (ReLU) is an activation function, which commonly uses CNN. The ReLU function's problem is that it is not differentiable at the origin; therefore, it is difficult to use with backpropagation training.
It is defined mathematically, as shown in Eqs. (6), (7).  After the convolutional layers, the max-pooling layer was used to reduce the input stream's spatial dimension. The height and weight of the images were reduced. As shown in Fig. 8, the CNN layer of the IADC system used max-pooling with 2 × 2 filter size and stride 2.
Finally, a fully connected layer becomes the input for the SoftMax layer and produces the classification layer results.

Mathematical Model of CNN Loss
where the y and Y vectors represent the estimated values from the convolutional layer and initial results. The difference between y i and Y i is called loss, which may be calculated by different methods. The loss is used in backpropagation to update the weights. Because the calculated values of y i are required near the original output Y i . In the mathematical model, the aim is to backpropagate by taking the derivative of Eq. (8) related to weights or filter ∂L ∂W and ∂L ∂b bias. It is the cross-entropy loss that is used for classification. Because the new weights and bias will be obtained using a decent gradient algorithm by , ∝ is the learning rate parameter.
where c = the number of classes depending upon the implementation.
we have SoftMax conversion as in Eq. (9) where Z i represents logits or output units, and logits will convert into probabilities via the SoftMax conversion.
Z i is attained by interrelated weights with the X j . Here, we get loss related to weights contingent on dual summations in Eq. (11). One from j = 1 to n out and the other one from l = 1 to c. Finally, the product of two derivatives will be taken where ∂y i ∂Z l is the SoftMax derivative.
In Eq. (8), loss considering y i as its parameter is obliquely associated to Z i in connection with subsequent Eq. (12).
Z l = n out j=1 (W jl * X j ) is given as Z i = Z l Two demonstrations are vital, where demonstration 1, i = l, and demonstration 2 i = l when i = lth unit. l is the unique neuron and pivot in SoftMax output neurons; and l neuron has the highest values, and the rest are close to zero.
When i = lth unit, then it has a low probability but when l is the single neuron pivot in SoftMax output neuron. Therefore, Case 2 (i = l): Now derivative of Eq. (12) via quotient rules with reference to Z l It can be written as We can summarize Eqs. (13), (14).
Because cross-entropy has no component of Z l , the partial derivative of Z l related to log(y k ) will be observed Taking the derivative cross-entropy loss, the equation becomes ∂y k ∂Z l was earlier computed for the SoftMax gradient. Two demonstrations are there i = l, k = l as in Eq. (15) and Eq. (16) will be divided into two parts Y K y k * y k y l We can simplify this as We can further simplify this as Eq. (17) is the derivative of loss related to weights for the fully connected layer. Once ∂L ∂W jl is obtained by applying gradient descent on the fully connected layer; the updated weights will be achieved.

Simulation and Results
MATLAB was used to simulate the proposed IADC system using a CNN. The dataset used for the simulation contained 920 images showing persons with and without weapons. The dataset was further divided into training (700) and validation (220). Figs. 9 and 10 show the accuracy and extent of loss in the proposed system. Training accuracy achieved a level of 99.41%; validation accuracy was 96.74%.    10 shows the training and validation performance of the proposed IADC system related to iteration and loss. We observed that the proposed system resulted in loss rates of 0.01 during training and 0.09 during validation. Fig. 11 shows the randomly selected labeled output images based on the proposed system. The results showed that the proposed IDAC system classified persons into two classes: With ammunition and without ammunition. Tab. 1 compares the performance of the proposed IADC system with previously published models. The results show that, of the previously published models, the Overfelt-3 system [5] offered the highest precision with 93% during training and 89% during testing. In comparison, the proposed IADC system operated at 99.41% during training and 96.74 % during testing. Moreover, the IADC system provided more accurate results than previously published methods like VGGNet, OverFeat-1, OverFeat-2, and OverFeat-3 [5].

Conclusions and Future Studies
In conclusion, the Convolutional Neural Network application in security systems offers more precise detection of armed persons and weapons. Additionally, it has given more accurate results than previously published methods such as VGGNet, Overfeat-1, Overfeat-2, and Overfeat-3. The proposed IADC system achieved a 96.74% accuracy rate and a 3.26% loss rate.
In future studies, the Yolo model can be used to obtain more precise results through comparing results. Moreover, improved precision in detecting and classifying a wider variety of weapons and ammunition can be achieved.