Identifying threat objects using faster region-based convolutional neural networks (faster r-cnn)

Automated detection of threat objects in a security X-ray image is vital to prevent unwanted incidents in busy places like airports, train stations, and malls. The manual method of threat object detection is time-consuming and tedious. Also, the person on duty can overlook the threat objects due to limited time in checking every person’s belongings. As a solution, this paper presents a faster region-based convolutional neural network (Faster R-CNN) object detector to automatically identify threat objects in an X-ray image using the IEDXray dataset. The dataset was composed of scanned X-ray images of improvised explosive device (IED) replicas without the main charge. This paper extensively evaluates the Faster R-CNN architecture in threat object detection to determine which configuration can be used to improve the detection performance. Our findings showed that the proposed method could identify three classes of threat objects in X-ray images. In addition, the mean average precision (mAP) of the threat object detector could be improved by increasing the input image's image resolution but sacrificing the detector's speed. The threat object detector achieved 77.59% mAP and recorded an inference time of 208.96 ms by resizing the input image to 900 × 1536 resolution. Results also showed that increasing the bounding box proposals did not significantly improve the detection performance. The mAP using 150 bounding box proposals only achieved 75.65% mAP, and increasing the bounding box proposal twice reduced the mAP to 72.22%.


Introduction
Terrorist attacks in many countries result in the injury and deaths of civilians and even military personnel [1]. In the Philippines, this problem is also dominant due to the terrorist attacks that happened recently [2] caused by the use of an improvised explosive device (IED). IED is a homemade explosive device used by perpetrators designed to harm people. Generally, IED contains a power source, switch, initiator, wires, and main charge. The power source, commonly a 9 volts battery, provides power to the initiator (electric or non-electric) to start the detonation of the main charge. The arming or firing of the IED is controlled by the switch.
In the Global Terrorism Index 2022, the Philippines was listed in the top 20 countries most impacted by terrorism [3]. As a safety measure, tightened security in public transport systems such as airport terminals, train stations, and also in commercial establishments is strictly implemented. Pieces of baggage are scanned using an X-ray machine to identify the objects inside and look for threats like explosives and bladed weapons. Although this process is valid, the possibility of missed detection is high during rush hour because of the limited time to scan thousands of baggage and identify threat objects [4]. As a solution, this paper used Faster Region-based Convolutional Neural Network (Faster R-CNN) to identify threat objects (e.g., battery, mortar, wires) in an X-ray image to aid the operator in deciding whether a piece of baggage poses a threat or not. Faster R-CNN [5] is a deep learning-based object detector from the family of a region-based convolutional neural network that introduces Region Proposal Networks (RPN). This network accepts a feature map and then outputs object proposals (bounding box) with corresponding objectness scores.
To date, several studies in the computer vision field explored Faster R-CNN in many different applications such as vehicle detection [6], disease detection [7], [8], face detection [9], [10], ship detection [11], [12], metal object detection [13], radar images [14], defect detection [15], [16], object detection on medical images [17], [18], and autonomous driving [19]. Although many researchers successfully implemented Faster R-CNN in object detection, there are few studies [20] that explored this detector for X-ray images due to limited data available and complicated procedures in collecting Xray images. Some researchers used a different approach [21], like improved Mask R-CNN [22], X-ray proposal and discriminative Networks [23], and multi-view branch-and-bound search algorithm [24] for object detection in X-ray images. Researchers in [25] and [26] were able to implement a deep learningbased object detector for identifying threat objects such as IEDs. However, a detailed evaluation is still needed to know the right configuration and trade-offs.
The contributions of this paper are as follows: (a) extensive evaluation of Faster R-CNN architecture in threat object detection, (b) investigation of how the bounding box proposals and image resolution affects the performance of the treat object detector, (c) experiments on how to improve the performance of the threat object detector in terms of mean average precision (mAP) and speed.

Method
The overview of the Faster R-CNN architecture for identifying threat objects is shown in Fig. 1.  The input is an X-ray image with corresponding class labels and bounding boxes. X-ray images are fed to the preprocessing stage, such as resizing and augmentation before feature extraction. Data augmentation performs random geometric transformations to the image to increase the training data. Features are extracted using CNN via transfer learning using ResNet-101 [27] as a base network. The RPN module accepts anchor boxes and looks for possible objects in the image. The anchor boxes serve as a reference at multiple scales (e.g., 64 × 64, 128 × 128, and 256 × 256) and aspect ratios (e.g., 1:1, 2:1, 1:2). Each sliding window contains nine anchor boxes centered at every position. Then, the RPN module determines its objectness score and proposed regions where the objects are possibly located. The objectness score measures the probability that an anchor is an object. The output of the RPN module is bounding box proposals, each having an objectness score. The region of interest (ROI) pooling module accepts the top N proposals from the RPN module and extracts fixed-sized windows of ROI features from the feature maps. The N proposals were varied from 10 to 450 to determine the effect on the detection performance. The ROI pooling module resizes the feature map into 14 × 14 × D, where D is the depth of the feature map. When max pooling is applied with a stride of 2, the result is a 7 × 7 × D feature vector that will be fed to two fully connected (FC) layers and then finally passed to two fully connected layers that yield the class label and bounding box. Class label C has four dimensions (3 classes + 1 background) such as the battery, mortar, and wires, while the bounding boxes are twelve (4 coordinates ×3 classes).

Dataset
Dataset collection was done using a dual-view X-ray machine. In order to capture the X-ray images projected to the computer monitor, a video recorder was used. The images were collected by extracting one out of five frames (20%) in a given video file to ensure that the extracted images were not similar to the previous image. As an example, in a 60-second video with a frame rate of 30 frames per second (fps), the extracted images will be 360 images. Once extracted, the images were manually selected based on the clarity and quality of the image. Finally, the images were labeled according to classes using LabelImg [28]. The dataset was called IEDXray [25], as shown in Fig. 2, which is composed of X-ray images of IED replicas without the main charge. The left part of the figure shows the one-channel histogram (grayscale) of the sample X-ray image. The histogram shows that the pixel intensities of the image were concentrated approximately between 200 to 255 (white pixels). This dataset contains the basic circuitry of an IED without explosive material. Six IED types were scanned in the X-ray machine.

Training and Evaluation
Faster R-CNN was trained using stochastic gradient descent (SGD) with momentum. Momentum is a method used to improve convergence speed and reduce oscillation [29]. Several hyperparameter values were tried during the experiment using the manual search method. The highest mAP was achieved using the following hyperparameter values: learning rate = 0.0003, momentum = 0.9, batch size = 1. Regularization was also added to the model to increase the mAP by augmenting the data passed into the network for training. Data augmentation was used as an implicit regularization [30]. Each experiment was trained for 20,000 steps. The IEDXray dataset was divided into train and test data. Train and test data consist of 1,209 and 134 images, respectively. Then, the evaluation metric used to measure the performance of Faster R-CNN in threat object detection was based on the PASCAL VOC metric [31], which uses the equations (1), (2), and (3) to compute the mean average precision (mAP).
Intersection over Union (IoU) was calculated by dividing the area of intersection between groundtruth XG and predicted bounding box XP to the area of union shown in (1). To be considered as correct detection or true positive (TP), the score should have an IoU > 0.5 [31]; otherwise, it is a false positive (FP). A false negative (FN) is recorded for undetected ground truths. Given that the average precision AP is the precision P averaged across all recall R values between 0 and 1, the mAP in (4) can be computed by averaging the AP of all class C (3 classes). The classes were battery, mortar, and wires.

Hardware and Software Setup
All of the experiments were conducted on a desktop computer with Intel Core i7-9700K 3.6 GHz 8-Core Processor, 16GB RAM, using Ubuntu 18.04 LTS with NVIDIA RTX 2070 8GB graphics processing unit in a Tensorflow framework

Results and Discussion
In this research, two important parameters of the Faster R-CNN were investigated, such as the number of bounding box proposals generated by the RPN and the image resolution of the input image using the IEDXray dataset that was discussed in the previous section.

Bounding Box Proposals
In the experiment, the number of bounding box proposals varied between 10 and 450 to explore the trade-off. Table 1 illustrates the mAP and evaluation time (per image) of Faster R-CNN on the different number of bounding box proposals. The mean average precision (mAP) was calculated from the last training step (20,000), while the evaluation time was measured by averaging the time it takes to evaluate the test data. The notation (e.g., APbattery) is the average precision of each class. It can be seen from the table that changing the number of bounding box proposals in each training results in different values of mAP. The highest value was achieved using 150 bounding box proposals (75.65%), with a small difference when using 75 bounding box proposals (75.10%). What is interesting about the data is that using 75 bounding box proposals reduces the evaluation time by 29.85 ms (22.22%) and still has a comparable mAP as 150 bounding box proposals. On the other hand, increasing the number of bounding box proposals from 150 to 450 recorded a 1.16% decrease in mAP. Therefore, increasing the number of bounding box proposals does not always improve the mAP of the object detector. The precision and recall in each class using 150 bounding box proposals are shown in Table 2. It can be seen that the Faster R-CNN detected the mortar with high precision (96.67%) and high recall (100%). While the wires were not accurately detected with 87.41% precision and 65.10% recall. The performance of Faster R-CNN in each bounding box proposal during the evaluation is shown in Fig. 3. It can be seen that the mAP using 10 bounding box proposals significantly reduces the performance of the object detector. The inference time in each bounding box proposal was also evaluated. The comparison of mAP versus time on the different number of bounding box proposals is presented in Fig. 4. Using 450 bounding box proposals gives the slowest inference time, while 10 bounding box proposals are the fastest but give the lowest mAP. The graph indicates that it is recommended to use 75 bounding box proposals to get the best trade-off between speed and mAP.

Image Resolutions
In this experiment, the input image resolutions were varied between 150 × 256 and 900 × 1536. Then, the Faster R-CNN was trained using these image resolutions. Table 3 shows the performance of Faster R-CNN on different image resolutions. The aspect ratio of all resolutions was fixed (75/128), while the number of proposals was 300. It can be seen from the table that as the image resolution gets bigger, the mAP increases. The highest mAP was achieved using 900 × 1536 resolution (77.59%) in exchange for lower speed. Increasing the resolution by a factor of 2 (from 150 × 256 to 300 × 512) increases the mAP by 16.42% while increasing it to a factor of 4 (from 150 × 256 to 600 × 1024) increases the mAP by 27.09%. In addition, there is no change in evaluation time if 150 × 256 or 300 × 512 resolution is used, but the mAP in 300 × 512 is higher than 150 × 256. The table clearly shows that image resolution can significantly impact the mAP of the object detector. The precision and recall in each class using 900 × 1536 resolution are shown in Table 4. It can be seen that the Faster R-CNN detected the mortar with high precision (93.55%) and high recall (100%). While the wires were not accurately detected with 77.84% precision and 75% recall. The mAP plot on different image resolutions is shown in Fig. 5. Interestingly, the image size was observed to affect the performance of the object detector. Increasing the image size also increases the mAP of the object detector. Same with the bounding box proposal experiment, the inference time in different image resolutions was also examined. The comparison of mAP versus time on different image resolutions is presented in Fig. 6. The increased mAP can be achieved by sacrificing the speed of the object detector. Every 150 pixels increase in the shorter edge, and 256 pixels increase in the other edge of the input image increases the mAP while the evaluation speed slows down. After training and evaluating the Faster R-CNN, the trained model was tested in an X-ray image to verify its detection performance. A python script was developed that accepts an input image, performs inference, and outputs the bounding box coordinates and corresponding class labels of the threat objects. The detection output using Faster R-CNN is shown in Fig. 7. The class label and class score of the detected objects are shown in the upper portion of the bounding box coordinates. The model was able to detect three classes of IED components, such as battery, mortar, and wires.

Conclusion
This study extensively evaluated Faster R-CNN in identifying threat objects in an X-ray image dataset. Different experiments were conducted to increase the performance of the threat object detector by changing the number of bounding box proposals and the image resolution of the input image. These experiments confirmed that increasing the number of bounding box proposals may lower the mean average precision (mAP) and slows the detection time. The research has also shown that increasing the input image's size positively impacts the mAP by sacrificing speed. It is recommended to identify the best trade-off between the mAP and speed when using Faster R-CNN by balancing the bounding box proposals and the image size. Overall, the experiment result shows that the proposed method can reliably identify the threat object in an X-ray image.
More X-ray images can be added to the training data to improve this study further. The data is recommended to have other objects aside from the IED components. This may increase the generalizability of the IED detector model and prevent several false positives and negatives. If acquiring additional data is impossible, another option is to generate synthetic X-ray images using another machine learning framework like generative adversarial networks (GANs) and variational autoencoders (VAEs).