Fast Vehicle and Pedestrian Detection Using Improved Mask R-CNN

School of Physics and Electronic Information, Anhui Normal University, Wuhu 241002, China Anhui Provincial Engineering Laboratory on Information Fusion and Control of Intelligent Robot, Wuhu Anhui, 241002, China School of Communications and Information Engineering, Xi’an University of Posts & Telecommunications, Xi’an 710061, China School of Mathematics and Statistics, Anhui Normal University, Wuhu, 241002, China


Introduction
To improve driving safety and reduce driver fatigue, research is being conducted on the development of intelligent driving technology [1]. In intelligent driving, we need to first guarantee the human's safety, and therefore, the assisted driving system (ADS) [2] to improve safety is a hot spot in intelligent driving research. e collision avoidance warning system (CAWS) [3] is particularly important for ADS in smart cars. One key issue of CAWS is the awareness of the driver's surroundings. e images of vehicles and pedestrians captured by car cameras are to be identified, detected, and divided by object detection technology, which faces challenges due to complex scene information. e two main methods for vehicles and pedestrians detection are machine learning-based [4] approaches and deep-learning-based [5] approaches. Machine learning approaches first define features using one of the feature acquisition descriptors such as histogram of oriented gradient (HOG) [6] and then perform classification using a technique such as a support vector machine (SVM) [7]. e HOG + SVM approach shows superior performance but suffers from low mean average precision (mAP) and is not suitable for multistage process feature extraction [8]. Deep learning systems, such as convolutional neural networks (CNNs), show superiority in object detection because they aim to discover discriminative features from raw data [9]. e CNN was developed in the 1980s and 1990s [10], but since experiencing a resurgence of interest [11] in 2012, it has established a foothold in the field of computer vision and has grown at a rapid pace.
As the requirements for algorithm accuracy and speed continue to increase, vehicle and pedestrian recognition

Mask R-CNN
Mask R-CNN is a conceptually simple, flexible, and general framework for object recognition, detection, and instance segmentation, which can efficiently detect objects in an image, while generating a high-quality segmentation mask for each instance. Feature pyramid networks (FPNs) for object detection [24], the first block structure of Mask R-CNN, are responsible for feature extraction. e regional proposal network (RPN) [25], the second piece of Mask R-CNN, shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals [26]. We then expanded the Faster R-CNN to form the Mask R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. e RPN was applied to Mask R-CNN instead of selective search [27] so that the RPN can share the convolution feature of the full map with the detection network. It can predict both boundary position and object scores at each location, and it is also a fully convolutional network (FCN) [28]. As shown in Table 1, the Faster R-CNN uses the RPN as a region generation network to generate candidate regions. e FPS based on the Fast R-CNN algorithm is as high as 5, and its MAP tested on VOC 2012 is also increased to 70.4% [14].
To further improve the detection accuracy of the target, Mask R-CNN uses the bilinear interpolation algorithm region of interest (ROI) align instead of ROI pool [29] on the basis of Faster R-CNN. e ROI align layer removes the harsh quantization of the ROI pool and properly aligns the extracted features with the input. is method of ROI align avoids any quantization of the ROI boundaries or bins. e algorithm of ROI align is used to compute the exact values of the input features based on bilinear interpolation [30] at four regularly sampled locations in each ROI bin and aggregate the results.
is method improves the accuracy of Mask R-CNN by 10% [15].
In order to enable Mask R-CNN to implement the mask function, Mask R-CNN adds mask branches to achieve highprecision instance segmentation from pixel-to-pixel alignment. Mask R-CNN can accomplish three tasks: target recognition, detection, and segmentation. Its detection speed can still reach 5 FPS. e flowchart of Mask R-CNN is shown in Figure 1. At the input, after the image passes through the FPN, five sets of feature maps of different sizes are generated, and the candidate frame area is generated by the RPN. After the candidate region is combined with the feature map, the system can achieve the detection, classification, and mask of the target. To further improve the computing speed of the algorithm, it can adapt to the realtime requirements of the intelligent driving anticollision warning system.
Based on Mask R-CNN, we propose a method to improve the detection of accuracy and speed through SF-FPN with Resnet-86. In this study, the dataset, FPN structure, and RPN parameter settings are improved. e improved method proposed in this study can realize the recognition, detection, and segmentation of the target at the same time.

Feature Pyramid Networks for Object Detection (FPN).
Feature extraction is an important part of the field of machine vision. With the development of machine learning, methods based on neural network feature extraction, including featurized image pyramid [31], single feature map [32], pyramidal feature hierarchy [33], and feature pyramid network (FPN), have been proposed.
As shown in Figure 2(a), featurized image pyramid refers to an image input that passes through different convolutional layers, sets convolution kernels of different sizes, generates multiscale feature maps, and outputs feature maps of different sizes. Although this method obtains feature maps at different scales, it adds a large amount of calculation time and the semantic information from the feature map is not sufficient. Figure 2(b) shows a method of single feature map feature extraction. e idea is to input an image and pass different convolutional layers from bottom to top, with the output of the last convolutional layer used as the final feature output of the network. is method offers faster operation and utilizes the semantic information on each layer. For this reason, it has been previously applied to SPP-net, Fast R-CNN, and Faster R-CNN. But the performance of multiscale target detection is poor. Figure 2(c) shows that the pyramidal feature hierarchy method is still about inputting a picture from the bottom to top and passing through different convolutional layers. But it extracts different scale features of different layers as predictions, which will not increase the amount of calculation and can obtain multiscale features. Although the pyramidal feature hierarchy method can maintain speed and generate multiscale feature information at the same time, this method can neither make full use of lower level semantic information nor achieve good results of small target detection. To this end, the FPN adds a side link algorithm based on the pyramidal feature hierarchy algorithm, that is, when inputting one image, it passes through different convolutional layers from the bottom up and then links from top-down at side and combines low resolution and strong semantic features with high resolution and weak semantic features. It not only maintains the original calculation speed but also generates accurate multiscale feature information.
e top-down pyramid [34] algorithm in Figure 2(d) uses convolutional sampling to first reduce the size and then uses upsampling to increase the feature map. e network has no horizontal connection, that is, the top-down process does not integrate the original features, which will cause the location characteristics of the target to become more inaccurate after multiple downsampling and upsampling processes.
e finest level only [35] algorithm shown in Figure 2(e) is capable of taking only the last layer P2 of PFN as the output and does not produce multiscale output. e sliding of the RPN stage window at different layers of the pyramid will increase the robustness of the scale change, so the FPN is useful for identifying different sizes of the target in robustness, which is significantly better than the finest level only algorithm.
As can be seen from Table 2, compared to the outputs of C4 and C5 of the featured image pyramid algorithm, the FPN algorithm improves the accuracy by nearly 21.7%. Particularly in the small target detection, it increases a significant advantage of 12.9 points. From Figure 2, we can also see that the FPN adds top-to-bottom side links and multiscale output compared with the single feature map and pyramidal feature hierarchy algorithm. In this way, low resolution and strong semantic features can be fully integrated with high resolution and weak semantic features. From the data table, we can see that when 1000 anchors are generated as predictions, the AR is improved by 6.   Mathematical Problems in Engineering 3 targets have improved by 9.8 points, which greatly improves the robustness. e FPN in Mask R-CNN uses Resnet-101 as the backbone. e design of the deep residual network overcomes the problem that the learning efficiency becomes lower due to the deepening of the network and the inability to effectively improve accuracy. e deep residual network divided a training series into one block for training so that the error of each block is minimized in order to achieve the goal of the smallest overall error. Resnet-101 is an internationally used classical deep residual network. It can be roughly divided into five stages of convolutional layers. e output scale is reduced by half at each stage.
is FPN + Resnet network has robustness and adaptability and can not only send high-level features to low-level features but also make full use of all highlevel feature information and underlying feature information through side links, thereby improving feature extraction capabilities. e FPN is the first part of Mask R-CNN, which can obtain feature maps. As shown in Figure 3, the FPN is built on the basis of image pyramids [36]. Its input image is obtained from the convolutional layer to obtain five sets of characteristic maps (C1, C2, C3, C4, and C5), and all five are upconverted or reduced to 256 dimensions by a 1 * 1 size convolution kernel. Since the upsampling of C5 is as same as the one of C4, we thus use the dimensionality reduction results of C4 directly. is connection method can connect the high-level features of low resolution and high-semantic information and the low-level features of high resolution and low-semantic information from top to bottom so that the features of all scales have rich semantic information. e same is true for the connection method of P5, P4, P3, and P2. In order to make the large target detection effect better, P5  e first column is the feature extraction algorithm, where the second and third rows are the two-layer outputs of the algorithm, and the fourth row contains the single feature map and pyramidal feature hierarchy. e second column is the name of the output layer, where the "{}" symbol indicates the independent prediction at each layer. e evaluation standard uses average recall (AR). e number in the upper right corner of the AR indicates the number of anchors generated by each image. e letters "s," "m," and "l" in the bottom right corner denote the small goal, medium goal, and large goal, respectively. performs the maximum pooling at the end and forms a feature map P6 of 16 * 16 size. e COCO datasets contain 81 categories. e FPN in Mask R-CNN uses Resnet-101 as the backbone to detect 81 types of targets. However, since the detection target in this study consists of only three categories-person, car, and bus-there is a problem of parameter redundancy when using Resnet-101 to detect three types of targets. To reduce the redundancy and improve the computing speed, this study designs a Resnet-86 with only 86 layers as the backbone of the network and FPN to detect the three types of targets. As can be seen from Table 3, the Resnet-86, Resnet-50, and Resnet-101 structures are all composed of five-part convolutional layers. e number of residual blocks of Resnet-50 and Resnet-101 at Conv_4 are 6 and 23, respectively. In this study, the number of residual blocks of Resnet-86 at Conv_4 is changed to 18. It can be seen from the experimental results in Table 4 and Figure 4 that although the Resnet-50 structure is faster in recognition speed, its recognition accuracy cannot meet our requirements. Compared with Resnet-101, Resnet-86 not only increases the computing speed by about 7.94% but also reduces the weight memory by 9.43%. is can effectively promote the development of deep learning in the field of embedded development. erefore, this study uses Resnet-86 as the backbone of Mask R-CNN to extract the features of the picture. FPN (SF-FPN). In view of the excellent feature extraction performance of the FPN, researchers in the field of machine vision in the past two years have successively proposed models such as path aggregation network (PANet) [37], neural architecture search (NAS-FPN) [38], and bidirectional feature pyramid network (BiFPN) [39,40] and applied them to image recognition, detection, and segmentation of various scenarios based on their research applications. We proposed the Side Fusion FPN, the main idea of which is to make full use of feature semantic information on feature fusion and feature extraction while increasing the amount of calculation as little as possible. e aim is to take full advantage of the feature map of the information with high semantics, that is, P2 to P6 mentioned above, and combine them side by side.

Side Fusion
As shown in Figure 5, as a pioneering method for feature extraction, the FPN proposes a top-down and side-to-side connection method to combine multiscale features. Following this idea, PANet is proposed, which adds an additional path on the basis of the FPN bottom-up to the aggregating network. is method further combines feature semantic information for better feature extraction. e NAS-FPN uses a neural architecture search to obtain irregular feature network topologies. is method can crossrange fusion characteristics and adopt the neural network search technology to form a new feature pyramid structure. Although the NAS-FPN can achieve better performance, it requires thousands of GPU hours in the search process, and the generated feature network is only one thing, so it is difficult to explain. e next method to emerge was BiFPN. It uses two-way cross-scale connection and weighted feature fusion to improve the detection accuracy, but compared with the FPN, it still requires a good deal of calculation.
Again, this is why we proposed the SF-FPN algorithm. It is based on the FPN algorithm, but without increasing any output, it reduced the amount of calculation as much as possible, making full use of high semantic feature information, adding 6 fusions lines, and making the fusion between P2 and P5 and simultaneously making P6 the final fusion output. We also proposed a fully connected FPN as a comparison, that is, on the basis of PANet, all the semantic feature information will have a through-connection.
From the classic FPN structure, compared to C2−C5, we know that P2−P6 have rich feature semantic information. It is cost-effective to fuse these five feature maps. erefore, this paper designs the Side Fusion FPN in such a way that we add only six side fusion curves on the basis of the FPN. Curve 1: transfer P5 feature semantic information to P3. Curve 2: transfer P5 feature semantic information to P2. Curve 3: transfer P4 feature semantic information to P2. is article uses the Side Fusion FPN we proposed as the first part of Mask R-CNN and uses our design of the deep residual network Resnet-86 as the backbone to obtain five scale feature maps. As can be seen from Tables 5 and 6, with the SF-FPN algorithm we designed in the entire network framework, the amount of calculation has only increased by 2.54 × 10 −7 . Although this calculation amount is almost minimal, in the test results, mAP has increased by 2.77 points, so there is an obvious improvement in accuracy.    Mathematical Problems in Engineering

Regional Proposal Network (RPN).
e five sets of feature maps generated by the FPN are sent to the RPN. As can be seen from Figure 6, the RPN uses a small grid to slide across five sets of feature maps to produce 758664 boxes and form regional recommendations. We introduce the algorithm for generating candidate frames for RPN. As shown in Figure 6, suppose there are n feature maps in total, the width of each feature map is P W i , and the feature map is P H i , we can obtain that the feature maps have P W i • P H i pixels. Using each pixel as an anchor point, three scales and three scale candidate frames are generated at the same time, that is, nine candidate frames are generated for each pixel. In this way, we can obtain the number of candidate frames for each feature map as 9 • (P W i • P H i ) . erefore, RPN generates the total number of candidate frames after passing all the feature maps.  . e words "person," "car," and "bus" in the bottom right corner mean the detection of the single category. Min_train_loss refers to the loss value of each model after training.
From the above, we can know that this paper generates five sets of feature maps. e size of these five feature maps is 16 * 16, 32 * 32, 64 * 64, 128 * 128, and 256 * 256. rough the above formula, we can see that the phases of RPN can generate 758,664 boxes.
Next, the network will calculate 758,664 boxes through nonmaximum suppression (NMS) [38]. e network sorts the scores from large to small according to the four factors of color, texture, total area after merging, and the total area of the merged box in its bounding box and retains 2000 train boxes and 1000 inference boxes. rough the NMS algorithm, a large amount of calculation is required when the network selects to reserve 3000 boxes, and the network performs subsequent training and prediction on the selected 3000 boxes, with a large amount of calculation required. is is therefore the most time-consuming part of the RPN. e 3000 boxes are used to detect 81 types of targets in the COCO dataset, but there are only three categories of detection targets in this study. To increase the speed of the network without affecting the detection accuracy of the target, we need to retain 500 train boxes for training and 250 inference boxes as predictions.

Dataset Improvement.
e detection objects of this paper are cars, buses, and pedestrians. e Microsoft COCO public dataset contains 81 categories, and it contains 82,081 image samples. e parameters in mask_rcnn_COCO.h5 obtained by training the dataset through Mask R-CNN are for detecting 81 kinds of targets. Using this weight directly to detect vehicles and pedestrians can make the calculations too complicated. erefore, two changes have been made to the COCO dataset: for the first change, we screened the three  Figure 6: e RPN operation uses each pixel as an anchor point on each feature map and simultaneously generates candidate frames of three sizes and three ratios. All candidate frames are then subjected to NMS screening according to the score, and a certain number of candidate frames are selected and saved, which are used for subsequent training and prediction.
categories of images in the COCO dataset (car, bus, and person) and formed a new dataset, which we named the COCO_pcb dataset. For the second change, we labeled the new dataset with an open-source image annotation tool called VGG Image Annotator (VIA). e annotated file is named via_pcb_data.json. e dataset uses 1000 images as the training set, 100 images as the verification set, and 50 images as the test set. VIA was developed by the Visual Geometry Group, and it can be used online or offline. As shown in Figure 7, we can label the target using annotation methods for rectangles, circles, ellipses, polygons, points, and lines. is way, we not only make the sample a valid sample of the three categories of car, bus, and person but also ensure that the number of datasets is sufficient. is will assist in the precise training of the experimental part and improve the accuracy of the target detection. We used the VIA image annotation tool to polygonize the original image and then sent the newly obtained annotation file via_pcb_data.json to Mask R-CNN network for 160,000 iterations. We named the weight as mask_rcnn_via.h5. Finally, we used mask_rcnn_via.h5 as the initial weight for migration learning. We used the coco_pcb dataset, which has only three categories of car, bus, and person, to perform 160,000 iterations of training with FPN + Resnet101, FPN + Resnet50, and our designed FPN + Resnet86. Under the same environment, 1,000 iterations were performed for SF-FPN + Resnet101, SF-FPN + Resnet50, and SF-FPN + Resnet86, respectively.

Experiment and Results Analysis
is experiment was conducted with FPN + Resnet101, FPN + Resnet86, FPN + Resnet50, SF-FPN + Resnet101, SF-FPN + Resnet86, and SF-FPN + Resnet50 as backbones. We trained it on a COCO dataset containing 81 classes and a COCO_pcb dataset containing three classes. In addition, we used the SF-FPN designed by the author of this article as a backbone to train the COCO_pcb dataset containing only three classes. e epoch for the above 12 sets of experiments is set to 160. Each epoch contains 1000 iterations, and the total number of iterations is 1,920,000. During the experiment, we recorded the memory size of the weights, the total time spent on training, the average length of time on test 4952 images, and the value of mAP on test 4952 images. We also tested the params and FLOP values of 12 groups of experiments. As shown in Figures 8 and 9, we call the tensor board function to draw loss image. Finally, as shown in Figures 8 and 9, we selected the smallest loss value for each set of experiments and recorded it in Tables 4 and 5. We tested 4952 images with the best weights in each set of experiments and calculated the detection time for each network.
As can be seen from Table 4, total params is the first parameter comparison for the six network structures. Params refers to the amount of parameters included in the network structure. For example, for each layer of the n-layer convolutional neural network, the convolution the kernel width is C W i , kernel length is C H i , the number of input channels is C IN i , and the number of output channels is C OUT i . en, we can get the memory parameters of each layer of the convolutional neural network: Usually, a complete neural network structure also includes the last fully connected layer. We assume that the input of the fully connected layer is N IN , and the output is N OUT . e memory parameters of this fully connected layer are Hence, we can get the number of total params:

Experimental Data Analysis of FPN + Resnet-50, FPN + Resnet-86, and FPN + Resnet-101.
As can be seen from Table 4, compared to Resnet101_81, the params value of Resnet86_3 is reduced by 6,023,374, and the FLOP value is decreased by 12,199,439, which saved nearly 7.55 hours of training time. When testing 4,952 images, the average detection time was reduced by 0.17 seconds, and its detection speed was increased by 7.94%. e speed of Resnet86_3 in training and detection was improved, and its weight memory was reduced by 24.3 MB, which contributes to its implementation in embedded development. As can be seen from Table 5, the minimum loss value of Resnet86_3 is 0.7138, which is 15.73 points higher than the mAP value of Resnet101_81, and it also has a certain improvement in accuracy. Compared with Resnet86_81, Resnet86_3's params value is reduced by 414,414, and the FLOP value is reduced by 827,904, which shortens the training time by 1.35 hours. When testing 4,952 images, the average detection time is shortened by 0.04 seconds. In terms of loss, it is 0.1954 points lower than the loss value of Resnet86_81, and the mAP value is 16.4 points higher. By comparison, the two structures of Resnet-86_3 and Resnet-86_81 reflect the impact of the number of detection categories in the network.
Compared with Resnet-50_81, although Resnet-86_3 does not display an advantage in params, FLOPs, and speed, we can see from Figure 8 that it is not suited for our purpose.
is is because the training effect of Resnet50_81 shows that the smallest loss is high (1.287) and mAP (28.68) is lower than the one in Resnet86_3. is does not meet our standard for detection accuracy. We can also see in Figure 8 that the    smallest loss of Resnet50_3 is very high (0.8643) and mAP (7.75) is lower than that of Resnet86_3.
Compared with Resnet101_3, Resnet86_3's params are reduced by 5,608,960, and FLOPs are reduced by 11,371,535, which saves 2.58 hours in training time. When testing 2,895 images, the average test time per image was reduced by 0.13 seconds, and the speed of the test increased by 6.19%. e weight memory of Resnet86_3 was also 22.6 M smaller than the one of Resnet101_3, and the value of mAP was only 0.85 lower than that of Resnet101_3.
It can be seen from Figure 8 that when the Resnet-86 is designed as a backbone in the Mask R-CNN to detect the three major categories of car, bus, and person, its training effect is significantly better than the original Mask with Resnet-101 as the backbone R-CNN. Although the mAP value of Resnet-86_3 is slightly lower than the mAP value of Resnet-101_3, the detection speed of the former one is faster, which meets the goal of our design. erefore, in the end, we used Resnet-86_3 as the backbone of Mask R-CNN and applied it to the target detection algorithm.

Analysis of FPN and SF-FPN Experimental Data.
By comparing the experimental data of 12 groups in Table 4  A comparison of the loss curves for the two groups of experiments selected from Figures 8 and 9 is shown in Figure 10. When the parameters such as the dataset, the number of network layers, the number of detection categories, the training rate, and the number of training iterations during training are the same, we find that the effect is obviously better than the FPN, and the loss value of the SF-FPN's loss curve has always been lower than the FPN's loss value. Particularly in the early stage, it can quickly reach a lower loss value, which has basically been ahead of the FPN training effect.
Taking into account the quality characteristics of the SF-FPN, we finally adopted the SF-FPN as the feature extraction structure, in which the SF-FPN uses Resnet86 as the network structure for object recognition, detection, and classification of the three categories of car, bus, and person. Compared with the original FPN + Resnet101 structure recognition class 81, we designed the network structure SF-FPN + Res-net86_3 to improve the training speed by 26.98% and improve the mAP accuracy evaluation by 17.53 points. As shown in Figure 11, we tested two images with the above two algorithms, respectively, in which the FPN + resnet101_81 algorithm missed the small target vehicle in the red frame in the first image, while the SF-FPN + resnet86_3 algorithm proposed in this paper accurately detected the small target vehicle in the red frame. When detecting the second image, the FPN + resnet101_81 algorithm does not segment the red table frame accurately in the segmentation process, while our algorithm SF-FPN + resnet86_3 segments the part accurately, distinguishing between the vehicle and nonvehicle parts. At the same time, the network structure framework we designed can be easily migrated to other network structure models, such as Faster R-CNN, SSD, and YOLOv3.
At the end of this section, we would like to provide a short comparative remark on our method and some state-of-art methods. After testing, the mAP value of SSD-based vehicle detection algorithm [41] is only 50.4%, which is 26.24% lower than our new algorithm. As shown in Figure 12(a), in the detection of the first test diagram, the target at the lower right corner is not detected, and there is a missed detection. e mAP value of vehicle detection algorithm [42] based on YOLOv3 is 57.9%, which is 18.74% lower than the new algorithm. It can be seen from Figure 12(b) that the detection and positioning of the leftmost target "car" in the second image is not accurate enough. e mAP value of Faster R-CNN vehicle detection algorithm [43] is 59.1%, but it is still 17.54% lower than the new algorithm. It can be seen from Figure 12 algorithm [44], bibox regression for pedestrian detection algorithm [45] and other algorithms are compared, and the experimental results show that the vehicle and pedestrian detection based on the improved Mask R-CNN is slightly more accurate in the task of case segmentation.

Conclusions
e main research content of this study is about how to make the Mask-RCNN algorithm detect and segment cars, buses, and persons on the road more accurately and more quickly in the anticollision warning system. To increase the accuracy of the training effect of the network, we filtered and supplemented the dataset. To meet the real-time requirements of smart driving, we designed the Resnet-86 network and used it as a network backbone. To further increase the detection speed of the network, we modified the number of reserved RPN candidate frames. For greater accuracy, we designed the SF-FPN algorithm for feature extraction.
rough improving the dataset, FPN, and RPN, our network improved the detection speed by 7.94%, and the detection accuracy increased the value of mAP by 17.53 points over the original Mask R-CNN network. Based on Mask R-CNN, we improved the network to integrate the functions of image recognition, detection, and segmentation. As we can see from Figure 13, the improved network can accurately detect a distance of about 200 m, even though the target is occluded by 95%. It can be seen that the network can be applied to the intelligent driving anticollision warning system to identify the car, bus, and person ahead.
Although the recognition speed of this network has reached 5 FPS, for some real-time system applications, the recognition speed still needs to be increased and the hardware configuration requirements need to be reduced. For example, in the vehicle tracking task, the target detection speed needs to be completed faster; in the precision instrument segmentation task, the target needs to be segmented more accurately; in the vehicle emergency braking device, the target needs to be detected faster to complete emergency braking.
In the future, in order to improve the detection accuracy, we can further improve and design the feature extraction algorithm to make the feature semantic information more abundant. It can also further optimize the depth residual network, reduce the loss function value, and improve the network training effect. In order to improve the detection speed, we can combine and optimize the deep convolution neural network algorithm, reduce the network computing redundancy, and improve the network detection speed. It can also be combined with hardware configuration to enhance the network computing ability and further improve the speed of target detection. erefore, how to improve the speed and accuracy of target detection is still our key research work in the future.

Data Availability
e data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 6 months after publication of this article, will be considered by the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.