SSD Object Detection Model Based on Multi-Frequency Feature Theory

In order to further improve the accuracy and real-time performance of the traditional Single Shot Multibox Detector (SSD) object detection model, an improved SSD multi-object detection model is proposed. Firstly, aiming at the defect of weak correlation between prediction object score and positioning accuracy in the traditional SSD model, the improved model enhanced the correlation between the two by adding Intersection Over Union(IoU) prediction loss branch. Secondly, in order to reduce the spatial redundancy of traditional SSD model, a multi-frequency feature component convolution module is designed, which greatly reduces the calculation overhead and hardware overhead of the traditional model. Finally, in order to accelerate the convergence speed of the improved model, the Adaptive and Momental Bound (AdaMod) optimizer is introduced to modify the adaptive learning rate of the improved model which is too large in the training process. Experimental results show that the improved model has stronger detection capabilities, better overall detection results, and improved detection accuracy and real-time detection.


I. INTRODUCTION
With the continuous improvement and development of the object detection technology, there are more ideal use experience and a wide range of applications. In the aspect of traffic [1], object detection can be efficiently completed by detecting pedestrian, vehicle, road signs, traffic lights and other objects on the road to assist traffic management. In the medical field [2], object detection is often applied to the pathological detection and recognition of images, which makes a significant contribution to the prevention and cure of diseases. The application of these aspects shows that the work done by the object detection instead of the human is very efficient and convenient. In the military field [3], object detection is usually used to track the missile hitting the target, which plays an important role in the intelligent development of the military field. In the field of security [4], object detection is used to track suspicious vehicles and The associate editor coordinating the review of this manuscript and approving it for publication was Zhihan Lv . detect illegal and criminal behaviors in real time, providing a favorable guarantee for social security protection. At present, there are two kinds of common object detection algorithms, which are based on two-stage and single-stage processes. In the two-stage object detection algorithm, the candidate boxes area are first generated, and then the candidate boxes are classified and adjusted. Its main representative algorithms are R-CNN series algorithms. Such algorithms have high accuracy, but there are problems such as slow detection speed and low real-time detection of algorithms. However, the object detection algorithm based on single-stage process generates object bounding box and probability classification directly, without the need to generate the candidate box areas, which improves the speed of object detection, but the accuracy rate has decreased. Representative object detection algorithms based on single-stage and two-stage processes include SSD [5] and Faster R-CNN [6].
In order to solve the problems in object detection, many scholars at home and abroad have done a lot of research work. In 2014, Ross Girshick et al. proposed R-CNN (Region-based Convolutional Neural Networks) [7] algorithm, which introduced the deep learning [8] model CNN (Convolutional Neural Networks) into the object detection field for the first time, and the detection effect was significantly improved compared with the traditional object detection algorithms. However, the R-CNN algorithm needs to acquire the features of 2000 RoI (Region of Interest) from select search [9] no matter in the training or prediction stage of the model, which results in the slow detection speed of the network model. In addition, the process of feature extraction can't be updated because the relevant feature extractor of CNN is separated from the SVM used for prediction. Therefore, after R-CNN algorithm, Ross girshick proposed Fast R-CNN [10] algorithm in 2015, which was optimized for the defects of R-CNN and improved the training and prediction speed of R-CNN detection algorithm to some extent. After that, Faster R-CNN was proposed by Shaoqing Ren et al. Based on Fast R-CNN, this algorithm constructed a region proposal network (RPN). The prediction network directly generated Region Proposals instead of the ROI obtained by selecting the search method. With the help of RPN, the detection speed of Faster R-CNN was further improved. Because R-CNN needs to obtain a large number of proposals, and the large amount of overlapping proposals causes a lot of unnecessary repetitive work. You only look once (YOLO) [11] modified the prediction idea based on proposal in the R-CNN series of algorithms, dividing the input image into several small cells, and making prediction in each small cell. YOLO algorithm realized the end-toend detection effect and improved the detection rate, but the detection accuracy was deficient due to its coarse granularity. The SSD algorithm draws on the ideas of the YOLO cell and the anchor mechanism of Faster R-CNN, which is a multi-object detection algorithm that directly predicts the object categories and boundary boxes. The SSD algorithm uses the methods of generating default box and convolution prediction to achieve the purpose of multi-object detection by comprehensive utilization of the output feature maps of different convolution layers. The SSD algorithm generates multiple default boxes for each predicted position in the output feature maps and sets different aspect ratios and sizes. During the prediction, SSD algorithm generates categories scores for the object in each default box and processes the corresponding default boxes. However, there are some problems in SSD detection algorithm, high network space redundancy and the high redundancy between each feature map, which are not conducive to the accurate positioning of the objects in the input image. Therefore, Fu et al. improved the original algorithm by combining stronger feature extraction network and adding more context information through the deconvoltional module, and proposed the Deconvolutional Single Shot Detector (DSSD) [12] model. However, with the replacement of feature extraction network and the addition of deconvolution module, the real-time detection performance of model is greatly reduced. In order to improve the detection accuracy of SSD object detection algorithm, Jeong et al. [13] improved the method of feature fusion, so as to make full use of the features of each output layer. Li et al proposed the feature fusion single shot multibox detector (FSSD) [14] model, which obtained more details of the output feature layers through feature fusion and down-sampling, so as to improve the detection accuracy of the model.
In order to improve the detection accuracy and real-time performance of traditional SSD object detection algorithm, the model of this paper makes the following related work: Firstly, to enhance the correlation between object score and positioning accuracy, the IoU prediction loss branch was added to the improved model. Secondly, in order to reduce the spatial redundancy of the SSD model, the multi-frequency feature component convolution module is designed for the original model. Finally, in order to accelerate the convergence of the improved model, the abnormal adaptive learning rate during the model training process was modified by the AdaMod optimizer [15].

II. RELATED WORK A. MULTI-FREQUENCY FEATURE THEORY
With the continuous improvement and development of related technologies in the field of computer vision, convolutional neural network has been applied in many fields such as object detection, image recognition [16], semantic segmentation [17] et al., and has achieved great success. Although in recent years, convolutional neural network has made some achievements in reducing the redundancy of relevant model parameters [18]- [20] and channel dimension of network feature maps [21]- [24], the output feature maps generated by convolutional neural networks still have a lot of redundant information in the spatial dimension. In the output feature maps generated by the network model, each location separately stores the relevant feature information of its own location. However, the feature information stored in adjacent locations is often similar, and these public information can be stored and processed together. Due to the existence of a large amount of redundant information, the execution efficiency of the network model is reduced.
Perform a related Fourier transform on the natural image. The transformed image usually contains two parts: low frequency and high frequency. The region with slow change in the grayscale image of the natural image corresponds to the low frequency part, while the region with drastic change corresponds to the high frequency part. The low frequency region represents the overall structure of the natural image, while the high frequency part focuses more on the detail edge in the natural image. Inspired by this, the multi-frequency feature theory [25] divides the relevant output feature maps of convolutional neural network into high-frequency region and low-frequency region. In order to reduce the spatial redundancy information, the low-frequency information that changes more gently is stored in a tensor with lower dimension. VOLUME 8, 2020 On the premise of satisfying the information exchange and update between different frequencies, the low frequency region and high frequency region of the feature maps are operated separately by convolution kernel. The relevant flow of multi-frequency feature representation is shown in figure 1. In order to perform correlation operations on multifrequency feature maps, it can be seen from Figure 1 that the multi-frequency feature theory is extended to the general convolution neural network, and divides the feature maps of the general convolutional neural network into high-frequency and low-frequency feature maps. By sharing adjacent location information, the spatial features of low-frequency groups are reduced and stored in the tensor of low-resolution, so as to reduce spatial redundancy. With the reduction of spatial redundant information, the memory resource consumption and the computation cost of the convolutional neural network are greatly reduced.

B. ADAMOD OPTIMIZER
In order to accelerate the convergence of the algorithm, the Adma [26] algorithm is widely used at present, but due to its poor convergence, not only the applicability of the algorithm is limited, but also the convergence result is not ideal. Therefore, in order to obtain better experimental results, the Stochastic Gradient Descent (SGD) [27] algorithm is still widely used in sample classification prediction. But better experimental results are at the expense of real-time detection. Therefore, the AdaMod optimizer based on the Adma algorithm is proposed. During the training process of the network model, the AdaMod optimizer does not need to warm up and is not sensitive to the learning rate of the network model. By calculating the average value of the adaptive learning rate, the abnormal learning rate in the training process is modified, thus improving the convergence of the optimizer. The AdaMod optimizer principle is as follows: The related parameters of the optimizer are set, including step length ε, moment estimation exponential decay rate ρ1, ρ 2 ∈ [0,1], smaller constant value δ for numerical stability, and initial parameter θ. The first moment and second moment variables s, r and time step t are initialized. Randomly select m samples from the training data set, {x 1 , x 2 , x 3 ,. . . , x m }, the corresponding relevant target is y i . The gradient of the relevant sampling data set is calculated by equation (1) and the time step is updated.
Through equations (2) and (3), the first moment and second moment estimates are updated.
The deviation of the first moment and the second moment is corrected by equations (4) and (5).
Equations (6) and (7) are used to update the parameter θ.
The above process expresses the optimization principle of the Adma optimizer. Based on the Adam optimizer, the AdaMod optimizer adds a hyper-parameter ρ 3 to describe the length of memory during the model training process. Before updating the relevant parameter θ, the relevant update operation of the smooth value u is added, as shown in equation (8).
In formula (8), ρ 3 represents the measure of memory length, and 1/ρ 3 represents the range of exponential average sliding. The closer the value of ρ 3 is to 1, the larger the memory range of the optimizer. After the smooth value u t is calculated, it is compared with the learning rate v t calculated by the optimizer. In order to avoid a high learning rate in the training process, the smaller value of the two is selected to update the relevant parameter θ. The relevant description can be expressed by equation (9).
As described in the related literature of AdaBound [28], abnormal learning rate and the fluctuation of learning rate generally appear at the end of training, which is not conducive to the convergence and generalization of the optimizer.
Referring to the idea of exponential moving average, the AdaMod optimizer first calculates the low-order moment value of the gradient. Secondly, the hyper-parameter ρ 3 is introduced to describe the length of memory in the training process of the model. Through the parameter ρ 3 , the longterm memory in the training process is introduced into the next step of the optimizer process, so as to avoid the optimizer falling into a bad state and trim the adaptive learning rate of excessive value. Thus the generalization and convergence of the optimizer are improved. In addition, the AdaMod optimizer can control the change of the adaptive learning rate at the beginning of training to ensure the stability of the training start and training process, eliminating the relevant ''warmup phase'', so that the convergence result of the optimizer is better, the convergence speed is faster, and the overall performance is better.

III. ALGORITHM DESIGN A. LOSS FUNCTION
As a typical single-stage object detector, SSD object detection model has the advantages of simplicity and efficiency, and has been widely used in many fields [29], [30]. However, due to the low correlation between the predicted object category score of the model and the accuracy of object positioning, which leads to the decline of the performance of the SSD model.
In terms of loss function, the improved SSD model enhances the correlation between the object classification score and the object positioning accuracy. The improved model uses Visual Geometry Group 16 (VGG-16) [31] as the basic network, and adds an IoU prediction loss branch to predict the IoU value between the default bounding box and the real bounding box. Multiply the classification score of the predicted object and the predicted IoU value, and use the result as the detection confidence of the improved model. The improved model includes classification loss branch, regression loss branch and IoU prediction loss branch. The loss structure of the improved SSD model is shown in figure 2. An IoU loss prediction branch was added to the improved SSD model. In order to ensure the effectiveness of the improved scheme and not affect the efficiency of the model, the head of the IoU prediction branch is similar to other branches, only including a 3 × 3 convolution layer, and the sigmoid activation function layer ensures that the predicted value of IoU is at [0,1]. Since the range of IOU prediction value is between [0,1], the Binary Cross Entropy (BCE) loss is taken as the IOU prediction loss, which can be expressed by equation (10). In addition, the classification loss of the improved model adopts Cross Entropy (CE) loss, while the regression loss adopts smooth L1 loss, which is respectively represented by equations (11) and (12). In the training process of the model, the three loss branches participate in the training together.
The total loss of the model can be expressed by equations (13).
Before outputting the prediction box of the object to be detected, the final score of the default box is calculated by equation (14), in which the parameter ρ is used to control the weight of the category score and the IoU value. Compared with a single classification score, the calculation method of the default box score enhances the correlation between the detection confidence and the positioning accuracy. Applying the calculation results to the ranking in the NMS process can better suppress the poor local detection.
Based on the scale space theory, a series of Gaussian filters can be used in the processing of related images. With the change of the size of the Gaussian filter, the representation of the image at different scales can be obtained. Assuming that H (x, y) represents a two-dimensional image and the two-dimensional Gaussian function is G (x, y; t), the linear scale space of the relevant image can be obtained by convolution of the two, as shown in equation (15).
where t = σ 2 represents the variance of the Gaussian filter, which is called the scale parameter. The larger the value of t, the more dramatic the smoothing of the related image. When t = 0, the image is not smoothed. The convolution result is equivalent to the image itself. Similarly, the output feature maps of the convolutional layer can be divided into two parts, high-frequency features and low-frequency features. The low-frequency feature components of the correlation feature maps are obtained by the Gaussian filter with t = 2. The components not processed by the Gaussian filter are called high-frequency feature components. Due to the redundancy of the low-frequency features, the low-frequency component correlation feature maps was halved to 1/2 of the high-frequency component correlation feature maps.
Suppose the input feature tensor of the convolutional layer of the improved SSD model is X ∈ R c×h×w , where c represents the number of feature maps, and h and w are the spatial dimensions of the input tensor. X is divided into two parts: high-frequency feature component X H and low-frequency feature component X L . X H ∈ R (1−β)c×h×w , X L ∈ R β×c×h×w , β ∈ [0,1] represents the proportion allocated to low frequency feature component.

2) CONVOLUTION OPERATION BASED ON HIGH AND LOW FREQUENCY FEATURE MAPS
Although the decomposition of the feature maps can effectively reduce the spatial redundancy, it is also accompanied by corresponding problems. Because of the difference of spatial resolution between the low-frequency feature and the high-frequency feature, the traditional convolution calculation cannot directly operate the decomposed input feature tensor. In order to act directly on the decomposed feature tensor X = {X L , X H }, so as to avoid extra computing cost and hardware overhead, the following strategies are adopted.
It is assumed that W ∈ R c×k×k represents the k×k convolution kernel of the improved model, and X , Y ∈ R c×h×w represent the input and output of the correlation convolution calculation. Based on section 3.2.1, the input characteristic tensor X of convolution calculation is divided into two parts: high and low frequency characteristic components, X ={X H , X L }, and the corresponding output Y ={Y H , Y L }. Suppose that the input and output of convolution calculation have the same dimension c, that is, c in = c out = c. In order to obtain the convolution result Y , the convolution kernel W is divided into W H and W L , as shown in Figure 3, and corresponding convolution calculation process is represented by equations (16) and (17).
For the Y H→H part in equation (16), the information is updated using normal convolution calculation. While the Y L→H part, the low frequency feature component X L is up-sampled, and then the corresponding convolution operation is performed. Similarly, Y H→L in equation (17) is processed by average pooling. In equations (16) and (17), (p, q) represents the position coordinate, . . , k−1 2 ))}, Y H→H and Y L→L represent the internal information update of the high and low frequency feature components. Y H→L and Y L→H represent the information exchange between the high and low frequency feature components.

3) THE CONVOLUTION OPERATION MODULE AND COMPATIBILITY PROCESSING
The traditional SSD object detection model takes VGG16 as the basic network and replaces the fc7 of the basic network with conv7. By adding Conv8_2, Conv9_2, Conv10_2, Conv11_2 detection layers to increase the convolution depth. The detection model combines Conv4_3, Conv7, Conv8_2 and other convolutional layers to detect and identify the objects.
To enhance the detection efficiency of the detection model, reduce the calculation overhead and hardware overhead of the model, the improved model processed the ordinary convolution layers of the traditional SSD detection algorithm, decomposed the relevant input feature tensor, and compressed the spatial resolution of the low-frequency feature component. Then the purpose of reducing model calculation overhead and related memory overhead is achieved. In addition, through the setting of the switch control parameter β, the processed ordinary convolution layers has good compatibility with the convolution layers participating in the object detection, and the relevant process is shown in figure 4.
In the related improved SSD detection algorithm, in order to convert ordinary features into multi-frequency feature components for representation, the algorithm sets β out = 1 and β in = 0 in the Conv1_1 layer. Except for the relevant convolutional layers used for object detection, the remaining convolutional layers are all set to β out = β in = β. In order to ensure the compatibility of the multi-frequency feature component convolution modules and the object detection layers, it is necessary to transform the multi-frequency feature representation outputs from the general convolution layers into the common feature representation. Therefore, before relevant object detection, set β out to 0. At this time, the convolution path related to the output of the low-frequency feature components will be disabled, so as to generate the output of a single full-feature layer for the detection of related objects in the input image.

C. OPTIMIZATION OF TRAINING PROCESS
To further improve the real-time detection of the SSD model and accelerate the convergence speed of the model. Instead of using the traditional SGD algorithm, the improved model uses the AdaMod optimizer to optimize the real-time performance of model detection. The AdaMod optimizer is used to adjust the abnormal adaptive learning rate during the training of the improved model, which guarantees the stability of the training process. The AdaMod optimizer introduces a hyperparameter to describe the length of memory during the model training, which improves the generalization and convergence of the SSD model and accelerates the model's convergence speed.

A. EXPERIMENTAL DATA SETS AND EVALUATION INDICATORS
The related experiments are based on the MS COCO data set, which contains approximately 118,000 training images, 5000 verification images, and 20,000 unlabeled test images. It contains 500,000 labeled objects from 80 object categories. In addition, in order to verify the generality of the improved SSD algorithm, the relevant experiments based on the PASCAL VOC2012 data set were expanded, which is composed of 17125 training images and 5138 test images.
The evaluation indicators of the improved model are carried out from two aspects: On the one hand, the detection accuracy of the improved SSD model is measured by the following four types of AP values: AP 0.9 value when the IoU threshold is set to 0.9, AP 0.75 value when the IoU threshold is set to 0.75, AP 0.5 value when the IoU threshold is set to 0.5, and the average AP value (Average for AP 0.5 , AP 0.75 and AP 0.9 ). On the other hand, the FPS value is used to evaluate the real-time performance of the improved model.  The detection confidence of the model depends on the two parts of the category score and the IoU value, and the relevant contribution of the category score and the IoU value to the model detection confidence depends on the parameter ρ. It can be seen from table 1 that when the value of ρ VOLUME 8, 2020 is set to 1, the average accuracy value AP of the model is slightly increased by 1.4% compared with the original SSD algorithm, which indicates that IOU prediction loss branch of the improved SSD model is beneficial to improve model performance. When the value of ρ is set to 0.4, the average accuracy value AP of the improved model reaches the best, which is 4.36% higher than the original SSD algorithm. In addition, according to the experimental data in table 2, it can be seen that when ρ is 1, the average accuracy value AP is slightly improved by 0.86% compared with the original algorithm. When ρ is 0.3, the average accuracy value AP is 56.8% and the accuracy is improved by about 4%.
The experimental data in table 1 and table 2 are respectively based on the experimental results of MS COCO and Pascal VOC 2012 data sets, which verify that the IoU prediction loss branch has good generalization on different data sets and the effectiveness of the model accuracy performance improvement. In addition, in the process of ρ value continuously decreasing from 1 to 0.3, the contribution of the predicted IoU value to the detection confidence is continuously increasing, and the AP values of the relevant improved SSD model tend to increase, which obviously indicates the relationship between the prediction loss branch of IOU and the positioning accuracy of the model, and effectively improves the performance of the model. The change of AP value is shown in Figure 5.

2) RESEARCH ON SETTING HYPER-PARAMETER β IN MULTI-FREQUENCY CONVOLUTION
When decomposing the output feature maps of the convolutional layers, the calculation cost of the improved SSD model and the related memory consumption are closely related to the parameter β. With the change of parameter β, the optimal setting of parameter β is explored based on the PASCAL 2012 data set, and the parameter ρ of the IoU prediction loss branch is set to 0.3. The calculation cost and memory consumption proportion change of the improved model are shown in table 3 It can be seen from table 3 that the increase of parameter β makes the relevant low-frequency feature components of the improved model increase continuously,   resulting in more low-frequency feature components being compressed, and the calculation cost and memory loss are significantly reduced. With the continuous compression of the low-frequency feature space, the relevant accuracy changes of the improved SSD model are shown in Table 4. When β is 0.125, the detection accuracy of the model is improved by 0.4%, and the computational cost is significantly reduced. Experimental data show that the compression of related low-frequency features will not cause the loss of important features in the image. Continuing to improve the proportion of low-frequency feature components. Before β reaches 0.75, the detection accuracy of the improved model is still improved compared with that of the original SSD object detection model ( Table 2: the test result of ap0.5 of the original SSD model is 75.8%). When the proportion of low-frequency feature component is 75%, the accuracy rate drops by only 1.3%, but other related performance is greatly improved. The improved SSD model effectively reduces the relevant spatial redundancy information, improves the model efficiency, and shows the effectiveness of the improved model. Figure 6 shows the effect of the hyper-parameter β on various indicators of the improved model. Y H→L , Y L→H indicate two information communication paths between high and low frequency feature components, which have an important impact on the accuracy of the improved SSD model. The deletion of any information communication path will reduce the performance of the improved model. When β is 0.25, the relevant experimental results are shown in Table 5.

C. COMPARISON OF RELATED MODELS
Based on MS COCO and PASCAL VOC2012 data sets, the improved SSD model and related comparison models are trained and tested. In the training stage of the improved model, in order to accelerate the convergence of the model, the traditional SGD algorithm is no longer used, and the relevant training process of the model is optimized using the AdaMod optimizer. In addition, based on the exploration results of relevant experiments, the correlation parameter ρ of the IoU prediction loss branch is set to 0.3, and the related hyper-parameter β in the multi-frequency convolution operation is set to 0.25. Set the relevant parameters of the AdaMod optimizer, where the step size ε is 0.001, the moment estimation exponential decay rates ρ1, ρ2 are 0.9 and 0.999, and the smaller constant value δ used for numerical stability is set to 10-8, the measurement parameter ρ3 of the memory length is set to 0.99. Perform sufficient iterative training on all models involved in the experiments, and use the test sets of the relevant data sets to test the trained model. The test results and experimental analysis are as follows:

1) COMPARISON OF EXPERIMENTAL RESULTS BASED ON MS COCO DATA SET
Based on the MS COCO data set, the improved model and related comparison models are trained and tested. The comparison results are shown in Table 6. Table 6 shows the test results of the improved SSD detection algorithm and related comparison algorithms on the MS COCO data set. It can be seen from the data in the table that the improved model has better real-time detection and higher detection accuracy than Faster RCNN, YOLO v2, SSD, FSSD and DSSD detector models. On the related test set of MS COCO, the average accuracy of AP 0.5 and AP 0.75 on the improved SSD 300 algorithm can reach 39.55%. Compared with the FSSD model, the average accuracy of our algorithm is improved by 1.8%. Compared with the traditional SSD detection model, the accuracy of the improved model is increased by 5.1%, and the original SSD model is significantly improved. In terms of real-time detection, the FPS value of the improved SSD 300 model can reach 61, which is  enough to meet the needs of real-time detection. It is believed that the improvement of model performance can be described from the following two aspects. On the one hand, the introduction of the IoU prediction branch can more accurately locate the objects in the input image, the positioning effect of the objects are improved. The missed detection of small and medium-sized objects has been improved. On the other hand, the AdaMod optimizer makes the model convergence faster. In addition, the improved algorithm performs convolution operation based on multi-frequency feature maps, compresses the low-frequency feature components of the relevant convolution layers output, reduces the spatial redundant information of the improved SSD algorithm and reduces the interference of irrelevant information. In the end, the accuracy and the real-time detection of the improved model have been well improved. The relevant experiments fully demonstrated the advantages of the improved model and the effectiveness of the algorithm improvement. The AP 0.5 iterative training changes of the relevant models are shown in Figure 7. In order to verify the generality of the improved detection algorithm, the improved model and related comparison models were trained and tested again based on the PASCAL VOC 2012 data set. The experimental results are shown in Table 7.
According to the experimental data in table 7, the AP0.5 of the improved SSD 300 algorithm on the PASCAL VOC 2012 data set can reach 79.2%, which is 3.4% higher than the detection accuracy of the original SSD 300 algorithm. Compared with DSSD, FSSD and other algorithms, the detection accuracy of the improved algorithm still has obvious advantages on the Pascal VOC 2012 data set. Analysis of the real-time detection of the improved model, compared to Faster RCNN, SSD, FSSD and DSSD, the detection speed of our improved model is much faster. In this paper, it is considered that convolution operation based on multi-frequency feature maps and compression of correlation low-frequency feature components play a key role in improving the speed of model detection. The AP 0.5 iterative training changes of the related models are shown in Figure 8. Combining the experimental results of MS COCO and PASCAL VOC data sets, it can be considered that the improved model has good generality on different data sets.

D. DECTION EFFECT ANALYSIS
The advantages of the improved SSD algorithm are mainly reflected in three aspects. Firstly, our algorithm improves the situation of repeatedly detecting multiple parts of the same object and taking multiple objects as the same detection object. For example, in Figure 9 (a 1 ), the traditional SSD detection model repeatedly detects the same object, which is obviously incorrect. From the comparison figure(a 2 ), the improved algorithm can significantly improve this phenomenon. In addition, in Figure 9 (c 1 ), the traditional SSD object detection algorithm detects multiple objects as the same object, and the real situation should be two objects. Secondly, compared with (d 1 ) and (d 2 ) in Figure 9, the improved algorithm has better detection effect on small and medium-sized objects than the traditional SSD object detection algorithm. By introducing the IoU prediction loss branch, the improved model can more accurately locate the objects in the input image. Compared with the traditional SSD object detection algorithm, our algorithm can successfully detect more small and medium-sized objects. Finally, the traditional detection algorithm relies on the regression of the bounding box to complete the positioning of the objects, without considering the fuzzy situation of the real bounding box. Generally speaking, the bounding box regression with higher classification score should be more accurate, but the real situation is not the case. As shown in figure (d 1 ), the first person on the left outputs two prediction boxes, and the positioning effect of the bounding box with higher score (0.97) is not as good as that of the bounding box with lower score (0.91). To this end, the improved algorithm effectively improves this situation by exploring the optimal value of the hyper-parameter ρ in the IoU detection branch.

V. CONCLUSION
The improved SSD multi-object detection model has been improved in terms of detection rate and efficiency, and reduced the calculation cost and related hardware cost of the model. Its contributions are mainly reflected in the following aspects: (1) Aiming at the defect that the correlation between the predicted object category score and the object positioning accuracy of the traditional SSD model is weak, the improved model enhances the correlation between the object score and the positioning accuracy by adding the IoU prediction loss branch, so as to improve the detection accuracy of the model. (2) In order to improve the real-time performance of the algorithm and reduce the spatial redundancy of the model, the convolution correlation module of multi-frequency feature components is designed for the traditional SSD object detection model, which reduces Based on MS COCO and PASCAL VOC2012 authoritative data sets, it is verified that the improved model has good performance in different data sets. JIANGUO JU is currently pursuing the Ph.D. degree with the School of Information Science and Technology, Northwest University, Xi'an. His research interests include deep learning, data mining, and computer vision. VOLUME 8, 2020