A Rich Feature Fusion Single-Stage Object Detector

Single-stage object detectors are quick and highly accurate. Based on the way training model is developed, single-stage object detectors either adopt a training model based on a pre-trained backbone network model, or a model trained from the scratch. The pre-trained backbone network model is associated with the propagation sensitivity both in classification and detection. This leads to deviations in learning goals, and results in an architecture which is limited by the classification network, hence not easy to modify. Training from the scratch is not as efficient as using a pre-trained network, mainly due to the limitations of the predefined network system. In this paper, we combine these two approaches to overcome the above-mentioned shortcomings. In our proposed method a top-down concatenated feature pyramid is built upon a basic FSSD network. The experiments in this paper are conducted on MS COCO and PASCAL VOC data sets. Moreover, we apply VGG16 as the backbone network to further indicate the effectiveness of our proposed method which reaches 33.1 AP on MS COCO benchmark.


I. INTRODUCTION
Object detection is a rapidly developing research area as it is used in a wide range of applications. The object detection techniques can be divided into single-stage and two-stage object detection categories. In the single-stage object detectors, see, e.g., [1]- [6], extracting the candidate regions are carried out without explicitly giving the final detection result. Combining extraction and detection is then used to detect the object, therefore, single-stage object detectors are faster than that of the two-stage object detectors. In two-stage object detector, see, e.g., [7]- [12], a two-stage cascaded network is used, where in the first stage the candidate boxes are proposed and then the object detection is decided on the candidate boxes.
There are two types of single-stage object detectors: (i) an object detector based on a pre-trained convolutional neural network, see, e.g., [13]- [15], and; (ii) an object detector trained from the scratch. Yolo algorithm in YOLOv3 [16] utilizes the CNN model to achieve the end-to-end object detection. In Yolo algorithm, first, the input image is resized to a fixed size, then it is feed as an input to the CNN network. Then the network predictions are processed to obtain the The associate editor coordinating the review of this manuscript and approving it for publication was Nuno Garcia . detected object. In SSD [1] inherits the idea of converting detection into a regression problem as in YOLO, and directly completes target positioning and classification; it is inspired by the anchor in Faster R-CNN [8], and proposes a similar Prior box; by adding FPN, predict the target on the feature map of different receptive fields. The FSSD [17] is for the improvement of the SSD model, adopting shallow feature fusion to improve the recognition of SSD in small object samples. In the object detection, positive and negative sample areas are extremely unbalanced, and the object detection loss is easily affected by many negative samples. This issue is addressed in RetinaNet [6], where the focal loss is proposed by RetinaNet, the problem is also addressed in RefineDet [18] as it proposed starting from the network structure and combines the advantages of one-stage object detection algorithm and two-stage object detection algorithm to redesign an object detection network with both accuracy and speed of SOTA. This method filters out the negative anchors first to reduce the classifier search space, and then roughly adjusts the position and size of the anchors to provide better an initialization for subsequent regressions.
Subsequent modules utilize the revised Anchors as the input to further improve the regression and enable predicting multi-level labels. The RFBNet [19] model designs the RFB block, which enhances the distinguishability of the features as well as the robustness of the model by simulating the relationship between the size of the receptive field (RF) structure and eccentricity in the human visual system, and further adds the RFB block to the SSD Method. The GBFPNSSD [20] adds the SE module as a gate module to the top-down and bottom-up feature pyramid networks. This dynamically readjusts the weight of features to achieve the purpose of transmitting only information features, GBFPNSSD then combines the two feature pyramids to improve the detection performance. Another technique, FFBNET [21] builds a dense feature pyramid based on FSSD which improves the accuracy of the small object recognition. The dense feature pyramid only slightly increases the number of the model's parameters. The pre-trained model is generally trained on a classification image dataset such as ImageNet dataset [22], and may not necessarily be transferred to the detection model data set. Besides, its structure is fixed and not easy to modify. The training target of the pre-trained classification network is generally inconsistent with object detection, so the pretrained model may be a suboptimal choice for the detection algorithm.
An object detector trained from the scratch is proposed in DSOD [23], where it uses the Proposal-free method to ensure convergence of the network, and draws on the design principle of DenseNet [13], and uses Dense Block to avoid the disappearance of the gradient. The proposed Dense Prediction Structure in DenseNet greatly reduces the number of model's parameters, and the extracted features contain a lot of information. The DSOD uses a stem structure, which has the advantage of reducing the loss of input picture information. ScratchDet [24] introduces a new Root-ResNet backbone, which greatly improves the detection accuracy, especially for small objects, and also illustrates the importance of Batch-Norm in the network structure. Our focus is on general object detection, mainly detecting more common objects, such as people, animals, etc. The input is taken from RGB images taken by a camera, with only 3 channels; the images with hyperspectral information are mainly remote sensing, aviation, etc. Domain, an image composed of an array of multiple channels (tens or even hundreds), each pixel has a lot of numbers to describe, the ''grey value'' on a single channel reflects the subject's reflection of light. Since the hyperspectral image contains more information about the captured target, the recognition and classification of targets such as face recognition and object detection [25], [26] using hyperspectral images has higher accuracy than using RGB images. Rotation is a key and important issue in object detection. In our article, we use common rotation methods, and the rotation methods proposed in hyperspectral images [27], [28], the performance of these methods has been improved to a certain extent. We use the Mixup method [29] to expand the data. Essentially, Mixup trains a neural network on the convex combinations of paired samples and their labels. In Mixup technique two random samples are mixed in proportion, and the results of classification are distributed in proportion. Mixup is to fuse positive and negative samples into a new set of samples, doubling the sample size. Cutout [30] randomly cuts out part of the sample area and fills it with 0-pixel values, and the result of the classification is not changed. CutMix [31] cuts part of the area but not fill it with 0 pixels but randomly fill in the area pixel values of other data in the training set. The classification results are then distributed in a certain proportion. The difference between the above three data enhancement techniques are as follows: cutout and Cutmix differentiate between the pixel value of the filled area. Mixup is to interpolate the two images in proportion to mix the samples, and Cutmix is to use the cut part of the area and then patch the form to mix the image, there is no image mixing after unnatural situation. The main idea of AutoAugment [32] is to create a search space for data enhancement strategies and directly evaluate the quality of specific strategies on some data sets.
The scratch network used in this paper has a small number of convolution operations. Therefore, the extracted features have richer location information. The pre-training model generally contains a deep convolutional network structure, so the extracted features are relatively abstract and have rich semantic information. In our proposed technique, the FSSD method is used as a pre-training model, and then use the proposed Concatenated Feature Pyramid (CFP) to combine FSSD with the scratch network, so that the high-level semantic information of the deep feature map is extended to the shallow layer of the neural network. Therefore, we proposed an object detector that combines the scratch network and pre-training model to enrich the semantic information in the middle and shallow layers of the neural network. This improves the detection performance of small objects.
We conducted experiments on two data sets, MS COCO [33] and PASCAL VOC [34], and compared the performance of our proposed object detector with several existing object detectors. The experiments indicate that our proposed method overperforms the existing methods. Compared with the benchmark on the MS COCO data set, the accuracy of detection has greatly improved using our proposed method especially for small targets. For an input size of 512 × 512, our method achieves 33.1% AP, within 42 ms, exceeding the performance of object detectors such as YOLOv3 and RefineDet (see, Fig. 1 and Table 1).

II. BASELINE DETECTION FRAMEWORK
In our work, we use FSSD as the baseline structure of the object detector. FSSD [17] is a fast and accurate single-stage object detector. FSSD is an improved version of the SSD algorithm. To extract features, FSSD utilizes VGG-16 [14] as the backbone network, followed by using feature maps of different resolutions at different stages for prediction. Compared to the SSD, FSSD is based on the FPN concept as it constructs a feature fusion method which introduces the lower layer features to the upper layer of the network.
Limitation: Based on the above, although FSSD adds feature fusion to SSD, the feature semantic information on the shallow feature map is too little, and the recall rate for small FIGURE 1. On the MS COCO dataset, the accuracy (AP) and speed (ms) are compared with the existing single-stage method. We also show the overall accuracy (AP) and the performance of small objects. Here, except for YOLOv3 (608 × 608), DSOD (300 × 300), and ScratchDet (300 × 300), the input image size used by the detector here is about 512 × 512. Our method, like other methods is based on VGG-16 [14] backbone. For a fair comparison, the speed is measured on a single 1080Ti GPU. targets is not high. Therefore there will likely be missed detections and the detection accuracy of scale objects is low. There are existing works on improving the detection of small objects, so for small-scale targets detection accuracy is not high. For example, a top-down feature pyramid network [11] is used to combine deeper feature maps with high-level semantic information and shallower layer with more accurate location information. Use of an image pyramid network is also proposed in [35]. Although such networks have good performance, they may require a relatively large amount of calculation.

III. OUR APPROACH
Here we first present the proposed framework of the algorithm in which to supplement the information it combines our scratch network with the FSSD algorithm. We then introduce the Concatenated Feature Pyramid, which is similar to the top-down feature pyramid network, which supplements the shallow layer with the high-level semantic information of the deep layer.

A. OVERALL ARCHITECTURE
The overall architecture of the proposed technique is presented in Fig. 2 including three main modules: FSSD as the Baseline Detection Framework, the scratch network (SN), and the Concatenated Feature Pyramid. As the backbone network, FSSD uses the image classification network, VGG-16, which is pre-trained by the ImageNet dataset. The feature maps extracted by the FSSD and scratch network are then combined into the Concatenated Feature Pyramid through the sum and batch norm operations for direct prediction.

B. FSSD
The basic network architecture, FSSD, is an algorithm for object detection using a pre-trained VGG16 backbone network. FSSD fuses the feature maps including Conv4_3, FC7, Conv_8 extracted from the original SSD backbone network. We examined two fusion methods including concatenated and ele-sum, where ele-sum represents the pixel-by-pixel addition of the feature maps. Our experiments indicate that concatenated fusion overperforms ele-sum. We then add Convolutional and ReLU layers to get multiple-scale feature maps like the SSD with different channels. Finally, feature maps are input to the prediction layer to generate the results. The shallow features are fused. This is beneficial to object detection and results in improving the accuracy of small object detection.

C. THE SCRATCH NETWORK
The scratch network is similar to the image pyramid network. It saves most of the original image information after a few convolution operations, although there is a less deep high-level semantic information. The scratch network uses a small number of convolutional layers to ensure that the targets location information in the feature maps is rich enough. As it is seen in Fig 2, this architecture directly generates fixed-size feature maps by max-pooling the image, and then generates the feature maps of different scales through the convolution network. The feature maps obtained by the combination of FSSD and the scratch network in shallow/intermediate feature maps have therefore rich location information. This is conducive to the detection of small targets and appropriately improves the shallow semantic information.

D. CONCATENATED FEATURE PYRAMID
The Concatenated Feature Pyramid (CFP) is equivalent to the top-down schemes. The specific operations are as the following: where Cat represents the Concat operation, U(·) represents the up-sampling operation, and Conv(·) represents the convolution of 1 × 1.
There are two fusion methods: Concat or ele-sum. Through experiments, it is found that Concat overperforms ele-sum. As shown in Fig. 2, in the CFP, this step is to increase the size of the deep feature maps to the same size as the shallower layer through Bilinear up-sampling, and the shallower feature maps through the 1 × 1 convolution operation. The two are then combined using the Concat operation, which extends the high-level semantic information of deep feature maps to shallow features. This improves the detection accuracy of the smaller objects.
The deep network is shown to be effective for classification, and the shallow network is efficient for positioning. The deeper the network, the more abstract is the extracted features and the less location information becomes available for the generated feature maps. A larger number of convolutional layers results in a smaller feature map, and a larger range of the corresponding Receptive Field. This then results in the ability to detect larger objects, and a larger shallow feature map can detect small objects. However, due to the small amount of shallow semantic information, the detection performance for smaller objects is rather low. This paper proposes that based on the combination of pre-training and training from scratch, training from scratch through image Maxpooling significantly reduces the number of parameters. This guarantees translation invariance, such as rotation, translation, and expansion, so that the training network from scratch includes more image position information. A small number of convolutional layers is also to ensure that the image has more location information, and the deeper feature map generated by the pre-training model is rich in semantic information. The two methods are combined to improve the effectiveness of small objects. Besides, our top-down concatenated feature pyramid transmits deep semantic information to the shallow network, such that the features at all scales have rich semantic information. The introduced top-down concatenated feature pyramid module offers the advantages of fewer parameters while providing improved performance.

IV. EXPERIMENT A. DATASETS
In this paper we use two datasets including MS COCO, and PASCAL VOC. PASCAL VOC data set is divided into 4 major categories: vehicle, household, animal, and person, with a total of 20 small categories. This dataset can be used for network evaluation such as target classification and detection, and image segmentation. We use VOC2007 and VOC2012 train+val (including 16551 pictures) to train, and VOC2007 (including 4952 pictures) to test the detection performance. The evaluation standard of PASCAL VOC is mAP (mean average precision). MS COCO data set is used for object detection tasks which comprising of a total of 80 classes, and 123287 images including 118287 images in the training set, and 5000 images in the validation data set. For the object detection task, it has an average of 5 targets per image. The target position is constructed according to the key points, and the location annotation is performed. After training these annotations, we perform object detection. The evaluation method in MS COCO data set is different from the one used in the Pascal VOC data set. For MS COCO, a series of IOUs from 0.5 to 0.95 are used with an interval of 0.05 to calculate the AR (average recall) and the AP (average precision), and then calculate the average value as the final AR and AP.

B. PASCAL VOC
In the conducted experiment, Rich feature fusion singlestage detector was trained on the combined data set of Pascal VOC2007 and Pascal VOC2012. We use VOC2007 and VOC2012 train+val (16551 pictures) to train, and then VOLUME 8, 2020 VOC2007 (4952 pictures) to test the performance. For input size of 300 × 300, we set the batch size to 32 during training, and we set the total epoch to 250. We also set the initial learning rate to 0.003. For stable training, we use a warmup strategy that gradually increases the learning rate from 1 × 10 −6 to 3 × 10 −3 within the first 6 epochs. Subsequently, it goes back to the original learning rate schedule divided by 10 at 150, 200 and 240 epochs. In our experiments, the weight decay is set to 0.0005 and the momentum to 0.9. For an input size of 512 × 512, compared with input size of 300 × 300, the total epoch is set to 200, the batch size is set to 16, and other settings are the same.
The ablation experiment is shown in Table 2. For the input size of 300×300, the value of the mAP in our method reaches 79.7, and for the input size of 512 × 512, the mAP value for our method reaches 81.8. Note that the inference speed of our proposed method is the fastest among the comparison methods.

C. MS COCO
In this experiment, the rich feature fusion single-stage detector was trained based on MS COCO2017 data set, which contains 118287 pictures in the training set and 5000 pictures in the verification set. Multiple GPUs are used for training. For an input size of 300×300, we set the batch size to 31 for each GPU during the training process, the total batch size is 93, and the total epoch is set to 150. At the beginning of the training, we apply the warm-up technique that gradually increases the learning rate from 1 × 10 −6 to 3 × 10 −3 during the first five epochs and then decrease it after 80 and 100 epochs by a factor of 10, ending up at 140. In our experiments, the weight decay is set to 0.0005, and the momentum to 0.9. For an input size of 512 × 512 we use 4 GPUs, and the total batch size is set to 38, and the other settings are the same as 300 × 300.
The ablation experiment is shown in Table 3. For the input size of 300 × 300, the AP of our method reaches 28, and for the input size of 512 × 512, the AP of our method reaches 33.1.

D. COMPARATIVE ANALYSIS OF EXPERIMENTS
Our method and FSSD, YOLOv3, RefineDet and RFBNet are all object detectors that use multi-scale prediction. FSSD is an improved version of the SSD detector to improve the detection effect of small targets. Small targets are usually predicted by shallow networks, with insufficient feature abstraction capabilities and lack of semantic information. Secondly, small object detection usually relies heavily on context information. Therefore, FPN is proposed to fuse shallow and deep features to better assist shallow features for object detection, thereby improving the detection effect of small targets. The scratch network used by our network takes the image through simple maxpooling and a small number of convolutional layers to obtain the shallow feature map with rich location information. It then uses a top-down concatenated feature pyramid to improve the detection effect of small targets.
YOLOv3 is a multi-scale object detector based on the network structure Darknet-53. The network structure draws on the practice of the residual network. Shortcut connections are set between some layers. The accuracy is high, and the speed is fast. It has a good effect on image feature extraction and further uses multi-scale prediction (similar to FPN).
RefineDet is an object detector that combines the advantages of one-stage and two-stage object detectors. It refers to the coarse-to-fine regression idea of the two-stage type box (first obtain the coarse-grained box information through the RPN network, and then pass The conventional regression branch performs further regression to obtain more accurate frame information), and refers to the feature fusion idea of  FPN (improve the detection effect of small targets). The framework of RefineDet detection is still SSD. The difference with SSD refers to the RPN idea to multi-feature map detection.
The RFBNet detector also uses a multi-scale detection framework SSD. The RFB module is embedded to make the lightweight backbone SSD network faster and more accurate.
RFB is a multi-branch convolution module similar to the Inception module. Its internal structure can be divided into two components: a multi-branch convolution layer and a subsequently expanded convolution layer.
As shown in Table 4, on the MS COCO data set, our method is compared with FSSD, RefineDet, YOLOv3 and RFBNet in terms of accuracy (AP) and speed (ms). In this VOLUME 8, 2020 table, the performance of AP, IOU of 0.5, IOU of 0.75, large objects, medium objects, and small objects are shown, and our results are all sub-optimal. Detailed comparison in the experiment: Our method overperforms FSSD and SSD in all aspects, except for AP 50 and AP 75 indicators in which, it is better than RefineDet; it is better than RFBNet in the detection of small and medium targets; except for AP 50 and AP s indicators, it is better than YOLO v3. Besides, our method is also faster than RefineDet and YOLO v3.
Ablation Study: To evaluate the effectiveness of CFP and the scratch network in our method, we also conducted a series of ablation experiments summarized in Table 5. To be fair, we used the same training strategy and input size (300 × 300 and 512 × 512) in all experiments.

V. CONCLUSION
In this paper we proposed an object detector that combines the scratch network and pre-training model to enrich the semantic information in the middle and shallow layers of the neural network. In our proposed technique, FSSD method was used as the pre-training model, followed by Concatenated Feature Pyramid to extend the high-level semantic information of the deep Feature maps to the neural network shallow layer. This improved the detection performance of small objects. Our experiments on Pascal VOC dataset and MS COCO dataset indicate that the proposed method overperformed single-stage object detectors, including, YOLO v3, RefineDet, ScratchDet.