Weakly perceived object detection based on an improved CenterNet

: Nowadays, object detection methods based on deep neural networks have been widely applied in autonomous driving and intelligent robot systems. However, weakly perceived objects with a small size in the complex scenes own too few features to be detected, resulting in the decrease of the detection accuracy. To improve the performance of the detection model in complex scenes, the detector of an improved CenterNet was developed via this work to enhance the feature representation of weakly perceived objects. Specifically, we replace the ResNet50 with ResNext50 as the backbone network to enhance the ability of feature extraction of the model. Then, we append the lateral connection structure and the dilated convolution to improve the feature enhancement layer of the CenterNet, leading to enriched features and enlarged receptive fields for the weakly sensed objects. Finally, we apply the attention mechanism in the detection head of the network to enhance the key information of the weakly perceived objects. To demonstrate the effectiveness, we evaluate the proposed model on the KITTI dataset and COCO dataset. Compared with the original model, the average precision of multiple categories of the improved CenterNet for the vehicles and pedestrians in the KITTI dataset increased by 5.37%, whereas the average precision of weakly perceived pedestrians increased by 9.30%. Moreover, the average precision of small objects (AP_S) of the weakly perceived small objects in the COCO dataset increase 7.4%. Experiments show that the improved CenterNet can significantly improve the average detection precision for weakly perceived objects.


Introduction
With the popularization of artificial intelligence technology, automatic driving technology continues to develop rapidly [1]. For automatic driving systems, object detection technology plays a vital role in environmental awareness tasks [2]. Nowadays, the traditional object detection algorithms based on handcrafted features are being gradually replaced by detection technology based on a deep neural network.
Region-Convolutional Neural Network (R-CNN) is the first object detection framework based on the application of convolutional neural networks (CNN) [3], and it improves the ability of feature representation via CNN operation for detection. The faster R-CNN [4] generates regional proposals via a region proposal network (RPN) and introduces an anchor mechanism to regress the objects; it establishes the framework of the anchor-based detection algorithm to improve the detection performance. Since the anchor mechanism has been proposed, it has gradually played a critical role in popular detectors, such as YOLOv2 [5], YOLOv3 [6], YOLOv4 [7], Libra R-CNN [8] and Cascade RCNN [9].
However, owing to the development of object detection technology, the drawbacks of the anchor mechanism cannot be ignored. For instance, the detection performance of the anchor-based model is greatly affected by the size, aspect ratio and number of anchor boxes [10]. Moreover, due to the fixed size and aspect ratio of the anchor boxes, it is difficult for the anchor-based models to detect objects with large-scale variations; thus, the models need to reset the anchor boxes with different sizes and aspect ratios for the new detection task. In addition, the anchor-based models need to put dense anchor boxes on the input image to obtain a higher recall rate, which brings large amounts of hyperparameters and increases the computational complexity of the model [11]; meanwhile, most anchor boxes are considered as negative samples and only a few are considered as positive samples, which aggravates the imbalance between positive and negative samples.
Therefore, object detection algorithms that are anchor-free have attracted lots of attention in recent years, and they do not rely on pre-set anchor boxes. Compared with the traditional anchor-based method, the anchor-free detector with the simpler structure has no hyperparameters related to the anchor boxes, and it has the potential to surpass anchor-based methods in detection speed and accuracy.
The anchor-free detectors are usually divided into key point-based methods and center-based approaches. The key point-based approach first locates pre-defined or self-learned key points and then generates bounding boxes to detect objects, such as CPNDet [12], RepPoints [13] and YOLOX [14]. And, the center-based method regards the central area (center point or area) of the object as the foreground area and then predicts their distance to the four sides of the object, such as LSNet [15], GA-RPN [16] and FSAF [17].
Though those approaches achieve great detection performance, the detection performance falls when confronting complex scenes with lots of weakly perceived small objects. Therefore, we propose a novel and effective detector based on the anchor-free mechanism to detect weakly perceived objects accurately. Concretely, we first replace the backbone ResNet50 of CenterNet with ResNext50 to acquire various levels of features, and then we integrate the multi-level features of the bottom-up pathway into the corresponding features with the same scale in the top-down pathway by using the lateral connections of a feature pyramid network (FPN) [18], obtaining plentiful information on weakly sensed small objects. Simultaneously, we add the dilated encoder following the input features in the top-down pathway to enlarge the receptive fields of the weakly sensed small objects. Finally, we append the squeeze-and-excitation (SE) attention module in the detection head to enhance the key point knowledge of the weakly sensed objects.
In summary, the main contributions in this work can be summarized as follows: 1) We propose an improved anchor-free detector based on the CenterNet to elevate the detection performance for weakly perceived objects in complicated scenes.
2) We improve the feature enhancement layer of the CenterNet by adding the lateral connection structure and the dilated convolution to enrich the information of the features and amplify the reception fields for weakly sensed small objects.
3) We modify the detection head of the CenterNet with the SE attention module to enhance the key information of weakly sensed objects with small sizes.

Related Work
Anchor-free detectors. The anchor-free detectors require no pre-set anchor box, which makes the network structure simpler and the model more generalized. Anchor-free detectors mainly consist of two streams of center-based detectors [19] and key point-based detectors. The center-based detectors use the central region or the central point of the object to determine the positive sample and then regress its distance to the bounding box. The typical center-based works are FCOS [20], SAPD [21], etc. Another stream of key point-based detectors identifies the key points of the object to regress the bounding boxes of the objects. For instance, CornerNet [22] converts the object detection problem into the detection of a pair of key points for the object without anchors and then uses a top-left corner and a bottom-right corner of the object to predict the bounding box of the object. Referring to CornerNet, ExtremeNet [23] detects four extreme points (top-most, left-most, bottom-most, right-most) and one center point of the object to generate the bounding box. Different from the above models that detect multiple key points, CenterNet [24] regards the object as a key point to determine the center coordinates of the object through Gaussian operation and then regresses the size and position of the object.
Multi-scale feature-enhancement methods. SSD [25] is the first detector to adopt the multiscale feature-enhancement method and multi-level feature stratification to detect objects of various sizes, and it allocates multi-scale objects to corresponding feature layers according to the size of the object. Each layer is responsible for the prediction of the object with the corresponding scale. The features of shallow layers with more detailed information are suitable for learning small objects. And, the features of deep layers with more global semantic information are appropriate for predicting large objects. DSSD [26] supposes that the insufficient semantic information and plenty of noise derived from shallow layers in SSD will weaken the classification ability of the detection network. Thus, DSSD adopts ResNet101 to integrate the global semantic information of deep layers into shallow layers. Liu et al. [27] presented the RFBNet based on SSD with a receptive field module to improve the detection performance of weakly sensed small objects. The receptive field module is composed of multi-branch convolution and expansion convolution, which expands the width of the network and enhances the adaptability of the network to multi-scale objects.
Attention mechanism. Since each channel of the feature contributes differently to the detection performance, Hu et al. [28] proposed a channel-attention model, SENet, to learn the weights of different channels, leading to the network paying more attention to the key channel information by weighting. Inspired by the SENet, ECANet [29] adopts 1D convolution to implement the local crosschannel interaction and maintain the detection performance while reducing the parameters of the model. Different from SENet and ECANet, CBAM [30] exploits both spatial-and channel-wise attention mechanisms to heighten the focus of important parts, and it contains two main components: a channel attention module and spatial attention module. The channel attention module pays attention to the important channels of the feature, and the spatial attention module focuses on the key location information of objects. Mnih et al. [31] imported the attention mechanism to extract more small-scale features to strengthen the focus of small-scale objects and improve the detection performance of small objects.

CenterNet
As an anchor-free detector, CenterNet directly predicts the category and coordinates of the object on feature maps without numerous pre-setting anchor boxes, leading to fewer hyperparameters. In addition, the CenterNet determines center points via key point estimation and then regresses the object properties of location and size. As shown in Figure 1, the CenterNet is composed of three parts: backbone network, feature enhancement layer and detection head. First, CenterNet extracts the preliminary features from the input image via the backbone network, and then it uses a feature enhancement layer to strengthen the semantic contexts of the features to obtain high-resolution features. Finally, the high-resolution features are applied to classify and regress in the detection head to predict the bounding boxes of objects.

Backbone network
The standard CenterNet model takes ResNet50 as the backbone network for feature extraction. ResNet introduces residual learning into the deep network, which brings the shallow information into the deep layers of the network by performing an identity mapping operation, solving the problem of deep network degradation. However, the feature extraction effect of ResNet50 is still insufficient; thus, the grouped convolution [32] is introduced into ResNext50 [33] to extract multiple levels of features.
Concretely, ResNext50 divides the input feature into several groups and applies the block constituted with several 1 × 1 and 3 × 3 convolutions to update each group feature; then, these updated features are concatenated to enhance the feature information and the shortcut connection referring to the ResNet structure is performed to prevent network degradation. The whole construction schemes of the ResNet50 and ResNext50 are illustrated in Figure 2, where Figure 2a) is the block of ResNet50 and Figure 2b) is the block of ResNext50. As shown in Figure 2b), ResNext50 divides input features into 32 groups by grouped convolution to widen the network and applies the 1 × 1 convolution operation to the grouped features to reduce its channel dimensions; then, it adds one 3 × 3 convolution to refine semantic contexts and a 1 × 1 convolution to raise the channel dimensions. Finally, the features of each group and the original input features are aggregated to get the output feature of the residual block. 256  ResNext50 extracts the rich features of different levels of the network by increasing the network width, which improves the performance of the network, keeping the parameters at the same level as ResNet50. Meanwhile, considering that weakly sensed objects require sufficient information for detection, we employ the ResNext50 as the backbone network to improve the feature-extraction capability of the CenterNet model.

Improvement of the feature-enhancement layer
After obtaining the preliminary feature 1 1 1 from the ResNext50 backbone, the featureenhancement layer of the basic CenterNet in Figure 1 enhances the preliminary feature in three upsampling layers to generate a high-resolution feature . However, the high-resolution feature is only originated from the preliminary feature , and the stacked sampling layers of the basic CenterNet in Figure 1 will cause information loss, resulting in insufficient detailed information on weakly sensed objects in the feature map P2. To solve the problem, inspired by the FPN structure, we improved the structure of the feature-enhancement layer with lateral connections to aggregate the features with different scales in the bottom-up pathway and achieve abundant detailed knowledge of weakly perceived objects in the top-down pathway. Then, we integrate the dilated encoder module [34] to enlarge the receptive field of the features and acquire more global semantic contexts for weakly sensed objects.
The architecture of the improved feature-enhancement layer is shown in Figure 3; it involves a bottom-up pathway, a dilated encoder, a top-down pathway and the lateral connections referring to the FPN structure.
The bottom-up pathway with numerous convolution layers is the feed-forward computation of the ResNext50 backbone, which computes a feature hierarchy consisting of the features with different scales. We deem that the layers producing output maps of the same size belong to the same network stage. Thus, for the bottom-up pathway on the ResNext50 backbone, we define the four stages of S2, S3, S4 and S5 according to the sizes of the output maps. And, we denote the output features of each stage in {S2, S3, S4, S5} as C 2 , C 3 , C 4 , C 5 , and they are input to the top-down pathway of the improved CenterNet model, forming a multiple-in structure, which is different from the top-down pathway of the original CenterNet model with the single-input structure. The multiple-in structure of the improved CenterNet with the multi-scale features of C 2 , C 3 , C 4 , C 5 can merge various input features for the topdown pathway to obtain sufficient knowledge.  Similar to the bottom-up pathway, we detect four feature maps with different scales in the topdown pathway and define them as {P 2 , P 3 , P 4 , P 5 }. P 5 is first produced by the preliminary feature map C 5 of the ResNext50 backbone, followed by a dilated encoder, which enlarges the receptive field of Feature C 5 to acquire more context from the semantic information. And then, following the FPN structure, we merge the feature maps {C 2 , C 3 , C 4 } of the bottom-up pathway into the features with the same spatial size in the top-down pathway to enrich the detailed information of the output feature in the top-down pathway. Specifically, the bottom-up features C 2 , C 3 and C 4 , followed by a 1 × 1 convolution, are integrated into the corresponding feature maps in the top-down pathway via the topdown connections of the FPN, producing features of P 2 , P 3 and P 4 , as shown in Figure 3. With the adoption of the dilated encoder and lateral connections, the detailed information of the features with various scales in the top-down pathway is strengthened and the corresponding receptive fields are enlarged. As a result, the feature P2 with a large receptive field and rich contextual semantic global knowledge is obtained as the output feature of the top-down pathway.
The dilated encoder in the top-down pathway contains two main components: the projector and the residual blocks, as shown in Figure 4. The projection layer first applies a 1 × 1 convolution to reduce the channel dimensions, and utilizes a 3 × 3 convolution to refine semantic contexts. Then, the output of the projector is fed into the residual blocks consisting of 1 × 1 convolutions and 3 × 3 dilated convolutions with different dilation rates to obtain the multiple receptive fields. Specifically, in each residual block, the channel dimension of the projector output is reduced via a 1 × 1 convolution, and a 3 × 3 dilated convolution is applied to improve the receptive field of the features; then, a 1 × 1 convolution is adopted to increase the channel dimensions. And then, the four successive dilated residual blocks are stacked to generate output features with multiple receptive fields.
Meanwhile, to reduce the calculation cost of the network, we retain the output structure of the feature-enhancement layer of the original CenterNet model and output only Feature P2 for the classification and regression of the detection head; then, the non-maximal suppression and other postprocessing steps are eliminated to reduce the calculation cost of the model.

Embedding of attention mechanism
In the task of object detection for weakly perceived objects, the texture features play an important role. To strengthen the network's attention to the texture features of weakly sensed objects, we applied the channel attention mechanism [28] of SE module shown in Figure 5 to the CenterNet model.  Input SE sigmoid function takes the embedding vector as input and produces a series of channel-wise weights. We multiply these weights with the input feature P2 to generate the attention-weighted feature P as the output of the SE block to guide the network to focus on key channels; the overall structure of the improved CenterNet is shown in Figure 6. Finally, the attention-weighted feature P is input to the detection head for classification and regression to generate the refined object bounding boxes. The detection head includes three branches: a heatmap branch, an offset branch and the size branch. For the heatmap branch, the key point heatmaps are generated by applying the Gaussian operation to the feature P, and then the heatmaps are used to predict the categories of objects and center coordinates of objects. For the offset branch, the offsets of the center coordinates are estimated. And, the sizes of the bounding boxes of objects are regressed in the size branch.

Loss function
The improved CenterNet proposed in this work is trained with the total loss, which consists of three parts: the key point loss (Lk), offset loss (Loff) and size loss (Lsize), corresponding to the three branches of the detection head, respectively. And, the total loss is formulated as The key point loss is realized via the key point heatmaps to learn the categories and center coordinates for objects, as folllows: where xyc Y represents the ground-truth feature value at the (x, y) coordinates for Category C on the key point heatmaps, which is generated by Gaussian operation on the input feature of the heatmap branch, Further, to recover the bias caused by the downsampling of the convolution operation, we utilize the offset loss to learn the offsets of the center coordinates, and it is constructed with L1 loss, as follows: where P is the center of the original input image of the network, p * is the center of the scaled-down feature map achieved by the convolution operation and R is the stride of the convolution operation; then, we can obtain the real bias of   In addition, for the k-th object, its bounding box is denoted as  (4) where ˆk p S is the predicted regression value.

Dataset
We evaluate our proposed weakly sensed object detection network on the KITTI dataset [35] and COCO dataset [36], respectively. KITTI is one of the popular datasets in the autonomous driving field and it provides a large number of images of complex environments, including urban areas, roads, rural scenes, etc. The KITTI dataset contains a total of 7481 images, including 33,252 vehicles and 6340 pedestrians. There are eight categories in the KITTI dataset, including Car, Van, Truck, Pedestrian, Person-sitting, Cyclist, Tram and Misc. The fact is that pedestrians have a small size and lack of abundant feature information; thus, they are usually regarded as weakly sensed small objects. At the same time, lots of weakly sensed small vehicles caused by distance, truncation and occlusion are difficult to be perceived and detected. Therefore, we chose two categories in the KITTI dataset, including pedestrians (Pedestrian, Person-sitting and Cyclist) and vehicles (Car, Van, Truck and Tram) to evaluate the effectiveness of our improved CenterNet for weakly perceived objects. Meanwhile, we divided the dataset into the training set, the verification set and the test set in the proportions of 0.8, 0.1 and 0.1, respectively. Different from the KITTI dataset, the MS COCO dataset has more samples, providing 330,000 images, 1.5 million object instances and 80 object categories. The images in the COCO dataset were mainly captured from complex daily scenes in real environments, containing numerous objects of various types; each image contains 3.5 object categories on average. Meanwhile, COCO divides objects into three scales: the large objects with sizes greater than 96 × 96 pixels, the medium objects with sizes ranging from 32 × 32 pixels to 96 × 96 pixels and the small objects with sizes less than 32 × 32 pixels to measure the average precision (AP) of multiple categories (mAP) values, respectively, where the large objects account for 24%, the medium objects account for 34% and the small objects account for 41%; that is, almost half of the objects in the COCO dataset are small, and they lack sufficient information to be perceived. Therefore, it is recommended to leverage the MS COCO dataset containing numerous small objects to evaluate our detection performance of weakly sensed objects. To clearly demonstrate the improvement of the detection performance of our proposed detector for weakly perceived objects, we evaluated our detector on the large objects (L), the medium objects (M) and the small objects (S) in the COCO dataset, respectively; we discuss the ability of our detector to detect weakly sensed objects based on the detection accuracy of the small objects (S).

Training details and evaluation metrics
We trained our proposed model in an end-to-end manner with the Adam optimizer; the pretraining weights of ResNet50 and ResNext50 were obtained from ImageNet.
The overall training process includes the frozen stage and unfrozen stage. During the first 50 epochs of training, the parameters of the backbone in the network are frozen and will not be updated, while the parameters of the other parts of the network are updated. For the KITTI dataset, we trained the entire network with a batch size of 8 and learning rate of 0.001 using a GTX 1660Ti GPU. For the COCO dataset, the network was trained with a batch size of 16 and learning rate of 0.001 using two GTX 2080 Ti GPUs. After 50 epochs, the parameters of the overall network were updated. For the KITTI dataset, the network was updated with a batch size of 4 and learning rate of 0.0001. For the COCO dataset, the network was trained with a batch size of 8 and learning rate of 0.0001.
To illustrate the effectiveness of our proposed detection network, we adopted the evaluation metrics of the AP and mAP to evaluate the performance of the model. The AP is the average precision of a single category, and it is the index to measure the model performance. And, mAP is the average value of the AP for multiple categories; it measures the performance of the model for all categories. The AP is denoted as where u demonstrates accuracy and v denotes the recall rate.

Evaluation and analysis
To select the appropriate components to compose our proposed improved CenterNet, we conducted a series of experiments using the KITTI dataset; then, we trained our proposed network and evaluated it on the KITTI and COCO datasets, respectively. Backbone. For the selection of a suitable backbone network, we compared the performance of CenterNet on ResNext50 and ResNet50, respectively. As shown in Table 1, compared with ResNet50, the mAP of the CenterNet on ResNext50 was increased by 1.81%, achieving gains of 0.96% for vehicles and 2.66% for the weakly perceived small objects of pedestrians. Obviously, the performance of the CenterNet on ResNext50 was better than that of the CenterNet on ResNet50. Thus, we selected the ResNext50 as the backbone for further experiments. Feature-enhancement layer. Based on the backbone of ResNext50, we compared the improved feature-enhancement layer with the original structure of the CenterNet. The experimental results shown in Table 2 indicate that the improved structure achieved 2.26% mAP gains, whereas the AP of vehicles increased by 0.59% and the AP of the weakly perceived small objects of pedestrians increased by 3.94%. It can be seen that the detection accuracy of the weakly perceived small objects of pedestrians has been significantly promoted by improving the feature enhancement layer. Attention module. According to the results in Table 2, we selected ResNext50 with the improved feature-enhancement layer as a baseline to append various attention modules for comparative experiments. Specifically, we added the SE module, convolutional block attention module (CBAM) and efficient channel attention (ECA) module into the detection head of the baseline, respectively, to verify the effect of the attention mechanism on the model. As displayed in Table 3, the AP of the weakly perceived small objects of pedestrians of the CenterNet with the SE module ranked first among all of the models. Moreover, compared with the baseline model, the mAP of the CenterNet model with the SE module was increased by 1.30%, where the AP of the weakly perceived pedestrians achieved a 2.7% increase, but the AP of vehicles slightly dropped. According to the above discussion, we propose our improved CenterNet consisting of the backbone of ResNext50, the improved feature-enhancement layer and the detection head with the SE module. Meanwhile, the pedestrians are usually small in size and possess too little information to be perceived, resulting in hard detection; a series of detectors with multi-scale structures are proposed to strengthen the features of the small weakly sensed objects for detection, such as FCOS, YOLOV4, YOLOV3, YOLOX and SSD. Therefore, to verify the effect of our proposed detector for the weakly sensed objects of pedestrians from the KITTI dataset, we compared our proposed improved CenterNet with different state-of-the-art detectors of multi-scale structure, such as the anchor-free detectors of FCOS and YOLOX, the representative anchor-based detectors of SSD, YOLOV3 and YOLOV4 and the latest anchor-based algorithms of YOLOV3-SPP, YOLOV3-SPP-ASFF, YOLOV3-SPP-ASFF-SE and ResNext-SSD; the results are shown in Table 4.  Table 4 shows that our improved CenterNet achieves an improvement of 5.37% for the mAP value relative to the basic CenterNet, the AP gain of vehicles was 1.46%, and the AP value for weakly sensed pedestrians was remarkably increased by 9.3%. Besides, compared with previous state-of-the-art multi-scale detectors, our proposed model obtained the highest mAP value for vehicles and pedestrians, where the AP value of weakly perceived pedestrians was higher than that for most detectors, except for YOLOV4; and, it achieved a 0.9% gain at least. Although our AP for pedestrians dropped slightly by 0.4%, contrasting with YOLOV4, the overall detection performance of the mAP value showed a 2.22% improvement. In summary, our proposed model appears to achieve excellent performance for the detection of weakly perceived small pedestrians in the KITTI dataset, which has numerous small pedestrians and certain weakly sensed vehicles of small sizes.
To further validate the effectiveness of our detector in detecting weakly perceived small objects, we evaluated our method on the objects of the three scales of large (L), medium (M) and small (S) in the COCO dataset, respectively; the experimental results are shown in Table 5. Meanwhile, Table 5 shows the detection accuracy of other state-of-the-art detectors on the three scales of objects. As shown in Table 5, the AP value of our proposed improved CenterNet ranked first among all of the methods, leading to a 3.2% AP gain over the basic CenterNet. Meanwhile, for the detection of small objects, our proposed detector achieved a 7.4% AP_S gain over the standard CenterNet, and at least 1% AP_S improvement over other detectors, which demonstrates that our detectors can effectively promote the detection performance for weakly sensed objects of small size. Furthermore, for the detection of the objects with large scales and medium scales, the AP_M and AP_L of our framework have slight improvements over the basic CenterNet, achieving 1.1 and 1.2% gains, respectively. We deem that the added attention module in the detection head and the proposed improved feature-enhancement layer allow weakly sensed small objects with insufficient information to yield abundant knowledge and attract more attention for detection. By contrast, the large-scale and the medium-scale objects already have rich features; hence, our improvements on CenterNet yield little effect on these objects. Overall, the proposed detection framework in this paper effectively improves the detection precision and achieves great detection performance for weakly sensed small objects from the COCO dataset.

Ablation study
By summarizing the above experimental results, we designed the ablation experiments using the KITTI dataset to verify the effectiveness of various components in the improved CenterNet. In the ablation experiments, the backbone network, feature-enhancement layer and attention module in the detection head of the improved CenterNet were analyzed; the experimental results are shown in Table 6, where "+" indicates the model with the improved feature-enhancement layer and "√" represents the model using the SE attention module in the detection head. Table 6 shows that the improved feature-enhancement layer can promote the performance of the original CenterNet model better than the improvements of the backbone network and SE module. Specifically, based on the backbone of ResNet50 and ResNext50, the mAP of the CenterNet model with an improved feature-enhancement layer was increased by 3.86 and 4.07%, respectively; however, the mAP of the CenterNet model with the SE module was only increased by 0.49 and 2.24%, respectively. Combining with the improved feature-enhancement layer and SE module in the detection head, for the backbone of ResNet50 and ResNext50, the improved model achieved 4.65 and 5.37% mAP gains, respectively. Meanwhile, the results in Table 5 show that the detection performances of different improved models with ResNext50 were better than those with ResNet50. Table 6. Results of ablation study using the KITTI dataset ("√": adding the attention module). The evaluation metric is the AP, with an IoU threshold of 0.5 for pedestrians and vehicles; the maximum AP value or mAP value in each column is bolded.  Subsequently, as shown in Figure 7, we compared the training losses for various models listed in Table 6; the loss of each improved model from Cases 1 to 7 was lower than that for the original model, and the losses of the models with the improved feature-enhancement layer for Cases 1, 3, 5 and 7 decreased faster than other models. It proves that the improved feature-enhancement layer can effectively accelerate the rate of convergence of the CenterNet model. And. the model in Case 7 gained the least training loss, which proves its powerful learning ability. a) PR curve of vehicle b) PR curve of pedestrian In Figure 8, we present the precision-recall (PR) curves for the various models displayed in Table 6; the PR curves for the vehicles are displayed in Figure 8 a), and the PR curves of pedestrians are shown in Figure 8b). And, all of the PR curves indicate that the precision and recall values for all models from Cases 1 to 7 listed in Table 6 have been significantly promoted relative to the original CenterNet. And, among all of the cases in Table 6, the improved model of Case 7 with the improved feature-enhancement layer and the SE block had the largest area of the PR curve and achieved the best performance.

Qualitative results
Some qualitative results for the KITTI dataset and COCO dataset that were obtained via our proposed detector and the standard CenterNet for weakly perceived objects are displayed in Figure 9. As shown in Figure 9a), our detector effectively detected the weakly perceived objects of the small pedestrians due to distance in the complex environment, while it was undetected by the original CenterNet. The weak pedestrian with poor illumination in the shadow on the left of Figure 9b) was ignored by the original model, while it could be detected by our detector. In Figure 9c), our detector could accurately detect the weakly perceived object of the occluded vehicle, which was missed by the original detector. Moreover, in the case of the COCO dataset, for the weakly sensed objects of the occluded persons of Figure 9d), the original model ignored them, while they were detected by our detector. And, in Figure 9e),f), our detector successfully located the objects with small sizes due to the occlusion or distance, which failed to be detected via the basic CenterNet.
In summary, our proposed detector given as the improved CenterNet can detect weakly perceived pedestrians and other weakly perceived objects with small sizes caused by occlusion, truncation and distance.

Conclusions
Aiming at the problem of missed detection for weakly perceived small objects in complex environments, we have proposed an improved CenterNet based on the anchor-free mechanism. First, ResNext50, instead of ResNet50, was adopted to be a backbone network, as it improves the ability of the feature extraction. Second, the feature-enhancement layer has been improved to strengthen the semantic information and enlarge the reception fields for the weakly sensed objects by combining the FPN structure and dilated convolution module. Finally, by appending the attention module in the detection head, the key information of the weakly sensed small objects is enhanced. The experimental results show that our improved model can elevate the detection precision of the model and accelerate the convergence speed of the original model, achieving a good effect on the detection of weakly sensed objects.