1. Introduction
Object detection plays a crucial role in robotics. For instance, in the context of household serving robots, achieving an accurate and reliable grasp of objects requires the robot to be able to acquire the precise locations of objects [
1]. Object detection can also be used in the field of industrial robots to assist robots in tasks such as item sorting, component assembly, and work area confirmation [
2]. Over the years, numerous studies have focused on creating precise and speedy detectors to cater to the needs of robots and other domains. Enhancing the interpretability and accuracy of detectors by optimizing their structure, as well as improving their performance in detecting and segmenting small objects, remain critical and challenging issues that current algorithms are striving to solve and overcome.
Object detectors generally fall under two categories, namely, two-stage object detectors and one-stage object detectors. Standard two-stage object detectors locate all possible object positions by maximizing the recall rate in the first stage but identify objects within these positions based on their likelihood scores. The optimization objectives in the two stages are distinct, which results in a lack of probabilistic interpretation and structural redundancy in standard second-stage object detectors. One-stage detectors maximize the likelihood of annotated ground-truth objects during the training stage and rely on the likelihood scores as the basis for inference. They are a probabilistically sound framework but the problem of insufficient accuracy may arise due to the impact of imbalanced positive and negative samples. CenterNet2 [
3] modified the structure of the standard two-stage detectors and developed a probabilistic two-stage detection framework by maximizing a lower limit for a combined probabilistic goal across both stages. However, there are still limitations in the proposed approach in CenterNet2. For example, the localization quality score and classification score are trained separately but are utilized during inference in the first stage; this inconsistency between training and prediction leads to insufficient interpretability and low efficiency of the model. The positive sample selection approach during the training phase is relatively simple, which can result in lower-quality proposal boxes provided by the first stage, ultimately affecting the performance of the model. Generalized focal loss (GFL) [
4] and adaptive training sample selection (ATSS) [
5] have addressed the aforementioned issues to some extent, but they still lack strong prior guidance during training and inference, which can result in the incomplete probabilistic interpretation of the model and relatively weak stability. In summary, a detector with complete probabilistic interpretation and compact structure is the current focus of research.
The precise detection of small targets is another important issue in the field of object detection. There have been numerous works aiming to solve these problems. Feature pyramid network (FPN) [
6] is the pioneer of those works; it has been widely adopted due to its capability of improving the detection accuracy for small targets and enhancing adaptability to multi-scale objects. Path aggregation net (PANet) [
7], NAS-FPN [
8] and other studies [
9,
10,
11] have furthered the progress of network architectures for cross-scale feature integration. How to effectively integrate features from different layers, explore the correlations between them, and preserve and restore the details of the images is the current research focus. Attention mechanisms have emerged as another means of mining and preserving detailed information in recent years [
12]. They enhance the accuracy and efficiency of a neural network by weighting the input data and highlighting the important parts. Some recent research [
13] implies that there are interdependent relationships among pixels, and these dependencies are not limited to adjacent pixels. Pixels that are far apart from each other also have interdependencies. For example, in an image of a cat, the shape of the tail may depend on the position of the ears, even if they are far apart. Another example could be the relationship between the background color and the color of an object in the foreground, which can impact the overall visual coherence of the image. Leveraging this type of long-range dependency has the potential to enhance performance. However, such methods require a significant amount of computational resources, and methods that exclusively rely on convolutions demonstrate limited capability in capturing long-range dependencies. Only a minority of approaches have endeavored to exploit features across varying levels to capture long-range dependencies, and most of them still struggle to adequately address the computational burden involved [
14]. Therefore, balancing the demands of enhancing algorithms’ ability to integrate and extract detailed features, improving their capacity for detecting and segmenting small targets, and ensuring computational efficiency is a major challenge in current research.
In this paper, we proposed a probabilistic two-stage detector that has a reasonable probability interpretation and a compact structure, enabling accurate object detection. Upon the integration of a simple segmentation header, our detector further achieves precision instance segmentation. Notably, our detector exhibits notable control over details, thereby demonstrating exceptional performance in detecting objects.
Specifically, we first introduced a robust single-stage object detector as a replacement for the region proposal network (RPN) in standard two-stage detectors. We trained both stages simultaneously to maximize the likelihood of ground-truth objects, which is then used as the detection score during inference. Secondly, we enhanced the method of ground-truth matching and improved the first-stage proposal generator by coupling the classification branch with the box generation branch and incorporating a better prior for the box regression branch. This resulted in a more stable first stage and a more comprehensive probability interpretation. Thirdly, we proposed an effective pyramid non-local attention (PNA) module, we incorporate the non-local attention mechanism into FPN to capture non-local dependency across multiple levels and embed a pyramid sampling module into every non-local block, which significantly reduces computational overhead while preserving semantic features. Finally, we made minor modifications to BiFPN, resulting in improved accuracy. Our main contributions can be summarized as follows:
1. We built a probabilistic two-stage detector that achieves higher accuracy with a more reasonable probability interpretation.
2. We proposed a strong proposal generator by coupling different branches and providing a prior for box regression. This makes the first stage more stable and interpretable, thus improving the overall accuracy of the network with almost no cost.
3. We proposed a pyramid non-local attention(PNA) module, which enhances the network’s ability to extract detailed features, ultimately significantly improving its detection capabilities for objects, especially for small objects.
The rest of this paper is outlined as follows. In
Section 2, we summarize relevant work. In
Section 3, we elaborate on the structure of the object detector, including the design of the strong proposal generator and PNA module in detail.
Section 4 shows the experimental results. Finally, we present certain conclusions and outline our prospective research endeavors.
2. Related Works
Object detectors:Two-stage detectors, such as regions with CNN feature (RCNN) series [
15,
16,
17], employed an RPN for generating imprecise object proposals, followed by using a specialized head for each region to refine and classify them. Cascade RCNN [
18] improved localization accuracy by repeating the detection head of Faster-RCNN multiple times, each time utilizing different threshold values. To further improve the feature flow between stages in Cascade RCNN, hybrid task cascade (HTC) [
19] incorporated extra annotations for both instance and semantic segmentation. Mask RCNN [
20] is an extension of Faster RCNN that includes an instance segmentation branch for generating precise masks of the objects. Task-aware spatial disentanglement (TSD) [
21] separated the localization and classification branches for each region of interest (ROI). Libra RCNN [
22] and gradient harmonizing mechanism (GHM RCNN) [
23] proposed new loss functions, optimizing the performance of detectors across different scales, difficulty levels, and object categories. Ammar et al. [
24] enhanced models’ accuracy by expoiting the temporally redundant information. Two-stage object detectors still achieve high accuracy nowadays, but their efficiency is low due to weak proposal generators that generate numerous but low-quality proposals [
3]. In addition, the two-stage optimization objectives differ, and there are discrepancies between training and evaluation metrics, resulting in a significant degradation of the overall detector performance.
One-stage detectors, such as the you-only-look-once (YOLO) series [
25,
26,
27,
28,
29,
30], simultaneously forecast both the object’s location and output class. The YOLO series of detectors utilize the grid-based approach to predict class and bounding box regression. Betti and Tucci [
31] optimized the parameters of YOLO, further reducing the computational cost. Fully convolutional one-stage object detector (FCOS) [
32] and CenterNet [
33] abandoned the use of numerous anchors per pixel and determine foreground/background by location. ATSS [
5] and probabilistic anchor assignment (PAA) [
34], which are derived from FCOS, revised the definition of foreground and background to make the allocation of positive and negative samples more reasonable. GFL [
4] provided a weighted representation of category truth values and takes into account the uncertainty of bounding boxes under occlusion, which further increased the interpretability of the algorithm. CornerNet [
35] detected the two diagonals of an object; ExtremeNet [
36] detected four extreme points of an object and used an additional center point to group them. RepPoint [
37] and Dense RepPoint [
38] utilized a set of points to represent the boundaries of bounding boxes, and the features of these points were employed to classify the objects. This type of detector often has comprehensive probability explanations, but they still lack accuracy. For example, under the same training conditions, Faster RCNN outperforms single shot multiBox detector (SSD) by five points on the COCO dataset and Cascade RCNN outperforms RetinaNet by 3.7 points on the COCO dataset.
In recent years, there has been a high level of research interest in visual transformers. The visual transformers (ViT) [
39] algorithm attempted to directly apply the standard Transformer structure to images by splitting the entire image into small image blocks, and then using the linear embedding sequence of these blocks as the input to the Transformer network for training. Data-efficient image transformers (DeiT) [
40] improved the training strategy based on ViT, reducing the computational resources required during training. Detection transformer (DETR) [
41] replaced traditional object detection methods such as RPN and ROI Pooling with Transformer networks, greatly simplifying the object detection process. Deformable DETR [
42] added deformable convolution modules to DETR to adapt to changes in object shape and size. Sparse RCNN [
43] used sparse attention mechanisms to only compute regions relevant to the object. DETR with improved denoising anchor boxes (DINO) [
44] algorithm achieved feature extraction and classification by using a self-attention mechanism. The use of attention mechanism and transformer can greatly improve the performance of the algorithm, but it also requires a large amount of computing power. Balancing the accuracy and computational cost is the current focus of research.
Feature pyramid: The utilization of a feature pyramid can enhance the network’s resolution, improving the detection accuracy of small objects. One of the primary challenges is to efficiently encode and handle features across multiple scales. FPN [
8] proposed a top-down feature fusion structure, which greatly improves the performance of the network. Following the idea of FPN, PAN [
7] added a feature aggregation path from bottom to top based on FPN, allowing for more comprehensive feature fusion. Han et al. [
45] combined super-resolution with YOLOv5 to achieve improved accuracy in safety helmet detection. Scale-transferrable detection network (STDN) [
46] introduced a transfer module to the network for extracting features from different scales and SNIPER [
47] added a weakly supervised mechanism on top of FPN; the addition of an attention mechanism enables the network to achieve higher accuracy under the same time complexity. M2det [
48] used a U-shape module to process feature fusion of different scales. Gated feedback refinement network (G-FRNet) [
49] introduced gate units to regulate the flow of information between features. NAS-FPN and NAS-FPN+ [
50] can automatically search for the optimal network structure, but require thousands of GPU hours during the training phase. BiFPN [
51] utilized bidirectional feature fusion to merge feature maps of different levels, which balances algorithm speed and performance better than NAS-FPN. The ultimate goal of all the above methods is to fully explore valuable information from different levels and fuse them more comprehensively.
Attention mechanism: Attention mechanism plays an important role in human visual perception. In 2017, Vaswani et al. [
12] introduced this mechanism into the field of machine learning, and since then, it has been widely applied. Wang et al. [
52] proposed a Network that incorporates an encoder and a decoder to implement attention mechanisms, while Hu et al. [
53] leveraged a Squeeze-and-Excitation module to exploit the inter-channel relationship of the Network. These approaches yielded a notable improvement in the accuracy of the algorithm. Similarly, Chen et al. [
54] utilized weight matrixes to amplify salient features and suppress irrelevant ones, resulting in increased accuracy and sensitivity to small targets. Meanwhile, convolutional block attention module (CBMA) [
55] and DANet [
56] combined spatial and channel attention. Despite their effectiveness in enhancing the algorithm’s performance, all these methods were limited to a single scale.
Recent studies have also focused on how to make sufficient use of long-range dependencies. Wang et al. [
13] proposed a non-local attention mechanism module in 2018, which was initially used for image denoising and later applied to image super-resolution in 2020 [
57]. Zhang et al. [
58] introduced a self-attention generative adversarial network, which uses non-local attention mechanisms to improve the details and texture of the image. Residual non-local attention networks (RNAN) [
59] adopted a kind of network structure based on residual blocks and introduces non-local attention modules to capture long-range dependencies in the image. It has achieved excellent performance in multiple image restoration tasks. Zhou et al. [
60] used non-local attention mechanisms for multi-organ semantic segmentation in 2019, greatly improving the accuracy and robustness of image segmentation. Many studies have shown that non-local attention mechanisms can enhance the network’s ability to extract details, but there is still relatively little research on applying non-local attention mechanisms to object detection and segmentation. Even fewer studies consider the comprehensive use of non-local attention mechanisms and multi-scale information.
3. Materials and Methods
The architecture of our proposed object detector is shown in
Figure 1. The input image is processed by a backbone network to extract features and then downsampled to generate five features of different scales. These features are fused through a repeated feature pyramid structure, which is based on the structure proposed in EfficentDet [
51] but has been improved to further consider the importance of different channels. The aforementioned features are then passed through a PNA block, which will be detailed in later sections, to fuse global information across different scales, resulting in the final five features of different scales.
Based on these features, we then use a robust proposal generator to generate a series of proposals, which will also be detailed in later sections. The proposals generated by this module are then fed into the cascade heads, which consist of three heads that use different thresholds for bounding boxes regression and filtering, to obtain the final results.
3.1. Probabilistic Two-Stage Detector Framework
Our probabilistic interpretable framework draws inspiration from CenterNet2 [
3]. The aim of an object detector is using bounding boxes to locate objects and provide the class-specific likelihood score for them. Different detectors have similar methods for regressing the bounding boxes, and there is no fundamental difference among them. The core difference lies in how they handle the class likelihood.
One-stage object detectors directly predict the location of the object and its class likelihood. Let represent the ith candidate object belongs to the cth class(, represents the set of all annotated objects; bg means the background class). Although different single-stage object detectors may have different definitions of object and background classes, their overall logic is the same. They maximize the likelihood during training and use the class probability to score boxes during inference. One-stage object detectors are a simple, clear, and probabilistically complete framework for object detection.
Two-stage object detectors try to explore as many potential regions of the object as possible in the first stage, and then extract features of these regions again in the second stage and determine their category. Let present the ith potential object location which contains an object; means it belongs to the cth class(). The goal of the first stage is to maximize the recall of positions with , The goal of the second stage is to maxmize the likelihood . During training, the two stages have different criteria for defining positive samples. The standard in the first stage is loose while the standard in the second stage is strict. During inference, it uses the classification scores of the second stage only. There is no reasonable probability interpretation for the overall detector, for their two stages are disjointed and the training and inference stage are inconsistent.
For the two-stage object detector, a reasonable probability distribution should be Equation (
1):
where
. It is obvious that the places where
are always lead to the background category. Therefore, the above formula can be further simplified as Equation (
2):
We used maximum likelihood estimation to train our detectors in our framework for annotated objects; our goal is to maximize the log-likelihood like Equation (
3):
The two terms in the above formula correspond exactly to the first and second stages of the detector, respectively. For the background, the maximum-likelihood goal should be Equation (
4):
However, this objective involves both stages and it does not factorize. In practical applications, it can cause difficulties in back propagation of gradients. Using Jensen’s inequality as in Equation (
5):
with
,
and
, we can get Equation (
6):
It is a tight bound when
or
, and then we add another tight boundary when
, like Equation (
7):
The two boundaries mentioned above will be optimized together, so the actual optimization objective for the background class is Equation (
8):
With Equations (
2) and (
8), our first stage maximum represents the likelihood with positive labels at annotated objects and negative labels for all other locations. The first stage of our detector is only used to predict whether there is an object at location O, while the second stage is used to further distinguish the category to which the object belongs. The difference between our detector and traditional two-stage object detectors is that in the training stage, our definition of positive samples is the same for both stages, achieving true end-to-end training. In the prediction stage, we use the scores from both stages to comprehensively evaluate the boxes. The objectives of the two stages of the detector are both maximum likelihood estimation, which has good consistency and relatively complete probability interpretation.
3.2. Feature Pyramid
Our feature fusion section references EfficentDet [
51] and makes some improvements. It aggregates features from different levels to enable high-level feature maps to contain geometric features from the bottom level, resulting in higher performance of the detector.
Similar to EfficentDet, our feature pyramid is composed of a single block repeated multiple times. The size of each feature map is half of the size of the previous feature map, and all feature maps have the same number of channels. In this paper, we use two forms of feature pyramid: three-layer and five-layer; the blocks that make up them are shown in
Figure 2. For the five-layer feature pyramid, the features of the first three layers are taken from the backbone network, while the features of the last two layers are obtained by downsampling the third-layer feature twice; the blocks in
Figure 2 are repeated three times. For the three-layer feature pyramid, all the features are taken from the backbone network, and the blocks in
Figure 1 are repeated four times.
In terms of feature fusion, we take a five-layer feature pyramid’s block for example.
represents features in the middle of the feature fusion process, and
means the feature after feature fusion(
equals to
). Here, we described some fused features as Equation (
9); there will be a batch normalization module and an activation module after each convolution. All convolutions do not change the size of the feature map, and the number of channels in all feature maps is the same.
As shown in
Figure 1, we add a channel attention mechanism module to the feature pyramid, because the importance of the information contained in different feature layers is different. By leveraging the significance of inter-channel maps, we can enhance the feature representation of specific semantics, thereby improving the detector’s ability to accurately predict the category of small objects. The channel attention mechanism used in this paper is shown in
Figure 3.
We apply the input to a max pooling layer and an average pooling layer separately, with the pooling operation performed along both the width and height axes, resulting in the extraction of features X and Y; then, we summed them up. We used convolution layers instead of fully connected layers to embed features, thus reducing the computational cost. After two rounds of convolution, we obtained the feature W, which represents the importance of each channel. For regularization, we adopted the method of dividing all elements in W by the maximum value of W instead of using sigmoid, which also aims to reduce computational complexity. To clarify, channel attention is not applied to every repeated FPN but only appears in specific FPN modules, intending to balance accuracy and time. For the five-layer feature pyramid, this module only appears in the second block. For the three-layer feature pyramid, it appears in the second and fourth blocks.
3.3. PNA Module
The pyramid non-local attention (PNA) module is the core module of our method, which effectively utilizes the multi-scale and multi-level features generated by the feature pyramid, and establishes dependencies between different locations based on this.
Firstly, let us revisit the definition of non-local attention block, as shown in
Figure 4. The input feature map
goes through three 1 × 1 convolutional layers
,
and
, respectively, to obtain three embeddings, namely,
,
and
, where
means the channel number after convolution. Then, the three embeddings will be flattened to get
,
and
, whose sizes are
. The similarity matrix
is calculated as Equation (
10):
Finally, we can get the output
Y as Equation (
11):
where the convolution operation is to adjust the importance of the non-local operation and and restore the channel of the feature map to
c.
From a spatial perspective, the essence of the non-local attention mechanism is to establish connections between different pixels and regions, as shown in
Figure 5a. The output
Y before performing convolution and resize operations is denoted as
; for a single location
in
, when we choose sigmoid as the normalization method, its relationship with the input
X is as Equation (
12), where
means the
ith location in the input
X:
The response
can incorporate information from all features. However, images of different scales contain varying types of information. For example, reducing the size of an image can filter out some noise and provide purer information. Although the aforementioned operation is effective in capturing long-range correlations, it only extracts information at a single scale. To break this scale constraint, Mei et al. [
14] proposed scale-agnostic attention, as shown in
Figure 5b, which computes the affinities between a target feature and regions to capture correlations across scales. Let
be the feature map obtained by down-sampling
by a factor of
s. Then,
can be the region descriptor of
, where
means the
neighborhood centred at index
j on input
x. The improved formula is as Equation (
13):
However, the information that can be obtained only by scaling the image is limited. Inspired by this method, as shown in
Figure 5c, we will consider fusing scale-agnostic attention with the feature pyramid to achieve a cross-scale non-local attention mechanism. Compared with scaling operations, a feature pyramid can better fuse neighborhood features, extract more abstract and advanced information, and filter out useless noise. The representation of our method is similar to scale-agnostic attention like Equation (
14) where F represents different feature maps, and
represents the features corresponding to
:
Our detector will use up to five layers of the feature pyramid at most due to the high computational cost of the non-local attention mechanism,; if we directly calculate each point in each feature map, it will cause great computational cost. Looking back at the process of the non-local attention mechanism, we can see that Equations (
10) and (
11) are the main causes of high computational cost, as both equations involve the multiplication of two large matrices. The changes in matrix sizes are as Equation (
15):
It can be noticed that the red-highlighted parts do not affect the size of the output ; therefore, if we adopt some methods to compress the dimensions of the highlighted parts, the computational cost can be greatly reduced.
In our method, we use spatial pyramid pooling (SPP) [
61] module, as shown in
Figure 6, to compress the dimensions of the highlighted parts. For the non-local attention mechanism on a single feature layer, we first pass
and
through four pooling layers, to obtain four feature maps of different sizes (1∗1, 3∗3, 6∗6 and 8∗8). Thenm we flatten and concatenate them to obtain
and
, where
. This can greatly reduce the computational cost. Of course, this does not affect the computational effect, because it is essentially the same as scale-agnostic attention; only the value of
s in the
s neighborhood has changed.
The structure of the entire PNA module is shown in
Figure 7. The feature maps in the middle layer (
) will be fused with the adjacent two layers, while the features in the top layer will only be fused with the previous layer (such as
is only fused with
). For the bottom layer feature, such as
, it will first be upsampled once through bilinear interpolation to obtain
, and then undergo subsequent feature fusion. Take feature map
as an example; it will enter the PNA module together with the adjacent feature maps
and
. These three features will go through
and
, respectively, and obtain
-
,
-
. Afterward, this series of features will go through the spatial pyramid pooling (SPP) module, respectively, and each feature map will first generate four different scaled pooling results. Then, the pooling results of each image will be concatenated in order to obtain the feature
-
,
-
with size
.
-
will be concatenated again to obtain the feature
with size
, and the same applies to
. The calculation method for feature
is the same as the conventional non-local attention mechanism calculation method.
first goes through a
convolutional layer
, and is then flattened to obtain
. Obviously, the change in the shape of the M matrix does not affect the shape of the final result, although our single PNA module involves three scales at the same time, and the value of
is still far smaller than
. If the SPP module is not used, our computational complexity will double.
3.4. Proposal Generator
The proposal generator in this paper integrates the advantages of various excellent algorithms. The structure of our proposal generator is shown in
Figure 1, where the generated feature maps at five scales are fed into the heatmap branch and bbox distribution branch, similar to the GFL [
4] algorithm. Considering the issue of blurry boundaries, we generate the distribution of the components related to the box and obtain the final box from the distribution. However, we do not directly generate the four quantities of
, but generate them based on the prior anchor boxes, making the network more stable. Subsequently, we encode the distribution of the box and couple it with the heatmap branch to correct the heatmap score. The difference between our proposal generator and the traditional RPN is that we generate fewer but higher-quality proposals and the generated proposals have scores, which plays a role in both training and prediction.
Firstly, for the generation of prior anchor boxes, we conduct k-means clustering on the bounding boxes in the training set to automatically find good priors instead of choosing priors by hand, which is similar to the YOLO [
27] series. We adopt the IOU between the prior anchor boxes and the ground truth boxes as the distance metric for clustering to eliminate the influence of box sizes on the error, as in (
16). Finally, we assign the automatically generated anchor boxes to different feature pyramids, with higher levels corresponding to larger proposals.
Regarding the allocation of ground truth boxes, we use adaptive training sample selection [
5]. At each level of the feature pyramid, we choose
k boxes whose centers are closest to the center of ground truth box
as the candidate positive samples. After determining the candidate positive samples, we calculate their IOU with the corresponding ground truth boxes and denote the set of all IOU values as
. We calculate the mean and variance of
, denoted as
and
, respectively. The threshold value for IOU is set as
. The prior anchor boxes with IOU values greater than or equal to
with the ground truth boxes are considered positive samples, as shown in
Figure 8. If a prior anchor box satisfies the condition with the IOU values of multiple ground truth boxes, it is assigned to the ground truth box with the highest IOU value.
In complex scenes, the mutual occlusion of objects and blurriness of the main image can lead to uncertainty in the borders, as shown in
Figure 9. In this paper, we regress the distribution of the four offset values
,
,
, and
based on the borders, and their joint distribution can reflect the clarity of the boundaries. For example, in
Figure 9a, when all borders are very clear, the joint distribution of
and
, and the joint distribution of
and
, will both have a sharp peak. When one of the upper and lower borders becomes blurry, as in
Figure 9b, the peak value of the joint distribution of
and
will no longer be obvious, and the same goes for the left and right borders. In
Figure 9c,d, when the target shows two possible borders, the joint distribution will have two relatively indistinct peaks.
We denote the distribution we predict as F(x), where F(x) satisfies
. Let the ground truth be y, and the predicted value by
. We cannot perform calculations and regression on x in the continuous domain, so we artificially add upper and lower boundaries
to x and discretize x to
to ensure consistency with the convolutional neural network and artificially add upper and lower boundaries, as shown in Equation (
17); in practical algorithms, we use the softmax function as F(x).
During training, we want
to converge to a value close to y as soon as possible, but we cannot directly calculate the loss between
and y; otherwise, regressing
through the distribution will lose its meaning. The value of the ground truth y is not necessarily exactly one of
. Therefore, in this case, we choose to make the distribution as close as possible to two adjacent values
and
of y. Taking the joint distribution of
and
as an example, assuming the ground truth is obtained at
and
, we want the joint distribution of
and
to converge to
and
as soon as possible. The design of the loss function is as Equation (
18):
where:
For the heatmap branch, we use soft one-hot encoding to label the ground truth, which is different from the traditional method where the value of positive sample points is all 1 and the value of negative sample points is all 0. We assign a value of 0 < y ≤ 1 to the positive sample points, where y is the IOU score of the point, and the larger the IOU value between the anchor and the ground truth at the point, the larger the value of y. The advantage of this approach is that it establishes a connection between the position and the IOU, making the consistency of the network better during training and prediction. At the same time, positive samples with a higher ground truth IOU can contribute more weight, thereby improving the performance of the network.
In the follow-up process, we will encode the distribution of the border distribution branch and apply the result to the heatmap. The specific process is shown in
Figure 1. First, we select the top k values from the discrete distribution and then input them into two FC layers and an activation layer to generate corresponding weights, which are multiplied by the corresponding points on the heatmap. The reason is that the distribution of bounding boxes is strongly correlated with the IOU score. Coupling the two branches can further improve the accuracy of the heatmap and reduce the difficulty of training, making the proposal score more accurate.
3.5. Cascade Heads
In this paper, we adopt cascade heads as the second stage of our detector, which decompose the regression of categories and bounding boxes into multiple stages; each stage takes the bounding boxes from the previous stage along with the feature map as inputs, and outputs the classification and a new distribution of bounding boxes. The detailed structure of cascade heads is illustrated in
Figure 10.
Regarding the bounding box regression part, it relies on a cascade of specialized regressors, as depicted in Equation (
20).
In this formula, x represents the input feature map, and T represents the total number of stages. In this paper, we set . Each stage has an independent regressor with independent parameters, instead of simply repeating the same f multiple times. The cascaded regression is a resampling procedure that changes the distribution of hypotheses to be processed by the different stages. Likewise, each regressor f in the cascade is optimized based on the sample distribution that arrives at the corresponding stage, rather than the initial distribution of . The cascade progressively enhances hypotheses. The cascade heads utilize the same structure and parameters during both training and inference; this provides a more reasonable probability explanation and there is no discrepancy between training and inference distributions.
As the number of regressions increases, the quality of the bounding boxes improves; in other words, the cascade regression begins with a set of examples , and then iteratively samples a new example distribution with a higher IoU. Therefore, to maintain a relatively balanced number of positive samples and to maximize the elimination of outliers in order to enable a better trained sequence of specialized detectors, the regressors in different stages should use different IOU thresholds, and the IOU thresholds should be increased gradually. In practical training, our three regressors use as IOU thresholds, which is consistent with the original paper.
As for the classification part, each cascade head has an independent classification branch with different parameters, which outputs the probability of the target belonging to each class. Unlike the bounding box regression part, the classification results of each stage are not affected by the results of the previous stage. The cascade heads is learned by minimizing the loss in Equation (
21). where
,
g is the ground truth,
is the classifier of the t-th cascade head.
During the prediction phase, we also couple the two stages. Specifically, the score of the final bounding box is obtained by multiplying the score of the first stage with the score of each cascade. This is one of the essential differences between our method and traditional two-stage object detectors, as the two stages of our detector are not separate.