1 Introduction

As one of the important research directions of computer vision, object detection has always attracted the attention of researchers. With the development of deep learning research, object detection has also developed rapidly, and many object detection algorithm models have been proposed. The target detector can be divided into anchor-based detectors and anchor-free detectors from the perspective of whether to use anchors, and it can be divided into one-stage detectors and two-stage detectors according to the network structure. Despite the wide variety of object detectors, the key is to locate and classify the target in the images/videos.

The localization of the target by the object detector is mainly obtained by the regression learning of the bounding box. In early object detection algorithms, \({\mathcal {L}}_n\)-norm loss [1] is usually used for target bounding box regression. But in recent work, Intersection over Union (IoU) [2, 3] and its improved version are commonly used for bounding box regression. According to the previous research, IoU loss has scale-invariant properties for bounding boxes, which is more helpful for the training of object detectors. However, when the predict box does not overlap with the target box, IoU will encounter the problem of gradient disappearance, which prompts the rapid development of research on IoU loss, including GIoU [4], DIoU [5], CIoU [5], Alpha-IoU [6], EIoU [7], GGIoU [8] and RIoU [9], etc.

Based on experiments, we find that the current IoU and its improved loss are still less accurate in bounding box regression, especially for high IoU samples. In addition, IoU is actually the result of introducing the concept of Jaccard Similarity Coefficient into the calculation of regression loss, then the introduction of other mathematical concepts may get better evaluation metrics. In fact, the existing improvement of the loss function of IoU power transformation and the application of Dice coefficient in the field of segmentation provide reference for the new measure design of bounding box regression. We believe that a new level regulation method can be introduced to improve learning efficiency in the bounding box regression process based on the IoU localization performance index.

To address these issues, we propose a new measure that is superior to the IoU measure and deduce three properties of the new measure. Based on these, we propose a higher performance N-IoU measure and a family of loss functions based on the new measure. To verify the effectiveness of our proposed method, N-IoU loss is applied to several common detection algorithms, such as YOLOv3 [10], SSD [11], Faster R-CNN [12], YOLOX [13], YOLOv8(s)[14] and DETR [15], respectively. Meanwhile, we test our method on two mainstream benchmark datasets, PASCAL VOC 2007 [16] and MS COCO 2017 [17].

The main contributions of this paper are the following four points:

  1. 1.

    A new measure for regression evaluation is proposed for the first time, and the basic characteristics of the new measure are defined and demonstrated. The idea can also be used to guide the improvement of other application fields of deep learning;

  2. 2.

    A new family of N-IoU loss is proposed, and experiments prove that N-IoU has higher regression accuracy than existing IoU-based loss;

  3. 3.

    We analyze the properties of N-IoU loss and find that the proposed new loss can describe a variety of existing IoU-based loss functions uniformly;

  4. 4.

    Experiments show that our proposed loss outperforms existing IoU-based loss functions on multiple common object detection datasets and models. In particular, it has better robustness on lightweight object detectors.

2 Related work

In this section, we mainly investigate the evolution of target localization methods of object detector and summarize the technical route of bounding box regression learning. It is mainly divided into two parts: object detection algorithm and bounding box regression loss function.

2.1 Object detection algorithm

Since the proposal of the RCNN [18] algorithm, the research on object detection algorithms has entered the era of deep learning. From the perspective of network structure, the existing object detector models can be divided into one-stage and two-stage. Representative works of one-stage models are SSD series [11, 19, 20], YOLO series [10, 13, 21,22,23], RetinaNet [24], FCOS [25], CornerNet [26], CenterNet [27], etc. Representative works of two-stage models are RCNN series [1, 12, 18, 28, 29], SPPNet [30], HTC [31], TSD [32], CPNDet [33], CenterNet2 [34], etc.

Compared with the one-stage model, the two-stage model adds a region proposal network (RPN), which can generate a large number of foreground and background region proposals for preliminarily partitioning the target and background regions to balance the imbalance of positive and negative samples. This is also the reason why the accuracy of the two-stage model intersects the one-stage model is higher, but the calculation speed is relatively slow.

From the bbox regression method of object detector, an important distinction is whether the model uses anchor boxes. From this perspective, existing models can be divided into anchor-based and anchor-free detectors. Anchor-based detectors pre-define anchor boxes with a certain scale and aspect ratio for training the model. Based on the differences between these predetermined anchor boxes and the ground-truth target boxes, positive and negative samples are assigned, and these differences are learned. The anchor-free detector does not need to pre-set anchor boxes, allocate positive and negative samples according to the key points of the target box, learn the difference between the key points and the target box and then complete the model training. At the same time, e.g., FSAF [35] and GA-RPN [36], there are some works that mix anchor-based and anchor-free methods.

But no matter what kind of target detector it is, the current positioning of targets in images/videos is inseparable from target bounding box regression learning. Therefore, it is extremely important to design an efficient bounding box regression loss to guide the learning of the model localization branch.

2.2 Bounding box regression losses function

Along with the development of this object detection algorithm, the bounding box regression loss function used to learn the target location has also been continuously developed. The classic loss functions based on \({\mathcal {L}}_n\)-norm regression include \({\mathcal {L}}_1\) and \({\mathcal {L}}_2\) loss functions, the former is not smooth and slow to learn, and the latter is sensitive to abnormal loss, so the square root of w and h is used in YOLOv1 [21] to alleviate the n-norm impact of loss. In YOLOv3 [10], \(2-wh\) is used to alleviate this problem. Faster R-CNN [1] uses smooth \({\mathcal {L}}_1\) loss for bbox regression. In practical applications, logistic regression loss (e.g., BCE loss [37]) is also used for bbox regression.

However, recent work mostly uses IoU and its variants for bbox regression. The advantage of IoU is its scale-invariant property, and the disadvantage is that the gradient disappears on non-overlapping samples. Therefore, GIoU [4] was proposed to solve the gradient disappearance of non-overlapping samples, but it still suffers from slow convergence and low accuracy. The proposals of DIoU and CIoU [5] further consider the overlapping area, center point distance and aspect ratio and regularization term in IoU, which greatly improves the regression accuracy and convergence speed.

In a recent related study, Focal-EIoU [7] further optimizes the penalty term of CIoU, and Rectified IoU [9] constructs a more complex new IoU-based loss function with a hyperbolic shape gradient. Pseudo-IoU [38] proposed an IoU calculation method suitable for anchor-free object detectors. DIR [39] proposes decoupled IoU regression, which separates the complex mapping between the bounding box and its IoU into two clearer mappings, purity and integrity, which are modeled independently.

On the other hand, how to make the model pay more attention to high IoU samples is also one of the important research directions. The current method is mainly to add power transformation to the IoU value or combine the difficult-to-easy sample balance strategy (Focal Loss [24]), these recent works Focal-EIoU [7], GGIoU [8], Rectified IoU [9] and Alpha-IoU [6] and so on are reflected.

3 Formulation of hypothesis

According to the analysis of commonly used loss functions in Sect. 2.2, there are two main categories of loss functions used by bounding box regression: \({\mathcal {L}}_n\)-norm loss and regression loss based on Intersection over Union, and the IoU-based loss is more widely studied and applied.

3.1 Analysis to IoU-based loss

$$\begin{aligned} {\text {IoU}}= & {} \frac{\vert B\cap B^{gt} \vert }{\vert B\vert + \vert B^{gt}\vert -\vert B{\cap }B^{gt}\vert } \end{aligned}$$
(1)
$$\begin{aligned} {\mathcal {L}}_{{\text {IoU}}}= & {} 1-IoU \end{aligned}$$
(2)

IoU loss first introduces the Jaccard Similarity Coefficient between collections into regression calculation to evaluate the loss between the predict box and the target box. \(B=(x,y,w,h)\) and \(B^{gt}=(x^{gt},y^{gt},w^{gt},h^{gt})\) are the predict box and target box. Equation (1) is calculated of the IoU value, and Eq. (2) is the IoU loss function, which will cause the problem of gradient disappearance when the prediction box and the target box have no overlapping area. Subsequent studies found that adding penalty items was conducive to solving the problem of gradient disappearance of IoU loss and thus derived the IOU-based loss function family, whose general form is as follows:

$$\begin{aligned} {\mathcal {L}}_{{\text {IoU-based}}}=1-IoU+{\mathcal {R}}_i\left( B,B^{gt} \right) \end{aligned}$$
(3)

where \({\mathcal {R}}_i\left( B,B^{gt} \right)\) is the penalty item.

3.1.1 GIoU, DIoU and CIoU

$$\begin{aligned} {\mathcal {L}}_{GIoU}= & {} 1-IoU+\frac{\vert C-(B{\cup }B^{gt})\vert }{\vert C\vert } \end{aligned}$$
(4)
$$\begin{aligned} {\mathcal {L}}_{DIoU}= & {} 1-IoU+\frac{\rho ^2(b,b^{gt})}{c^2} \end{aligned}$$
(5)
$$\begin{aligned} {\mathcal {L}}_{CIoU}= & {} 1-IoU+\frac{\rho ^2(b,b^{gt})}{c^2}+\alpha \upsilon \end{aligned}$$
(6)

Equations (4), (5) and (6) are the GIoU [4], DIoU and CIoU [5] losses based on the IoU loss by adding different penalty terms. GIoU solves the gradient vanishing problem of IoU loss for the first time, but the regression accuracy and speed are poor. DIoU and CIoU improve the regression accuracy and speed while solving the problem of gradient disappearance. These are the loss functions commonly used in object detectors today.

3.1.2 EIoU

$$\begin{aligned} {\mathcal {L}}_{{\text {EIoU}}}=1-IoU+\frac{\rho ^2(b,b^{gt})}{c^2}+\frac{\rho ^2(w,w^{gt})}{C_w^2}+\frac{\rho ^2(w,w^{gt})}{C_h^2} \end{aligned}$$
(7)

The penalty term of EIoU [7] loss function contains three parts: overlap loss, center distance loss and width and height loss, the first two parts are the same as the method in CIOU, width and height loss can minimize the difference between the width and height of the predict box and the target box, and accelerate the model convergence during training. Where \(C_w\) and \(C_h\) are the width and height of smallest enclosing box of the predict box and the target box.

3.1.3 Balanced-IoU

Balanced-IoU (BIoU) [40] loss function considers the parameterized distance between the centers and the minimum and maximum edges of the bounding boxes to address the localization problem.

$$\begin{aligned} {\mathcal {L}}_{BIoU}= & {} 1-IoU+R\left( b^P,b^G \right) \end{aligned}$$
(8)
$$\begin{aligned} R\left( b^P,b^G \right)= & {} \frac{{\text {WC}}+{\text {HC}}+{\text {MNE}}+{\text {MXE}}}{C^2} \end{aligned}$$
(9)

3.1.4 Diagonal-base IoU

Diag-IoU (DiagIoU) [41] loss retains the superior part of CIoU loss, i.e., the loss of the center distance. It uses \(L_{Diag}\) to replace \(L_{\frac{w}{h}}\) of CIoU loss, which directly minimizes the difference in size and scale between two boxes by the difference of the diagonal vector.

$$\begin{aligned} {\mathcal {L}} _{\mathrm{{Diag\_IoU}}}=1-IoU+\frac{\rho ^2\left( b,b^{gt} \right) }{c^2}+\frac{\rho ^2\left( d,d^{gt} \right) }{\vert d \vert \vert d^{gt} \vert } \end{aligned}$$
(10)

3.1.5 Manhattan-distance IOU

By using Manhattan distance, Manhattan-distance IOU (MIoU) [42] solves the problem that the Euclidean distance term is unstable due to the huge gradient at the early stage of regression during training, and sets the denominator of the Euclidean distance term to a normalized coefficient without participating in backpropagation which can effectively improve the convergence speed.

$$\begin{aligned} {\mathcal {L}} _{MIoU-C}= & {} 1-IoU+\frac{\rho ^2\left( b,b^{gt} \right) }{c^2}+\alpha \nu \nonumber \\{} & {} +\frac{\delta _x\left( b,b^{gt} \right) }{c_w}+\frac{\delta _y\left( b,b^{gt} \right) }{c_h} \end{aligned}$$
(11)

3.1.6 MPDIoU

MPDIoU [43] is a bounding box similarity comparison measure based on minimum point distance, which contains all of the relevant factors considered in the existing loss functions, namely, overlapping or non-overlapping area, center distance and deviation of width and height, while simplifying the calculation process.

$$\begin{aligned} {\mathcal {L}} _{MPDIoU}=1-IoU+\frac{\mathrm{{d}}_{1}^{2}}{h^2+w^2}+\frac{\mathrm{{d}}_{2}^{2}}{h^2+w^2} \end{aligned}$$
(12)

3.1.7 Scylla-IoU

Scylla-IoU (SIoU) [44] divided the penalty item into three parts: angle cost, distance cost and shape cost. The angle cost describes the minimum angle between the central points’ connection and the x–y axis:

$$\begin{aligned} \varLambda= & {} \sin \left( 2\sin ^{-1}\frac{\min \left( \vert x-x_{gt} \vert ,\vert y-y_{gt} \vert \right) }{\sqrt{\left( x-x_{gt} \right) ^2+\left( y-y_{gt} \right) ^2}+\epsilon } \right) \end{aligned}$$
(13)
$$\begin{aligned} \varDelta= & {} \frac{1}{2}\sum _{t=w,h}{\left( 1-e^{-\gamma \rho t} \right) }, \gamma =2-\varLambda , \rho _x=\left( \frac{x-x_{gt}}{Wg} \right) ^2,\nonumber \\{} & {} \rho _y=\left( \frac{y-y_{gt}}{Hg} \right) ^2 \end{aligned}$$
(14)
$$\begin{aligned} \varOmega= & {} \frac{1}{2}\sum _{t=w,h}{\left( 1-e^{w_t} \right) ^{\theta }}, \theta =4, \omega _w=\frac{\vert w-w_{gt} \vert }{\max \left( w,w_{gt} \right) },\nonumber \\{} & {} \omega _h=\frac{\vert h-h_{gt} \vert }{\max \left( h,h_{gt} \right) } \end{aligned}$$
(15)
$$\begin{aligned} {\mathcal {L}}_{\mathrm{{SIoU}}}= & {} 1-IoU+\frac{1}{2}{\mathcal {R}} _{SIoU}=1-IoU+\frac{\varDelta +\varOmega }{2} \end{aligned}$$
(16)

The penalty term of SIoU is composed of distance cost and shape cost, which can make the model converge quickly, and the regression error is smaller during training. However, its limitation is that it is more complex than other IoU-based loss calculations.

3.1.8 Focal-EIoU

Focal-EIoU [7] loss combines EIoU with Focal Loss to propose a more balanced regression loss.

$$\begin{aligned} {\mathcal {L}}_{\mathrm{{Focal-EIoU}}}=IoU^\gamma {\mathcal {L}}_{EIoU} \end{aligned}$$
(17)

3.1.9 Gaussian-guided IoU

Gaussian-guided IoU (GGIoU) [8] focuses more attention on the closeness of the predict box’s center to the target box’s center.

$$\begin{aligned} D_c= & {} e^{-\frac{1}{2}\left( \frac{\left( x_{c}^{a}-x_{c}^{gt} \right) ^2}{\sigma _{1}^{2}}+\frac{\left( y_{c}^{a}-y_{c}^{gt} \right) ^2}{\sigma _{2}^{2}} \right) }, \sigma _1=\beta \times \omega ^{{\mathcal {I}}}, \sigma _2=\beta \times h^{{\mathcal {I}}} \end{aligned}$$
(18)
$$\begin{aligned} GGIoU= & {} IoU^{1-\alpha }D_c^\alpha \end{aligned}$$
(19)

3.1.10 Rectified IoU

Rectified IoU (RIoU) [8] increases the gradient of a large number of easy samples (large IoU) and makes the network pay more attention to these samples, while the gradient of a small number of difficult samples (small IoU) is suppressed. The contribution of each type of samples is more balanced, and the training process is more efficient and stable.

$$\begin{aligned} {\mathcal {L}}_{RIoU}=1-\left( \frac{a}{2}IoU^2+bIoU+kln|IoU-c|+t\right) \end{aligned}$$
(20)

3.1.11 Alpha-IoU

Alpha-IoU (\(\alpha\)-IoU) [6] adds the power parameter á to the IoU and penalty terms of the loss function, and by adjusting \(\alpha\), the gradient importance of bounding box regression at different levels is different. \(\alpha\)-IoU is more robust to small datasets and noise.

$$\begin{aligned} {\mathcal {L}}_{\alpha \text{- }IoU}= & {} 1-IoU^\alpha \end{aligned}$$
(21)
$$\begin{aligned} {\mathcal {L}}_{\alpha \text{- }CIoU}= & {} 1-IoU^\alpha +\frac{\rho ^{2\alpha }(b,b^{gt})}{c^{2\alpha }}+(\beta \upsilon )^\alpha \end{aligned}$$
(22)

3.1.12 Wise-IoU

Wise-IoU (WIoU) [45] is a dynamic non-monotonic FM loss function based on IoU. The dynamic non-monotonic FM uses the outlier degree instead of IoU to evaluate the quality of anchor boxes and provides a wise gradient gain allocation strategy. This strategy reduces the competitiveness of high-quality anchor boxes while also reducing the harmful gradient generated by low-quality examples. This allows WIoU to focus on ordinary-quality anchor boxes and improve the detector’s overall performance.

$$\begin{aligned} {\mathcal {L}} _{WIoUv1}=\exp \left( \frac{\left( x-x_{gt} \right) ^2+\left( y-y_{gt} \right) ^2}{W_{g}^{2}+H_{g}^{2}} \right) {\mathcal {L}} _{IoU} \end{aligned}$$
(23)

From these research results, there are two main technical routes for the evolution of IOU-based loss function: 1. The improvement of the penalty item, GIoU, DIoU, CIoU, EIoU, BIoU, Diag-IoU, MIoU and MPDIoU is all evolving along this technological line; 2. The IoU term is weighted, or the power transform is added to optimize the gradient change. Focal?EIoU, GGIoU, RIoU, Alpha-IoU and WIoU are developed along the second technical route, and part of the loss function also improves the penalty term. However, these studies are still improved under the framework of Jaccard Similarity Coefficient application and do not explore whether there is a more suitable evaluation measure for bounding box regression than IoU. Our research is based on this point.

3.2 New measure

The Dice coefficient is widely used metric in computer vision community to calculate the similarity between two images. In some studies [46,47,48,49], Dice loss [50] has been extended for image segmentation. Dice loss actually uses the Dice coefficient, a similarity measure of the collection, as an image segmentation evaluation.

$$\begin{aligned} \mathrm{{DiceIndex}}\left( y,{\hat{p}} \right)= & {} \frac{2y{\hat{p}}}{y+{\hat{p}}},\nonumber \\ \mathrm{{DiceLoss}}\left( y,{\hat{p}} \right)= & {} 1-\frac{2y{\hat{p}}+1}{y+{\hat{p}}+1} \end{aligned}$$
(24)

here, 1 is added in numerator and denominator to ensure that the function is not undefined in edge case scenarios such as when \(y={\hat{p}}=0\).

Tversky index can also be seen as an generalization of Dices coefficient. It adds a weight to false positives and false negatives with the help of \(\beta\) coefficient. Tversky index and Tversky loss can also be defined as follows:

$$\begin{aligned}{} & {} \mathrm{{TverskyIndex}}\left( p,{\hat{p}} \right) \nonumber \\{} & {} \quad =\frac{p{\hat{p}}}{p{\hat{p}}+\beta \left( 1-p \right) {\hat{p}}+\left( 1-\beta \right) p\left( 1-{\hat{p}} \right) } \end{aligned}$$
(25)
$$\begin{aligned}{} & {} \mathrm{{TverskyLoss}}\left( p,{\hat{p}} \right) \nonumber \\{} & {} \quad =1-\frac{1+p{\hat{p}}}{1+p{\hat{p}}+\beta \left( 1-p \right) {\hat{p}}+\left( 1-\beta \right) p\left( 1-{\hat{p}} \right) } \end{aligned}$$
(26)

We can directly extend Dice loss to the field of bounding box regression£¬The resulting Dice loss can also be defined as follows:

$$\begin{aligned} \mathrm{{Dice}}= & {} \frac{2\vert B{\cap }B^{gt}\vert }{\vert B\vert +\vert B^{gt}\vert }=\frac{2\vert B{\cap }B^{gt}\vert }{\vert B{\cup }B^{gt}\vert +\vert B{\cap }B^{gt}\vert } \end{aligned}$$
(27)
$$\begin{aligned} {\mathcal {L}}_{{\text {Diceloss}}}= & {} 1-\frac{2\vert B{\cap }B^{gt}\vert }{\vert B\vert +\vert B^{gt}\vert } \end{aligned}$$
(28)

Experiments show that the obtained Dice loss has better regression accuracy than IoU loss, which indicates that there is a better regression evaluation metric than IoU.

  1. 1.

    The new measure should have the same scale-invariant property as the IoU but independent of the Jaccard Similarity Coefficient;

  2. 2.

    It should be able to appropriately amplify the regression loss gradient of low IoU sample, suppress the regression loss gradient of high IoU sample and improve the robustness and accuracy of the model;

  3. 3.

    Since \(IoU\in [0,1]\), new measure should be bounded, continuous and differentiable in the range of [0,1] to facilitate backpropagation.

Further, this control idea of amplifying the loss gradient of low IoU samples and suppressing the loss gradient of high IoU samples can not only be used for the loss function design of the regression branch, but also has guiding significance for the design of other types of loss functions.

Fig. 1
figure 1

Relationship between IoU and possible new measures

In Fig. 1, the black line \(L_1\) is the IoU loss, and the red line \(L_2\) is the possible new measures. \((\theta , L)\) in the figure represents the IoU value and losses at a certain moment. In the high loss area, IoU=0.1 is selected as the starting time of parameter learning. Model parameters are initialized to \(\theta _1\). The initial state of parameter learning when using \(L_1\) loss is \((\theta _1, L_1^1)\). The learning end state is \((\theta _1^\prime , L_1^{1^\prime })\). When using \(L_2\), the initial state is \((\theta _1, L_2^1)\), and the end state is \((\theta _1^{\prime \prime }, L_2^{1^\prime })\). The IoU of the learning start and end times selected in the low loss area is 0.6 and 0.8, respectively. The parameters are initialized to \(\theta _2\) at the beginning of training, and the states before and after learning are from \((\theta _2, L_1^2)\) to \((\theta _2^\prime , L_1^{2^\prime })\), and \((\theta _2, L_2^2)\) to \((\theta _2^{\prime \prime }, L_2^{2^\prime })\), respectively.

Under the same learning rate \(\eta\) and without considering the loss coefficient \(\mathcal {\lambda }\), in the parameter update of the low IoU area, when the loss is \(L_1\), the loss changes to \(\sigma _1=L_1^{1^\prime }-L_1^1\) before and after learning. The gradient of the parameter update is \({\hat{g}}={\nabla }_{\theta _1}\sigma _1\). The updated parameter is \(\theta _1^\prime =\theta _1-\eta {\nabla }_{\theta _1}\sigma _1\). When the loss is \(L_2\), the loss changes to \(\sigma _2=L_2^{1^\prime }-L_2^1\), and the updated parameter is \(\theta _1^{\prime \prime }=\theta _1-\eta {\nabla }_{\theta _1}\sigma _2\). Obviously \(\sigma _2>\sigma _1\), and \(\vert \theta _1^{\prime \prime }-\theta _1\vert >\vert \theta _1^\prime -\theta _1\vert\), which indicates that for high loss samples, \(L_2\) enriches the learned gradient changes of the model. When using the same calculation method, in the high IoU area, \(\theta _2^\prime =\theta _2-\eta {\nabla }_{\theta _2}(L_1^{2^\prime }-L_1^2)\) when the loss is \(L_1\), \(\theta _2^{\prime \prime }=\theta _2-\eta {\nabla }_{\theta _2}(L_2^{2^\prime }-L_2^2)\) when the loss is \(L_2\), and \(\vert L_1^{2^\prime }-L_1^2\vert >\vert L_2^{2^\prime }-L_2^2\vert\) at this time, so \(\vert \theta _2^{\prime \prime }-\theta _2\vert >\vert \theta _2^\prime -\theta _2\vert\). It can be found that for low loss samples, \(L_2\) makes the learned gradient of the model change more finely.

The above analysis shows that as long as we can find a certain new measure, the new measure can be compared with the IoU image shape \(L_2\) curve in Fig. 1, that is, the new measure conforms to the three characteristics proposed by us.

Fig. 2
figure 2

a The relationship between \({\mathcal {L}}_{N\text{- }IoU}\), \({\mathcal {L}}_{\alpha \text{- }IoU}\) and the IoU. b The relationship between gradient and IoU. For the sake of brevity, only the correlation curve of \({\mathcal {L}}_{\alpha \text{- }IoU}\) when \(\alpha =0.5\) is drawn

Some studies such as Focal-EIoU [7], Gaussian-guided IoU (GGIoU) [8], Rectified IoU (RIoU) [9] and Alpha-IoU [6]. These works all have a common feature, that is, using a power function to modulate the IoU to obtain a better loss. Because for the power function \(y=x^\alpha\), given \(0\le \alpha \le 1\), in the range of \(x\in [0,1]\), when \(x<0.5\), \({\nabla }y>{\nabla }x\). When \(x>0.5\), \({\nabla }y<{\nabla }x\) and \(\alpha >1\) is the opposite. Using a power function to modulate IoU, that is, inputting IoU as an independent variable, can amplify the regression loss gradient of low IoU sample and suppress the regression loss gradient of high IoU sample. The IoU image modulated by the power function is shown in Fig. 2. These works are all using power functions to regulate the loss gradient of high IoU and low IoU samples to improve the regression accuracy and speed. From this, we can conclude that this method conforms to the second and third features of the proposed new measure.

The Dice coefficient is equivalent to adding the area of the intersection region to the numerator and denominator of IoU. Dice loss curve is shown in Fig. 2, which can enlarge the regression loss difference of low IoU sample to speed up the regression process, and reduce the regression loss difference of high IoU sample to improve the regression accuracy. This kind of transformation method is bounded and continuously differentiable, which obviously meets all the characteristics of new measure, and the loss calculation is simpler than the power transformation calculation.

4 Propose N-IoU

IoU is an acronym for Intersection over Union, which means the intersection area of overlapping samples divided by the union area, which is a fraction numerically. A simple common sense is to add equal magnitudes to its numerator and denominator, which can change its relationship with the original IoU value. Based on this, we propose N-IoU. It can be defined as follows:

$$\begin{aligned} N\text{- }IoU= & {} \frac{\vert B{\cap }B^{gt}\vert +n\vert B{\cap }B^{gt}\vert }{\vert B\vert +\vert B^{gt}\vert -\vert B{\cap }B^{gt}\vert +n\vert B{\cap }B^{gt}\vert }\nonumber \\= & {} \frac{\vert B{\cap }B^{gt}\vert +n\vert B{\cap }B^{gt}\vert }{\vert B{\cup }B^{gt}\vert +n\vert B{\cap }B^{gt}\vert } \end{aligned}$$
(29)

where B and \(B^{gt}\) are still the prediction box and the target box, and \(\vert \cdot \vert\) represents the area. N-IoU is obtained by adding n times the intersection area of the numerator and denominator of IoU. According to this method, for the same set of overlapping samples, when \(n>0\), the value of N-IoU is less than IoU; otherwise, the value of N-IoU is greater than IoU.

figure a
$$\begin{aligned} {\mathcal {L}}_{N\text{- }IoU}= & {} 1-N\text{- }IoU \end{aligned}$$
(30)
$$\begin{aligned} {\mathcal {L}}_{N\text{- }CIoU}= & {} 1-N\text{- }IoU+\frac{\rho ^2(b,b^{gt})}{c^2}+\alpha \upsilon \end{aligned}$$
(31)

Equations (30) and (31) are N-IoU and N-CIoU losses, which are obtained by replacing the IoU term in the corresponding IoU-based loss function with N-IoU. In order to explore the influence of the parameter n on the performance of N-IoU, the adjustment of IoU by N-IoU and power function is intuitively analyzed. We plot the loss of \({\mathcal {L}}_{N\text{- }IoU}\) and \({\mathcal {L}}_{IoU}\) versus the IoU value for several different values of n from 0 to 15. The results are shown in Fig. 2, we also plotted the curve of \({\mathcal {L}}_{\alpha \text{- }IoU}\) in Eq. (21) and calculated the gradient of \({\mathcal {L}}_{N\text{- }IoU}\) and \({\mathcal {L}}_{\alpha \text{- }IoU}\) with respect to the IoU.

As can be seen from the curves in Fig. 2, when \(n=0\), \({\mathcal {L}}_{N\text{- }IoU}\) and \({\mathcal {L}}_{IoU}\) are the same. When \(n=1\), \({\mathcal {L}}_{N\text{- }IoU}\) is Dice loss. As the value of n increases, the gradient of \({\mathcal {L}}_{N\text{- }IoU}\) increases, N-IoU optimizes IoU better and \({\mathcal {L}}_{\alpha \text{- }IoU}\) also has similar characteristics to \({\mathcal {L}}_{N\text{- }IoU}\).

According to the analysis in Sect. 3.2, when \(n>0\), N-IoU amplifies the loss gradient of low IoU samples and reduces the loss gradient of high IoU samples, and the opposite is true when \(n<0\). This is the same as the gradient change situation reflected by the gradient curve in Fig. 2b. In addition, through the comparison of the \({\mathcal {L}}_{\alpha \text{- }IoU}\) and \({\mathcal {L}}_{N\text{- }IoU}\), it can be seen that the N-IoU suppresses the loss gradient of high IoU samples and enhanced the loss gradient of low IoU samples, and the method is more balanced. Alpha-IoU has strong gradient amplification effect on low IoU loss, but weak gradient suppression on high IoU loss.

figure b

5 Simulation experiment

5.1 Implementation details

This paper proposes to replace the IoU term in the loss functions of IoU, GIoU, DIoU and CIoU with N-IoU, which can form a new family of IoU-based loss functions. However, the N-IoU loss and the existing regression loss cannot be compared intuitively only from the detection accuracy of the target detection model. We propose a set of bbox regression process simulations with random settings to compare IoU, GIoU, DIoU, CIoU, SIoU, WIoU, Diag-IoU, MIoU, Dice loss, Alpha-CIoU and loss of \({\mathcal {L}}_{N\text{- }IoU}\) for different values of n. Compared with the Alpha-CIoU of Eq. (22), we only add the power transformation to the IoU term, so as to achieve the purpose of simplifying the operation. The pseudocode of the single-group bbox regression simulation experiment is shown in Algorithm 1.

Fig. 3
figure 3

Use IoU, GIoU, DIoU, CIoU, SIoU, WIoU, Diag-IoU, MIoU, Alpha-CIoU, Dice loss, N-CIoU with \(n=1\) and N-CIoU with \(n=5\) and 9, respectively, as loss functions. Observe the regression process of their predict box center coordinates, width and height. In particular, SIoU has the slowest convergence speed for the same learning rate, and its iteration number must be set to T=1500 to observe the impact of SIoU loss on regression as a whole

During the simulation, we use gradient descent to iteratively update the position of the predict box given the loss function. \(B^t=(x^t,y^t,w^t,h^t)\) is the predict box of t iterations; \({\nabla }B^{t-1}\) is the gradient of the regression loss function to the predict box \(B^{t-1}=(x^{t-1},y^{t-1},w^{t-1},h^{t-1})\) of \(t-1\) iterations and \(\eta\) is the learning step size. In the simulation process, in order to ensure that the above various regression loss learning can reach the convergence state, and then accurately compare the time and accuracy required for each regression convergence. We set the number of iterations to \(T=200\).

Fig. 4
figure 4

The regression of a single-group of bboxes with different losses. a IoU loss, b GIoU loss, c DIoU loss, d CIoU loss, e SIoU, f WIoU, g Diag-IoU, h MIoU, i Alpha-CIoU loss, j Dice loss, k N-CIoU (\(n=5\)) and l N-CIoU (\(n=9\)). The green box is the target box; the black box is the initial predict box and the red box is the regression value of the predict box in different iteration cycles. The first row is the results of 10 iteration of regression with different loss conditions, and the second to fourth rows are the results of 50, 100 and 200 iterations, respectively. In particular, due to the slow convergence speed of SIoU, the bounding box regression iteration of SIoU is set to 50, 500, 1000 and 1500

The regression simulation of a single sample is not enough to illustrate the performance of the N-IoU loss, and it is also very difficult to analyze the bounding box regression process from the model detection results. In order to reflect the performance of N-IoU loss at different distances, different scales and different aspect ratios, we use more complex simulation experiments for analysis. Figure 5a shows the experimental scheme we adopted. To fully cover the complex bbox situations that may arise during model training, we choose seven target boxes with different aspect ratios of area 1. The ratios are 1:4, 1:3, 1:2, 1:1, 2:1, 3:1 and 4:1. The coordinates of the center are at (3, 3). The center of the predict box is distributed on N points in a square area with (3, 3) as the center and a side length of 4, and the area of the predict box at each point is set to 0.5, 0.67, 0.75, 1, 1.33, 1.5 and 2. The predict boxes of different areas also have seven same aspect ratios as the target box.

Fig. 5
figure 5

a Large-scale bbox regression simulation experiment. b and c Regression errors when N is 10 and 1000 in large-scale bbox regression simulation experiments

In the experiment, N is set to 10 (sparsely distributed predict box) and 1000 (densely distributed predict box), respectively, for experiments. Considering that Dice loss is a case of NIoU, Alpha-CIoU and N-CIoU both conform to the properties of the proposed new measure, and IoU, GIoU, DIoU and CIoU are regression loss functions that have been widely used. So only regression errors of IoU, GIoU, DIoU, CIoU, Alpha-CIoU and N-CIoU are compared in our experiment. The pseudocode of the large-scale bbox regression simulation experiment is shown in Algorithm 2.

5.2 Simulation results

Figure 3 shows the change of the predict box in the single-group bbox regression simulation experiment. The iteration situation shows that SIoU loss regression has the slowest convergence, and the regression accuracy is also poor; IoU, GIoU and WIoU loss regression have a small improvement in the convergence speed; DIoU, CIoU, Diag-IoU and MIoU are faster and have poor convergence accuracy. Only the final convergence value of DIoU, CIoU and Diag-IoU loss shows periodic oscillation characteristics, and the final convergence value of other losses oscillates irregularly. When Alpha-CIoU and Dice loss regress, the oscillation trend of the final convergence value is suppressed. The N-IoU loss regression has the best effect on the oscillation suppression of the final convergence value.

Table 1 Evaluation results of Faster R-CNN on PASCAL VOC 2007 test data, parameter \(n=9\) in N-IoU series loss

In Fig. 3, it is found in experiments that although the larger the value of n, N-CIoU is better than other losses optimization, the larger the value of n, the longer the time required for the regression to converge. When \(n=9\), the performance of \({\mathcal {L}}_{N\text{- }CIoU}\) is optimal, which can make the convergence speed and accuracy of the regression process meet expectations.

Figure 4 shows the regression of predict boxes for some iteration cycles in a single-group bbox regression simulation experiment. By comparison, N-CIoU with \(n=9\) converges faster and has the highest accuracy.

Figure 5 shows the results of a large-scale bbox regression simulation experiment, and the partially enlarged area shows the final regression error of several of these losses. With \(7\times 7\times 7\times 10\) random predict boxes, the final error of CIoU loss is 8.07. And the final error of N-CIoU with parameter \(n=9\) is 1.09. With \(7\times 7\times 7\times 1000\) random predict boxes, the final errors of the two are 813.3 and 108.5, respectively. Experiments show that N-CIoU can reduce the regression error by about 87.5\(\%\) compared to CIoU in bbox regression. On the premise of sacrificing negligible regression speed, the N-CIoU loss regression proposed in this paper has higher regression accuracy.

6 Performance evaluation experiment

In this section, on the mainstream object detection datasets PASCAL VOC [16] and MS COCO [17], we use the proposed N-CIoU in common object detection model to evaluate the performance of this novel loss. The main algorithms we compare are YOLOv3 and SSD based on a one-stage model and Faster R-CNN based on a two-stage model. In addition, we also evaluate its performance on the lightweight version of YOLOX [13], YOLOv8(s) [14] and DETR [15].

6.1 Training and evaluation

The MS COCO 2017 dataset is a relatively difficult and complex large-scale standard dataset in object detection evaluation, including more than 118K training images and 5K evaluation images. Since the labels of the test data are not public, COCO 2017-val is used as the evaluation data in the experiment. The PASCAL VOC dataset is one of the mainstream object detection datasets. The experiments are trained using VOC07+12 (two parts of VOC 2007 trainval and VOC 2012 trainval), which contains 20 categories and 16,551 images. The performance is evaluated on the VOC 2007 test dataset with 4952 images. Models trained and evaluated on MS COCO data are YOLOv3, SSD, YOLOX(nano), YOLOv8(s) and DETR. The model trained and evaluated on PASCAL VOC data is Faster R-CNN.

To simplify the experiments, we use only N-GIoU, N-CIoU and CIoU, GIoU loss to train the model in all experiments, and the training uses stochastic gradient descent (SGD) optimizer. The initial learning rate is 0.1, the momentum is 0.9 and the weight decay is 4e-5. YOLOv3, YOLOv8 and DETR are trained for 200 epochs, and the mini-batch is 64. SSD training is 200 epochs, and the mini-batch is 128. YOLOX is trained for 300 epochs with a mini-batch of 128. Faster R-CNN is trained for 300 epochs with a mini-batch of 128.

The performance metric when evaluated using MS COCO 2017-val data uses mAP0.5:0.95, which means that the IoU threshold is taken from 50\(\%\) to 95\(\%\) with a step size of 5\(\%\), and then, the mean of AP under these IoU is calculated. Performance metrics when evaluated using PASCAL VOC test 2007 data use AP (the average of 10 mAP across different IoU thresholds) = (AP50 \(+\) AP55 \(+\)... \(+\) AP95)/10 and AP75 (mAP=0.75).

6.2 Faster R-CNN on PASCAL VOC

The model Faster R-CNNFootnote 1 is a recent PyTorch implementation, the input image size is \(600\times 600\), and the backbone network is ResNet50-FPN. Table 1 reports the AP and AP75 quantitative metrics for all targets of the model on the VOC 2007 test. The experimental evaluation results in Table 1 show that the AP and AP75 of Faster R-CNN are improved by 0.58 and 0.82, respectively, when using N-GIoU loss compared to GIoU loss, and 1.11 and 1.26 when using N-CIoU and CIoU loss, respectively.

Table 2 Performance evaluation of YOLOv3 and SSD on MS COCO 2017-val data, parameter \(n=9\) in N-IoU series loss

6.3 YOLOv3 and SSD on MS COCO

The models YOLOv3Footnote 2 and SSDFootnote 3 used in the experiments are recent PyTorch implementations. Table 2 reports the detection AP values of these two models for large, medium and small objects in all evaluation images when using the set loss function and mAP 0.5:0.95 for the entire evaluation data. Table 2 shows the quantitative comparison of the inference accuracy of YOLOv3 and SSD on MS COCO 2017-val data. Experiments show that for YOLOv3 and SSD, N-GIoU has a certain performance improvement compared to GIoU and N-CIoU compared to CIoU. Compared with the GIoU loss, the mAP0.5:0.9 of YOLOv3 and SSD is improved by 0.91 and 1.08 when using N-GIoU loss, and 1.17 and 1.28 when using N-CIoU and CIoU loss, respectively.

Table 3 Evaluation results of YOLOX(nano) lightweight version, YOLOv8(s) and DETR on MS COCO 2017-val, parameter \(n=9\) in N-IoU series loss

6.4 YOLOX(nano), YOLOv8(s) and DETR on MS COCO

Model YOLOX(nano)Footnote 4 YOLOv8(s)Footnote 5 and DETRFootnote 6 are also recent PyTorch implementations that implement some of the important improvements of YOLOX [13], YOLOv8(s) [14] and DETR [15], with some details for the implementation. Table 3 reports the large, medium and small target AP values for all images and mAP0.5:0.9 for all images in COCO 2017-val of these models. The evaluation results in Table 3 show that the mAP0.5:0.9 of YOLOX(nano) is improved by 0.87 when using N-GIoU compared to GIoU loss, the YOLOv8(s) and DETR are improved by 0.91 and 0.82, respectively. When using N-CIoU and CIoU loss, the three models are improved by 1.01, 1.21 and 1.18, respectively. The comparison with the experimental results shows that N-CIoU can show better performance on lightweight models.

In the above series of experiments, we find two phenomena in the application of N-IoU measure: 1. When using N-IOU-based loss function to train the model, the model convergence speed is significantly slower than IoU-based loss. Comparing experimental results, it is found that when the N-IoU loss is used for YOLOv8 model, the performance improvement of small object detection is slightly higher than that of other models. This indicates that we must take measures to improve the convergence speed of the model under the N-IoU measure and the accuracy of the model for small targets. A feasible scheme is to combine the N-IoU-based loss function with Focal Loss or GFocal Loss as in the loss function design in YOLOv8.

7 Conclusion

In this paper, we propose a new bounding box regression measure that outperforms and replaces the IoU measure, and we define the general properties of this new measure and propose an N-IoU loss family. This series of losses is more flexible and simple to calculate, and they can be used to obtain the optimal regression loss suitable for application scenarios by debugging parameter n.

We demonstrate the correctness of the proposed theory through mathematical reasoning and analysis of the previous work. We design simulation experiments and analysis to demonstrate the superiority of N-IoU performance. After comparative experiments on multiple detector models and standard datasets, the following four conclusions can be drawn: (1) N-IoU performs better than the commonly used IoU measure; (2) it can better optimize the regression learning of highly overlapping instances; (3) it can be used more widely for object detection or image segmentation and (4) it has better optimization ability for lightweight detectors.

In the future research, we will evaluate and improve the proposed N-IoU in some key application domains, such as small target scenarios [51]. We will further enrich the theory of obtaining novel loss functions based on IoU composite modulation and explore other functional forms that are more in line with the characteristics and further study new generalization formulas for other metric-derived loss function, such as boundary-based loss functions, which representative works include boundary (BD) loss [52] and Hausdorff distance (HD) loss, or hybrid loss functions, such as combo loss [53] and exponential logarithmic loss [54].