A non-negative feedback self-distillation method for salient object detection

Self-distillation methods utilize Kullback-Leibler divergence (KL) loss to transfer the knowledge from the network itself, which can improve the model performance without increasing computational resources and complexity. However, when applied to salient object detection (SOD), it is difficult to effectively transfer knowledge using KL. In order to improve SOD model performance without increasing computational resources, a non-negative feedback self-distillation method is proposed. Firstly, a virtual teacher self-distillation method is proposed to enhance the model generalization, which achieves good results in pixel-wise classification task but has less improvement in SOD. Secondly, to understand the behavior of the self-distillation loss, the gradient directions of KL and Cross Entropy (CE) loss are analyzed. It is found that KL can create inconsistent gradients with the opposite direction to CE in SOD. Finally, a non-negative feedback loss is proposed for SOD, which uses different ways to calculate the distillation loss of the foreground and background respectively, to ensure that the teacher network transfers only positive knowledge to the student. The experiments on five datasets show that the proposed self-distillation methods can effectively improve the performance of SOD models, and the average Fβ is increased by about 2.7% compared with the baseline network.


INTRODUCTION
Salient object detection (SOD) aims to estimate the visual saliency region, and is an important computer vision task (Ali et al., 2019). In recent years, with the rapid development of deep neural networks, the performances of SOD have been greatly improved. However, high-performance SOD networks usually require large network structures and a large number of computing resources .
In order to solve the problem of too large network structure, Hinton, Vinyals & Dean (2015) proposed knowledge distillation to improve the performance of lightweight networks. Knowledge distillation uses the knowledge transferred from the teacher network to guild the student network training, which can improve the performance of lightweight student network. Traditional knowledge distillation methods need to train a large-scale teacher network with good performance in advance; then the lightweight student network can improve its performance by learning the knowledge transferred from the teacher network. However, the pre-training teacher network still has the complex network structure, so Zhang et al. (2019a) proposed self-distillation methods to solve this problem. Self-distillation methods do not require an independent teacher network, and improve the network performance by distilling the knowledge from the student network itself.
To solve SOD problem, many researchers (Tang, Li & Zou, 2020;Zhang et al., 2019b) apply knowledge distillation, but pay less attention to self-distillation. Current researches on self-distillation mainly focus on classification tasks such as image classification  and semantic segmentation (Ji et al., 2021). In order to improve the performance of SOD without increasing the network size, we introduce self-distillation into SOD, and propose a non-negative feedback self-distillation method for SOD. The output of classification task is the category probability distribution . The output of SOD is the category probability (Pang et al., 2022). As the outputs are different, the self-distillation method used for classification tasks may not be suitable for SOD.
In classification tasks, the knowledge distillation structure usually uses Cross Entropy (CE) loss and Kullback-Leibler divergence (KL) loss to guide the student network training (Kim et al., 2021). CE generates classification loss, which guides the student network to match the ground truth of training samples. KL produces distillation loss, which guides the student to match the prediction probability of the teacher network. KL divergence can effectively measure the similarity between two distributions, and drive the student network to imitate the performance of the teacher network. However, when KL divergence formula is directly applied to SOD as distillation loss function of the output layer, it is found that the signs of the distillation loss and its derivative will be opposite to CE loss. That is, the optimization direction is not consistent between KL loss and CE loss. The negative feedback will be generated and attenuate the performance of SOD network.
To solve this problem, we propose a new non-negative feedback self-distillation method. First of all, inspired by regularization thought , we construct a pixel-wise virtual teacher model that is based on the ground truth. We find that the virtual teacher model can achieve good performance in classification tasks, but is not suitable for SOD. Then, we analyze KL divergence and find out the cause of the negative feedback. Finally, using the ideas of KL and Focal Loss (Lin et al., 2017), a non-negative feedback distillation loss is proposed, which calculates the distillation losses of foreground and background by different ways and is suitable for SOD. The proposed non-negative feedback distillation loss can drive the transferring of the teacher knowledge to the student network. The main contributions of this paper are as follows: (1) The reason why KL divergence formula is not a suitable distillation loss for SOD is analyzed. The optimization directions of KL and CE are inconsistent, which affect the network training.
(2) An improved distillation loss with non-negative feedback is proposed, which modifies the formulation of KL to eliminate the negative feedback and holds the same optimization direction with CE.
(3) Experiments on five datasets show that the self-distillation architecture can be applied to SOD; Our self-distillation method improves by about 3.8% in the average E compared with the baseline network. Compared with other self-distillation methods, our method can obtain the best experiment results; Our self-distillation method improves by about 2% in the average F β compared with the second-best method (BYOT).

Salient object detection
Traditional SOD methods mainly use color (Cheng et al., 2014b), boundary prior (Zhu et al., 2014), and sparsity (Li, Sun & Yu, 2015) to obtain the object salient map. These methods can obtain more detail information, but they are difficult to obtain high-quality semantic information. The recently developed deep learning methods can obtain both the detail and semantic information, so more and more scholars Zhao et al., 2020;Wu, Su & Huang, 2019;Mao et al., 2021) use deep learning methods to solve SOD problem.  divided the feature map into two parts: body map and detail map, and refined the two parts respectively.  adopted feature interweaved aggregation, self-refined, head attention and global context flow modules to construct a global context-aware progressive aggregation network. Zhao et al. (2020) designed a gated dual branch structure. Collaborations between features from different layers were established to improve the distinguishability of the entire network.  mixed features from different layers by designing cross feature modules and cascaded feedback decoders, to generate better salient maps. Wu, Su & Huang (2019) discarded the features from shallow layers to improve the computing efficiency, and refined the features from deep layers to improve their representation ability.  introduced initial prediction, side-output residential learning and top-down reverse attention to solve the complex architecture problem. Mao et al. (2021) used swin transformer structure to mix multi-layer features, and used attention mechanism to strengthen feature representation ability. However, these methods usually have large network structures and are difficult to be directly applied to the reality.
Recently, Zhang et al. (2019b) applied KD to SOD, they reduced the channels amount to construct the student network, and adopted multi-scale to transfer knowledge from the teacher to the student. Besides Piao et al. (2020) introduced cross-modal distillation on RGB-D based SOD, they distilled the depth information through an adaptive distiller. However, less attention is paid to self-distillation to solve SOD problem.

Self-distillation
Ordinary knowledge distillation requires an additional teacher network to guide the student network learning. Self-distillation methods do not need an additional teacher network, and the student network learn the knowledge from itself. Self-distillation methods mainly construct auxiliary branches to transfer the knowledge to the student network. Ji et al. (2021) used the auxiliary teacher network to refine the soft label and feature map knowledge, which can better preserve the local information. Hou et al. (2019) let the shallow feature learn the deep feature expression, so as to strengthen the overall feature expression ability. Zhang et al. (2019a) divided the original network into several shallow networks according to the characteristics of the network structure, and distilled the separated shallow networks respectively.  used consistency strategy to match the knowledge that is extracted from auxiliary branches.  proved that self-distillation is a special label smoothing regularization method. Yun et al. (2020) used the predicted distribution of training samples as distillation knowledge. Therefore,  and Yun et al. (2020) considered that self-distillation is a special regularization method.
At the same time, Xu & Liu (2019) and Lee, Hwang & Shin (2020) considered that self-distillation is a special data augmentation method. Xu & Liu (2019) used the distorted versions of training samples as distillation knowledge. Lee, Hwang & Shin (2020) used the transformations of training samples as distillation knowledge.
The regularization method does not require complex branch structures and matching strategies, and can further simplify the scale of network parameters. We use pixel-wise regularized distribution to construct a self-distillation framework.

METHOD
In this section, we firstly introduce a virtual teacher self-distillation architecture; secondly analyze the distillation loss calculated by KL in classification task, and compare it with CE; thirdly analyze the distillation loss calculated by KL formula in SOD; finally propose our non-negative feedback distillation loss.
As the principles of multi classification tasks and binary classification tasks are the same, we take binary classification task as an example for analysis. Given a training sample X 1 = x i ,i =1 ,...,W × H , x i is the i-th pixel in the sample, W and H are the width and height of the sample, Y1 = y i ,i=1 ,...,W × H is the corresponding ground truth. In order to facilitate the description, y i = 1 means the pixel belongs to the object category; y i = 0 means the pixel belongs to the background category. The sample output is p s in the student network and p t in the teacher network.  used the regularized category probability as virtual knowledge to construct a teacher model, and manually set the output probability of the teacher network. Based on this, we used the regularized probability distribution to construct a pixel-wise virtual teacher model. Unlike , who addressed multi-classification problems, we have extended their approach to tackle pixel-wise segmentation problem. The self-distillation structure is shown in Fig. 1. The output probability of the teacher network is as follows:

Virtual teacher self-distillation architecture
where K is the total number of categories, K is 2 in SOD; y i is the correct label; and a i is the prediction label; µis the predict probability of correct pixel classification. Usually, µ ≥ 0.9 is set , to ensure that the probability of correct pixel classification is far greater than wrong classification; we set µ = 0.99. When the pixel is the labeled foreground, the output probability of the virtual teacher is 0.99; when the pixel is the labeled background, the output probability of the virtual teacher is 0.01. If the value of µis small, the predict probability of the student will be greater than the teacher in easy and well classified pixels. In these pixels, the teacher cannot transfer knowledge to the student, and will bring negative effect to the network training. Therefore, we set µas 0.99, and  proved it in experiments. We use the backbone network of F3Net  as the student network (the right box in Fig. 1), and build a self-distillation learning framework on this basis. The virtual teacher provides correct knowledge to the student network, and guild the student network to optimize. Different from hard label learning which hopes the output of the teacher and student are the same, this self-distillation method hopes that the output distribution of the student fits the teacher output distribution. Virtual teacher self-distillation method provides more distribution information while making the student results the same as the teacher.
For a training set with N samples, is the j-th sample, H and W are the height and width of the sample, Y j is the corresponding ground truth. W M = {W m |m = 1,...,L} represents the learnable weight matrix of a L-layer neural network. The training goal of the neural network is to learn a mapping function f(W m ;X ) : X → Y. The most common training method is Empirical Risk Minimization (Wang, 2021). The neural network parameter W m can be adjusted by optimizing the following functions.

Arg min
where L mt is the total loss of all training samples. In the self-distillation framework for SOD, the loss function is determined by pixel-wise CE and KL loss.
where p s ij is the prediction probability of the student network for i-th pixel in j-th sample; p t ij is the prediction probability of the corresponding teacher network; y ij is the corresponding annotated label, which is 1 when the pixel belongs to the foreground and is 0 when the pixel belongs to the background. It is found that this virtual teacher self-distillation architecture can improve the model performance for classification tasks, but achieve little improvement for SOD. Especially on easily classified datasets with significant differences between foreground and background, the model performance can hardly be improved. To apply the virtual teacher self-distillation method to SOD, we analyze KL loss and propose a new loss to replace KL loss.

Distillation loss analysis in classification task
In classification task, self-distillation structure usually uses Cross Entropy loss ( L CE ) and Kullback-Leibler divergence loss ( L KL ) to guide the student network training (Kim et al., 2021). The sample loss L is calculated as follows: In binary classification task, Cross Entropy (CE) is Binary Cross Entropy . Hossain, Betts & Paplinski (2021) proved that when the pixel belongs to the object, CE loss is positive and the loss derivative is negative; when the pixel belongs to the background, CE loss is positive, and the loss derivative is positive. The formulas of CE loss and loss derivative are shown in Table 1, the loss curves are shown in Fig. 2, and the loss derivative curves are shown in Fig. 3. The optimization direction of a good distillation loss should be consistent with CE. If the optimization direction is not consistent, the negative feedback will be generated. The negative feedback affects the network optimization, and leads to the poor performance.
KL divergence L KL is calculated by Eq. (7) . As the temperature parameter does not affect the optimization direction of the loss and loss derivative, the influence of temperature parameter is not considered.
where p s i is the output of the student network, which denotes a 1×k dimensional array. K is the number of categories, it is 2 in binary classification task. p t i is the output of the Relationship between pixels' predicted probability and loss. The horizontal coordinate represents pixel predicted probability, the vertical coordinate represents pixel loss value. CEFore and CEBack represent Cross Entropy loss; ClassFore and ClassBack represent distillation loss in binary classification task; SalientFore and SalientBack represent distillation loss in SOD; OurFore and OurBack represent nonnegative feedback distillation loss. Full-size DOI: 10.7717/peerjcs.1435/ fig-2 corresponding teacher network.
where p1 s i ,p2 s i ,...,pk s i are the output probability predicted by the softmax function, and their summation is 1. In binary classification task, the output of the student network can be expressed as follows: where p s i is the probability that the student network predicts the pixel i as the object. The KL loss of the pixel i is calculated as follows: where p t i is the probability that the teacher network predicts the pixel i as the object, and the value of p t i is (0,1). The derivation of L i KL is as follows: In self-distillation framework, most works construct auxiliary teacher branches to generate refined knowledge, or adopt deep-level knowledge to guild the shallow-level Figure 3 Relationship between pixels' predicted probability and loss derivative. The horizontal coordinate represents pixel predicted probability, the vertical coordinate represents pixel loss derivative value. CEFore and CEBack represent Cross Entropy loss; ClassFore and ClassBack represent distillation loss in binary classification task; SalientFore and SalientBack represent distillation loss in SOD; OurFore and OurBack represent non-negative feedback distillation loss.
Full-size DOI: 10.7717/peerjcs.1435/ fig-3 training. The purposes of these works are to construct the teacher which performance is better than the student network (Hinton, Vinyals & Dean, 2015). When the pixel belongs to the object, the output probability of the teacher network is greater than the student network, that is p t i ≥ p s i . So (dL i KL /dp s i ) ≤ 0, L i KL is a monotone decreasing function in the range of values. p t i ≥ p s i , the maximum value of p s i is p t i .L i KL p s i = p t i = 0, the minimum value of L i KL is 0. So, L i KL is greater than 0. Therefore, when the pixel belongs to the object, the distillation loss calculated by KL divergence is greater than 0 and the loss derivative is less than 0. In the object, the optimization directions of the distillation loss and loss derivative are consistent with CE.
When the pixel belongs to the background, the output probability of the teacher network is less than the student network, that is p t i ≤ p s i . So (dL i KL /dp s i ) ≥ 0, L i KL is a monotone increasing function in the range of values. p t i ≤ p s i , the minimum value of p s i is p t i .L i KL p s i = p t i = 0, the minimum value of L i KL is 0. So, L i KL is greater than 0. Therefore, when the pixel belongs to the background, the distillation loss calculated by KL divergence is greater than 0 and the loss derivative is greater than 0. In the background, the optimization directions of the distillation loss and loss derivative are consistent with CE.
For a more intuitive presentation, we assume that the teacher network output probability p t i is 0.99 in the object pixel and 0.01 in the background pixel. At this time, the distillation loss and loss derivative formulas are shown in Table 2, the loss curves are shown in Fig. 2, and the loss derivative curves are shown in Fig. 3.
From Figs. 2 and 3, it can be seen that the signs of distillation loss and loss derivative value are consistent with CE. As the optimization direction of the distillation loss which is calculated by KL divergence is consistent with CE, distillation loss can better guide the student network training in binary classification task.

Distillation loss analysis in salient object detection
Similar to the classification task, we also use CE loss (L CE ) and distillation loss (L KD ) to guide the student network training in SOD. Then the sample loss L is calculated as follows: In SOD, CE is also the binary CE . That is, when the pixel belongs to the foreground, CE loss is positive and the loss derivative is negative. When the pixel belongs to the background, CE loss is positive, and the loss derivative is positive. When KL formula is used to calculate the distillation loss in SOD, the distillation loss is calculated as follows: where y i = 0 means the pixel belongs to the background; y i = 1 means the pixel belongs to the foreground; p s i and p t i are the outputs of the student and teacher networks, which are 1-dimensional values. The KL loss of the pixel i is calculated as follows: The derivation of L i KL is as follows: In self-distillation framework, most works construct auxiliary teacher branches to generate refined knowledge, or adopt deep-level knowledge to guild the shallow-level training. The purposes of these works are to construct the teacher which performance is better than the student network (Hinton, Vinyals & Dean, 2015). When the pixel belongs to the foreground, the output probability of the teacher network is greater than the student network, p t i ≥ p s i .L i KL p s p t is less than 0. At this time, the sign of distillation loss is inconsistent with CE. When 0 < (p s i /p t i ) ≤ (1/e), that is (dL i KL /dp s i ) ≤ 0. At this time, the sign of loss derivative is consistent with CE. When (1/e) < (p s i /p t i ) ≤ 1, that is (dL i KL /dp s i ) ≥ 0. At this time, the sign of loss derivative is inconsistent with CE. Therefore, in the foreground, the optimization direction of distillation loss will be inconsistent with CE, resulting in negative feedback, which will affect the student network performance.
When the pixel belongs to the background, the output probability of the teacher network is less than the student network, p t i ≤ p s i .L i KL p s p t is greater than 0; (dL i KL /dp s i ) is greater than 1. At this time, the optimization directions of distillation loss and loss derivative are consistent with CE. For a more intuitive presentation, we assume that the teacher network output probability p t i is 0.99 in the foreground pixel and 0.01 in the background pixel. The distillation loss and loss derivative formulas are shown in Table 3, the loss curves are shown in Fig. 2, and the loss derivative curves are shown in Fig. 3.
Combining Figs. 2 and 3 and the above analysis, it can be seen that when KL formula is used to calculate the distillation loss in SOD, the optimization direction of the distillation loss and loss derivative are inconsistent with CE in the foreground. The performance improvement of the student network is limited. Therefore, it is defective to directly use KL formula to calculate the distillation loss in SOD.

Non-negative feedback distillation loss (NKL)
In order to transfer the knowledge, the optimization direction of distillation loss should be consistent with CE. In SOD, when the pixel belongs to the foreground, the loss is greater than 0, and the loss derivative is less than 0; when the pixel belongs to the background, the loss is greater than 0, and the loss derivative is greater than 0.
Inspired by KL, CE and Focal loss (Lin et al., 2017), we propose a non-negative feedback distillation loss, which uses different formulas to respectively calculate foreground and background distillation loss. The loss is calculated as follows: where y i = 0 means the pixel belongs to the background; y i = 1 means the pixel belongs to the foreground; α is a hyperparameter that is greater than 0 and less than 1, which is determined by experiments and is selected as 0.3 here. When the pixel belongs to the foreground, the distillation loss and loss derivative of the pixel i are calculated as follows: dL i+ KL dp s At this time, 1 ≥ p t i ≥ p s i ≥ 0. Therefore, L i+ NKL p s ,p t is greater than 0, (dL i+ KL /dp s i ) is less than 0. Therefore, when the pixel belongs to the foreground, our distillation loss is greater than 0 and the loss derivative is less than 0. In the foreground, the optimization directions of our distillation loss and loss derivative are consistent with CE.
When the pixel belongs to the background, the distillation loss and loss derivative of the pixel i are calculated as follows: At this time, 0 ≤ p t i ≤ p s i ≤ 1. Therefore, L i− NKL p s ,p t is greater than 0, (dL i− KL /dp s i ) is greater than 0. Therefore, when the pixel belongs to the background, our distillation loss is greater than 0 and the loss derivative is greater than 0. In the background, the optimization directions of our distillation loss and loss derivative are consistent with CE. For a more intuitive presentation, we assume that the teacher network output probability p t i is 0.99 in the foreground pixel and 0.01 in the background pixel. The distillation loss and loss derivative formulas are shown in Table 4, the loss curves are shown in Fig. 2, and the loss derivative curves are shown in Fig. 3.
As the optimization direction of our non-feedback distillation loss is consistent with CE in the foreground and background, it can better guide the student network training in SOD. So, in virtual teacher self-distillation architecture, we replace Eq. (4) as Eq. (21).
In order to verify the universality of our method, we also apply our non-negative feedback distillation loss to other self-distillation frameworks. Through experiments in 'Comparison with recent self-distillation methods' we find that our non-negative feedback loss function can work in other self-distillation methods and work better in our virtual teacher method. The main reason is that the teacher network performance may be worse than the student in easy and well classified pixels in other self-distillation methods. In non-negative feedback loss function, the worse performance teacher cannot guild the student training, which leads to the limited improvement to the student network. While in the virtual teacher method, the teacher performance is always better than the student.

Model
We use the backbone network of F3Net which is based on Resnet-50 (He et al., 2016) as the student network. For the convenient analysis, we remove the branches of F3Net, only use its backbone network as the baseline network. The loss function is CE. We use this model as the baseline network. During training, the maximum learning rate is 0.005, and warmup  and linear decay strategies are used to dynamically adjust the Table 4 The formulas of non-negative feedback distillation loss and loss derivative.

Foreground pixel (y i =1)
Background pixel (y i =0) learning rate. The training strategy is Stochastic Gradient Descent (SGD). The momentum and weight decay are set to 0.9 and 0.0005 respectively. In the experiments, the batchsize is 32, the maximum epoch is 32, and all image sizes are set to 352*352.

Datasets
We conducted experiments on five challenging datasets with salient or camouflaged objects. The five datasets are COD (Fan et al., 2020), DUT-O (Yang et al., 2013), THUR (Cheng et al., 2014a), PASCAL-S  and HKU-IS . COD is the dataset with natural camouflaged objects, including 6066 natural images and corresponding pixel-wise annotation images. DUT-O, THUR, PASCAL-S and HKU-IS are salient object datasets, which respectively contain 4,447, 5,168, 850, 1,447 images and corresponding annotation images. COD is divided into training and testing sets by the default setting; DUT-O and THUR are divided into training and testing sets by the proportion of 0.6 and 0.4. PASCAL-S and HKU-IS are divided into training and testing sets by the proportion of 0.8 and 0.2.

Metrics
We use F β Measure (F ) (Fan et al., 2020), the mean absolute error (MAE) (Yang et al., 2021), E-measure (E) (Kang & Kang, 2021) and precision-recall (PR) curve (Xian et al., 2022) to evaluate the network performance. F is the weighted mean of precision and recall. The calculation formula is as follows: where β is the weight, usually is set to 0.3; precision focuses on the accuracy of the object detection; Recall focuses on the integrity of the object detection. MAE is calculated as follows: where H and W is the height and width of the sample, P is the prediction result of the network, and G is the ground truth. F and E are larger, MAE is smaller, the network performance is better.

Hyperparameter selection
We discuss the selection of α in Eq. (16). We select different α to test the model performance.
Through experiments, we choose α = 0.3, when the model achieves the best performance. The result is shown in Table 5. The following conclusions are drawn from Table 5.
(1) When α takes different values, the model performance all can be improved. (2) When α is greater than a certain value, the model performance begins to decline. This shows that the influence of output probability is not the bigger, the model performance is better. When the influences of misclassified pixels are too great, the model may be affected just by these pixels. The model only obtains optimal result in these pixels.

Comparison with different one-dimensional distance metric methods
In SOD, the output is the category probability which is one-dimension, the aim of selfdistillation is that the student network produces the same distribution with the teacher. Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Squared Error (MSE), Cosine Similarity (CS) are the common one-dimensional distance metric functions and are widely used as the loss functions. Therefore, we use these functions and KL formula (KL) as distillation loss functions, and compare them with our method (NKL). In the experiment, virtual teacher self-distillation architecture remained unchanged and only changed the distillation loss function. The experiment results are shown in Table 6.
The following conclusions can be drawn from Table 6. (1) Not all one-dimensional distance metric methods can be used as distillation loss function. From the mean performance over five datasets, when RMSE is used as the distillation loss, the network performance after distillation is poorer than before distillation. From the evaluating indicator MAE, when KL formula is used as distillation loss, the network performance after distillation is poorer than before distillation. (2) Our method (NKL) can transfer the knowledge well. Our method can achieve the best detection results on all five datasets. Especially, from the mean E, our method improves 2.2% compare with the second-best method (KL); from the mean MAE, our method reduces 1.6% compare with KL. These prove that our method is effective.

Comparison with recent self-distillation methods
First, we use the backbone network of F3Net as the baseline and take this as the student network. Then, FR (Ji et al., 2021), SA , BYOT (Zhang et al., 2019a) and DHM  self-distillation methods are introduced into the baseline. Finally, KL formula and our non-negative feedback loss function (NKL) are respectively used as distillation loss to train the network. The experiment results are shown in Table 7. SA directly uses pixel-wise attention features in the backbone network as distillation knowledge. FR uses Bi-directional Feature Pyramid Network (BiFPN) to generate pixel-wise knowledge. Therefore, SA and FR can be directly applied to SOD. BYOT and DHM cannot directly generate pixel-wise knowledge. Therefore, we modify them to ensure that they can be applied to the self-distillation framework for SOD. We mainly make the following two modifications. (1) We adjust the step size of the first convolution layer in the bottleneck module from 2 to 1. This operation is to maintain the spatial size of the feature in the bottleneck module of the auxiliary branch. (2) We change the full connected layer of the auxiliary branch to the convolution layer. This operation is to generate pixel-wise knowledge that can be transferred to the backbone network.
From Table 7, we draw the following conclusions.
(1) Self-distillation methods are also suitable for SOD. The results in the table show that self-distillation methods can improve the network performance. Compared with the baseline, our method improves the average F by nearly 2% and the average E by nearly 3.8%. (2) Our virtual teacher model is better than other self-distillation methods. In the five datasets, our virtual teacher model can achieve the best detection results. Especial, from the mean F, our method improves nearly 2% compare with the second-best method (BYOT); from the mean E, our method improves nearly 3% compare with the second-best method (BYOT). (3) Our non-negative feedback loss function (NKL) achieves better results than KL formula in different self-distillation methods in SOD. In DHM, NKL can achieve better detection results in five datasets. Among other methods, NKL can also achieve better results in at least three datasets. And KL is mainly better than our method in COD. As the foreground and background are similar in camouflaged images, the prediction result of the teacher network may be worse than that of the student network in other self-distillation methods. In NKL, the worse performance teacher cannot guild the student training, which limits the improvement of the network performance. (4) NKL can work better in our virtual teacher method than other self-distillation methods. From the mean E, NKL improves nearly 2% compare with KL in virtual teacher, but improves nearly 0.3% in other self-distillation methods. The main reason is that the teacher network performance may be worse than the student in easy and well classified pixels in other self-distillation methods. In NKL, the worse performance teacher cannot guild the student training, which leads to the limited improvement to the student network.
All methods are used their default learning rate, momentum, weight decay and maximum epochs. Table 8 quantitatively shows the detection results of different methods. It can be seen that our method can achieve good detection results on five datasets, and can achieve the best detection results on THUR and HKU-IS. From the mean performance over five datasets, our method also achieves the best detection results. From the mean E, our method improves nearly 3% compare with CPD. Figure 4 shows the precision-recall curves of different methods. It can be seen that our curves are higher than other methods in COD, THUR and HKU-IS, which prove that our method can achieve good performance. Table 9 shows the detection efficiency of different methods. We compare the model efficiency from the model parameter size and detection speed. It can be seen that our method has the smallest parameter scale and the fastest detection speed, when our model performance is similar to other methods. From the size of params, our method reduces near 100M compare with GateNet.

CONCLUSIONS
Self-distillation has been proven to improve the performance of the lightweight network, and is widely used in computer vision tasks. However, when self-distillation is applied to SOD, the common distillation loss function (KL divergence) will generate negative feedback. In order to solve this problem, a non-negative feedback distillation loss is proposed. The experiment results show that our method can improve the network performance. As the advantages of self-distillation, more and more tasks will make use of self-distillation in the future. Our method further expands the application scope of self-distillation, and provides a new attempt to adopt self-distillation for new tasks.