Reverse Self-Distillation Overcoming the Self-Distillation Barrier

Deep neural networks generally cannot gather more helpful information with limited data in image classification, resulting in poor performance. Self-distillation, as a novel knowledge distillation technique, integrates the roles of teacher and student into a single network to solve this problem. A better understanding of the efficiency of self-distillation is critical to its advancement. In this article, we provide a new perspective: the effectiveness of self-distillation comes not only from distillation but also from the supervisory information provided by the shallow networks. At the same time, we find a barrier that limits the effectiveness of self-distillation. Based on this, reverse self-distillation is proposed. In contrast to self-distillation, the internal knowledge flow is in the opposite direction. Experimental results show that reverse self-distillation can break the barrier of self-distillation and further improve the accuracy of networks. On average, 2.8% and 3.2% accuracy boosts are observed on CIFAR100 and TinyImageNet.

by the shallow classifiers. In short, the joint training (JT) of a deep classifier and multiple shallow classifiers, without any distillation, is equally effective in improving performance. To verify this idea, we carefully compare the effect of joint training with self-distillation in different frameworks. The experimental results show that self-distillation becomes less effective as the framework scale increases. Based on this, we propose a new self-distillation method: reverse selfdistillation (RSD). Each network is treated as a deep classifier and divided into multiple parts according to the network structure. Additional modules are added to all parts to construct multiple shallow classifiers, which transfer knowledge to the deep classifier. All classifiers are trained simultaneously. The details of multiple training methods are shown in Fig. 1. Finally, RSD is compared with SD and JT. All experimental results are analyzed in detail.
The following are the primary contributions of this article: 1) We compare the training effects of JT and SD across different framework scales and identify the true sources of effective SD. It is found that a framework with too many parameters can impede SD. 2) We propose a new distillation method, which swaps the teacher-student relationship within the framework. RSD can mitigate the negative impact of the framework on distillation and achieve higher performance.
3) We find that the more shallow classifiers are, the higher the average accuracy of the deep classifiers is under the three training methods. The effectiveness of the three training methods is reduced by shallow classifiers located at the tail position. Our proposed method has the best performance.

II. RELATED WORK A. KNOWLEDGE DISTILLATION
Knowledge distillation is one of the model compression [22], [23], [24], [25], [26], [27] schemes. It is also a training method to improve the model performance. Hinton et al. [13] proposed a one-teacher-single-student training framework in which the student model improved performance by acquiring the knowledge of inter-class similarities from the teacher model. Teacher and student models must deal with a common task in this architecture. In contrast, knowledge amalgamation [28], [29] and multi-teacher learning [30], [31] are the two types of multi-teacher-single-student training. In the first case, the student model is able to handle the tasks of several teacher models simultaneously. However, in the second case, the student model utilizes the knowledge of multiple teacher models to improve performance on a single task. In the distillation process, the teacher-student gap may reduce the student's efficacy in learning knowledge, and the proposed teacher-assistant solution [32], [33] addresses this problem. In addition to the teacher, classmates can also serve as a source of knowledge. Mutual distillation [34] can enhance the performance of all student models by transferring knowledge from multiple student models to each other.

B. MULTI-EXIT ARCHITECTURE
For classification networks, the difficulty of classifying each category of images is different. As a classical multi-exit classification network, MSDNet [35] selects different classification exit outputs based on the difficulty of image classification. Shallow-Deep Networks [36] utilize confidence scores to address the problem of overfitting, balancing accuracy and inference cost. Lee et al. [37] investigated the influence of DNNs with single and multiple exits on inference accuracy and latency in edge computing. Additionally, the multi-exit architecture is also widely employed in knowledge distillation. Zhang et al. [19] suggest that deep exits can guide shallow exits training, providing knowledge of diversity. The distillation method proposed by Phuong et al. [38] encourages early exits to mimic later exits by matching the output probabilities. Wang et al. [39] encouraged each exit to learn from all its later exits. All the works mentioned help a single neural network acquire more knowledge efficiently and improve performance. However, previous research on multi-exit architecture distillation mainly took deep exits as the sources of knowledge, ignoring shallow exits. In this article, We utilize shallow exits knowledge to enhance network performance.

III. PROPOSED METHOD A. RSD
RSD transforms the teacher-student relationship of SD: the deep classifier functions as the student, while the three shallow classifiers act as the teachers to convey knowledge. As shown in Fig. 2, ResNet [5], as a deep classifier, is divided into four parts according to the structure of ResBlock modules, and three shallow classifiers are constructed by adding an external module (Bottleneck) and a fully connected layer (FC layer) in each part in turn.
As shown in Fig. 3, we design three different bottlenecks for ResNet18, corresponding to the distillation frameworks: S, M, and L, respectively. The three frameworks all adopt the structure shown in Fig. 2, with the difference in the number of parameters. Changing the convolution structure of bottlenecks can adjust the framework's parameters, thereby building different frameworks. Building each framework does not affect the original structure of Resnet18. The S framework has minor parameters, the M framework has moderate parameters, and the L framework has the most. When constructing the X (X = S, M, L) framework, the bottleneck structure utilizes the ResNet18_X . In the S and M frameworks, the number of convolutional layers for bottleneck_i (i

B. FORMULATION
Given that the RSD framework has M samples X , and a total of p classifiers. The deep classifier is indicated as θ p , while the shallow classifiers are denoted as {θ i } p−1 i=1 . A SoftMax layer is set after each classifier. Meanwhile, temperature is introduced by the SoftMax, and the output of each classifier can be softened by adjusting the temperature: Here z c i is the output result of the class i after the fully connected layer of the classifier θ c , and q c i represents the output probability of the class i of the classifier θ c . T denotes the temperature. When T is set to 1, (1) is translated into a standard SoftMax function. The larger T is, the softer the output probability distribution becomes.
In RSD, the deep classifier is supervised by three sources: labels, the outputs of the shallow classifiers, and the hints of the shallow classifiers. The output of a bottleneck is a hint. The three sources are balanced with two hyper-parameters α and β. α is used to adjust the loss proportions for labels and the outputs of shallow classifiers. As the value of α increases, the weight of the loss of outputs from shallow classifiers increases, and the importance of the loss of labels decreases. β is used to adjust the loss proportion of hints. The larger the β is, the higher the proportion becomes.
Loss CE : The first loss is obtained by computing with Soft-Max output for each classifier and the labels in the dataset: Here Cr is the cross-entropy loss function, q i denotes the output of the SoftMax layer in the shallow classifier θ i , and y are labels. The SoftMax layer's output in the deep classifier is q p . In this way, the dataset's knowledge can be introduced directly to all classifiers. Loss KD : The second loss is the Kullback-Leibler divergence between the deep classifier and each shallow classifier: KL is the Kullback-Leibler divergence, q i denotes the output of the SoftMax layer in the shallow classifier θ i , and q p represents the output of the SoftMax layer in the deep classifier.
By introducing the SoftMax output of each shallow classifier into the SoftMax layer of the deep classifier, the knowledge learned by each teacher model can be transferred to the student model. Note that α in (2) and α in (3) denote the same meaning and setting. Loss F : The third loss is obtained through the computation of the L2 loss between the output of each bottleneck and the output of ResBlock4: Here F p and F i represent the outputs of the hidden layer in the deep classifier and the bottleneck in the shallow classifier θ i , respectively. In this way, each shallow classifier bottleneck's output can be introduced into the deep classifier's hidden layer (ResBlock4).
The total loss function includes the three components listed above, which can be written as:

C. THEORETICAL ANALYSIS OF RSD
We will explain why SD is flawed and why RSD is effective from the perspective of generalized distillation [40]. Generalized distillation holds that teachers can provide additional sample information to students, which is why distillation methods are effective. In SD, the deep classifier provides additional sample information. Assuming only one shallow classifier g s ∈ G s and one deep classifier g d ∈ G d are in the framework. g ∈ G is the target function. Let H (·) be a suitable function that measures model performance, such as VC-dimension. A(·) is the generalization error. E (·) is the estimation error. Shallow and deep classifiers learn from the target function: The approximate errors of function classes G s and G d learning towards g are represented by σ s and σ d , respectively. m represents the total amount of learning data. Shallow classifier learns from the deep classifier: Here σ l is the approximate error of G s learning towards g d . Generalized distillation assumes that the student's learning rate from the teacher is moderate, between the student's rates and the teacher's learning from the target. Therefore, the value of γ ranges between 1/2 and 1. Combining (7) and (8), we get the inequality: Therefore, we have to analyze the if the inequality (10) holds, it indicates that distillation is effective. In SD, the performance of the deep classifier is much better than that of the shallow classifier, where σ s σ l + σ d and (10) holds. When the performance of the shallow classifier exceeds that of the deep classifier, (10) may not hold, and this could lead to the failure of self-distillation. If the deep classifier learns from the shallow classifier at this point, the inequality holds, which is also why reverse self-distillation is effective. σ L is the approximate error of G d learning towards g s .
All the experiments are implemented by PyTorch on GPU devices, which use various networks (VGG8BN [3], VGG11BN [3], VGG13BN [3], VGG16BN [3], ResNet8 [5], ResNet18 [5], SqueezeNet [45], MobilenetV1 [46], MobilenetV2 [25], ResNet50 [5]). SGD with learning rate decay and momentum is utilized to optimize the neural networks. Weight_decay is 5e-4, and the momentum is 0.9. The recommended values for α, β are 0.5 and 0.03. Batchsize is 128, and the initial learning rate is 0.1. On the CIFAR100, CUB, and MIT datasets, neural networks are trained by 200 epochs, with the learning rate divided by 10 at the 66th, 133rd, and 190th epochs. On the Tiny-Imagenet dataset, neural networks are trained by 100 epochs, with the learning rate divided by 10 at the 33rd, 66th, and 90th epochs. The distillation frameworks are classified into three types: L framework with the most parameters, M framework with moderate parameters, and S framework with the least parameters. We evaluate the accuracy of the networks under JT, SD, and RSD in different frameworks.

B. EXPERIMENTS DATASETS
The CIFAR100 dataset consists of 60,000 RGB images of 32x32 pixels, divided into 100 categories, with 6,00 images per category, and split into 50,000 training images and 10,000 testing images. The Tiny-Imagenet dataset is divided into 200 categories. Each class has 500 training images, 50 validation images, and 50 test images. Each image is of 64 × 64 pixels.
CUB and MIT are common benchmark datasets for finegrained visual recognition (FGVR). The CUB dataset contains 11,788 bird images divided into 200 categories, while the MIT dataset contains 15,620 indoor images divided into 67 categories.

C. EVALUATION METRICS
The evaluation metrics employed in the experiments are as follows: 1) Accuracy is the percentage of correct prediction in all testing samples, which measures the performance of the model: where Cor and TeN denote the number of correctly predicted testing samples and the total number of testing samples, respectively. 2) Storage refers to the number of parameters in the model, which depends mainly on the convolutional and fully connected layers. The calculation of the convolutional layer parameters is as follows: Here CH in and CH out represent the number of input feature map channels and the number of convolutional filters, respectively. K s is the size of the convolutional kernel. The fully connected layer parameters are calculated as follows: C in × C out (14) where C in and C out denote the number of input and output channels, respectively. 3) Train duration represents the time required to complete the training of the model, which can be written as: Here T N and E poch represent the total number of samples in the dataset and the number of times the dataset is traversed, respectively. v is the training speed determined by the specific experimental equipment. In this article, all models are trained on Tesla P100. 4) FLOPs are floating point operations, which depend mainly on convolutional and fully connected layers. The calculation of the convolutional layer is as follows: where H map and W map represent the height and width of the output feature map, respectively. The other parameters are the same as those in (13). The calculation of the fully connected layer is the same as the calculation of the parameters, as shown in (14).

D. RESULTS ON CIFAR100 AND TINY-IMAGENET
According to Table 1, the three training methods can improve each classification network's performance. RSD is optimal in all L frameworks. Only in the S frameworks, SD outperforms JT. The performance improvements of JT and SD on the classification networks decline as the framework parameters number increases, and the accuracy of the networks under RSD begins to saturate. Overall, JT performs better than SD. In particular, the accuracy of VGG16BN, ResNet18, and SqueezeNet under SD in the L frameworks is worse than their   Table 2, the accuracy of VGG16BN under SD and RSD in the S framework is lower than the baseline on Tiny-Imagenet. The performance of SqueezeNet does not show significant improvement under RSD and SD in the S framework. Therefore, certain specific networks cannot improve their performance under SD and RSD in the framework with few parameters. As the number of parameters in the frameworks increases, the accuracy of the networks under RSD starts to saturate, and the extent of performance improvement that JT and SD can achieve gradually decreases. Overall, RSD is superior to the other two training methods.

E. RESULTS OF RSD ON OTHER DATASETS
To further validate the effectiveness of RSD, we select the CUB and MIT datasets. The CUB dataset has a higher level of classification difficulty. The MIT dataset is an indoor scene detection dataset with instructive value for the real-world application of RSD. The experimental results are shown in Table 3. On average, RSD improves the accuracy by 6.7% on the CUB dataset and 3.5% on the MIT dataset. Surprisingly, VGG13BN achieves an accuracy improvement of 12.6% on the CUB dataset.

F. IMPACT OF HYPER-PARAMETERS ON RSD
We built the RSD framework using ResNet18 to investigate hyper-parameters impact on its performance. The framework structure is shown in Fig. 2, and the framework type is M. The output of SoftMax is the output of the deep classifier, and the output of SoftMax I (I = 1, 2, 3) is the output of the shallow classifier I. The roles of α and β balance the three supervision sources for the deep classifier. Referring to previous work [19], the range of values for α is from 0.1 to 0.9, and for β is from 0.01 to 0.09. First, we fix the value of β at 0.05 and investigate the impact of different values of α. Then, we select the optimal values of α and explore the effects of different values of β. All experimental results are shown in Table 4. When β is 0.05, the deep classifier achieves the highest accuracy with α values of 0.1 and 0.5. As α increases, the deep classifier's performance slightly decreases, and the performance of the shallow classifiers drops significantly. When α is fixed, as β increases, the average accuracy of the four classifiers first increases and then decreases. Moreover, all cases with α = 0.5 perform better than those with α = 0.1. The optimal combination of hyper-parameters is α = 0.5 and β = 0.03.

G. COMPARISON WITH THE OTHER DISTILLATION METHODS
The comparison between the proposed RSD and the five different types of distillation methods on CIFAR100 is shown in Table 5. Experimental results are discussed to demonstrate the effect of our method. RSD achieves the highest accuracy compared with the other distillation methods. On average, 2.9% of the performance enhancement can be achieved on CIFAR100 with our method, which is 0.7% higher than the average of the best results of the other methods. Table 6 compares RSD and the five different types of distillation methods on Tiny-Imagenet. We observe that RSD still outperforms other distillation methods. On average, our method improves accuracy by 3.3%, which is 0.9% higher than the average of the second-best methods.
In addition, we select ResNet50 as the teacher network and ResNet18 as the student network to discuss RSD's computational complexity and resource requirements compared to other distillation methods. Our evaluation metrics for the model training phase include accuracy, storage, and training duration. As shown in Table 7, the proposed method achieves the highest accuracy with lower resource requirements and shorter training duration.

H. TRADE-OFF BETWEEN ACCURACY, STORAGE, AND COMPUTATIONAL EFFICIENCY ON RSD
We use VGG11BN and VGG16BN as deep classifiers to discuss the contribution of RSD to model compression and inference acceleration. The evaluation metrics include accuracy, FLOPs, and storage, as shown in Table 8. When deploying VGG11BN and VGG16BN, all shallow classifiers can be removed, so the FLOPs and storage of deep classifiers remain unchanged. On average, all deep classifiers show an improvement in accuracy of 1.1%, while all shallow classifiers exhibit a storage reduction of 90.8% and a 28.3% decrease in FLOPs. For VGG11BN, all shallow classifiers have an average accuracy improvement of 2.0% compared to the baseline. For VGG16BN, shallow classifiers 2 and 3 have an average accuracy improvement of 1.0% compared to the baseline.

V. DISCUSSION
As shown in Fig. 4, to investigate the effect of the numbers and locations of shallow classifiers on the performance of a deep classifier, seven locations (A-G) can be divided according to ResNet18 structure to build shallow classifiers. At the same time, ResNet18 serves as the deep classifier. Two convolutional layers exist between each adjacent pair of positions from A to G. When building a shallow classifier at position P (P = A, B,..., G), a bottleneck and a fully connected layer (FC) are added based on all the convolutional layers before position P. When P is A or B, the bottleneck1 is used in constructing the shallow classifier. When P is C or D, the bot-tleneck2 is used. When P is E or F, the bottleneck3 structure is used. When P is G, the bottleneck4 is used. Please refer to Table 9 for the specific construction method. When the number of shallow classifiers is I, there are C(7, I ) (C(7, I ) = 7!/I!(7 − I )!) possible distribution patterns for shallow classifiers across seven locations. In Table 9, we assign a position identifier for each distribution pattern. TI-Number is the position identifier, where Number is the position number used to distinguish different distribution patterns under the same number of shallow classifiers. I in TI-Number represents the total number of shallow classifiers. Location is the specific location of the shallow classifiers in the position identifier. Experiments are conducted on the CIFAR100 dataset.

A. IMPACT OF SHALLOW CLASSIFIERS ON JT
As shown in Fig. 5, the deep classifier has the highest accuracy in the case of T5-5 with 79.6%, whereas it has the lowest accuracy in the case of T2-12 with 75.5%. When only one shallow classifier is used in the framework, the deep classifier's accuracy with the shallow classifier located at the middle position is greater than the deep classifier's accuracy with the shallow classifier located at the head or tail position. For example, the accuracy of the deep classifier is higher when the shallow classifier is in position D (T1-4) than when it is in position A (T1-1) or G (T1-7). On the other hand, when there are two shallow classifiers in the framework, the accuracy of the deep classifier with both shallow classifiers located in the middle positions (T2-12) is the lowest and is even lower than the baseline. When two shallow classifiers are located at the head position A and the tail position G respectively (T2-6), the accuracy of the deep classifier is comparable to the baseline and does not show a significant improvement. In the remaining cases, the accuracy of the deep classifier exceeds 78.0%. The impact of three or more shallow classifiers on the deep classifier is almost the same, with the minimum accuracy of the deep classifier reaching 78.0% and the maximum accuracy hovering around 79.5%. It is worth noting that in all of these cases (T3-2, T4-1, T5-5, T6-5, T7-1) with the highest accuracy, the shallow classifiers are always located at positions A and B.  Table 9 for specific details on each distribution pattern.  Table 9 for specific details on each distribution pattern.

B. IMPACT OF SHALLOW CLASSIFIERS ON SD
As shown in Fig. 6, the deep classifier has highest accuracy 79.5% (T3-1) and lowest accuracy 75.7% (T1-7). As the number of shallow classifiers increases, the highest accuracy of the deep classifier first rises and then falls. The optimal number of shallow classifiers is three. In all cases with a single shallow classifier, the accuracy of the deep classifier is higher when the shallow classifier is in the middle position than when it is at the head or tail. The accuracy of the deep classifier is significantly lower than the baseline when the single shallow classifier is located at the tail position G (T1-7). When there are two to five shallow classifiers in the framework, the shallow classifier located at the tail position G impedes the  Table 9 for specific details on each distribution pattern.
improvement of the accuracy of the deep classifier. For example, the deep classifier accuracy in T2-6 and T3-5 is lower than the baseline, while the deep classifier accuracy in T4-30 and T5-15 is approximately equal to the baseline. The deep classifier achieves the highest accuracy if the shallow classifiers are not positioned at the tail position G (T6-7), when there are six or seven shallow classifiers in the framework. As a result, the shallow classifier located at the tail position G can have a negative impact on the self-distillation effect.

C. IMPACT OF SHALLOW CLASSIFIERS ON RSD
As shown in Fig. 7, in all cases with a single shallow classifier, the shallow classifiers located at positions E, F, and G decrease the performance of the deep classifier. Deep classifier with several shallow classifiers performs better. Overall, the deep classifier with six shallow classifiers achieves the highest accuracy through RSD training. The highest accuracy of the deep classifier is 80.9% (T6-6), and the lowest is 79.7% (T6-4), both of which are higher than the corresponding highest and lowest accuracy of the deep classifier under other numbers of shallow classifiers. Additionally, we find that the lower accuracy of the deep classifier often occurs when the shallow classifiers are located at the tail positions F and G (T3-35, T4-34, T5-21, T6-4).

D. COMPARISON OF THREE TRAINING METHODS
In general, the more shallow classifiers are, the higher the average accuracy of the deep classifiers is in all three training methods, as shown in Fig. 8. The average accuracy of deep classifiers is higher under the JT training method compared to the SD training method. Therefore, JT is an effective factor for SD. When there are multiple shallow classifiers in the framework, RSD performs better than SD and JT, demonstrating its significant impact on improving model performance. Note that the effectiveness of the three training methods is reduced by shallow classifiers located at the tail position.

VI. CONCLUSION
In this article, we find that the effectiveness of self-distillation is not only from distillation. Only joint training within the framework can improve the performance of the classification networks. In addition, the effect of self-distillation can be reduced by using a training framework with too many parameters. Based on this, we propose a distillation method called RSD, which is demonstrated to be superior to joint training and self-distillation through comparisons. Furthermore, through extensive experiments on two datasets, RSD shows comparable performance to existing excellent distillation methods. Finally, by analyzing the effects of the numbers and positions of shallow classifiers on the three methods, we determine the optimal configuration of shallow classifiers for each method, further demonstrating the advantages of RSD.
Limitations When a single shallow classifier is used for training, the average accuracy of deep classifiers is lower than that of JT and SD. On the other hand, when multiple shallow classifiers are used for training, the accuracy can be further improved. However, this comes with additional storage and computational complexity.
Future work In RSD, when multiple shallow classifiers are used for training, we will investigate whether it is possible to further lighten the shallow classifiers while ensuring performance to improve the efficiency of distillation. When a single shallow classifier is used for training, SD and JT show better performance. Therefore, how to dynamically adjust the training method is also a direction for our future research.