Introduction

Deep neural networks (DNNs) based on deep learning have shown impressive results on tasks such as image classification [1,2,3], object detection [4,5,6], and natural language processing [7]. With the development of network models, it seems to be a new trend to build more sophisticated networks to achieve higher accuracy [8, 9]. These large, complex networks, however, do not work effectively on mobile devices or Internet of Things devices. Therefore, alleviating the model's operational burdens while ensuring high accuracy is one of the main problems facing DNNs. Pruning, knowledge distillation, quantization, and lightweight networks have developed into available ways needed to reduce the considerable computational resources required [10].

Pruning methods [11] allow the model to be simpler and more efficient by eliminating redundant parameters or connections through a certain measure, which is why pruning is a popular technique. Although this technique can minimize the size of the model while maintaining performance, precision loss is unavoidable in pruning networks. The goal of knowledge distillation is to guide student learning through a more robust teacher model, which enables more straightforward learners to have a certain degree of mastery over the teacher's skills [12]. This, however, requires the researchers to manually select models for both teachers and students.

Knowledge distillation is an effective way to compensate for the loss of precision due to pruning. However, using knowledge distillation only after the pruning has been completed, and not while it is in progress, may result in suboptimal model performance. In addition, most of the previous studies have focused on the problem of how to improve the performance of unstructured pruning [13, 14], while there have been few studies on structured pruning [15, 16]. In fact, non-structured pruning needs special software libraries or hardware to speed up the network model, whereas structured pruning can compress the network without any help [17]. Therefore, it is more realistic to combine structured pruning with distillation learning.

To solve the image classification problem, we present a new method of progressive multi-level distillation for structural pruning. In this paper, the original and pruned networks can be considered teacher and student models, thus avoiding the need for manual selection of teacher models. Moreover, we take full advantage of the characteristics of structured pruning, using each block of student network pruning and its corresponding teacher block as input for distillation loss based on feature representation. Its respective blocks gradually increase as pruning progresses, forming a progressive distillation. In addition to feature learning, our proposed multi-level distillation learning includes response representation-based learning, which allows students to mimic the logits output of the teacher's model. In this way, our approach can effectively reduce accuracy losses, allowing the pruned network to minimize the size of the model and the computational resources required within an acceptable range of accuracy degradation.

The contributions of this paper are as follows.

  1. 1.

    This paper proposes a progressive multi-level distillation learning approach for structured pruning networks. We also validate the proposed method on different pruning rates, pruning methods, network models, and three public datasets (CIFAR-10/100, and Tiny-ImageNet).

  2. 2.

    Compared with other knowledge distillation methods, our proposed method can better restore the structured pruning network's accuracy and improve the model’s performance after each pruning.

  3. 3.

    We conduct ablation study experiments further to understand each loss’s contribution to our proposed framework.

  4. 4.

    We show that distillation learning during pruning, rather than after pruning, improves model performance without additional inference time.

Related work

Network pruning

In earlier studies on pruning, the focus was more on the granularity of the pruning of individual neurons, i.e., unstructured pruning. Optimal Brain Damage [18] and Optimal Brain Surgeon [19] assessed the significance of weights on the basis of information related to the second-order derivatives of the loss function. More directly, Han et al. [20] determined whether the parameters were significant (insignificant) depending on whether they were larger (less) than a given threshold. While leading to high compression ratios, these methods only changed the weight matrix from dense to sparse. Unstructured pruning would not yield the expected results without specialized software libraries or hardware to help calculate [17].

On the other hand, the pruning granularity of structured pruning is an integrated structure. For example, Li et al. [21] ranked the filters of each layer according to the sum of the absolute filter weights (i.e., L1-norm) to determine their importance. Zhuang et al. [22] considered sparse filters non-critical, and removing unimportant filters by imposing a scaling factor on the Batch Normalization (BN) was also an efficient approach [23]. In a recent study, Lin et al. [24] concluded that the rank of the feature map is more representative of the amount of information contained in a filter, which can lead to promising results.

Knowledge distillation

The initial knowledge distillation [25] argued that one-hot labels limit the performance of the network model, and that the soft labels of a more robust network would provide more abundant information, which would allow the transfer of knowledge from a larger teacher network to a smaller one, thereby bridging the gap. Moreover, besides focusing on extracting logits output knowledge, intermediate representations of knowledge within the teacher in the form of feature maps can also be learned by the student model. FitNet [26] first proposed distillation learning for a single intermediate layer of knowledge. AT [27] extended this idea by extracting multiple intermediate layers knowledge of the teacher model to guide student learning, and by using L2-regularization on each feature map to ensure consistent dimensions for each pair of feature maps. However, knowledge from deeper intermediate layers may provide students with overly standardized guidance, while knowledge from shallower layers may not serve as a guiding role [12], which results in the inefficient transfer of knowledge. In relation-based distillation learning, knowledge transfer relationships between different layers or data are further explored. Yim et al. [28] used the relationship between layers of the teacher's network as the goal of student model learning. SP [29] aimed to preserve the student's pairwise similarity rather than mimicking the teacher's representation space, so that students could better understand the relationships between instances. Furthermore, in addition to the applications mentioned above in classification tasks, knowledge distillation methods have also proven their effectiveness in more complex tasks such as object detection [30, 31].

Model pruning and knowledge distillation are two independent parts of model compression. How to combine these two methods is one of the problems worth discussing. The simplest way to combine them is to use knowledge distillation after the completion of pruning [15, 32]. However, we have shown that the use of distillation learning in the fine-tuning process of pruning can yield better results, as demonstrated in “Two combined strategies”. Furthermore, it is also necessary to validate the efficacy of distillation learning for structured pruning networks on various model architectures and public datasets.

Quantization

The memory footprints and inference speed of the model can be effectively decreased by reducing the number of representation bits of original weights. This technique is known as quantization. Gong et al. [33] quantizing of the weights using K-means clustering could compress the network model by a factor of 8–16 with minimal or no performance impairment. In addition, under exceptional cases, weights could be represented as one-bit data and constituted a binarized network [34], significantly reducing computational consumption. Han et al. [35] integrated pruning, quantization, and Hoffman coding for deep model compression, providing a solution for its deployment on devices with low energy consumption.

The proposed method

Figure 1 gives an overview of the progressive multi-level distillation learning approach for structured pruning. In the process of structured pruning, the original network and the pruned network are treated as a teacher and student model, respectively, and the proposed method is used in the fine-tuning process. In contrast to using knowledge distillation only after pruning is completed, our approach increases the training time but improves the performance of the model. Although the structure of the network model (i.e., the number of channels) is constantly changed with pruning, it has been shown that we can improve the performance after every pruning without the need to adjust the hyperparameters. The algorithm flow is illustrated in Algorithm 1. The proposed approach will be described in more detail in the following sections.

Fig. 1
figure 1

An overview of the implementation of this method, which is based on a progressive multi-level distillation method for structured pruned networks. In the figure, the network is divided into three blocks, and we take the network after pruning each block as a student and the well-trained network as a teacher. Students #1 to #3 represent the student models obtained after sequential pruning of the first module of the teacher model to the third model. As pruning progresses, the intermediate features of the extracted knowledge are increased, which can maximize the utilization of pruning properties for distillation learning. The adaptation layer makes the feature mapping dimension of the student block the same as that of the teacher block. Note that a block can contain more than one convolution layer and block

figure a

Progressive feature distillation

As mentioned in  “Knowledge distillation”, the intermediate knowledge from deep layers can easily lead to over-normalization of the students' models, and the intermediate knowledge from shallow layers will not be able to provide guidance. Therefore, effectively transferring the knowledge of teachers' models to students is a critical issue. As shown in Fig. 2, unlike FitNet [26] and AT [27] for distillation learning of fixed intermediate blocks of knowledge, we subtle used the characteristic of structured pruning in which each block is pruned in turn, so that each block that is pruned becomes a mentee. The corresponding unpruned block in the teacher model becomes a mentor. Although there is a significant deviation between the pruned block and the original one, the corresponding feature pairs can effectively transfer intermediate knowledge to achieve better performance recovery. As illustrated in Fig. 1, when pruning begins, the number of pruned blocks is small, and only shallow, intermediate knowledge can be used as a guide. But as the number of pruned blocks increases, the corresponding loss of information increases, so that the deep intermediate knowledge becomes useful, avoiding the over-standard of the student model and compensating for the loss of representation power caused by pruning.

Fig. 2
figure 2

a and b denote the extraction of the single and all intermediate features by FitNet [26] and AT [27], respectively

In the pruning of the student model, the structured pruning removes the non-significant channels, which leads to a discrepancy in the number of channels between the two models. Using an adaptation layer consisting of a pointwise convolution (1 × 1 kernel) and a BN layer, we map the student channels to their corresponding teacher counterparts, allowing for more efficient knowledge extraction and reducing differences in feature maps between the pruned and the original model. We present the distillation losses of individual blocks as follows,

$$ l_{block}^{intermediate} = D_{p} \left( {{\varvec{F}}_{t} ,r\left( {{\varvec{F}}_{s} } \right)} \right), $$
(1)

in which \({\varvec{F}}_{s}\) is denoted as a feature map of the student model and \({\varvec{F}}_{t}\) is denoted as a feature map of its corresponding teacher model. \(r( \cdot )\) is a regressor consisting of a 1 × 1 convolutional layer and a BN layer. \(D_{p}\) is a measure of the \(L_{2}\) distance between student and teacher feature maps. The overall distillation loss based on feature representation can be expressed as follows,

$$ L^{{{\text{intermediate}}}} = \sum\limits_{b = 1}^{B} {\mathop l\nolimits_{b}^{{{\text{intermediate}}}} } , $$
(2)

where \(B\) is the number of pruned blocks. This loss makes it possible for the student model to learn the features of the teacher model efficiently during the structure pruning process.

Output logits distillation learning

Multi-level distillation learning has been shown to perform better than single knowledge distillation methods for image classification [36] and object detection [37]. Therefore, we extend this concept to the pruning process in a reasonable manner. Apart from the feature representation-based knowledge distillation described above, our approach also includes output logits mimicking distillation learning. It is also necessary to mimic the softened teacher outputs in order to learn more from the teacher model. We use the Kullback–Leibler Divergence loss between the student and teacher outputs as the distillation loss for output imitation. The temperature \(\uptau \) softens the outputs between each pair of students and teachers. This method enables the student model to learn the predictions of the high-performance teacher model more efficiently, which can significantly reduce the classification error rate. The softened softmax function and the overall output imitation loss are shown below,

$$ X_{ij} = \frac{{\exp \left( {x_{ij} /\tau } \right)}}{{\sum\nolimits_{j = 1}^{C} {\exp \left( {x_{ij} /\tau } \right)} }}, $$
(3)
$$ \,L^{output} = \sum\limits_{i = 1}^{S} {\sum\limits_{j = 1}^{C} {X_{ij}^{T} } } \log \left( {X_{ij}^{T} /X_{ij} } \right), $$
(4)

where \(x_{ij}\) represents the student single output logit for the \(j^{{{\text{th}}}}\) class of the \(i^{{{\text{th}}}}\) batch sample. \(X_{ij}\) and \(X_{ij}^{T}\) represent the softened softmax output of the student model and the teacher model for the \(j^{{{\text{th}}}}\) class of the \(i^{{{\text{th}}}}\) batch sample, respectively. The temperature hyperparameter \(T\) determines the softening degree of output. \(X_{ij}^{T}\) can also be calculated by Eq. (3).

Total loss

In addition to the feature and output imitation learning described above, each student model is trained in a classical cross-entropy function with ground-truth labels and student output logits, which aids the model to learn better about a given dataset, as shown in the following equation,

$$ \widetilde{{X_{ij} }} = \frac{{\exp \left( {x_{ij} } \right)}}{{\sum\nolimits_{j = 1}^{C} {\exp } \left( {x_{ij} } \right)}}, $$
(5)
$$ L^{CE} = \sum\limits_{i = 1}^{S} {\sum\nolimits_{j = 1}^{C} - } Y_{ij} \log \left( {\widetilde{{X_{ij} }}} \right), $$
(6)

where \(\widetilde{{X_{ij} }}\) represents the student softmax output for the \(j^{{{\text{th}}}}\) class of the \(i^{{{\text{th}}}}\) batch sample. \(Y_{ij}\) denotes the ground-truth label for the \(j^{{{\text{th}}}}\) class of the \(i^{{{\text{th}}}}\) batch sample.

Our proposed progressive multi-level distillation learning is a weighted combination of these three losses mentioned above, updating the parameters of the student network only during the training phase to allow better accuracy recovery of the pruned model, which is mathematically represented as follows,

$$ L^{{}} = \alpha L^{{{\text{intermediate}}}} + \beta L^{{{\text{output}}}} + \gamma L^{CE} . $$
(7)

We find the optimal values of weights by grid search can be taken at \(\alpha = 0.25\), \(\beta = 0.1\), and \(\gamma = 0.9\), and use these hyperparameters in all the subsequent experiments. Note that the proposed method does not increase the inference time of the model, and it is orthogonal to techniques such as quantization.

Experiments

The effectiveness of this method is evaluated by comparing it with the existing methods. See  “Implementation details” for details of implementation. In  “Main results”, the superiority of our approach is demonstrated in publicly available datasets. The effects of the ablation experiments and different combination strategies will be discussed in later sections.

Implementation details

We perform L1-norm [21] pruning and HRank [24] pruning for VGGnet-16 [38] and ResNet-56 [39], GoogleNet [40]. The location of our selected feature distillation blocks is shown in Fig. 3. In addition, to enable a more comprehensive assessment of the usability of the proposed methods, we also validate it under different layer pruning rates: 60%, 70%, and the appropriate pruning rate (APR) given by HRank, as shown in Table 1. All experiments are performed using Pytorch and on an NVIDIA GeForce GTX 1080Ti GPU. The resource costs of the model at various pruning rates in CIFAR-10 are shown in Table 2.

Fig. 3
figure 3

Block selection locations of the network models. From left to right, VGGNet with plain block, GoogLeNet with inception block, and ResNet with residual block

Table 1 The appropriate pruning rate of models
Table 2 Comparison of the number of parameters and FLOPs for the model at various pruning rates in the CIFAR-10 dataset

In order to demonstrate the effectiveness of the proposed approach, we compare it with the following representative approaches. Baseline is the result of pruning without the use of a distillation method. Details are as follows.

  1. (a)

    KD [25]: Makes use of KL divergence to close the softmax output of teacher and student, so as to transfer the knowledge and reduce the classification error of the student model.

  2. (b)

    FitNet [26]: It extracts the knowledge of a single intermediate layer of the well-trained teacher network, and uses it to guide the students' study. Knowledge distillation is accomplished by optimizing the distance between student and teacher intermediate layer features.

  3. (c)

    AT [27]: Improves student network performance by transferring the attention map of the teacher network so that the student can learn more useful information.

  4. (d)

    SP [29]: It uses pairwise activation similarity in each mini-batch to train students. Thus, it is possible to encourage student models to maintain pairwise similarity in their representation space without mimicking the teacher's representation space.

Main results

CIFAR-10/100

CIFAR-10 [41] has 50,000 training and 10,000 test images divided into 10 classes. CIFAR-100 [41] has the same number of training and test set images as CIFAR-10; the difference is that these images are classified into 100 categories. To obtain the pre-trained model to be pruned, we execute an SGD optimizer with a momentum of 0.9, weight decay of 0.0005, initial learning rate of 0.1, training of 350 epochs, and multiplying the learning rate by 0.1 at 175 and 262 epochs. Batch size is set to 64 and fine tuning using 40 epochs after each layer pruning with a learning rate of 0.01 and divided by 10 at epochs 5, 10.

A more easily categorized CIFAR-10 dataset can be obtained from Table 3, as pruning can still cause performance impairments to the model even at lower pruning rates. Compared with other distillation methods, our method allows the model to recover the maximum lost accuracy during the fine-tuning phase. Especially on the GoogLeNet model with a 60% pruning rate, almost lossless pruning can be achieved (only a 0.04% decrease in accuracy compared to teacher). Our method improves only 0.28% accuracy on ResNet-56 at a 60% pruning rate, but the other methods improve at most 0.11%. GoogLeNet achieves an optimal 0.93% improvement at APR, while other methods achieve at most 0.4%. This result provides evidence for our framework to better transfer knowledge.

Table 3 The other methods are compared with ours in the case of different pruning rates on CIFAR-10

Figure 4 illustrates the time comparison of VGGNet with different knowledge distillation approaches at a 60% pruning rate in CIFAR-10. FitNet [26] is closer to running time than the proposed approach, but it only improves the performance by 0.15% (we have an improvement of 0.64%). SP [29] and AT [27] do not effectively compensate for the loss of accuracy while consuming significant runtime resources. The KD [25] method requires less runtime but has a relatively limited precision recovery. Figure 4 shows that our approach can get the best results with fewer resources.

Fig. 4
figure 4

Time comparison of various methods for VGGNet at 60% pruning rate of CIFAR-10. Baseline indicates that no knowledge distillation method is used

Table 4 shows that on the more challenging to classify CIFAR-100 dataset, our method can achieve the most considerable boost on VGGNet 2.09%—over baseline (at 60% pruning rate). While ResNet-56 has only 0.41% accuracy improvement at a 70% pruning rate, the other methods only improve by 0.13% at most. Compared to the results in CIFAR-10, the proposed method generally improve more on CIFAR-100, which may explain the significant difference in performance between the teacher and student models. However, as described in [42], it is not the case that the better the teacher model performs, the better the distillation will be, as we can observe on ResNet56 at a 70% pruning rate for both datasets (note that the difference in performance between teacher and baseline in CIFAR-100 is much larger).

Table 4 The other methods are compared with ours in the case of different pruning rates on CIFAR-100

Both Tables 3 and 4 show that our method achieves effective and superior results on different pruning rates, models, and pruning methods and allows for continuous improvement.

Tiny-imageNet

Tiny-ImageNet [43] consists of 100,000 training and 10,000 validation images containing 200 classes, and we resize its input to 32 × 32. To obtain a pre-trained model, we use an SGD optimizer with a momentum of 0.9 and a weight decay rate of 0.0005, train 120 epochs with a learning rate of 0.01, and multiply by 0.1 at epochs 30, 60, and 90. The fine-tuning strategy after pruning is the same as CIFAR10/100.

As shown in Table 5, on the larger dataset Tiny-IamgeNet, some knowledge distillation methods do not work as well as on CIFAR-10/100. However, our method still recovers the lost performance, clearly observed at a 60% pruning rate. The proposed method also obtains better results than other methods at a 70% pruning rate, allowing the model to recover 0.52% accuracy. This result proves that our method is still effective, even when it is more difficult to classify in larger datasets.

Table 5 The other methods are compared with ours in the case of different pruning rates on Tiny-ImageNet

Improving each pruning

As described in the previous section, we show that our approach achieves promising results at the end of pruning. However, we hope it will improve the performance after each pruning. Figures 5, 6, and 7 illustrate the performance changes of VGGNet-16, GoogLeNet, and ResNet-56 after each pruning completion on the CIFAR-100 dataset with different pruning rates, respectively. The blue line indicates the baseline accuracy, i.e., the results obtained without distillation. The orange line shows the results obtained with the proposed distillation. The dotted line represents the accuracy of the teacher model without pruning. The pruning of each layer in the convolution layer removes the irrelevant filters, which leads to a lower precision. As shown in Figs. 5, 6 and 7, our method still works during the pruning process, which also means that the proposed approach is still practical even if we haven't completed the pruning. Moreover, the performance of the student is better than that of the teacher model during the initial process of pruning proceeding, which also shows the ability of our approach to combine structured pruning and distillation learning methods better.

Fig. 5
figure 5

Performance change of VGGNet after pruning per layer on CIFAR-100. a, b and c represent the different pruning rates, respectively

Fig. 6
figure 6

Performance change of GoogLeNet after pruning per layer on CIFAR-100. a, b and c represent the different pruning rates, respectively

Fig. 7
figure 7

Performance change of ResNet-56 after pruning per layer on CIFAR-100. a, b and c represent the different pruning rates, respectively

Ablation study

In order to further analyze the contribution of each of our proposed losses, we add the ablated portions step by step to observe their effects. We perform experiments related to VGGNet-16 with a 70% pruning rate on CIFAR-10, as shown in Table 6, where baseline refers to the pruning process without using our method. It can be observed that our proposed method in  “Progressive feature distillation” improves the performance of the model to the maximum, and the proposed progressive mechanism based on the pruning process further improves the pruning process in terms of feature distillation. In conclusion, the weighted combination of the proposed components can be used to compensate for the loss of performance due to pruning as much as possible.

Table 6 Ablation experiments of the proposed method

Two combined strategies

The strategy of combining distillation learning and pruning can be broadly divided into two categories: using after pruning is completed and using during pruning, and Fig. 8 shows the performance and time comparison of our proposed method on these two strategies. It has been shown that using distillation in the pruning process leads to higher precision recovery, but it also requires more training time. As a result of our progressive distillation process, it takes less time and achieves greater performance gains compared to AT [27]. Compared with the other one, FitNet [26], although it takes less time to train, its accuracy improvement is not even as good as using our distillation method only after the completion of pruning. This result demonstrates that our method can recover the decreased accuracy within a relatively short training time without affecting the inference speed of the model.

Fig. 8
figure 8

The performance and training time of the proposed approach are compared only when pruning is finished and during pruning. a, b, and c represent the VGGNet-16, GoogLeNet, and ResNet-56 models under APR on CIFAR-100, respectively

Conclusion and future work

We propose a progressive multi-level distillation learning method to alleviate the accuracy drop by structured pruned networks. This method takes advantage of the characteristics of structured pruning, which allows the pruned network to learn more information from the teacher network. Experiments on different datasets, model architectures, and pruning rates show that the proposed approach achieves better performance than other approaches, and the accuracy of the model is improved after every pruning. Further experiments demonstrate that the proposed method in the pruning process enhances the model performance more effectively. Our approach has higher efficiency in training time and does not influence inference time. Our study provides a valuable approach to better integrate pruning and distillation learning.

In future work, we hope to extend this idea to more complicated tasks such as object detection and semantic segmentation. Furthermore, it is also worth exploring how to combine better pruning, knowledge distillation, and other compression techniques like quantization.