1 Introduction

In recent years, deep neural networks (DNNs) have achieved much improved performance in various computer vision tasks such as image classification, image segmentation, object detection, etc [1,2,3,4,5,6]. On the other hand, the performance gain comes at the price of heavy computation and large models with a huge amount of network parameters. The large model size has become a bottleneck to various edge-computing devices, where resources in data storage and computational power are often too constrained to accommodate large deep network models.

Researchers have investigated different approaches for compressing deep network models. The existing works fall into four broad categories: (1) network pruning [7,8,9,10], (2) network quantization [11, 12], (3) compact network design [13, 14], and (4) knowledge transfer [15,16,17,18,19,20,21]. Network pruning and network quantization deal with existing large networks by removing less informative network parameters and reducing the number of bits used to represent the network parameters, respectively. Compact network design is more elegant, which directly designs light and efficient network architectures (e.g., MobileNet) from the scratch.

Fig. 1
figure 1

The architecture of the proposed adversarial-based multifarious knowledge transfer network MKTN. Firstly, the first teacher (Teacher1) is trained for image reconstruction aiming to learn generative and low-level features such as bird’s boundary; The second teacher (Teacher2) is trained with the same task as the student aiming to learn discriminative yet task-specific features such as bird’s head. Secondly, both teacher networks transfer the learned features to the student under the guidance of adversarial loss and feature loss, respectively. Here, ‘C’ denotes the ConvBlock module for feature alignment

Knowledge transfer (KT) compresses networks by transferring distilled knowledge from one or multiple teacher networks to a compact student network. Inspired by Hinton’s pioneer work on network distillation, several works have been carried out largely to address constrain of softmax function in [22], e.g., by transferring intermediate features [15,16,17, 23] or optimizing the student initialization [24, 25]. On the other hand, the aforementioned methods typically employed a single teacher network, which tend to learn homogeneous instead of multifarious knowledge. Several works did employ multiple teacher networks [26,27,28] or an assistant teacher [29] to learn richer knowledge in recent years, but they trained the teacher networks under similar objectives and still tend to learn homogeneous instead of complementary and multifarious features.

We present an adversarial-based knowledge transfer network (MKTN) that transfers multifarious features from two complementary teachers to train a compact yet powerful student network. The design is motivated by the observation that a student often learns and becomes a master of a subject by absorbing comprehensive instead of solely subject-specific knowledge, e.g., many champions in the International Physics Olympiad often have very strong foundation in mathematics. We hence utilize two teacher networks in MKTN as illustrated in Fig. 1, one pre-trained under an image reconstruction objective and the other pre-trained under the same objective as the student network. The hypothesis is that the reconstruction task guides to learn more generative features in scene layouts and object structures (e.g., the bird’s contour) which are complementary to the discriminative and task specific features (e.g., the bird’s head). For knowledge transfer to the student, we introduce adversarial learning to transfer more spatial and structural features as well as the performance of robustness from the reconstruction teacher, and distill discriminative features before the full connected layer from the second teacher. Extensive experiments demonstrate the effectiveness of our proposed multifarious knowledge transfer network.

The contributions of this work can be summarized in three aspects. First, it proposes a multifarious knowledge transfer network that employs two complementary teachers pre-trained under different objectives to train a compact yet powerful student. Second, it introduces an adversarial learning strategy for optimal knowledge transfer from a teacher network pre-trained under reconstruction objective. Third, extensive evaluations verify our hypothesis − generative features can complement discriminative features in image classification and semantic segmentation tasks. Furthermore, combining the two types of complementary features helps train a compact and accurate student network effectively.

2 Related Works

2.1 Offline Knowledge Transfer

Knowledge transfer (KT) aims to train a smaller yet compact student network by transferring knowledge from one or multiple powerful teacher(s). There have been plenty of KT researches based on one teacher and one student, such as attention transfer (AT) [16], neuron selectivity transfer (NST) [30] and factor transfer (FT) [17], that will not be covered in detail here. Our work is more related to the multiple teacher applied KT methods. As a comparison, Yin et al. [26] pre-trains multiple teachers with the same architecture to make respective policy distillation on a specific task, and Zhang et al. [28] distills knowledge of multiple self-supervised teacher models from soft probability distribution and internal representation. In [29], Mirzadeh employs a teacher assistant to achieve multi-step knowledge distillation. While these methods provide fairly good performance improvements, they neglect the complementarity and multifarious between the transferred features. To alleviate this problem, in this paper, we propose a novel MKTN that attempts to transfer multifarious knowledge from two complementary teacher networks to empower a compact yet powerful student.

2.2 Reconstruction Learning

Auto-encoder as a neural network based feature extraction method has achieved great success in capturing abstract image features. Hinton proposes that auto-encoder can learn sufficient information for reconstructing the input images in [31]. Recently, many studies prove that auto-encoder can produce better classification accuracy via learning generative features. For example, Shin et al. [32] demonstrates that stacked auto-encoder can learn classification related features effectively over complex datasets and Ng et al. [33] introduces an auto-encoder based method for learning a set of features with better classification capabilities. Additionally, existing works [34,35,36] have demonstrated that reconstruction task can be applied for improving the classification accuracy via providing additional detailed features in large-scale image classification [34], domain adaptation [35] and open-set classification [36]. Inspired by the above related work, we incorporate a reconstruction module in Teacher1 network to generate the reconstructed image of the input, targeting to learn the reconstruction sensitive yet classification related features in an unsupervised way.

2.3 Adversarial Learning

Generative adversarial learning [37] is proposed to generate realistic-looking images by using a generator and discriminator to compete with each other while training simultaneously. Several researches [38,39,40] employ adversarial loss to measure the feature difference between the teacher and student networks trained with the same task. Specifically, the teacher and student model applied in [38, 40] should have the same number of blocks, limiting the network application. In this work, we introduce a discriminator to discriminate the difference between the transferred Teacher1’s generative features and the student’s task-specific output. An advantage of adversarial learning is that the generator, i.e., the student network in the proposed MKTN, attempts to produce similar features as Teacher1 that the discriminator cannot differentiate.

On the other hand, existing work [41,42,43,44,45,46] applied adversarial learning in their model for robustness learning and transferring. In particular, Liu et al. [41] proposed an Adversarial Collaborative Knowledge Distillation (ACKD) method to bulid a more powerful student with attention mechanism. Tang et al. [42] introduced Adversarial Variational Knowledge Distillation (AVKD) via estimating the KL-divergence term between a prior p(x) over the latent variables and an approximate generative model \(q(x\vert y)\). Wang et al. [43] designed a novel Harmonized Dense Knowledge Distillation (HDKD) training method for multi-exit architecture by incorporating all possible beneficial supervision information. Maroto et al. [44] presented Adversarial Knowledge Distillation (AKD) to boost a model’s robust performance by consisting on adversarially training a student on a mixture of the original labels and the teacher outputs. Dong et al. [45] employed an adversarial-based learning strategy as supervision to guide and optimize the lightweight student network to recover the knowledge of teacher networks, and enable the discriminator module to distinguish the feature of teacher and student simultaneously. Ham et al. [46] proposed a new knowledge distillation, named NEO-KD, reducing adversarial transferability in the network while guiding the output of the adversarial examples to closely follow the ensemble outputs of the neighbor exits of the clean data, and significantly improving the overall adversarial test accuracy.

Inspired by the relevant work above, the discriminator designed in our MKTN is trained to empower the student with the distilled knowledge as well as the ability of robustness.

3 Proposed Method

Compared with the deeper teacher network, the student does not have sufficient capacity to capture the rich and comprehensive image features. Take the interested image classification task as an example, the student could classify images better when it is equipped with generative yet latent features, as well as the classification sensitive features.

Following the aforementioned intuitions, we put forward the architecture MKTN with the following steps: (1) a candidate teacher network is first modified by adding reconstruction modules and ConvBlock module respectively, as Teacher1 and Teacher2 shown in Fig. 1. (2) Two teacher networks are pre-trained with reconstruction loss and task-specific loss respectively. The reconstruction loss, calculated by minimizing the distance between the input and generated images, drives to learn generative reconstruction representations in an unsupervised manner. The task-specific loss drives to learn task-specific features with discriminative information. (3) Once the teacher networks are pre-trained well, the distilled diverse yet complementary features are fed into the corresponding student output with adversarial loss and feature loss, enabling the student to learn this knowledge and mimic the performance of its teachers thoroughly.

3.1 Generative Features Learning

Given a labeled dataset (X, Y), \(\tilde{X_{i}} = T_{1}(X_{i})\) denotes the reconstruction output of Teacher1, which has the same size as the input image \(X_{i}\). Here, i=\(1,2,\dots ,M\), M is the number of objects.

Instead of using \(L_{1}\) or \(L_{2}\) to directly calculate the pixel-level distance between \(\tilde{X_{i}}\) and \(X_{i}\), we attempt to model their respective feature space as the probability distribution before evaluating the similarity. As Maaten et al. proposed in [47], the conditional probability distribution expresses the probability of each sample via selecting each of its neighbors, applying this method is expected to better facilitate the image reconstruction via describing the local regions between \(\tilde{X_{i}}\) and \(X_{i}\). Furthermore, we employ the cosine similarity based affinity metric to be the kernel \(K_{\mathrm{cosine}}\), which can be formulated as:

$$\begin{aligned} K_{\mathrm{cosine}}(m, n; \sigma ) = \frac{1}{2}\left( \frac{m^{T}n}{\Vert m\Vert _{2} \Vert n\Vert _{2}}+1\right) \in [0,1]. \end{aligned}$$
(1)

Therefore, the conditional probability distribution for the input image \(X_{i}\) is defined as:

$$\begin{aligned} p_{i \mid j} = \frac{K_{\mathrm{cosine}}(X_i, X_j)}{\sum _{m=1, m\ne j}^{M}K_{\mathrm{cosine}}(X_{m}, X_j)} \in [0,1], \end{aligned}$$
(2)

while for the generated image \(\tilde{X_{i}}\) as:

$$\begin{aligned} q_{i \mid j} = \frac{K_{\mathrm{cosine}}(\tilde{X_{i}}, \tilde{X_{j}})}{\sum _{m=1, m\ne j}^{M}K_{\mathrm{cosine}}(\tilde{X_{m}}, \tilde{X_{j}})} \in [0,1]. \end{aligned}$$
(3)

The conditional probabilities are delimited to [0, 1] and sum to 1, i.e., \(\Sigma ^{M}_{i=0, i\ne j}p_{i \mid j}\) = 1 and \(\Sigma ^{M}_{i=0, i\ne j}q_{i \mid j}\) = 1. \(\sigma\) is the parameters of kernel K. \(X_{m}\) means the mth input image. While training \(T_{1}\), the Kullback–Leibler (KL) divergence metric is applied to calculate the reconstruction loss formulated as:

$$\begin{aligned} L_{T_{1}} = \sum _{i=1}^{M} \sum _{j=1, i\ne j}^{M} p_{j \mid i}\log \left( \frac{p_{j \mid i}}{q_{j \mid i}}\right) . \end{aligned}$$
(4)

3.2 Discriminative Features Learning

In Teacher2 network, we first add a convolutional layer with batch normalization to align the transferred features. This is followed by an averaged pooling and fully connected layer to produce the classification probabilities. Similar to the conventional metric in classification work, we apply cross-entropy function C against labels Y to evaluate the classification result:

$$\begin{aligned} L_{T_{2}} = C(T_{2}(X), Y), \end{aligned}$$
(5)

where \(T_{2}(X)\) denotes the output after fully connected layer in Teacher2 network and \(L_{T_{2}}\) is the task-specific training loss for Teacher2 network.

3.3 Complementary Features Transferring

Once \(T_{1}\) and \(T_{2}\) converge, their parameters are frozen and the student network S is trained with the transferred knowledge, that actually correspond to the generative reconstruction features \(T^{f}_{1}(X)\) before the deconvolution layers in \(T_{1}\) and informative classification features \(T^{f}_{2}(X)\) before the pooling layer in \(T_{2}\). As illustrated in Fig. 1, student S is trained with the adversarial loss, feature loss, and task-specific loss simultaneously.

To calculate the adversarial loss \(L_{D}^{s}\) between \(T_{1}\) and S, an adversarial-based learning strategy is introduced to assimilate the distilled knowledge \(T^{f}_{1}(X)\). The discriminator D attempts to classify its input \(\bar{x}\) by maximizing the following objective [37]:

$$\begin{aligned} L_{D}^{s} = E_{\bar{x}\sim P_{S}} \log (1-D(\bar{x})) + E_{\bar{x}\sim P_{T1}} \log D(\bar{x}), \end{aligned}$$
(6)

where \(\bar{x}\) is the concatenated result of \(T_{1}^{f}(X)\) and the corresponding output \(S^{f}(X)\) before the fully connected layer in S. At the same time, S attempts to generate similar features which will fool the discriminator by minimizing \(L_{D}^{s}\). D is consisted of three fully connected layers with ReLu operation, that works by generating more valuable gradient for S. The last layer with 2 hidden neurons is responsible for identifying the input features of \(T_{1}\) or S.

To calculate the feature loss \(L_{\mathrm{fea}}^{s}\) between \(T_{2}\) and S, we first normalize (denoted as \(\eta (\cdot )\)) the transferred knowledge \(T^{f}_{2}(X)\) and \(S^{f}(X)\). Then, the feature metric is formulated as:

$$\begin{aligned} L_{\mathrm{fea}}^{s} = d(\eta (T^{f}_{2}(X)), \eta (S^{f}(X))), \end{aligned}$$
(7)

where the feature metric d can be evaluated by either \(L_{1}\) or \(L_{2}\) distance.

Therefore, the student can be trained with adversarial loss \(L_{D}^{s}\), feature loss \(L_{\mathrm{fea}}^{s}\) and task-specific loss \(L_{\mathrm{cls}}^{s}\) as follows:

$$\begin{aligned} L_{\mathrm{cls}}^{s}= & {} C(S(X), Y), \end{aligned}$$
(8)
$$\begin{aligned} L_{S}= & {} \alpha L_{D}^{s} + \beta L_{\mathrm{fea}}^{s} + L_{\mathrm{cls}}^{s}, \end{aligned}$$
(9)

where \(\alpha\) and \(\beta\) are weight parameters. C(S(X), Y) is a common item used in the classification task which calculates the cross-entropy between the student output S(X) and ground-truth labels Y. During the student’s learning process, gradients are computed and propagated back within S, guiding it to learn the two teachers’ knowledge in Eq. (9).

4 Experiments and Analysis

This section verifies the validity of our proposed knowledge transfer network including (1) datasets and evaluation metrics, (2) implementation details, (3) comparisons with the state-of-the-art, (4) ablation studies and (5) discussion, more details to be described in the following subsections.

4.1 Datasets and Evaluation Metrics

The proposed MKTN is evaluated over four datasets, which have been widely used to study the knowledge transfer challenge in [17,18,19, 21, 38, 48, 49]. CIFAR10 [50] and CIFAR100 [51] are two publicly accessible classification datasets. Both datasets have 50,000 training images and 10,000 test images, with 10 and 100 image classes respectively. All images in the two datasets have 32 \(\times\) 32 pixel with RGB colors. ImageNet refers to the LSVRC 2015 classification dataset [52] which consists of 1.2 million training images and 50,000 validation images of 1000 image classes. PASCAL VOC 2012 dataset [53] is a standard instance segmentation benchmark with 1,464 pixel-level image annotations for training. We use an augmentation of the dataset provided by the extra annotations in [54] as employed in the baseline paper [55].

For evaluation metrics, we use Top-1/5 mean classification error (%) and mean Intersection over Union (mIoU) on classification and semantic segmentation task, respectively. To measure the computation cost in model inference stage, floating point operations (FLOPs) is adopted in the discussion section.

4.2 Implementation Details

We compare MKTN with several state-of-the-art KT methods including knowledge distillation (KD) [22], attention transfer (AT) [16], neuron selectivity transfer (NST) [30], factor transfer (FT) [17], activation boundaries (AB) [25] and OFD [56]. For KD, the temperature of softened softmax is fixed to 4 as in [22]. Following [16, 30], \(\beta\) of AT and NST is set to 1000 and 0.01 respectively. In MKTN, the balance weight \(\alpha\) of adversarial loss is set to 100 consistently, and \(\beta\) of feature loss is set to 100 for CIFAR and 10 for ImageNet. On classification data, teacher networks are pre-trained with an initial learning rate of 0.1 and a batchsize of 64. On segmentation data, all models are trained for 50 epochs, and the learning rate schedule is the same as the baseline paper [55]. All experiments are implemented by PyTorch on GPU devices.

4.3 Comparisons with the State-of-the-Art

CIFAR10 and CIFAR100: Several experiments are designed to compare the proposed MKTN with the state-of-the-art methods over CIFAR10 and CIFAR100. Specifically, different combinations of the backbone architectures ResNet [3], Wide ResNet (WRN) [57] and PyramidNet (PYN) [58] are employed for testing various situations as shown in Tables 1 and 2, where the numbers in parentheses are the network parameter sizes in Millions. Teacher1 noted with * is modified with reconstruction modules based the same backbone of Teacher2.

Table 1 Comparison results of Top-1 classification error rate (%) with unitary features transferred methods KD [22], AT [16], FT [17], AB [25], NST [30] and OFD [56] over the dataset CIFAR10
Table 2 Comparison results of Top-1 classification error rate (%) with unitary features transferred methods AT [16], KD [22], FT [17], AB [25] and OFD [56] over the dataset CIFAR100

According to Tables 1 and 2, four conclusions can be drawn as follows: (1) while trained from scratch, Teacher2\(\dagger\) obtains lower Top-1 mean classification error rate than Student\(\dagger\) as expected, largely due to its deeper network structures and/or larger amounts of network parameters. (2) The student trained with the compared transfer methods AT, KD, NST, FT, AB and OFD (i.e., trained by transferring unitary features from a single teacher) perform better than the corresponding ‘Student\(\dagger\)’, demonstrating that the student could significantly improve its performance if equipped or empowered with more knowledge and features. On the other hand, the compared models unsteadily show better or worse performances than others depending on the network pair used. (3) On CIFAR10 as shown in Table 1, MKTN outperforms all the compared methods consistently regardless of the type of network used, no matter whether the student and teacher networks are of different depth (ResNet20/ResNet56), having different type (ResNet20/WRN40-1), or having large depth gap (WRN16-1/WRN40-1, WRN16-2/WRN40-2). (4) On CIFAR100 as shown in Table 2, it can be observed that when the depth of the student network reduces from 56 to 32, both of the two MKTN-trained students outperform the Teacher2\(\dagger\) that has much deeper network architecture. It shows that a small network trained with proper knowledge distillation could have the similar or even better representation capacity than a large one. In addition, this clearly suggests the potential scalability of our proposed knowledge transfer network architecture. These outstanding performances are largely due to the complementary instead of unitary feature transfer in MKTN, where student effectively learns multifarious and complementary knowledge including both generative and discriminative features.

Table 3 Comparison results of Top-1/5 classification error rate (%) with unitary features transferred methods AT [16], FT [17], AB [25] and OFD [56] over the dataset ImageNet

ImageNet: To demonstrate the potential of MKTN to transfer more complex information, we conduct a large-scale experiment over the ImageNet LSVRC 2015 classification task. The student’s performance is validated based on Top-1 and Top-5 mean classification error rate as shown in Table 3 (following page). By absorbing the complementary features from the two diverse teachers, MKTN-trained student consistently makes better performance. As the results shown in Table 3, it helps to lower about 1.82% of MKTN-trained student’s (ResNet18) Top-1 error compared to the same network trained from scratch in Student\(\dagger\). This clearly demonstrates the potential adaptability of our proposed MKTN method, making promising performance even on the more complex dataset.

4.4 Ablation Studies

Two sets of experiments are conducted to evaluate the MKTN, where the first set aims to study the advantages of knowledge transfer from two complementary teachers and the second set aims to study the effects of different loss composition in knowledge transfer.

Transfer with One or Two Teachers:

Table 4 Ablation studies of transferring with one or two teachers over CIFAR100
Table 5 Ablation studies of different transfer loss composition over CIFAR10 and CIFAR100

To make the study solid, we employ different scenarios to evaluate the transfer performance where the student and teacher network have the same architecture but different depths (ResNet56/ResNet110) or have different network architectures (VGG13/WRN46-4). As Table 4 shows, in both setups, MKTN(T1,T2)-trained student transferred from two teachers (1) clearly outperforms both MKTN(T2) and MKTN(T1) trained students that learned from a single teacher alone, and (2) achieves even lower error than the Teacher\(\dagger\) that employs larger network architecture. The promising results indicate the great benefits of transferring both the generative features from Teacher1 and discriminative task-specific features from Teacher2, that complement with each other for guiding a small yet compact student network effectively.

Transfer Losses and Transfer Strategies: By including the reconstruction Teacher1 only as shown at the top part of Table 5, it shows using a discriminator to calculate adversarial loss (as denoted as \(L_{D}\) ) between Teacher1 and student features achieves the better performance than directly using \(L_1\) or \(L_2\) to calculate the pixel-level distances. The good performance is largely attributed to the discriminator which can empower the student model with good robustness and interpret the spatial information in transferred features. While including the task-specific Teacher2 only as shown in the middle of Table 5, using \(L_{1}\) to calculate the feature loss before the fully connected layer outperforms \(L_{2}\) clearly with 0.39 decrease in Top-1 classification error over CIFAR10. Nevertheless, it shows clearly worse performance if using \(L_{1}\) to calculate the feature loss after the fully connected layer (as denoted as ‘\(L_{1}^{\diamond }\)’). This again illustrates that the fully connected layer shares the commonality that dimension reduction would result in the loss of informative knowledge and features. In general, the lowest error rate is achieved when adversarial loss \(L_{D}\) and \(L_{1}\) loss are used together for knowledge transfer.

4.5 Discussion

Inference efficiency: MKTN is efficient during both training and inference stages. During training over CIFAR100, the MKTN-trained student learns multifarious features and knowledge and converges much faster than the student trained from the scratch as illustrated in Fig. 2. During inference, the MKTN-trained students (using ResNet56 and ResNet32) achieve similar even better accuracy as the teacher (using ResNet101) but consumes much less energy (lower FLOPs) and shorter inference time as shown in Table 6.

Fig. 2
figure 2

Comparison of training loss curves over CIFAR100 while the student networks trained with the proposed MKTN or from scratch

Table 6 Comparison results of #FLOPs and Inference time of each image on CIFAR100
Fig. 3
figure 3

Visualizations of validation images from the ImageNet dataset by t-SNE. We randomly sample 10 classes within 1000 classes. Left is the single model result trained from scratch. Right is the result of our MKTN-trained student

Feature visualization: We visualize the distribution of features from the MKTN-trained student and the plain student trained from the scratch. Figure 3 shows the t-SNE visualization [59] of 10 randomly selected ImageNet classes. It is obvious that the distribution of the MKTN-trained student has smaller intra-class variations and larger inter-class distances as compared with the plain student in Student*. Figure 4 further illustrates the learnt features with four sample images. As Fig. 4 shows, Teacher1 learns more generative features (e.g., bird’s boundary, car’s outline) while Teacher2 learns more discriminative features (e.g., bird’s head, car’s wheels). The MKTN-trained student learns multifarious features that capture more useful information than that of the plain student in Student*. All these illustrations align perfectly with the quantitative image classification results in Tables 1, 2 and 3.

Fig. 4
figure 4

Activation feature maps from different teachers and students in ResNet56/ResNet110 pair. The results in Teacher1, Teacher2, MKTN-Student and Student\(\dagger\) columns correspond to the output before deconvolution modules in Teacher1, as well as the output before the classifier layer in Teacher2, MKTN-trained student and the student trained from scratch, respectively

Semantic segmentation: We select the latest study, DeepLabV3+ [55] as the base model to perform semantic segmentation on PASCAL VOC 2012 data [53]. Specifically, DeepLabV3+ based on ResNet101 and ResNet18/MobileNetV2 are utilized as the teacher and student networks respectively. Similar to previous work, the student is initialized to the same one pre-trained on ImageNet. Results are shown in Table 7, where MKTN significantly improves the performance of student network. In particular, MKTN-trained MobileNetV2 performs better with 2.39 improvements in mIoU. Besides classification task, MKTN shows potential ability in semantic segmentation with promising results.

Table 7 Semantic segmentation based on DeepLavV3+ [55] on PASCAL VOC 2012 test data [53]

Compare against the multiple teachers applied KT method: We didn’t compare with [26, 28] as the two methods tackle deep reinforcement learning and video classification tasks. We performed a new experiment by comparing MKTN with TAKD [29] that employs assistant teachers and tackles the image classification task similarly. As the results shown in Table 8, with ResNet26 as teacher, the TAKD-trained ResNet8 and ResNet14 (under the assistant teacher ResNet14 and ResNet20) obtained 11.99% and 8.77% classification error, respectively, for CIFAR100. Our MKTN-trained ResNet8 and ResNet14 obtained lower classification error rate of 11.02% and 7.59%, respectively, under the same setup. This again demonstrates the potential capability of our knowledge transfer network architecture MKTN equipped with complementary teacher networks for transferring multifarious features.

Table 8 Comparison with multiple teachers applied methods over the dataset CIFAR100

5 Conclusion

Reconstruction task learns generative and low-level image representations, whereas recognition task learns discriminative and task-specific representations. The features learned by the two tasks capture different characteristics of images which are usually complementary to each other. This paper presents a multifarious knowledge transfer network (MKTN) that employs two complementary teachers to transfer generative features and discriminative features to train a compact yet powerful student network. On the other hand, the distilled features from teacher networks can be effectively transferred to a student network under the proposed adversarial loss and feature loss, that guide the student to learn spatial-level and pixel-level information, respectively. Extensive experiments show that our MKTN-trained student achieves superior performance despite its much smaller model size. We will adapt the MKTN idea for other vision tasks such as object detection and recognition in our future work.