Zero-Shot Knowledge Distillation Using Label-Free Adversarial Perturbation With Taylor Approximation

Knowledge distillation (KD) is one of the most effective neural network light-weighting techniques when training data is available. However, KD is seldom applicable to an environment where it is difficult or impossible to access training data. To solve this problem, a complete zero-shot KD (C-ZSKD) based on adversarial learning has been recently proposed, but the so-called biased sample generation problem limits the performance of C-ZSKD. To overcome this limitation, this paper proposes a novel C-ZSKD algorithm that utilizes a label-free adversarial perturbation. The proposed adversarial perturbation derives a constraint of the squared norm of gradient style by using the convolution of probability distributions and the 2nd order Taylor series approximation. The constraint serves to increase the variance of the adversarial sample distribution, which makes the student model learn the decision boundary of the teacher model more accurately without labeled data. Through analysis of the distribution of adversarial samples on the embedded space, this paper also provides an insight into the characteristics of adversarial samples that are effective for adversarial learning-based C-ZSKD.


I. INTRODUCTION
With the advent of effective solutions [1], [2] to the gradient vanishing problem, deep neural networks that provide high recognition performance have been developed rapidly. However, the still high computational cost of deep neural networks hinders the practical use of deep learning. Hinton et al. firstly introduced the concept of knowledge distillation (KD) to effectively lighten neural networks [3]. KD is a technique that transfers knowledge from large networks that perform similar tasks to relatively small networks. So KD not only allows small networks to overcome the limitations of training, but ultimately tries to have the same performance as large networks. Conventional KD techniques implicitly assume that training data is always available. However, it is often impossible to access training data due to legal factors such The associate editor coordinating the review of this manuscript and approving it for publication was Haiyong Zheng . as personal information protection. In this case, conventional KD techniques cannot be used. To overcome this limitation, zero-shot KD (ZSKD) algorithms have been recently proposed. One of the ZSKD approaches is the generalized ZSKD (G-ZSKD) [4], [5] that utilizes public data or meta data instead of original training data. The other one is complete ZSKD (C-ZSKD) [6], [7] that does not use external training data at all. A G-ZSKD method assumes that information on layer-wise activation when training the teacher model is meta data [4]. The meta data is converted into knowledge which is used for training the student model. C-ZSKD is a method of creating the training data instead of training external datasets such as meta data and public datasets. Nayak et al. generated a pseudo dataset similar to the dataset used to learn the teacher model using a parametric distribution, and then trained the student model [6]. Micaelli et al. adopted the adversarial learning (AL) of the generator and student model to transfer the decision boundary information of the teacher model to the student model [7]. Micaelli's method is relatively easy to apply to various computer vision tasks such as segmentation and object detection because it has no statistical restriction, unlike Nayak's.
However, C-ZSKD based on AL suffers from the so-called biased sample generation problem in which adversarial samples produced by the generator are mapped only to particular decision boundaries of the teacher model. Figure 1 explains the phenomenon that the output distribution of the teacher model receiving adversarial samples is biased to a specific class. This prevents the student model from accurately learning the teacher model's decision boundary through adversarial samples. To solve this problem, conventional adversarial attack techniques based on adversarial samples adopted a method of giving perturbation in label space [8][9][10]. However, the conventional techniques require labeled data and cannot be used in zero-shot learning. To solve this problem, we propose label-free adversarial perturbation. Specifically, by estimating the perturbed sample distribution using Taylor approximation, sample variance is improved without labels. In addition, to guarantee convergence of learning perturbed adversarial samples, a constraint to have a tight upper bound is presented. As a result, effective learning is realized because adversarial samples generated by the proposed method more clearly provide the decision boundary of the teacher network. The contribution points of this paper are summarized as follows.
• To solve an inherent biased sample generation problem of AL-based C-ZSKD, we propose a method to increase the variance of the adversarial sample distribution by using the convolution of probability distributions and Taylor series approximation.
• By analyzing the distribution of adversarial samples in the embedding space, this paper provides an insight into the characteristics of adversarial samples that are useful for AL-based C-ZSKD.

A. COMPLETE ZERO-SHOT KNOWLEDGE DISTILLATION
Unlike G-ZSKD, C-ZSKD can be said to be a highly scalable method because it operates even in an environment where training data is completely blocked. C-ZSKD is categorized into two approaches. The first approach is focusing on producing information of training data from the teacher model (T) by generating a pseudo dataset similar to the training data of T. For example, Nayak et al. estimated the number of classes from the weight of the final layer of T [6]. Then, assuming that the label of each class follows a Dirichlet distribution (D), the concentration parameter of D was derived from the weight W. Next, the pseudo labels (ŷ) were sampled from D, and the corresponding pseudo image (x * ) was generated according to Eq. (1).
L CE (·, ·) in Eq. (1) indicates cross entropy loss. Then, the pseudo dataset is generated by repeating the abovementioned process. Finally, the student model (S) was trained using the pseudo dataset and the conventional KD technique [3]. The second approach is to create adversarial samples to transfer decision boundary information of T to S [7]. Specifically, AL was applied to the generator (G) and S according to Eqs. (2) and (3).
Here, W G and W S indicate the weights of G and S, respectively. Also, D KL means Kullback-Leibler (KL) divergence, and z is the latent code sampled from N (0, I ). The AL was performed as follows. First, according to Eq. (2), G is trained so that S and T output different logits. Next, S receives the decision boundary information from T using the adversarial samples as in Eq. (3). Here, since the pseudo dataset generation method [6] must choose a parametric probability distribution suitable for each task, S is subject to statistical restrictions by parameters. So this method is very hard to fully reflect the actual data distribution. On the other hand, the ALbased method [7] need not select the parametric probability distribution for each task. Above all, this method has an advantage to reflect the actual data distribution because it can learn the probability distribution of the adversarial samples located near the decision boundary of the teacher model without statistical limitations. Therefore, this paper adopts the AL-based C-ZSKD approach for further effective training of the student model.

B. ADVERSARIAL PERTURBATION
Since adversarial perturbation (AP) had been introduced in [9] as a means to attack neural networks, a lot of AP generation techniques have been developed. In general, if arbitrary data is input to a neural network together with AP, the data can move from its original classification region to another classification region [10]. Based on this property, Heo et al. generated adversarial samples using AP, and then applied the produced adversarial samples to KD [11]. Since the adversarial samples generated by [11] exist near the decision boundary of the teacher model, the decision boundary information of the teacher model could be easily transferred to the student model. As a result, Heo et al.'s method showed effective KD performance even in an environment with little data. As such, many cases have been reported in which AP's function to prevent correct classification causes a rather positive effect.
We also try to solve the biased sample generation problem by utilizing the AP. Unfortunately, previous AP techniques cannot be used in an environment without label information [10], [12], [13]. Therefore, we propose a new method that induces similar effects to AP even in an environment without label data, and use this method to solve the biased sample generation problem.

III. METHOD
This section describes the overall flow of the proposed method. First, to solve the problem of biased samples, we propose a method of applying adversarial perturbation where Taylor approximation and convolution of probability distributions are used to give perturbation without label data (Section III.A). Next, a constraint that makes the upper bound of the loss function tight to guarantee the convergence of the loss function to which perturbation has been added is suggested (Section III.B).

A. BIASED SAMPLE GENERATION PROBLEM AND A SOLUTION
A state-of-the-art AL-based C-ZSKD [7] realized AL using Eqs. (2) and (3). AL forces adversarial samples to exist near the decision boundary of the teacher model. So, if the optimal weight for the student model is found by Eq. (3) using adversarial samples, the decision boundary information of the teacher model can be transferred to the student model. On the other hand, D KL (T (G(z))||S(G(z))) of the right-hand side of Eq.
(2) is used as a generator loss function, and is re-defined by Eq. (4).
where x = G(z), and Q means an adversarial sample distribution, and φ(x) indicates −D KL (T (x)||S(x)). However, when the generator is trained through AL, we can observe that adversarial samples tend to be intensively concentrated on some particular class decision boundaries that are easy to minimize Eq. (4) as shown in Fig. 2 (a). In addition, Fig. 1 shows that adversarial samples are biased to some classes in the early and middle periods of training. Therefore, this phenomenon makes it difficult to deliver accurate decision boundary information of the teacher model to the student model. This phenomenon can be resolved by moving adversarial samples using the AP in Section 2.2, but the problem is that the AP cannot be used as it is because it requires label data. Therefore, this paper proposes a solution to increase the variance of adversarial sample distribution by using Taylor series approximation and convolution of probability distributions. We will increase the variance of Q through convolution of Q and a normal distribution . However, since Q is implicitly learned, the exact probability distribution of Q is unknown. Accordingly, it is impossible to directly calculate the convolution of two probability distributions. Thus, we propose a smart bypass solution using the equivalence relationship in Eq. (5) [15]. Calculate decay factor d f = cos( iter 2πN total ) 5: Calculate Calculate w G L G 10. Update

14.
A t l : activation map for l-th layer of teacher model 15.
A s l : activation map for l-th layer of student model 16.
L S total = L S + L AT 18.
Calculate w S L S total 19. Update γ is a hyperparameter, λ and q are the PDFs of and Q, respectively. Also, ζ ∼ and x ∼ Q. Eq. (5) can be proved if using the definition of the convolutional integral, the symmetry property of Gaussian distribution, and the communicative property based on the independence of x and ζ [15]. Next, if we apply the 2nd-order Taylor series approximation near ζ = 0 to φ(x +ζ ) of Eq. (5), Eq. (6) is derived.
Then, if Expectation with respect to is applied to both sides of Eq. (6), Eq. (7) is obtained through = N (0, γ I ). And according to [16], Hessian can be represented through Jacobian J and approximation error R as in Eq. (8).
However, referring to [16], the right hand side of Eq. (7) can be represented as Eq. (9) because R is negligibly small.
Tr(J T J ) is squared norm of gradient, so if we apply Expectation with respect to Q to the right hand side of Eq. (9), Eq. (10) is derived from Eq. (7).
Finally, Eq. (5) is converted into Eq. (10) through Eqs. (6) to (9), so that the integral operation of Eq. (5) is transformed into an addition of squared norm of gradient and E Q [φ(x)]. Because the adversarial sample distribution is trained through AL and the exact distribution of Q is unknown, convolution according to Eq. (5) replaces direct convolution. As a result, the adversarial sample distribution becomes Q * , so it has a spread distribution compared to the original adversarial sample distribution Q (see Fig. 2 (b)). Since adversarial samples are distributed near decision boundaries of various classes, the biased sample generation problem can be alleviated.

B. PROPOSED C-ZSKD
Next, we propose a more stable way to learn perturbed samples. The sample distribution estimated by the Taylor approximation shows a clearer decision boundary because it has a higher variance in general. On the other hand, there is a problem that convergence is not guaranteed. To solve this problem, we present a way to tighten the upper bound through an additional constraint. As shown in Eq. (11), the second-order Taylor approximation has an upper bound of t(l) depending on the loss function l [17]. Based on this, the approximation constraint is derived as follows. First, using Eq. (6) and Eq. (11), a triangular inequality like Eq. (12) is obtained.  [7] for the teacher model trained with the CIFAR-10 dataset. Note that the proposed method is improved by an average of 0.52% over [7]. Numerical figures in the last row indicate improvements over [7].  [7]. Note that the proposed method improves the performance by 0.78% on average in comparison to [7].  [7]. Note that the proposed method is 0.58% better than [7].
Then, the second term on the right hand side of Eq. (12) has a deterministic upper bound t(φ) for φ. As a result, Eq. (13) is derived from Eq. (12).
Because t(φ) is a constant predetermined by φ [17], we compensate for the error of Taylor approximation by minimizing [ x φ(x)] T ζ . Therefore, the final loss function of the generator is defined by where α and σ are the hyperparameters. Here, the hyperparameters are determined according to the intrinsic dimensionality of latent space [18], [19] of the dataset. As a result, the learning process of C-ZSKD to which the proposed method is applied is represented as Algorithm 1. Prior to learning, set up the number of AL iterations (N total ) and the update rate of the generator and student model weights N S . Then, determine the α, γ , and σ for each dataset (see Section 4). Regular training of the generator proceeds as follows. Since the biased sample generation problem mainly occurs in the early and middle of training, the variance of is lowered by applying the decay factor d f every iteration. Next, the KL divergence of T and S (L 1 ) is calculated. Also, L 2 is calculated using d f and L 1 . After calculating a constraint (L 3 ) for the 2nd-order Taylor approximation, the total loss function of the generator (L G ) is obtained by adding L 1 , L 2 , and L 3 . Finally, the generator weight is updated through the gradient for L G . As for [7], attention transfer [14] was used as a constraint term for the stability of training, and the distillation loss of [3] was jointly used. The model trained by the proposed algorithm achieves high performance by stably learning adversarial samples with lower bias than the conventional method. The experimental verification is covered in the next section.

IV. EXPERIMENTS
Datasets For fair quantitative evaluation of the proposed method, three datasets were adopted: CIFAR-10 [20], FashionMNIST [21], and CIFAR-100 [22]. Additionally, MNIST [23] was used in the experiment for the analysis of adversarial sample distribution. CIFAR-10 has a training dataset composed of 50,000 RGB images with 10 classes and a size of 32 × 32, and a test dataset composed of 10,000 RGB images. FashionMNIST consists of training data composed of 60,000 gray-scale images with 10 classes and a resolution of 28 × 28, and test data composed of 10,000 grayscale images. CIFAR-100 has a training dataset consisting of 50,000 RGB images with 100 classes and size of 32 × 32, and a test dataset consisting of 10,000 RGB images. MNIST has a training dataset consisting of 60,000 gray-scale images The tops of (b) to (d) are the distribution of adversarial samples generated by [7]. The bottoms of (b) to (d) are the distribution of adversarial samples generated by the proposed method. The bar on the right in each figure represents density.
with 10 classes and a resolution of 28 × 28, and a test dataset composed of 10,000 gray-scale images.
Implementation Detail We adopted WideResNet-40-2 [8] as a teacher model. When training the generator and student model of [7], as well as the proposed method, AdamW [24] was used as an optimizer. In AlexNet and LeNet, the initial learning rates of the generator and student model were set to 5 × 10 −3 and 10 −3 , respectively. The initial learning rate of the generator and student model in the other networks was set to 2× 10 −3 . And β 1 and β 2 were set to 0.9 and 0.999, respectively, and the weight decay was 10 −2 . The learning rate was adjusted to decrease gradually every iteration by referring to the cosine annealing proposed by [25]. The batch size was set to 512, and the number of iterations of AL was 16,000 times. N S , which determines the AL ratio of the generator and student model, was set to 5. When training the student model through the teacher model trained with MNIST, the batch size was set to 64, and a total of 4,000 iterations were trained.
Note that the generator structures when training CIFAR-10 and CIFAR-100 are different from those when training MNIST and FashionMNIST. In the proposed C-ZSKD, the purpose of the generator is to increase the logit distribution difference between the teacher model and the student model as shown in Eq. (2). MNIST and FashionMNIST are composed of relatively monotonous images compared to CIFAR-10 and CIFAR-100. A dataset composed of monotonous images has a relatively low dimensional latent space according to the manifold hypothesis. So, if a generator learns a teacher model trained with a monotonous dataset, the generator is likely to cause mode collapse. Therefore, when training through a teacher model trained with MNIST and FashionMNIST, we reduced the parameter size of the generator and added an average pooling layer to the generator for suppressing this phenomenon effectively. And σ was set to 10 −3 which is very small compared to CIFAR-10 and CIFAR-100 for improving the training stability.

A. QUANTITATIVE RESULTS
In order to quantitatively evaluate the proposed method, the classification accuracy for three test datasets, i.e., CIFAR-10, CIFAR-100, and FashionMNIST, was measured. The hyperparameters α, γ , and σ in Eq. (14) were determined according to the dataset used for the teacher model. In detail, γ was determined proportionally to the inter-class variability of the dataset for mitigating the biased sample generation problem, and α was determined in proportion to γ to maintain the validity of the 2nd-order Taylor expansion. σ was determined experimentally considering the stability of training. As a result, α = 5, γ = 2.5, σ = 1 for CIFAR-10. In CIFAR-100, α = 20, γ = 4, σ = 1. Lastly, for FashionMNIST, α = 1, γ = 2, σ = 10 −3 . Here, the hyperparameters were set empirically. Since the hyperparameters VOLUME 9, 2021 are somewhat sensitive, they can cause performance degradation if they are not appropriately set. This is verified in the ablation study. Using the same seeds, a total of three trainings were performed on all teacher-student pairs, and the average accuracy was given as student accuracy.
Tables 1 to 3 compare the student accuracy of the proposed method and the conventional C-ZSKD techniques for CIFAR-10, FashionMNIST, and CIFAR-100. First, the proposed method provides 0.52%, 0.78%, and 0.58% better accuracy on average than the conventional state-of-the-art technique [7] for the three datasets, respectively, and shows up to 0.86%, 2.06%, 0.90%. We can observe that the performance improvement of the proposed method is noticeable as the channel width of the student model is smaller regardless of the dataset. In other words, the smaller the number of channels in the student model, the more serious the biased sample generation problem. As a result, the biased sample generation problem was significantly alleviated by the proposed method.
Next, take a look at the difference in performance between the teacher model and the student model. Comparing Table 1 and Table 3, we can observe that as the number of classes increases, the performance difference between the two models increases. When training with CIFAR-100 (see Table 3), the difference in performance between the two models amounts to 35.56% on average. Such a big difference means that the adversarial samples do not accurately transfer the decision boundary of the teacher model trained with CIFAR-100. In other words, it is more difficult to transfer the exact decision boundary of a teacher model trained with a dataset with many classes. We can analyze that this phenomenon occurs because the adversarial samples may omit decision boundary information of some classes. In the end, the more complex the decision boundary, as in training CIFAR-100, the more serious the biased sample generation problem. However, note that this problem is mitigated when the adversarial samples generated by the proposed method are applied to knowledge distillation. That is, as shown in Table 3, the proposed method delivers the decision boundary of the teacher model to the student model more accurately.

B. ANALYSIS FOR EFFECTIVE ADVERSARIAL SAMPLES
This section analyzes the characteristics of adversarial samples effective for AL-based C-ZSKD. To this end, the distribution in the embedding space of the network pre-trained with the MNIST dataset was visualized. Fig. 3 (a) shows the distribution of the MNIST test dataset embedded by the modified WideResNet trained with MNIST training dataset. The farther a sample is from the origin (0,0), the more certain the class is. Fig. 3 (b), (c), and (d) describe the distributions of adversarial samples represented in the same coordinate system as in Fig. 3 (a) for several iterations. We can observe that adversarial samples generated by the proposed method are more concentrated on the origin. On the other hand, adversarial samples generated by [7] show a relatively biased distribution. This indicates that the proposed method produces more useful adversarial samples than [7]. Therefore, it was experimentally proved that the proposed method greatly mitigates the biased sample generation problem, which is an inherent problem of AL-based C-ZSKD.
Based on the experiments of Sections 4.1 and 4.2, we can interpret that adversarial samples have ambiguity because they are close to the origin of Fig. 3 (a). This characteristic contributes a lot to transfer decision boundary information of as many diverse classes as possible.

C. ABLATION STUDY
This section describes the ablation study for each hyperparameter of the proposed method. The dataset used in this experiment was CIFAR-10, and WRN-40-2 and WRN-16-2 were used as teacher and student networks, respectively. Table 4 shows that performance deteriorates when each hyperparameter deviates from the optimal value. This indicates that, as mentioned above, the proposed method has a somewhat sensitive property to hyperparameters. In particular, since the sensitivity to γ is high, careful adjustment is required. It must be one weakness that the proposed method is sensitive to hyperparameter settings. Nevertheless, the proposed method is valuable because it can provide sufficiently higher performance than the existing C-ZSKD.

V. CONCLUSION
This paper aims to solve the biased sample generation problem of the complete zero-shot knowledge distillation (C-ZSKD) based on adversarial learning (AL). Inspired by the conventional C-ZSKD based on adversarial learning, we have devised a novel method that can have a similar effect even in an environment without label data. The proposed method increases the variance of the adversarial sample distribution by using the squared gradient norm of the generator loss function. As a result, the adversarial sample distribution is widened, so the student model is trained by receiving more accurate decision boundary information from the teacher model. The experiments on various datasets showed that the student model trained by the proposed method could provide a high performance improvement of up to 2.06%. Additionally, by analyzing the distribution of adversarial samples on the embedding space, the characteristic of the most effective adversarial samples for AL-based C-ZSKD is qualitatively demonstrated. In the future, we will expand the AL-based C-ZSKD study in the direction of generating adversarial samples with high entropy in the embedding space.