Binarized Neural Network With Parameterized Weight Clipping and Quantization Gap Minimization for Online Knowledge Distillation

As the applications for artificial intelligence are growing rapidly, numerous network compression algorithms have been developed to restrict computing resources such as smartphones, edge, and IoT devices. Knowledge distillation (KD) leverages soft labels derived from a teacher model to a less parameterized model achieving high accuracy with reduced computational burden. Moreover, online KD provides parallel computing through collaborative learning between teacher and student networks, thus enhancing the training speed. A binarized neural network (BNN) offers an intriguing opportunity to facilitate aggressive compression at the expense of drastically degraded accuracy. In this study, two performance improvements are proposed for online KD when a BNN is applied as a student network: 1) parameterized weight clipping (PWC) to reduce dead weights in the student network and 2) quantization gap-aware adaptive temperature scheduling between the teacher and student networks. In contrast to constant weight clipping (CWC), PWC demonstrates a 3.78% top-1 test accuracy enhancement with trainable weight clipping by decreasing the gradient mismatch with CIFAR-10 dataset. Furthermore, the quantization gap-aware temperature scheduling increases the top-1 test accuracy by 0.08% over online KD at a constant temperature. By aggregating both methodologies, the top-1 test accuracy for CIFAR-10 dataset was 94.60%, and that for Tiny-ImageNet dataset was comparable to that of the 32-bit full-precision neural network.


I. INTRODUCTION
Over the past decade, artificial neural network-based deep learning technology has been successfully applied in diverse fields. However, as networks become deeper and broader, real-world solutions require consideration of the computational cost. For example, a representative autoregressive language model, GPT-3 [1], increases the number of parameters to 175 billion, thereby significantly amplifying the computational burden. A variety of studies on neural network The associate editor coordinating the review of this manuscript and approving it for publication was Le Hoang Son .
compression have been conducted with minimal performance degradation to alleviate these problems. Through aggressive reduction of parameters to a data width of 1-bit at the expense of considerable accuracy loss, binarized neural networks (BNNs) demonstrate significant benefits in terms of memory footprint and computational speed. Various studies have addressed the accuracy loss of BNNs, such as XNOR-Net and Bi-real [2], [3]. Nonetheless, there is still an inherent limit in improving BNN performance through parameter processing or modulation.
Knowledge distillation (KD) is a widely applicable technique for compressing neural networks [4]. The key idea behind KD is to supervise the student network by imitating the teacher network via soft probabilities, which exposes more information than the class label and helps the student network learn. KD performance is primarily determined by the different characteristics of the teacher and student networks, such as the data widths, topologies, and hyperparameter configuration. Denser knowledge can be acquired as the depth of the teacher network increases, whereas as a soft label approaches a hard label, it becomes too taxing for the student network to emulate the teacher network owing to insufficient capacity [5], [6]. A low-bit network has been applied to the student network to boost the KD compression efficiency [2], [7], [8]. However, in KD, where harmony between the teacher and student networks is emphasized, the quantization gap between the two networks causes negative side effects. Furthermore, the data width parameter is an essential consideration because the difference in performance between the two networks is closely related to the number of recognizable classes.
In this study, gradient mismatch mitigation in a BNN and a KD composed of a binarized student network are addressed. Conventional constant weight clipping (CWC) causes a gradient mismatch in BNNs because fixed clipping values cannot cope with the dynamics of weight distribution. Although the dynamic clipping range has been suggested as an alternative, it has enormous complexity. To address this problem, trainable clipping values used for easing gradient mismatches were introduced.
Next, we propose a method to alleviate the capacity shortage of binarized student networks caused by data width difference between the teacher and student networks. Numerous studies [2], [6], [7] have revealed that a difference between teacher and student networks greater than the effective range results in insufficient knowledge transfer between them. Hence, adaptive scheduling based on the quantization gap is required to balance the knowledge proportion of each network. Inspired by this observation, information entropy was developed to assess the difference between the two networks appropriately.
The contributions of this study are as follows: • To mitigate the drawback of CWC that causes gradient mismatch in BNNs, we utilize a trainable weight clipping function adaptable to the dynamic weight distribution.
• In online distillation, information entropy-based temperature scheduling is introduced to overcome the problems caused by i) a poorly trained teacher network at the beginning of learning and ii) a capacity shortage of the student network.
• The effectiveness is revealed by aggregating the proposed approaches, which can be employed in the diverse network models. Furthermore, the binarized student network applied in the proposed methods exhibit a top-1 test accuracy comparable to that of the baseline CNN. The remainder of this paper is organized as follows. The related studies are described in Section II. Section III presents the main algorithm flow of learnable weight clipping and details the manner in which information entropy is applied to the technique. The simulation results and an analysis of several network topologies and datasets are presented in Section IV. Finally, conclusions are drawn in Section V.

II. RELATED WORK
Various types of neural network compression have been proposed to compute resource-constrained device deployment. BNN and KD are representative network compression schemes with data width conversion and loss function reinforcement, respectively.

A. BINARIZED NEURAL NETWORK
Courbariaux et al. [9] exploited a straight-through estimator (STE) [10] as a gradient approximation to overcome zero gradients at all locations in the sign function. However, the expressive ability of BNNs in binary space is restricted, resulting in a significant loss of accuracy. To reduce the disparity in accuracy between a BNN and its single-precision 32-bit floating-point (FP32) counterpart, XNOR-Net [11] introduced a scaling factor derived from the L1-norm of the weights or activations to minimize the quantization error.
The academic community has extensively explored enhancements in the accuracy of BNNs by building gradient estimation functions or designing binarization-friendly network architectures. For example, various BNN schemes [12], [13], [14] have aimed to apply a continuous activation gradient that approximates the sign function to refine the existing STE. ABC-Net [15] was constructed by utilizing more binary bases for weighting and activation. Qin et al. [16] applied an error attenuation estimator to minimize backpropagation information loss on the gradient. Additionally, ReActNet [17] was applied to formulate an activation function that was translated to fit the weight distribution.
Moreover, several studies have focused on gradient improvement, which is used for predicting the variation and scale of weight parameters. In addition, Xu et al. [18] investigated the gradient mismatch problem of STE when used as a gradient approximation in BNNs. By standardizing dead weights, whose gradients were not defined by STE, the authors contributed to BNN performance. Liu et al. [19] revealed that the Adam optimizer is superior to other optimizers in BNNs. Dead weights were reactivated not only from the regularization effect of the second-order momentum in the Adam optimizer but also because dead weights decreased through weight decay. STE is accountable for gradient approximation in backpropagation by providing a customized gradient to non-differentiable sign functions. However, the dead weight problems underlying gradient approximation are yet to be discussed. Thus, a weight clipping function is required to revive dead weights and simultaneously reduce the quantization error between FP32 and binarized weights.

B. KNOWLEDGE DISTILLATION
Low-precision numeric parameters and KD have common features that remarkably reduce computational requirements and memory footprints. Because the two techniques are different, a cumulative effect is expected if they are applied in parallel. Usually, KD with a low-bit student network scheme focuses only on the layer depth disparity while neglecting the effect of a quantization gap between the two networks.
In [2], the accuracy of 2-bit ResNet 20 [20] was increased by 1.4% with joint training, mimicking the prediction probability of the teacher network on CIFAR-10 dataset. In addition, Cho et al. [6] enhanced the efficacy of KD by transferring amenable knowledge from early stopped teachers.
Shin et al. [7] emphasized the importance of a suitable teacher model and hyperparameter selection for optimizing the performance of a student network using KD; however, they did not address adaptive temperature scheduling. According to several recent studies [21], [22], [23], the use of adaptive temperature scheduling in online KD has the potential to achieve higher accuracy in student networks.

III. METHODOLOGY
First, parameterized weight clipping (PWC) is introduced to efficiently reduce dead weights in a BNN through gradient descent. In addition, information entropy-based temperature scheduling is proposed to alleviate the quantization gap between teacher and student networks for online KD.

A. PARAMETERIZED WEIGHT CLIPPING
CWC prevents a case in which the binary weights are not updated in backpropagation when the absolute value of FP32 activation is greater than one in the BNN. As shown in Table 1, we calculated the differences in accuracy between BNNs with and without weight clipping for various network models in PyTorch [26] to correctly determine the effect of weight clipping on the performance of BNNs. Although the increase in accuracy differed depending on the network topology and weight clipping, the accuracies of all three networks increased. In particular, ResNet 20 exhibited the highest increase in accuracy (5.03%).
In the BNN, the FP32 weight parameter set W was binarized using (1) and (2) for the forward propagation. Conversely, in the backpropagation, (3) was applied as an STE for a non-differentiable sign function. In backward propagation, if the FP32 weight exceeds the fixed clipping range, disagreement between the presumed and actual gradient functions occurs, resulting in dead weight. Dead weights hinder correct weight updates during backpropagation. To minimize the dead weights caused by the CWC, PWC with gradient approximation was applied, considering the minimized overhead. To equip learnable clipping values according to changes in weight, we allocated a gradient for the clipping functions α and β, as follows: where L represents the loss function, and ∂L ∂|W | c represents the gradient from the deeper layer to the scaled sign function. Equations (4), (5), and (6) describe the approximated gradient equations for the clipping functions α and β. First, for the clipping function (4), based on a given weight, a value of 1 is returned if the weight exists between α and β. If the weight is greater than the negative clipping value α, it returns a value of ∂L ∂α = 1, as shown (5). The gradient ∂L ∂β for β can also be computed using the STE to estimate a value of 1 for ∂L ∂|W | c with (6). Consequently, gradient-descent-based training can adjust the clipping range to update the weights dynamically.
Because the weight values satisfy the range of (−1, +1) through initialization, the default clipping values of α = −1 and β = 1 include all weights within the clipping range. Each weight clipping value was adjusted from the initial value to narrow the range based on the PWC. Accordingly, the clipping range was modified for every training step to prevent the generation of dead weight. Backpropagation of the trainable clipping variables α and β was applied in the direction of the dashed arrow, as shown in Fig. 1.   Fig. 2 shows the change in performance of the binarized student network as a function of the data width change in the teacher network. The blue bar and line represent the n-bit weight and 1-bit activation, respectively, and the red bar and line represent the n-bit weight and n-bit activation, respectively. It was observed that the increase in data width in the teacher network was not directly related to the increase in performance. Therefore, the differences in data width between teacher and student networks must be considered when optimizing KD.

B. INFORMATION ENTROPY DISTANCE-BASED TEMPERATURE SCHEDULING
In KD, (7) is used for the output layer that generates the soft logits z i T and z i S for the teacher and student networks, respectively, and regularizes the probability of each class according to the hyperparameter τ , which denotes the temperature. The loss function with the scaling factor S f of the student network is given by (8), which includes the Kullback-Leibler divergence (KLdiv) between student and teacher distributions. However, considering student capacity, teacher knowledge close to the hard label caused by low temperatures can be overloaded. In addition, a strict teacher is required to maximize the effect of KD on the student network [4]. Therefore, the class classification of poorly trained teacher networks in the initial stage of learning can negatively affect the training process of the student network. Fig. 3 illustrates the effect on the loss value of the student networks depending on the temperature during learning. The student network imitates the knowledge that changes from a hard label to a soft label as the temperature gradually increases, as shown in Fig. 3(a). In contrast, Fig. 3(b) shows the loss in the student network, which indicates the knowledge that changes from soft labels to hard labels using gradually decreasing temperatures. Accordingly, gradually decreasing the temperature resulted in a 12.85% lesser loss than gradually increasing the temperature. Therefore, in online KD, a soft label (i.e., probability smoothing) should be actively adopted in the early stages of learning. In contrast, in the latter half of learning, where the performance of the teacher network is guaranteed, knowledge close to the hard label should be provided at a low temperature.
As previously mentioned, it is difficult for the teacher network to predict the correct class during the early stage of online KD learning. The variation in learning speed depending on the quantization gap between the teacher and student is shown in Fig. 3. Therefore, it is preferable to use a temperature scheduling technique that reflects the performance variation between the two networks instead of using a constant temperature for the entire learning process.
As learning progresses, the student network requires more accurate hard label knowledge. Thus, as illustrated in Fig. 3(b), the temperature should be gradually decreased to cope with the hard label. Therefore, we chose an information entropy distance that can measure the amount of information in the network while gradually decreasing. Using a low temperature at the beginning of the training interferes with the student network training owing to the knowledge of a poorly trained teacher network. Conversely, a high temperature cannot completely mimic the encyclopedic knowledge of the teacher network in the late stages of learning. The information entropies of the two networks were calculated using (9) and (10), where the sets T and S are the outputs of branched SoftMax in the teacher and student networks, respectively, with the convolution layer containing the most significant number of channels, as shown in Fig. 4. The distance between the two information entropies was calculated using (11).
Hence, adaptive temperature scheduling based on the performance difference between the two networks was formulated, as shown in (12), with the normalized factor λ by involving D distance .
To summarize, two techniques were developed for online KD, which comprises a binarized student network. First, the PWC lessens the dead weight problem of the CWC in backward propagation. Moreover, considering the characteristics of online KD, students learn the prediction probability of a poorly trained teacher at the beginning of the training process by implementing a soft label with a high temperature. Conversely, when a well-trained teacher is ready, temperature scheduling increases student performance through hard labels with a reliable teacher prediction probability at a low temperature.
The overheads of the PWC are the added gradient values corresponding to α and β of every layer, except for the first and last layers, with l representing the number of layers. In addition, the required clipping values are expressed as 2 · (l − 2). Taking ResNet 20 as an example, two clipping values per layer are required for 18 of the layers. Thus, only 36 parameters are added, for a total of 0.27M parameters.
The pseudocode for binarized student network training, which includes PWC and temperature scheduling, is described in Algorithm 1. (1) forward computation 3: Run forward computation of M T , M S simultaneously.

IV. EVALUATION
The benefits of parameterized weight clipping and information entropy distance-based temperature scheduling were validated by independently estimating and jointly evaluating the overall increase in accuracy compared with the various clipping functions and KDs. Furthermore, Table 2 presents the top-1 accuracy for the baseline network on the CIFAR-10 to further clarify the performance enhancement brought about by both proposed schemes.
The hyperparameters underwent a total of 300 epochs with a weight decay of 1e-4 and learning rates of 1e-1, 1e-2, and 1e-3 for the 1st, 150th, and 225th epochs, respectively. VOLUME 11, 2023 Table 3 lists the top-1 accuracy using weight clipping for each network model on CIFAR-10 dataset. No-weight clipping (NWC) indicates that no weight clipping was applied prior to binarization, and CWC, with a range of (−1, +1), was used as a clipping value in the existing XNOR-Net.

B. WEIGHT CLIPPING COMPARISON
For the PWC, the weight clipping value was adjusted based on the gradient descent training. The positive clipping value β was updated by the corresponding gradients from 1.28 to 0.5 to decrease the dead weight as shown in Fig. 5. Overall, the PWC improved the accuracy of all network models; in particular, the accuracy of WRN 22 × 4 increased by 3.78% compared with CWC.

C. KD TEMPERATURE SCHEDULING
The information entropy distance was used to determine the difference between the teacher and student networks for temperature scheduling. First, the temperature change based on λ was checked to match the different scales of loss and distance, based on (12). As shown in Fig. 6, when λ was fixed to 1, the temperature was configured from three to one. Based on this λ value, in the temperature scheduling experiment employing CWC, WRN 22 × 4 exhibited an accuracy of 94.56%, which is an improvement of up to 2.51% over VGG-small in comparison with τ = 3, as presented in Table 4.

D. COMPARISON WITH SOTA METHODS
Knowledge transfer to quantized (particularly 1-bit CNN) networks from networks composed of FP32 weights and activations has rarely been explored in previous KD methods.  Therefore, KD for a quantized neural network was chosen as a counterpart in this experiment to compare the KD 8062 VOLUME 11, 2023    performance for the quantization gap between the two networks. Table 5 presents the experimental results for CIFAR-10 and CIFAR-100 using the proposed method integrated with PWC and information entropy distance-based temperature scheduling. In the counterparts, the student network was configured using a 2-bit neural network. However, we obtained strength in deploying a binarized student network that achieved superior accuracy to the state-of-the-art KD for quantized deep neural networks. Specifically, our strategy when applied on CIFAR-100 dataset surpassed the 2-bit student network with a 9.89% improvement in accuracy. In Tiny-ImageNet experiment, the binarized student network significantly outperformed the BNN trained alone and showed comparable accuracy to the FP32 teacher network composed ResNet 18 network model, as shown in Table 6. This is because the hard labels were reflected in the temperature scheduling.
Information entropy-based temperature scheduling applied to online distillation shows a relatively faster training speed than its counterparts of offline distillation [2], [7], [8], composed of a two-stage training process. Even though, compared with the baseline online distillation, the computation overhead for the information entropy-based temperature in ResNet 20 is only 0.26%. While ResNet 110 has an overhead of 0.07% because this overhead decrease as the network depth are deeper.
To visualize the performance improvement for the disparity between the baseline (FP32 and 1-bit) and proposed techniques, attention maps were depicted for the qualitative results, as shown in Fig. 7. In the attention maps, a closer red value indicates a weight concentration in the network. 1-bit ResNet 18 with the proposed method is more clearly classified than the 1-bit baseline, and it can be seen that the performance for some images matched that for the FP32 baseline.

V. CONCLUSION
KD achieves high accuracy with a relaxed network depth by using soft labels derived from a teacher model for a less parameterized model. In contrast, a BNN can achieve a high compression rate by incorporating an aggressive reduction in the data width; however, it has an adverse effect on the accuracy. This study developed techniques to enhance the accuracy of online KD by using a BNN as a student network. Specifically, a PWC was applied to diminish the dead weights missing the gradient, and a temperature scheduling method was proposed to assess the quantization gap between the teacher and student networks. Consequently, for CIFAR-100 dataset, the accuracy of our technique increased by 9.89% in comparison with offline 2-bit student KD.
BNN can be advantageous in mobile and edge devices with resources constrained where energy efficiency is the primary concern. However, low capacity and performance originating from binarization lets BNN have challenges for application in a wide range. Therefore, further investigation on BNN includes more challenging applications (complex vision tasks such as object detection and unsupervised learning). His current research interests include SoC/chiplet architectures for AI, advanced memory architecture, network-on-chip, and system-level design methodologies.