Knowledge distillation in deep learning and its applications

Deep learning based models are relatively large, and it is hard to deploy such models on resource-limited devices such as mobile phones and embedded devices. One possible solution is knowledge distillation whereby a smaller model (student model) is trained by utilizing the information from a larger model (teacher model). In this paper, we present an outlook of knowledge distillation techniques applied to deep learning models. To compare the performances of different techniques, we propose a new metric called distillation metric which compares different knowledge distillation solutions based on models' sizes and accuracy scores. Based on the survey, some interesting conclusions are drawn and presented in this paper including the current challenges and possible research directions.


INTRODUCTION
Deep learning has succeeded in several fields such as Computer Vision (CV) and Natural Language 19 Processing (NLP). This is due to the fact that deep learning models are relatively large and could capture 20 complex patterns and features in data. But, at the same time, large model sizes lead to difficulties in 21 deploying them on end devices. To solve this issue, researchers and practitioners have applied knowledge distillation on deep learning 24 approaches for model compression. It should be emphasized that knowledge distillation is different from 25 transfer learning. The goal of knowledge distillation is to provide smaller models that solve the same task 26 as larger models (Hinton et al., 2015) (see figure 1); whereas, the goal of transfer learning is to reduce 27 training time of models that solve a task similar to the task solved by some other model (cf. Pan and Yang 28 in size and performance. The paper also discusses some of the recent developments in the related field 48 to understand the knowledge distillation process and the challenges that need to be addressed. The rest 49 of the paper is organized as follows: In Section 3, we provide a background on knowledge distillation. 50 In section 4, we present and discuss our proposed distillation metric. Section 5 contains the surveyed 51 approaches and section 6 contains some applications of knowledge distillation. We provide our discussion 52 on surveyed approaches and an outlook on knowledge distillation in section 7. Finally, we present our 53 conclusions in section 8. The feature extractor part of a network, i.e., the stack of convolution layers, are referred to as backbone. 86 There are no conventions that guide student models' sizes. For example, two practitioners might have 87 student models with different sizes although they use the same teacher model. This situation is caused by 88 different requirements in different domains, e.g., maximum allowed model size on some device.

90
There exist some knowledge distillation methods that target teacher and student networks having the 91 same size (e.g., Yim et al. (2017)). In such cases, the knowledge distillation process is referred to as 92 self-distillation and its purpose is to further improve the performance by learning additional features that 93 could be missing in the student model due to the random initialization Allen-Zhu and Li (2020). Although 94 an algorithm is developed to distill knowledge from a teacher model to a student model having the same 95 size, the algorithm can be used to distill knowledge from a teacher to a smaller student, as well. This 96 is because, based on our survey, there is no restriction on model sizes, and it is up to model designers 97 to map teacher's activations to student's. So, in general settings, knowledge distillation is utilized to 98 provide smaller student models that have comparable accuracy scores to their corresponding teacher 99 models. The distillation process can be performed in offline or online manner. In offline distillation, the 100 knowledge distillation process is performed using a pre-trained teacher model. While online distillation is 101 for methods that perform knowledge distillation while training the teacher model. The illustration of the 102 two subcategories can be seen in figure 2. 120 We propose distillation metric to compare different knowledge distillation methods and to select suitable 121 model for deployment from a number of student models of various sizes. The metric incorporates ratio 122 of a student's size to teacher's size and student's accuracy score to teacher's accuracy score . To have a 123 good reduction in size, first ratio should be as small as possible. For a distillation method to have a good 124 maintainability of accuracy, second ratio should be as close to 1 as possible. To satisfy these requirements, 125 we develop the following equation:

DISTILLATION METRIC
where DS stands for distillation score, student s and student a are student size and accuracy respectively, 127 and teacher s and teacher a are teacher size and accuracy respectively. Parameter α ∈ [0, 1] is a weight to 128 indicate importance of first and second ratio, i.e., size and accuracy. The weight is assigned by distillation 129 designers based on their system's requirements. For example, if some system's requirements prefer small 130 model sizes over maintaining accuracy, designers might have α > 0.5 that best satisfies their requirements.

132
It should be noted that when a student's accuracy is better than its teacher, the second ratio would be 133 greater than 1. This causes the right operand of the addition operation (i.e., 1 -second ratio) to evaluate to 134 a negative value. Hence, DS is decreased, and it could be less than zero especially if weight of the second 135 ratio is larger. This is a valid result since it indicates a very small value for the first ratio as compared to 136 the second ratio. In other words, this behaviour indicates a large reduction in model size while, at the 137 same time, providing better accuracy scores than the teacher model. As presented in section 5, a student 138 model with a better accuracy is not a common case. It could be achieved, for example, by having an 139 ensemble of student models.

141
Regarding the behaviour of the distillation metric, it is as follows: The closer the distillation score 142 is to 0, the better the knowledge distillation. To illustrate, an optimal knowledge distillation algorithm 143 would provide a value that is very close to 0 for the first ratio (e.g., the student's size is very small as 144 compared to the teacher's size), and it would produce a value of 1 for second ratio (e.g., the student and 145 the teacher models have same accuracy score). As a result, the distillation score approaches 0 as the first 146 ratio approaches 0 and the second ratio approaches 1.   combining the teachers output distribution and to train the student on the individual output distribution.

175
The authors argued that this would help the student model to observe the input data from different angles 176 and would help the model to generalize better.

Manuscript to be reviewed
Computer Science loss function for the student's network has two components: the cross-entropy loss between the output of 252 the student's network and the hard labels, and the cross-entropy loss between the student output and the 253 teacher's target.

255
Training a compact student network to mimic a well-trained and converged teacher model can be 256 challenging. The same rationality can be found in school-curriculum, where students at early stages are 257 taught easy courses and further increase the difficulty as they approach later stages. From this observation, 258 Jin et al. (2019) proposed that instead of training student models to mimic converged teacher models, 259 student models were trained on different checkpoints of teacher models until teacher models converged.

260
For selecting checkpoints, a greedy search strategy was proposed that finds efficient checkpoints that 261 are easy for the student to learn. Once checkpoints were selected, a student model's parameters were 262 optimized sequentially across checkpoints, while splitting data used for training across the different stages 263 depending on its hardness defined by a hardness metric that was proposed by the authors.   proposed a two-stage distillation for CNNs. The first stage defines two matrices between the activations of 311 two non-consecutive layers. The first matrix corresponded to the teacher network, and the second matrix 312 corresponded to the student network. Then, the student was trained to mimic the teacher's matrix. After 313 that, the second stage began by training the student normally. networks that trained a student model by comparing its soft labels to the teacher's labels and the ground 321 truth. Moreover, the student will also compare its encoders outputs to that of the teacher.  The student is later trained to match the compressed feature maps of the teacher model. Additionally, the 325 student was also trained to match its feature map affinity matrix to the of the teacher model. This was 326 needed because student network could not capture long-term dependencies due to its relatively small size.  In LSL framework, some intermediate layers are selected in both the teacher and the student network.

342
The selection process is done by feeding data to the teacher model and calculating the inter-layer Gram   Table 2 provides a summary of the presented works. It shows that the best approach in terms of size many other situations where delay is not tolerable or data privacy is a concern. Moreover, unpredictable 438 network connections between the cloud and the device can also pose significant challenges. Thus, running Table 2. Summary of knowledge distillation approaches that distills knowledge from parts other than or in addition to the soft labels of the teacher models to be used for training the student models. In case of several students, results of student with largest size reduction are reported. In case of several datasets, dataset associated with the lowest accuracy reduction is recorded. Baseline models have the same size as the corresponding student models, but they were trained without the teacher models. as illustrated in Figure 5. In this section we present some typical applications of knowledge distillation 444 based on the recent literature.  Reporting the reduction in model size as well as change in accuracy for a student model as compared 497 to the corresponding teacher model is useful in our opinion. Although most authors report this information, 498 some authors do not report either of the two pieces of information. Moreover, comparing the performance 499 of a student model to a baseline model (e.g., trained-from-scratch model of comparable size to the student 500 model) is also very informative, and we believe that it should be reported by authors. knowledge distillation on such models so that relatively small and high performance models could be 517 developed.

519
The idea that knowledge distillation is a one-way approach of improving the performance of a student to under-perform on the compressed models. Thus, it seems that the systems' bias get further amplified 529 which can be a major concern in many sensitive domains where these technologies will eventually be 530 deployed such as healthcare and hiring. In addition, compressed models are less robust to changes in 531 data. Addressing these concerns will be an important research direction in the area of model compression 532 including knowledge distillation. One implication of the work is to report class-level performances instead 533 of comparing one overall performance measure for the system such as accuracy. Macro-averaged F1 534 scores across all the classes may be a more useful performance measure than accuracy. Other appropriate 535 measures need to be used for evaluation which can compare fairness and bias across the models. The

536
authors presented two such measures in their work. Furthermore, it will be important to investigate these 537 issues on more domains as the current papers looked mainly on the image classification problems. One 538 approach that might mitigate the above mentioned problems is to use a modified loss function during the 539 distillation process that penalizes label misalignment between the teacher and the student models (e.g. Allen-Zhu and Li, in a recent paper Allen-Zhu and Li (2020), argues how knowledge distillation in 543 neural networks works fundamentally different as compared to the traditional random feature mappings.

544
The authors put forward the idea of 'multiple views' of a concept in the sense that neural network, with its 545 hierarchical learning, learns multiple aspects about a class. Some or all of these concepts are available in a 546 given class sample. A distilled model is forced to learn most of these concepts from a teacher model using   should be used in evaluating a technique and using the accuracy measure may not be sufficient by itself.

575
Some of the challenges in the area were discussed in this paper in addition to possible future directions.

576
Last but not the least, we also discussed in this paper some of the practical applications of knowledge 577 distillation in real-world problems.