Knowledge Distillation in Image Classification: The Impact of Datasets

: As the demand for efficient and lightweight models in image classification grows, knowledge distillation has emerged as a promising technique to transfer expertise from complex teacher models to simpler student models. However, the efficacy of knowledge distillation is intricately linked to the choice of datasets used during training. Datasets are pivotal in shaping a model’s learning process, influencing its ability to generalize and discriminate between diverse patterns. While considerable research has independently explored knowledge distillation and image classification, a comprehensive understanding of how different datasets impact knowledge distillation remains a critical gap. This study systematically investigates the impact of diverse datasets on knowledge distillation in image classification. By varying dataset characteristics such as size, domain specificity, and inherent biases, we aim to unravel the nuanced relationship between datasets and the efficacy of knowledge transfer. Our experiments employ a range of datasets to comprehensively explore their impact on the performance gains achieved through knowledge distillation. This study contributes valuable guidance for researchers and practitioners seeking to optimize image classification models through kno-featured applications. By elucidating the intricate interplay between dataset characteristics and knowledge distillation outcomes, our findings empower the community to make informed decisions when selecting datasets, ultimately advancing the field toward more robust and efficient model development.


Introduction
In the ever-evolving landscape of computer vision, image classification is a fundamental and challenging task with many applications [1][2][3][4].In the recent literature, machine learning (ML) models based on deep neural networks (DNNs) have proven to be the most effective for computer vision, particularly image analysis [5][6][7].To achieve this efficiency, several DNN architectures have been proposed in the literature, with different processes including knowledge distillation [8][9][10].Knowledge distillation in deep neural networks is a crucial process in the ML field [11].As the demand for more efficient and lightweight models grows, the concept of knowledge distillation (KD) has emerged as a promising avenue to transfer knowledge from complex, high-capacity models (teachers) to simpler, more deployable counterparts (students) [8,12].This transfer of knowledge from the teacher to the student through a training paradigm typically involves the following steps.

1.
Teacher model training: The first step is to train a large and complex model (the teacher) on a given dataset to achieve high accuracy.

2.
Generation of soft targets: The trained teacher model is then used to make predictions on the training data, producing probability distributions (soft targets) over possible classes.These soft targets contain more information than the hard targets (i.e., the actual labels), as they reflect the relative confidence of the teacher model in its predictions.The soft targets can be obtained using a sofmax function where q i is the output probability for class i, z i is the logit for class i, and T is the temperature parameter.3.
Student model training: The smaller student model is trained using a combination of the original true labels and the soft targets generated by the teacher model.The loss function typically includes a component for standard classification loss and another component for distillation loss, which measures the difference between the student and teacher probability distributions.The Kullback-Leibler (KL) function is usually used for distillation loss.The KL formula is defined as where P and Q are probability distributions defined on the same sample space X.
The final loss formula is defined as where L classi f ication = ∑ i y i log(p i ) and L distillation = KL(q t eacher T ∥ q s tudent T ).Indeed, this approach makes it possible to compress and generalize the information learned by complex deep neural networks, facilitating their deployment on resource-limited devices [8,13].This process of KD not only facilitates model compression but also enhances the generalization capabilities of the student model [10].The success of KD is inherently tied to the quality and diversity of the datasets used during the training step, as well as the large applications of the KD learning-based processes [1,12,[14][15][16][17][18][19].
The effectiveness of KD in DNN could depend largely on the complexity (quality and quantity, etc.) of the data used.Thus, datasets play a pivotal role in shaping the learning process, influencing the model's ability to discern patterns and generalize to unseen features [15][16][17].While extensive research has been conducted on KD and image classification independently, a comprehensive understanding of how various datasets impact the effectiveness of KD remains an open and critical area of investigation.However, although many studies have been published on this method, few have explored in depth how the characteristics and properties of the data would influence this knowledge transfer process.This research gap raises a crucial question: How do data characteristics, such as complexity, diversity, and distribution, impact the efficiency of KD in a deep neural network?Answering this question will enable us to better understand the challenges and opportunities for KD applications related to the use of different data sources, paving the way for more efficient and robust techniques for transfer learning in deep neural networks.
This study seeks to address this gap by systematically examining the impact of different datasets on KD in image classification.As datasets vary in terms of size, domain specificity, and inherent biases, their influence on the transfer of knowledge from teacher to student models warrants meticulous exploration.Through a series of experiments, we aim to unravel the intricate relationship between dataset characteristics and the performance gains achieved through knowledge distillation.In the subsequent sections, we delve into the relevant literature, providing insights into the existing landscape of KD and its appli-cation in image classification.Following this, we elucidate our methodology, detailing the datasets chosen for experimentation, model architectures, and the KD process.The results and their implications are then discussed, shedding light on the nuanced impact of datasets.Ultimately, this study aims to contribute valuable insights for researchers and practitioners navigating the intersection of knowledge distillation and image classification, offering guidance on optimizing model performance through judicious dataset selection.
Following on from the remainder of this work, Section 2 will discuss previous work on knowledge distillation in deep neural networks.Then, Sections 3 and 4 will describe the proposed research approach and analyze the obtained results, respectively.Finally, Sections 5 and 6 will discuss the results obtained and conclude this work.

Related Work
Knowledge distillation (KD) has been widely studied in the literature, and several notable works have contributed to the understanding and development of this technique [20][21][22][23][24][25].Since its introduction by Hinton et al. [8], this approach has attracted growing interest in the machine learning research community.Table 1 presents some recents knowledge distillation work in the field of image classification.This table mainly presents the different databases, the architecture of the teacher and student models and the main evaluation metric used to perform KD in image classification task.

Knowledge Distillation in the Literature
Several works have explored various aspects of knowledge distillation in deep neural networks [26], including teacher and student model architectures, regularization techniques, and optimization methods.
For example, Li et al. proposed a transferred attention method to improve the performance of convolutional neural networks [27], while Yazdanbakhsh et al. studied the application of knowledge distillation in specific domains such as healthcare [19].However, despite these significant advances, little attention has been paid to the impact of data on this knowledge transfer process.The authors demonstrated the effectiveness of the distillation on various tasks and highlighted its potential for model compression.The FitNets paper [20] proposed a specific form of knowledge distillation called FitNets, where a student network is guided not only by the output probabilities of a teacher network but also by intermediate representations (or hints).This work aimed to improve the transfer of information in the training process.Ref. [27] introduces attention transfer as a form of knowledge distillation.It focuses on transferring attention maps from a teacher to a student network to improve the student's performance.Attention transfer has proven effective in enhancing the generalization capabilities of the student model.To address the limitations of traditional knowledge distillation, ref. [31] introduces Jacobian matching, a novel method that aims to transfer not only the output probabilities but also the derivatives of the teacher model's predictions.This approach provides a more comprehensive form of knowledge transfer.Ref. [30] explores the benefits of knowledge distillation beyond model compression.The authors show that the knowledge distillation process not only compresses models but also accelerates the optimization process, enabling faster convergence during training.Ref. [32] introduces the concept of a "teacher assistant" by proposing an extension to traditional knowledge distillation.The teacher assistant helps bridge the performance gap between the teacher and the student, leading to enhanced knowledge transfer.

Role of Datasets for Model Training by KD
The impact of datasets on model training has been a longstanding focus in machine learning research.Datasets serve as the foundation upon which models learn to recognize and classify patterns, making their composition and characteristics crucial determinants of model performance.Studies by refs.[37,38] emphasize the importance of diverse datasets in fostering robust image recognition systems, highlighting how exposure to a wide range of scenarios aids in generalization.In the context of image classification, biases present in datasets have been identified as potential challenges, leading to models that may not generalize well across different domains [37].Addressing these biases and ensuring dataset diversity are pivotal considerations in the pursuit of building models that can perform reliably across various real-world scenarios.

Research Gap and Motivation
While the individual importance of KD and dataset characteristics in image classification has been adequately explored, a comprehensive examination of how different datasets impact the success of KD remains a notable gap in the literature.Synthesizing the existing literature, we recognize the intertwined nature of knowledge distillation and dataset influence on image classification models.Furthermore, the literature review confirms the preliminary observation that several works have studied knowledge distillation in neural networks [8,28,29,31,32,36].However, the majority of these studies have used not only a single dataset (CIFAR10, CIFAR100, MNIST, ImageNet, etc.) [20,[30][31][32]34] but also, more often than not, residual network architectures (ResNet) [30,[32][33][34][35].Moreover, knowledge acquisition is relative to the context, which is nothing other than the data, whereas the existing studies often focus on benchmark datasets without thoroughly investigating the nuances introduced by varying dataset characteristics.
This study aims to bridge this gap by systematically exploring the relationship between dataset properties and the efficacy of knowledge distillation.Successful knowledge transfer relies not only on the distillation techniques but also on the inherent properties of the datasets used during training.In the subsequent sections, we detail our methodology, experimentally addressing this critical gap and shedding light on how different datasets impact the performance of knowledge-distilled models.

Research Method
The research approach adopted in this paper aims to highlight the impact of data complexity on knowledge distillation in deep convolutional neural networks.To better illustrate this approach, we have represented its operating process on the diagram in Figure 1, which gives a better overview of the different steps of the followed method.
From this illustration, the first step in our approach is to select the databases most commonly used in the literature (see analysis in Section 2.2), which will enable us to carry out our study, as detailed in the following Section 3.1.Once the databases have been selected, the next step is to choose the architectures of the teaching and learning neural networks with which to test our approach.Once the architectures have been chosen, this stage, which we detail in Section 3.2, ends with training the parent model and an instance of the student model from scratch on all the experiments' datasets.Then, the third stage of our experiment consists of simulating the trained student models through knowledge distillation according to two configurations, namely response-based distillation (RKD) and intermediate-based distillation (IKD), which we explain in Section 3.3.Finally, the fourth and last stage of our study consists of comparing the results and seeing the effect of different databases on knowledge distillation.

Datasets Selection
To comprehensively investigate the impact of datasets on knowledge distillation in image classification, a diverse set of datasets is curated.The selection criteria include considerations of size, domain specificity, and potential biases.Well-established benchmark datasets, such as CIFAR-10, CIFAR-100, and MNIST as shown in Table 1, form the core of our study, providing a foundation for cross-dataset comparisons.

Dataset Description and Complexity Classification
To highlight the impact of datasets on the distillation of knowledge learned by deep neural networks, we tested teacher and student network architectures on the most popular datasets in the scientific machine learning literature.For this purpose, we used 5 different data sets, including MNIST [6], FashionMNIST [7], UPS [11], CIFAR10 and CIFAR100 [5], which are summarized in Table 2 and described in turn in the rest of this section.
Each dataset was selected to represent different characteristics and complexities, ensuring a comprehensive evaluation of the distillation process.The classification of the level of data complexity in this article is based on a combined analysis of the dataset's characteristics (dimensionality, class diversity, data volume, variability and domain specificity) and the performances obtained in the literature [39,40].Below are descriptions of the datasets mentioned in the literature review for knowledge distillation in image classification: The levels of complexity of the datasets were determined according to several key criteria, which include the following: • Dimensionality: the resolution and color channels of the images.higher resolution and multiple colour channels generally increase the complexity of the dataset, as they require more sophisticated models to capture detail.

•
Class diversity: the number and variability of classes within the dataset.A larger number of classes with significant differences between them increases complexity because the model has to distinguish between a larger set of categories.• Data volume: the size of the dataset in terms of the number of samples.Larger datasets can be more complex to manage and require more computing resources, but they also provide more information for robust model formation.

•
Variability: the level of noise, background variation, and object diversity within the dataset.Datasets with high variability in object appearance, backgrounds, and noise levels are more difficult for models to learn and generalize.
• Domain specificity: the within-domain specificity and variability of the dataset (e.g., handwritten figures versus real-world objects).Datasets from domains with high intra-class variability and inter-class similarity are considered more complex due to the more subtle distinctions that need to be learned.
The complexity increases from MNIST and USPS to FashionMNIST, CIFAR-10, and finally CIFAR-100, with the latter being the most challenging among the mentioned datasets for an image classification task using the ResNet architecture.

Model Architecture Details
Our experimental setup involves employing state-of-the-art model architectures as both teacher and student networks.Convolutional neural networks (CNNs) [42,43] have demonstrated exceptional performance in image classification tasks [44,45], and we leverage ResNet [46] architectures for our experiments.Table 1 shows the frequency of use of ResNet in the literature.The teacher model, being more complex, serves as the knowledge source, while the student model is designed with fewer parameters to facilitate efficient deployment.
ResNet, introduced by ref. [46], has become a pivotal architecture in deep learning due to its ability to tackle the vanishing gradient problem through the innovative use of residual connections [14].
The key innovation of ResNet lies in the use of residual blocks (Figure 2), where each block contains a shortcut connection that bypasses one or more convolutional layers.This shortcut connection enables the network to learn residual mappings, making it easier to optimize deeper architectures.ResNet architectures come in various depths, such as ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152 [46], each with a different number of layers.The following Table 3 shows the characteristics of the models used in our experiment.This table shows the key details of the ResNet-50 and ResNet-18 architectures [46] used in our experiments.The Bottleneck layers (ResNet-50) consist of three layers (1 × 1 convolution for channel reduction, 3 × 3 convolution, and 1 × 1 convolution for channel restoration), optimizing network efficiency and depth, and the Basic Unit layers (ResNet-18) consist of two 3 × 3 convolution layers, maintaining simplicity and reducing computational load.

Knowledge Distillation Processes
The knowledge distillation process involves transferring the knowledge from the teacher to the student model.We employ a combination of soft targets and intermediate representations during training.The soft targets, representing the teacher model's softened predictions, are integrated with traditional cross-entropy loss using the following formula.
where α ∈ (0, 1) is the balance factor between the two loss terms; L CE is the cross-entropy loss; y is the one-hot label; P (t) is the teacher output; P (s) is the student output; D KL is the KL divergence [47]; and τ is a temperature [8].
Additionally, we incorporate feature-matching techniques to ensure the student model captures intermediate representations from the teacher [20].

Response-Based Knowledge Distillation (RKD)
Response-based knowledge distillation (RKD) is a variant of knowledge distillation that refers to the neural response of the last output layer of the teacher model [48].The operating principle of the RKD is illustrated in Figure 3.According to Figure 3, response-based knowledge focuses on the final output layer of the teacher model.This is accomplished by the assumption that the student model will learn to mimic the predictions of the teacher model.The illustration in Figure 3 shows that this can be achieved by using a loss function, called the distillation loss, which captures the difference between the respective logits of the student model and the teacher model.As this loss would be minimized during the learning process, the student model would become increasingly capable of making the same predictions as the teacher model.By considering the decision-making process of the teacher model, response-based methods can potentially improve the generalization ability and robustness of the student model.

Intermediate Knowledge Distillation
Intermediate-based knowledge distillation (IKD), or feature-based knowledge distillation, is a variant of knowledge distillation in DNN that highlights knowledge learned from hidden layers.The operating principle of IKD is illustrated in Figure 4.According to Figure 4, IKD extends traditional knowledge distillation by transferring knowledge not just from the final output layer of the teacher model but also from intermediate layers.Indeed, a trained teacher model also captures knowledge of the data in its intermediate layers, which is particularly relevant for deep neural networks.Thus, the intermediate layers learn to discriminate specific features, and this knowledge can be used to train a student model.As depicted in Figure 4, the aim is to train the student model to learn the same feature activations as the teacher model.The distillation loss function achieves this goal by minimizing the difference between the feature activations of the teacher model and the student model.IKD requires careful design to balance the complexity of transferring knowledge from multiple layers while ensuring computational efficiency and avoiding issues such as vanishing gradients.

Experimental Setup and Results Analysis
To investigate the impact of datasets, we conduct experiments with varying configurations, including knowledge distillation with and without dataset-specific adaptations.The success of these manipulations depends on the optimal configuration of experimental parameters and a logical, transparent experimental protocol, which we present in Section 4.1 below.

Experimental Setup
As shown in Figure 1 and further motivated by the literature review in Section 2, we use teacher-student architecture to distill the knowledge in DNN.So ResNet50 was used as the teacher model and ResNet18 as the student model.
The teacher model is first trained on the original dataset, producing accurate predictions.We also trained the students from scratch to later compare the results after training the students via distillation.Figure 5 shows the validation accuracy over epochs during the training of the teacher and the student from scratch.
During the knowledge distillation process, the student model is trained on the same dataset using a combination of ground truth labels and soft targets generated by the teacher.This dual learning approach helps the student model generalize better and capture intricate patterns.The loss function used in knowledge distillation incorporates both the traditional cross-entropy loss, comparing the student's predictions with the ground truth labels, and a distillation loss, quantifying the similarity between the student's predictions and the soft targets provided by the teacher.The distillation loss encourages the student to mimic the teacher's decision-making process.As ref.
[15] confirms that good data augmentation can be used to obtain considerable knowledge distillation.For data augmentation, we use RandomRotation with the value of 15 to randomly rotate the image by up to 15 degrees, RandomHorizontalFlip to randomly flip the image horizontally, and RandomVerticalFlip to randomly flip the image vertically.We transform the images to a PyTorch (version 2.1.2) tensor, and finally, we normalize the data.The cross-entropy loss was used to train all models with the ground truth label, and the distillation loss used was Kullback-Leibler divergence.The hyperparameter controlling the balance between the two losses was α = 0.7.The temperature was t = 4 [8].We trained the teacher model within 20 epochs and the students within 10 epochs.We use SGD as an optimizer, and the value of the learning rate was lr = 0.001.The Kaggle environment (GPU P100) was used as the hardware and PyTorch as the software to conduct experiments.Each dataset was split into three different subdatasets for training, validation, and testing.The following Table 4 shows the different sizes of each sub-dataset.Evaluation metrics encompass traditional classification metrics such as accuracy as shown in Table 1.We conducted multiple runs for each experiment to account for variability and report averaged results for robust conclusions.
Our methodology combines a diverse set of datasets, state-of-the-art model architectures, and a nuanced knowledge distillation process.This comprehensive approach aims to elucidate the impact of datasets on the effectiveness of knowledge distillation in image classification, providing valuable insights for researchers and practitioners in the field.

Results Analysis
After simulations, the analysis of the results obtained consists in turn of analyzing and comparing the performances of the teacher (ResNet50) and student (Resnet18 from scratch) models on all the data sets (Section 4.2.1).Then, the analysis and comparison of knowledge distilled between the teacher (ResNet50) and the pupil (ResNet18) models in RKD and IKD in Sections 4.2.2 and 4.2.3, respectively.

Analysis of the Results of the Teacher and Student Models from Scratch
Let us remember once again that the first step in knowledge transfer is to train the teacher model since its results will guide the learning of the student model.Figure 6 shows the results of the teacher model after training on the different databases.It also shows the results of the student model from scratch, which will serve as a basis for comparison after knowledge distillation.Looking at Figure 6, we can easily notice that the teacher model performs better than the student model.Indeed, as the student model is shallower than the teacher model, it will also be less accurate.Table 5 completes this figure by presenting the performance differences between the two models on the involved databases.From these representations, we can also see that the performance of both models decreases with database complexity.Further analysis after distillation will enable us to determine whether the same behavior will be observed.Once the teacher model has been trained, the training of the student model can be followed by knowledge distillation.We carried out two different types of distillation experiments, namely RKD [8] for response-based KD and IKD [20] for intermediate-based KD.Sections 4.2.2 and 4.2.3 present the results of these distillations, respectively.

RKD Performance Results Analysis
In the RKD architecture, the student model is trained and guided by the results of the last layer of the teacher model [8]. Figure 7 shows the results of the student model after training by RKD on the different databases.In Figure 7, we can generally see a slight performance gain for the student model.This gain increases as the complexity of the database increases.To complement Figure 7, Table 6 shows the performance gap between the student instance trained from scratch and that trained by response-based knowledge distillation.Part (b) of Figures A1, A3, A5, A7 and A9 shows the precision and loss curves respectively during the epochs of RKD training of the student model on the MNIST, USPS, FashionMNIST, CIFAR10 and CIFAR100 databases.

IKD Performance Results Analysis
In the IKD architecture, the student model is trained and guided by the results of the teacher model's intermediate layer [20].Figure 8 shows the results of the student model after training by IKD on the different databases.
According to the illustration in Figure 8, we observe a considerable overall performance gain for the student model.This gain is even greater as the complexity of the database increases and is much better than that of RKD.Once again, Table 7 completes Figure 8 by presenting the numerical differences in performance between the student instance trained from scratch and that trained by intermediate-based knowledge distillation (IKD).
Part (c) of Figures A1, A3, A5, A7 and A9 shows the precision and loss curves during the epochs of IKD training of the student model on the MNIST, USPS, FashionMNIST, CIFAR10 and CIFAR100 databases, respectively.We can draw two major observations from Figure 9 by comparing it with Figure 6.The first observation concerns the RKD: although the student model gains in performance from the RKD, this gain is nevertheless slight, and the observation that the performance of the two models decreases as the complexity of the database increases is confirmed.On the other hand, when we look at the IKD, the gain for the student model is much more significant.Here, we see that, unlike the others, the student model gains much more in performance as the complexity of the database increases.
Figure 10 shows the gains in student performance after distillation.We first note that IKD [20] performs significantly better than RKD [8].We note that the more complex the database, the greater the gain in terms of performance.
Part (c) IKD of Figures A2, A4, A6, A8 and A10 confirms this last observation.Indeed, we observe a significant increase in the f1-score compared to part (a), from scratch, and part (b), RKD.This increase is proportional to the complexity of the database.We can conclude from this that the more complex the database, the greater the effect of distillation.Figure 11 shows us the differences in performance between the different instances of the student model (from scratch, RKD, and IKD) compared to that of the teacher model.Knowledge distillation is indeed effective, and we even note that in the case of IKD, the student performs better than the teacher.On the other hand, we observe that in the least complex databases (MNIST, USPS and FashionMNIST), the performances between the teacher and the different instances of the student are approximately the same.We observe a notable difference in the IKD framework on the CIFAR10 and CIFAR100 databases.This leads us to draw two conclusions: 1.
Knowledge distillation has a considerable effect on problems with complex databases.The more complex the database, the deeper and more powerful the model used for training.With a powerful teacher model capable of characterizing knowledge, the transfer to the student model will be assured.

2.
By observing the performance provided by RKD and that provided by IKD on different databases, we conclude that the choice of the IKD method will be preferable to that of RKD when dealing with complex databases.According to Figure 12, we see a slight variation in the curve for the LOW and LOW TO MODERATE databases, namely MNIST and FashionMNIST.The MODERATE (USPS) database curve shows a slightly more marked variation.Finally, the most complex CIFAR10 and CIFAR100 databases (MODERATE TO HIGH and HIGH) show a significant variation.

Discussion
The analysis of the results sheds valuable light on the effect of databases on knowledge distillation.By highlighting the importance of choosing the appropriate distillation method according to the complexity of the data and the learning objectives, these results could have important implications for the development of more robust and generalizable learning models.We highlight these insights in Sections 5.1 and 5.2.

Impact of Database Complexity on Distillation
By examining the performance curves for different databases, we observed significant variations according to the complexity of the data.This observation highlights the importance of considering the diversity of the data and its specific characteristics when designing learning models.The results show a significant difference between RKD and IKD.While RKD shows modest performance gains, IKD shows much more significant improvements, especially with complex databases.This raises questions about the mechanisms underlying these two approaches and their effectiveness in different contexts.More specifically, IKD outperforms RKD, mainly because of the nature of the information that each method transfers from the teacher's model to the student's model.IKD focuses on aligning the student's internal representations or feature maps with those of the teacher at different levels [20,49].This method ensures that the student model not only learns the final results but also mimics the teacher's hierarchical feature extraction process, capturing richer and more nuanced information throughout its architecture [20].Indeed, the theoretical underpinnings support this advantage.Intermediate representations contain fine-grained information and hierarchical abstractions that are crucial for complex tasks.By transferring these representations, the student model is better equipped to understand and generalize from the data.This approach exploits the concept of learning intermediate features, which are often more informative than final logs alone, particularly in deep networks where each layer captures progressively higher-level abstractions.In contrast, RKD relies solely on the teacher's final logits [8,9].Although this method helps the student to know the ultimate limits of the decision, it does not provide the intermediate knowledge essential for a comprehensive understanding of the input [48].This can lead to less effective transfer, as the student does not benefit from the multi-level learning process followed by the teacher.Interestingly, IKD seems to be more resilient to increasing data complexity; the results show that in some cases, the distilled student (especially with IKD) can even outperform the teacher in terms of performance.This suggests that the transmission of knowledge through abstract features may be more robust in varied or complex data environments.
That said, knowledge distillation may lead to better generalization or adaptation to specific test data.

Optimisation of Distillation Strategies
The results indicate the need to develop more sophisticated distillation strategies that take into account the specific nature of the data and the characteristics of the models.In fact, the more complex the database, the greater the effect of distillation on improving the performance of the student model.This observation highlights the importance of taking into account the specific nature of the data when choosing the distillation method and designing the model.According to the results obtained, the IKD method is preferable to RKD due to its greater performance gains.

Limitation of the Study
Although the results obtained in our work are very interesting, we are aware that our study may have certain limitations.The limited choice of model architecture (ResNet50, ResNet18) used, the fact that the scope was limited to image classification tasks, the nature of the data used, and the choice of distillation methods (RKD, IKD) were deliberate choices to maintain a controlled and detailed analysis in a well-defined context.The performance measures and evaluation methods used in this study could also be a limitation.The scope of the literature search was limited due to access restrictions on some articles, leading to our potentially overlooking important findings that could influence our results.The limitation in isolating the variable impact on knowledge distillation performance; indeed, we compared KD performance on very different datasets rather than systematically varying individual parameters while holding other factors constant.The interpretation of the results is also open to discussion.

Conclusions
We conducted a thorough examination of the impact of databases on knowledge distillation in the context of image classification.We have used a diverse array of databases with different levels of complexity.We were able to derive several important and meaningful conclusions by meticulously analyzing the performance of both teacher and student models across various distillation methods.
Firstly, our results clearly demonstrated that knowledge distillation can be pivotal in enhancing the performance of student models, particularly in scenarios where the data are intricate and heterogeneous.Specifically, the IKD method exhibited more substantial performance improvements compared to the RKD method, underscoring the significance of transferring knowledge through abstract and generalizable representations.Furthermore, we observed that the complexity of the database plays a critical role in determining the effectiveness of knowledge distillation.Our findings indicated that as the complexity of the database increases, so do the performance gains of the student model, emphasizing the necessity of considering the unique characteristics of the data during the distillation process.
Additionally, our comprehensive analyses allowed us to compare the performance of the teacher and student models in detail, revealing instances where the distilled student models actually outperformed their teacher counterparts.This observation highlights the remarkable potential of knowledge distillation to foster improved generalization and adaptation to specific test data.Moreover, our results provided guidance on selecting the most appropriate distillation method based on the complexity of the database.Specifically, they suggest that the IKD method is particularly advantageous in scenarios involving complex and varied data.
Overall, our study offers valuable insights into the influence of databases on knowledge distillation, contributing important perspectives for the development of more robust, generalizable, and efficient machine learning models applicable to a wide range of domains.By delving into the nuances of how different distillation methods perform across diverse datasets, we provide a deeper understanding that can inform future research and practical applications in the field of machine learning and image analysis in particular.

Figure 1 .
Figure 1.Flowchart of the proposed approach to highlight the impact of the dataset on knowledge distillation in DNN.

Figure 5 .
Figure 5. Variation in the validation accuracy by epochs for (a) the teacher model (ResNet50) and (b) the student model (ResNet18).

Figure 6 .
Figure 6.Test accuracy for the teacher and instance student model from scratch.

Figure 7 .
Figure 7. Test accuracy for the teacher and instance student models RKD.

Figure 8 .
Figure 8. Difference between the student model from scratch and the student IKD accuracy.

4. 2 . 4 .
Analysis of the Impact of the Database on Knowledge DistillationAfter analysing and comparing the results of the teacher model with those of the different instances of the student model, in this section we will analyse the effect of the databases on the distillation itself.To do this, we will first look at Figure9which shows the results of the different distillations compared with those of the teacher model; then we will look at Figures10 and 11which present the effect of distillation on the different databases and finally we will observe Figure12which presents the impact of datasets on knowledge distillation.

Figure 9 .
Figure 9. Difference between the teacher and instance student model distilled from RKD (a) and instance student model distilled from IKD (b).

Figure 11 .
Figure 11.All instances of student performance compared to teacher performance.bar visualisation (a) and curve visualisation (b).

Figure A10 .
Figure A10.Student metrics after the training phase of the student model in the CIFAR100 dataset.(a) Training student from scratch, (b) RKD student training, and (c) IKD student training.

Table 1 .
Summary of recent literature on knowledge distillation in image classification.EM = evaluation metric.

Table 2 .
[5] key statistics for each dataset.This dataset consists of 60,000 32 × 32 color images across ten different classes, each containing 6000 images.The classes include common objects like cars, dogs, and cats.The addition of color and more diverse objects increases the complexity compared to MNIST and USPS.Criteria: larger image size (32 × 32 pixels), threechannel color images, more diverse classes, and significant background variations.•CIFAR-100[5]:Similar to CIFAR-10, CIFAR-100 has 100 classes, with 600 images per class.It covers a broader range of object categories, making it more challenging.The increased number of classes and the finer distinctions between categories make it a more complex classification task compared to the previous datasets.Criteria: same image size (32 × 32 pixels) and color channels as CIFAR10, but a much larger number of classes (100), increasing variability and the challenge of classification.
[7]]PS[11]is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9298 16 × 16 pixel grayscale samples; the images are centered and normalized and show a broad range of font styles.Similar to MNIST, USPS contains images of handwritten digits.It is slightly more challenging than MNIST but still relatively simple.Criteria: small image size (16 × 16 pixels), same number of classes (10 digits), and slight variations in style and noise compared to MNIST.•MNIST[41]is a dataset with 28 × 28 grayscale images of handwritten digits.It consists of ten different classes and is often used for image classification tasks.The dataset is relatively simple and is often used as a beginner's dataset for image classification tasks.Criteria: small image size (28 × 28 pixels), a limited number of classes (10 digits), simple and uniform structure with minimal noise.•FashionMNIST[7]is a dataset with 28 × 28 grayscale images of fashion items, such as clothing and accessories.It consists of ten different classes and is often used as a replacement for the traditional MNIST dataset for image classification tasks.The dataset is more complex than MNIST as it requires the model to recognize various types of clothing items, adding a bit more complexity to the classification task.Criteria: same image size (28 × 28 pixels) as MNIST, but with 10 different classes of clothing, introducing more variability in shapes, and textures.

Table 3 .
Details of the ResNet architectures used for the teacher and student models

Table 4 .
Distribution of different data sizes for training (83.33%), validation (11.66%), and testing (5%).These data sizes were chosen based on experimental results from the literature review.

Table 5 .
Difference between teacher and student accuracy.

Table 6 .
Difference between the student model from scratch and the student RKD accuracy.

Table 7 .
Table of difference between the student model from scratch and student IKD accuracy.