Resolution-Based Distillation for Efficient Histology Image Classification

Developing deep learning models to analyze histology images has been computationally challenging, as the massive size of the images causes excessive strain on all parts of the computing pipeline. This paper proposes a novel deep learning-based methodology for improving the computational efficiency of histology image classification. The proposed approach is robust when used with images that have reduced input resolution and can be trained effectively with limited labeled data. Pre-trained on the original high-resolution (HR) images, our method uses knowledge distillation (KD) to transfer learned knowledge from a teacher model to a student model trained on the same images at a much lower resolution. To address the lack of large-scale labeled histology image datasets, we perform KD in a self-supervised manner. We evaluate our approach on two histology image datasets associated with celiac disease (CD) and lung adenocarcinoma (LUAD). Our results show that a combination of KD and self-supervision allows the student model to approach, and in some cases, surpass the classification accuracy of the teacher, while being much more efficient. Additionally, we observe an increase in student classification performance as the size of the unlabeled dataset increases, indicating that there is potential to scale further. For the CD data, our model outperforms the HR teacher model, while needing 4 times fewer computations. For the LUAD data, our student model results at 1.25x magnification are within 3% of the teacher model at 10x magnification, with a 64 times computational cost reduction. Moreover, our CD outcomes benefit from performance scaling with the use of more unlabeled data. For 0.625x magnification, using unlabeled data improves accuracy by 4% over the baseline. Thus, our method can improve the feasibility of deep learning solutions for digital pathology with standard computational hardware.


INTRODUCTION
Digital pathology was introduced over 20 years ago to facilitate viewing and examining highresolution scans of histology slides.A digital scanning process produces whole-slide images (WSIs), which are then analyzed with computational tools [1,2].While digital scans circumvent traditional microscope use, they introduce new computational challenges.The resulting WSIs can be as large as 150,000×150,000 pixels in size and occupy gigabytes of space.Software like OpenSlide [3] and QuPath [4] have been instrumental in providing tools to read and analyze WSIs.Nevertheless, these tools are only a single step in the pipeline to store and analyze WSIs, and they cannot address other computational requirements such as storage capacity, network bandwidth, computing power, and graphics processing unit (GPU) memory.
In recent years, computer vision-based deep learning methods have been developed for digital pathology [5][6][7]; however, their application and scope have been limited due to the massive size of WSIs. Figure 1 illustrates the magnitude of a sample histology image.Even with the most recent computational advancements, deep learning models for analyzing WSIs are still not feasible to run on all except the most expensive hardware and GPUs.These computational constraints for analyzing high-resolution WSIs have limited the adoption of deep learning solutions in digital pathology.This paper addresses this computational bottleneck through implementing a deep learning approach designed to operate accurately on lower-resolution versions of WSIs.This approach aims to lower the resolution of the input image while minimizing its effect on the classification performance.By operating on WSIs with a lower resolution, our approach allows for slides to be scanned at a lower resolution, reducing scanning time and the strain on computational hardware and infrastructure.Our proposed methodology is a novel approach to make high-resolution histology image analysis more efficient and feasible on standard hardware and infrastructure.Specifically, we propose a knowledge distillation-based method where a teacher model works at a high resolution and a student model operates at a low resolution.We aim to distill the teacher model's learned representation knowledge into the student model that is trained at a much lower resolution.The knowledge distillation is performed in a self-supervised fashion on a larger unlabeled dataset from the same domain.Large, labeled datasets are hard to find in the medical field, leading us to adopt a self-supervised approach to account for the lack of access to sizeable, labeled histology image datasets.This knowledge distillation method can potentially increase the model's performance on lower-resolution images, while simultaneously saving significant amounts of memory and computation.

Histology Image Classification
Previously, several methods have been proposed to solve the WSI classification problem.Some approaches work by tiling the WSI into more reasonably sized patches and learning to classify at the patch level [5][6][7][8].In some recent works, the patch level predictions are aggregated using simple heuristic rules to produce a slide level prediction [5,6,8].These rules are modeled after how pathologists classify WSIs in clinical practice.In another work, a simple maximum function was used on patch-based slide heat maps for whole-slide predictions [7].While these methods achieved reasonable overall performance, their analyses are fragmented, and they do not incorporate the relevant spatial information into the training process.We aim to avoid patch-based processing since it introduces additional computational overhead that can be bypassed with tissue-or slide-based analysis methods.
Multiple-instance learning (MIL) has been proposed to address the slide-level labeling problem [9][10][11][12][13].MIL is a supervised learning scheme where data-points, or instances, are grouped into bags.Each bag is labeled with the class by the instance count of that particular class.MIL is well-suited towards histology slide classification, as it is designed to operate on weakly-labeled data.MIL-based methods better account for the weakly-labeled nature of patches, but they still tend to miss the holistic slide information.
Recent work has shown that operating at the slide-level is possible by splitting up the computation into discrete units that can be run on commodity hardware [14,15].The overall calculation is equivalent to the one performed at the slide-level due to the invariance of most layers in a convolutional neural network.This method aims to analyze WSIs at the original high-resolution level to avoid losing larger context and fine details.Although this approach helps run large neural networks, it still requires considerable computational resources to analyze WSIs at a high resolution.
Attention-based processes have also been suggested for WSI analysis.Attentionbased mechanisms divide the high-resolution image into large tiles and simultaneously learn the most critical regions of WSIs for each class and their labels [16][17][18].Although these methods achieve high classification performance, they still necessitate considerable computational resources to analyze high-resolution images.

Self-Supervised Learning
Self-supervised learning is a machine learning scheme that allows models to learn without explicit labels.Large, unlabeled datasets are readily accessible in most domains, and selfsupervised methods can assist in improving classification performance without requiring resource-intensive, manually labeled data.In this scheme, learning occurs using a pre-text task on an inherent attribute of the data.As the pre-text task operates on an existing data feature, it requires no manual intervention and can be easily scaled.Proposed pre-text tasks include colorization [19,20], rotation [21,22], jigsaw puzzle [23], and counting [24].Recent studies have explored the invariance of histology images to affine transformations, but none use self-supervised learning [25,26].Several other works have proposed self-supervised techniques for histology images exploiting domain-specific pre-text tasks, including slide magnification prediction [27], nuclei segmentation [28], and spatial continuity [29].In contrast, our work introduces a new pre-text task designed to transfer the knowledge present in models trained on high-resolution WSIs to ones operating on low-resolution WSIs.

Knowledge Distillation
Knowledge distillation has proven to be a valuable technique for transferring learned information between distinct models with different capacities [30,31].As models and datasets exponentially increase in size, it is critical to adapt our methods accordingly to support less powerful devices [32].Knowledge distillation has been beneficial to many areas of computer vision such as semantic segmentation [33], facial recognition [34,35], object detection [36], and classification [37].Although some prior work has used knowledge distillation for chest X-rays in the medical domain [38], knowledge distillation has not been widely used for histology image analysis.
Initial knowledge distillation studies used neural network output activations, called logits, to transfer the learned knowledge from a teacher model to a student model [30][31][32].
FitNet built upon this knowledge distillation paradigm by suggesting that while the logits are important, the intermediate activations also encode the model's knowledge [39].This method proposed adding a regression term to the knowledge distillation objective to improve the overall performance of the student model while reducing the number of parameters.In this paper, we model our architecture after the FitNet approach to maintain the spatial correspondence between teacher and student models, as it represents clinically relevant information.Of note, in contrast to our approach, previous work in this domain does not include self-supervision [40].As we show later in this paper, self-supervision proves to be a deciding factor in increasing overall classification performance for histology images.

TECHNICAL APPROACH 3.1 Overview
There are two main phases and one optional phase to our approach as follows: 1. Train a teacher model at 10x magnification on the labeled dataset, as explained in Section 3.2.

2.
Train the knowledge distillation model on the unlabeled dataset from 10x to a lower magnification, explained in Section 3.3 and shown in Figure 2.
3. (Optional) Fine-tune the student model using the labeled dataset at a lower resolution, as explained in Section 3.4.

Teacher Model
For the teacher model, we used a residual network (ResNet) [41].ResNet was chosen due to its excellent empirical performance compared to other deep learning architectures.We used the built-in ResNet PyTorch implementation [42].
The teacher model input was slides at 10x magnification (1 μm/pixel), which are considered high-resolution.We performed online data augmentation consisting of random perturbations to the color brightness, contrast, hue, and saturation, horizontal and vertical flips, and rotations.Additionally, each input was standardized by the mean and standard deviation of the training set across each color channel.

Knowledge Distillation from High-Resolution
Knowledge distillation (also referred to as 'KD') is a machine learning method, where typically a larger, more complex model "teaches" a smaller, simpler student model what to learn [30].The learning occurs by optimizing over some desired commonality between the models.For our approach, we opted to keep the student and teacher model architectures identical and instead modified the input resolution.As input data resolution is a significant factor for efficient and accurate histology image analysis, we decided that the teacher model should receive the original high-resolution image as input while the student model receives a low-resolution input.For optimizing our knowledge distillation model, the total loss is the sum of (1) the soft loss and (2) the pixel map.These loss components are described below, and an overview of our knowledge distillation approach is shown in Figure 2.
Figure 2. Overview of the knowledge distillation model.The (⋅) block is a resizing function that scales the teacher feature maps to the same size as the corresponding student ones.The Pixel-wise and Soft losses are combined to produce the total loss for the optimization process.
To promote classification similarity between the teacher and student models, we utilized the Kullback-Leibler (KL) Divergence over the outputs of the teacher and student models as the loss function [30,43].Additionally, the loss function is "softened" by adding a temperature  to the softmax computation.Intuitively, softening the loss function gives more weight to smaller outputs, thus transferring information that would have been overpowered by greater values.The soft loss is computed as follows: where    �����⃗ ,    �����⃗ , and σ(⋅) represent the teacher logits, student logits, and softmax function, respectively.Note that we multiply by  2 since the gradients will scale inversely to this factor [30].
To ensure that the teacher and student models focus on similar areas, we compute the mean-squared error over the feature map outputs.We introduce   (⋅) and   (⋅) as the maxpooling and bicubic interpolation operations, respectively.We use both max-pooling and bicubic interpolation for the pixel-wise loss.As shown in the Supplementary Material, we found these two functions provide the most consistent performance for the loss function when combined.The pixel-wise loss is computed as follows: where    and    are the outputs of the final convolutional layer in the teacher and student models, respectively.We require the functions  1 (⋅) and  2 (⋅) since the size of the teacher feature map outputs are  2 times larger than the student ones, ignoring negligible differences due to rounded non-integer dimensions in some instances.

Fine-Tuning
After performing the knowledge distillation, we fine-tuned the student model weights on the lower resolution training dataset.The goal of fine-tuning the model is to make small weight adjustments for maximal performance on the labeled data without undoing the learning in the previous layers.To this end, all weights were frozen except the ones in the fully connected layer.The weights were trained using the Adam optimization algorithm [44] until convergence.The data augmentations enumerated in Section 3.2 were applied to the input data.Similar to the teacher model training, we used the cross-entropy loss to learn groundtruth labels in this phase.We opted to skip this phase in experiments where the same training set was used across Phases I and II, as it resulted in lower classification performance on the validation set due to overfitting on the training set.

Gradient Accumulation
We used gradient accumulation to account for the large and variable sizes of the slides.
Gradient accumulation computes the forward and backward pass changes for each input, but it does not update the model weights until all mini-batch backward passes are complete.
While gradient accumulation does not affect most layers, batch normalization layers are affected, as they require operation on a mini-batch to work correctly.In our model, instead of batch normalization, we used group normalization, which has more consistent performance across varying mini-batch sizes [45] due to its independence from the mini-batch dimension.
This modification allows the model to learn properly with gradient accumulation.

Datasets
We perform our experiments on two independent datasets collected at the Dartmouth-Hitchcock Medical Center, a tertiary academic medical center in New Hampshire, USA.The slides were hematoxylin-eosin stained formalin-fixed paraffin-embedded and digitized by a Leica Aperio scanner at either 20x or 40x magnification.Every downsampling was obtained directly from the original image to avoid any potential artifacts caused by a composition of downsamplings.To generate the required low-resolution WSIs, we used the Lanczos filter to create several downsampled versions of each image [46].We used the notation nx magnification relative to the original magnification.For example, an originally 20x slide downsampled four times in both height and width dimensions would have n = 20/4 = 5 and be denoted 5x.
This study and the use of human participant data in this project were approved by the Dartmouth-Hitchcock Health Institutional Review Board (IRB) with a waiver of informed consent.The conducted research reported in this article is in accordance with this approved Dartmouth-Hitchcock Health IRB protocol and the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research involving Human Subjects.

Celiac Disease Dataset
Celiac disease (CD) is a disorder that is estimated to impact 1% of the population worldwide [47,48].Diagnosing and treating CD is clinically significant, as undiagnosed CD is associated with a higher risk of death [47,48].A duodenal biopsy is considered the gold standard for CD diagnosis [49].A pathologist examines these biopsies under a microscope to identify the markers associated with CD.
Our CD dataset is from 1,364 patients distributed across the Normal, Non-specific Duodenitis, and Celiac Sprue classes.Each patient has one or more WSIs consisting of one or more tissues.A gastrointestinal pathologist diagnosed every slide as either Normal, Nonspecific Duodenitis, or Celiac Sprue.
The CD slides contained significant amounts of background.Hence, as a preprocessing step, we used the tissueloc [50] code repository to find approximate bounding boxes around the relevant regions of the slide using a combination of image morphological operations.This process aids in reducing the computational burden while simultaneously removing the clinically unimportant background regions.
We partitioned the dataset into a labeled set and an unlabeled auxiliary set.  1. Distribution of the CD tissues for all sets used in the model.The class counts for the self-supervised sets ADv1 and ADv2 are only provided as a reference and this class information was not used in the self-supervision process.
For self-supervision, we randomly sampled from the CD slides not used in any of the training, validation, or testing sets.In order to explore the effects of unlabeled dataset size, we created two auxiliary datasets, ADv1 and ADv2, such that ADv1⊂ADv2.ADv1 and ADv2 comprise 300 and 1,004 patients, respectively.Experimenting with two unlabeled datasets allowed us to demonstrate the efficacy of our method as the dataset size scales.We also sampled an additional 20 patients from each class to use as a proxy development set for hyperparameter tuning.The 60-patient development set was intended to validate the selfsupervision process and remained independent from the test set used for evaluation.The distribution for these datasets for self-supervised learning is shown in Table 1.

Lung Adenocarcinoma Dataset
Lung cancer is the leading cause of cancer death in the United States [51].Of all histologic subtypes, lung adenocarcinoma (LUAD) is the most common pattern [52] and its rates continue to increase among certain subpopulations [53].The World Health Organization identifies five predominant subtypes: lepidic, acinar, papillary, solid, and micropapillary for lung adenocarcinoma, in order of increasing severity [54].The classification of lung adenocarcinoma subtypes on histology slides has proven to be especially challenging, as over 80% of cases contain mixtures of multiple patterns [55,56].2. Distribution of the LUAD tissues for all sets used in the model.The counts correspond to the annotations provided by the pathologists.

Class
Our LUAD dataset was randomly split into two sets, with 235 slides for training and 34 slides for testing.Both the training and testing sets were annotated by pathologists for predominant subtypes.As such, each slide in the training and testing set consists of at least one tissue.Some training and testing slides contained tissues annotated as benign, but we removed them as they are trivial to identify and are not related to cancer subtypes.Given the considerably smaller size of this dataset compared to the CD dataset, we did not perform any experiments on varying unlabeled dataset sizes and used the entire training set for all runs.
No hyperparameter tuning was performed for this model, and we used the same configuration as the CD equivalent.The distribution of the LUAD data is presented in Table 2 for both training and testing sets.

Implementation Details
We evaluated all models on the labeled test set corresponding to each training dataset.No data augmentation was applied to the test sets beyond standardizing the color channels by the mean and standard deviation of the respective labeled training sets.To evaluate our classification performance, we used accuracy, F1-score, precision, and recall.These metrics were computed in a one-vs.-restfashion for each class.We computed the mean value for each metric by macro-averaging over all classes.The 95% confidence intervals (CIs) were produced using bootstrapping on the test set for 10,000 iterations.We calculate each model's computational cost by counting the number of billions of floating-point operations (GFLOPS) for a forward pass of that model.Using the number of GFLOPS allows us to evaluate the performance gains while also considering the computational cost.
Teacher Model.We trained the teacher model on high-resolution input images at 10x magnification.The He initialization scheme [57] was used to initialize the weights.We utilized the Adam optimization algorithm [44] for 100 epochs of training with a learning rate of 0.001.The Adam optimizer minimized the cross-entropy loss function with respect to the ground-truth slide labels.
Baseline.All baseline models are trained on a specified magnification from randomly initialized weights using the He initialization scheme [57].We use the same ResNet architecture as the teacher model for these baselines.KD.Our knowledge distillation approach consists of a teacher model, described above, and a student model of the same ResNet architecture.In contrast to the standard ResNet architecture, we use both the final convolutional and fully connected layer outputs as our unlabeled hints and feature recognition knowledge, respectively.We use the labeled training and validation sets for the distillation and ignore the labels in the self-supervised part of our approach.As explained in Section 3.4, we do not apply fine-tuning for these experiments as it contributes to overfitting according to our validation set.

KD (AD).
The knowledge distillation approach using the auxiliary datasets in this paper is similar to stock distillation [30].The main difference is that we utilized unlabeled auxiliary datasets for self-supervised learning instead of using a labeled dataset.

RESULTS
In Table 3 We present the results of our proposed approach for all tested magnifications in Table 4.The performance and computational costs of our models are shown in Figure 3.

DISCUSSION
As presented in Table 4, our KD method outperforms the baseline metrics in all trials for celiac disease.The lung adenocarcinoma results show that our approach improves performance for 0.625x (16 μm/pixel), 1.25x (8 μm/pixel), and 2.5x (4 μm/pixel), and is close to the baseline performance, with a slight difference, for 5x (2 μm/pixel) input images.
This outcome is consistent with our 5x results on the CD dataset without the AD selfsupervision phase.
While adding more data helped to increase CD classification accuracy at 0.625x magnification by over 4%, this performance benefit narrowed as the magnification increased further.This trend can be seen in Figure 3, where the test set accuracy curves approach each other as the computational cost grows.Most importantly, our method outperforms the baseline at 10x magnification for the distillation approaches on the auxiliary dataset.This performance gain comes with at least a 4x reduction in computational cost.
Using our model to maintain accurate classification performance while minimizing computational cost could facilitate scanning histology slides at a much lower resolution.
According to the Digital Pathology Association, scanners cost up to $300,000 depending on the configuration [58].Reducing the scanning resolution would have a two-fold benefit, lessening both the scan time and scanner cost.Beyond the scanner, storing and analyzing lower resolution WSIs would be less burdensome on the computational infrastructure.Instead of investing in complex data solutions, pathology laboratories could migrate to cloud-based services that could manage and analyze smaller datasets using standard network bandwidth.
There are still some improvement areas for our work, namely evaluating our model on additional datasets from different institutions.While our method was validated on two datasets, both are collected from our institution and may contain inherent biases in staining and slide preparation.With more datasets, we would be able to investigate the scaling effects of self-supervised learning beyond the size of our existing dataset.The impact of scaling could prove especially useful for smaller healthcare facilities that may not have the capabilities to collect and label data as required for training a typical deep learning model for histology image analysis.In addition to larger datasets, it is important to explore the efficacy of this methodology on slides from different medical centers and various diseases to evaluate the generalizability of our proposed approach.
Although the trained models can be used on WSIs with lower resolutions, our method still requires using high-resolution WSIs during training.While reducing the computational requirements of the inference stage is always beneficial, there is no reduction in cost for training the teacher model or the self-supervised and knowledge-distillation models.This weakness is an active area of investigation in our future work.One possibility is using transfer learning to adapt a pre-trained model to an alternative high-resolution histology dataset.A method that utilizes transfer learning in this fashion would remove the burden of continuously retraining teacher models for each new dataset.Lastly, we plan to combine our approach with a neural network visualization method.As pathologists rely on visual markers to diagnose slides, it is critical to provide humanly interpretable visualizations and insights to avoid a black-box approach in histology image analysis.

CONCLUSION
In this work, we demonstrated that knowledge distillation could be applied to histology image analysis and further improved by self-supervision.We showed that our method both improves performance at significantly lower computational cost and scales with dataset size.
The empirical evidence presented proves that it is possible to transfer information learned across magnifications and still produce clinically meaningful results.Our approach allows for scanning WSIs at a significantly lower resolution while having little to no classification accuracy degradation.Our method removes a major computational bottleneck in the use of deep learning for histology image analysis and opens new opportunities for this technology to be integrated into the pathology workflow.

Figure 1 .
Figure 1.A sample WSI intended to show the high resolution and large size of histology images.

Table 3 .
, we present the results of the teacher model trained from scratch at 10x magnification for the CD and LUAD test sets.Results and the corresponding 95% CIs for the teacher model trained at 10x magnification as percentages.The above results were obtained on the respective test sets, detailed in Sections 4.2 and 4.3.

Table 4 .
Results for baseline and KD approaches as percentages with corresponding 95% CIs.Baseline models were trained from scratch until convergence on the corresponding magnification.The KD model without an auxiliary dataset was trained using the labeled dataset.Boldface text indicates the best performing model for each magnification and metric.Test set accuracy plotted against the computational cost.The computational cost is measured in GFLOPS and corresponds to the approximate number of floating-point operations per forward pass.The magnification of the model input data is displayed under the computational cost values.