HistoPerm: A permutation-based view generation approach for improving histopathologic feature representation learning

Deep learning has been effective for histology image analysis in digital pathology. However, many current deep learning approaches require large, strongly- or weakly labeled images and regions of interest, which can be time-consuming and resource-intensive to obtain. To address this challenge, we present HistoPerm, a view generation method for representation learning using joint embedding architectures that enhances representation learning for histology images. HistoPerm permutes augmented views of patches extracted from whole-slide histology images to improve classification performance. We evaluated the effectiveness of HistoPerm on 2 histology image datasets for Celiac disease and Renal Cell Carcinoma, using 3 widely used joint embedding architecture-based representation learning methods: BYOL, SimCLR, and VICReg. Our results show that HistoPerm consistently improves patch- and slide-level classification performance in terms of accuracy, F1-score, and AUC. Specifically, for patch-level classification accuracy on the Celiac disease dataset, HistoPerm boosts BYOL and VICReg by 8% and SimCLR by 3%. On the Renal Cell Carcinoma dataset, patch-level classification accuracy is increased by 2% for BYOL and VICReg, and by 1% for SimCLR. In addition, on the Celiac disease dataset, models with HistoPerm outperform the fully supervised baseline model by 6%, 5%, and 2% for BYOL, SimCLR, and VICReg, respectively. For the Renal Cell Carcinoma dataset, HistoPerm lowers the classification accuracy gap for the models up to 10% relative to the fully supervised baseline. These findings suggest that HistoPerm can be a valuable tool for improving representation learning of histopathology features when access to labeled data is limited and can lead to whole-slide classification results that are comparable to or superior to fully supervised methods.


INTRODUCTION
Digital pathology involves the visualization and analysis of whole-slide images (WSIs) to assist pathologists in the diagnosis and prognosis of various diseases.These WSIs are digitized at high resolutions and can be analyzed manually, using computer vision models, or a combination.However, the large size of these images, up to 150,000×150,000 pixels, can present challenges for typical computer vision-based image analysis tools.
2][3][4][5][6][7][8] In terms of label and annotation requirements, these methods differ from those used on natural images in several ways.Firstly, the labeling process for WSIs requires highly trained experts, while natural images often require minimal or no prerequisites for labeling.Secondly, labels are typically provided at the slide level rather than at the patch level.Finally, the class label may only be determined by a small portion of the WSI.These characteristics present major challenges for the application of standard computer vision methods in digital pathology.
Among these three annotation bottlenecks, the last two are most unique to digital pathology.Due to the large size of the WSIs, it is infeasible for pathologists to label all regions of interest on a slide.Instead, the label is usually provided at the slide level, which also applies to class-negative regions of a slide.Moreover, an object in the average image from the ImageNet natural image dataset occupies 25% of the area, 9 while a typical regionof-interest annotation in a WSI can occupy as little as 5% of the image. 4,10The combination of weak labeling and low object scale poses a unique challenge and makes applying standard computer vision methods a suboptimal solution in digital pathology.
2][13] To make these standard deep learning models more feasible and effective for WSIs, it is common to preprocess the images into smaller patches, typically 224×224 pixels.However, this can lead to further issues if the weakly labeled nature of the slides is not considered.[16][17][18][19][20][21] However, this approach can have suboptimal performance if the signal-to-noise ratio is low among the extracted patches.More advanced methods involving attention 22,23 or multipleinstance learning [24][25][26][27][28][29][30][31][32] have been developed to use the weak-labels, but these still require large, labeled datasets.
In recent years, self-supervised representation learning techniques have gained significant traction for their ability to solve difficult problems in computer vision without relying on labor-intensive, manually-labeled datasets.These methods utilize a pretext task to learn a latent representation of an unlabeled dataset, which is often readily available in the medical domain.[35][36][37] These approaches aim to exploit the unique characteristics of these images, such as rotation invariance or local-to-global consistency.In addition, contrastive learning-based methods have gained popularity in histology feature representation, [38][39][40][41][42][43][44][45] with techniques such as Contrastive Predictive Coding 38,42 and DSMIL 46 proving effective for incorporating multiscale information into contrastive models.However, all these approaches still require all input data to be labeled, whether weakly or strongly.
To address this shortcoming, we propose a model-agnostic view generation method called HistoPerm for representation learning in histology image classification.Unlike prior methods, HistoPerm is flexible and incorporates both labeled and unlabeled data into the learning process.In contrast to prior view generation approaches for histology images, which produce views at random from the same instance, we perform a permutation on a portion of the mini-batch such that the view comes from the same class but a different instance of the class.By taking advantage of the large pool of both class-positive and class-negative patches, our approach can derive stronger representations for histological features.Our experiments show that adding HistoPerm to an existing state-of-the-art representation learning image analysis pipeline improves the histology image classification performance.

Representation Learning with Joint Embedding Architectures
8][59][60] These paradigms all employ joint embedding architectures, where two models are trained to produce similar outputs when given augmented views of the same source image.
Contrastive learning relies on positive and negative samples to guide the network through learning unique identifiers for each class in the downstream task.However, these approaches often require large mini-batch sizes and significant computational resources, making them impractical for many studies and applications.Smaller mini-batch sizes can be used by implementing "tricks" such as momentum encoders, 49,50 but in general, these approaches are still resource intensive and require massive computational power that is inaccessible to most researchers.
Noncontrastive approaches, which utilize only positive instances, also require fewer resources but may result in a slight decrease in the downstream classification performance. 54- 56 [63] Information preservation methods, such as Barlow Twins, 64,65 Whitening-MSE, 65 and VICReg, 66,67 aim to decorrelate variables in the learned representations and explicitly prevent collapse.These methods are effective at avoiding trivial embeddings and have shown promise in natural image tasks.

Rationale for Our Work
Prior work has shown that representation learning methods rely on building representations that are invariant to irrelevant variations in the input. 68For histopathology, many patches share similar histologic features and visual attributes, independent of the class.Given this, many of the patches sampled from WSIs are unsuitable as negative samples for learning unlike natural images.Hence, we utilize the large pool of both class-positive and classnegative patches to build stronger representations for histologic features by allowing permutation at the mini-batch scale.While we may encounter instances where class-positive and class-negative instances are paired and these instances are not morphologically similar, these hard cases should not be common enough in a typical WSI classification task to impact feature learning adversely, and such hard cases may even be beneficial to learning according to previous research. 69Notably, HistoPerm is model-agnostic and can be integrated into any joint-embedding architecture-based representation learning framework operating on two input views to improve histologic feature representation learning.

METHOD
In this section, we introduce our proposed method, HistoPerm, a permutation-based view generation technique to improve capturing histologic features in representation learning frameworks.A high-level overview of our approach is shown in Figure 1.$ show the change in patch order before and after the permutation operation, respectively.The unlabeled mini-batch of views,  & , is fed to joint embedding networks without permutation as in the standard architectures.

WSI to Patch Conversion
Let  be a dataset comprised of WSIs.Disjoint labeled and unlabeled subsets  % and  & partition .For each WSI  ' ∈  % , we have an associated ground-truth label  ' corresponding to the pathologist-provided slide-level classification label.Given a slide  ' , we produce a set of patches  ' and assign them the slide-level label  ' .In this weakly labeled setting, we can have anywhere between 1 and | ' | class-positive patches per set  ' .As discussed earlier, in histopathological classification, we can assume that the majority of patches in  ' will be negative relative to slide-level class  ' .After this step, we have labeled and unlabeled sets of patches  % and  & produced from  % and  & , respectively.In the next section, we explain how we generate the input views for our model, given  % and  & .

View Generation
View Augmentation.Given a weakly-labeled patch dataset  % and unlabeled patch dataset Analogously, we produce augmented views of the labeled mini-batch  % , denoted as  %,# = { %,# (#) ,  %,$ (#) , … ,  %,| " | (#) } and $ is a permutation of  %,$ where the original image differs, but the groundtruth class is the same.Through this permutation, we augment the size of possible view pairings, enabling the model to learn richer representations.Note that it was an arbitrary choice to permute  %,$ , and due to symmetry either view could be permuted without loss of generality.

Datasets
We applied and evaluated our approach on two datasets from an academic medical center.
Our datasets are representative of Celiac Disease (CD) and Renal Cell Carcinoma (RCC).
Each dataset consists of hematoxylin-eosin-stained, formalin-fixed, paraffin-embedded slides scanned at either 20× (0.5 µm/pixel) or 40× (0.25 µm/pixel) magnification.For run-time purposes, we downsampled the slides to 5× (2 µm/pixel) magnification using the Lanczos filter. 70We divided the slides into overlapping 224×224-pixel patches for use with the PyTorch deep learning framework. 71A different overlapping factor was used across each class in the training set to produce approximately 80,000 patches per class.For the development and testing sets, we used a constant overlap factor of 112 pixels among patches.
We provide dataset statistics in the supplementary material.Although these datasets are labeled in their original form, we have ignored the labels for a portion of the dataset used in the unlabeled section of the architecture in each epoch according to the formulation provided in Section 2.1 to simulate the intended use of our approach.This means that  % and  & are varying and built dynamically in each epoch.

Implementation Details
Image Augmentation.We used a typical set of image augmentations in our experiments according to common joint embedding architecture-based representation learning methods.A crop from each image is randomly selected and resized to 224×224 pixels with bilinear interpolation.Next, we randomly flip the patches over both the horizontal and vertical axes, as histology patches are rotation invariant.Finally, we performed random Gaussian blurring on the augmented images.Empirical justification, as well as exact implementation details for these transformations, are provided in the supplementary material.
Pretraining.In the pretraining phase, we used the LARS optimizer 74  decay. 75We used an initial learning rate of 0.2 and a mini-batch size of 256.Unlike the pretraining step, we only performed affine transformations to the input data.In this phase, we utilized all data in the respective training set.

Patch-level Results
First, we investigated the effect of HistoPerm on model patch-level classification performance.

Slide-level Results
In Table 2 we present the effects of HistoPerm on slide-level classification performance.For slide level classification we utilized average-pooling to aggregate the patch-level predictions.
This slide-level aggregation approach is straightforward and keeps the evaluation focus on the impact of HistoPerm.Our results showed that incorporating HistoPerm improved performance for all cases of the CD dataset compared to the fully-supervised baseline.On the RCC dataset, the models with HistoPerm showed improved performance for BYOL and VICReg, although all models fell short of the fully-supervised baseline.

DISCUSSION
In this study, we presented HistoPerm, an approach for generating views of histology images to improve representation learning.HistoPerm leverages the weakly-labeled nature of histology images to expand the available pool of views.By expanding the available view pool, we improved the learned representation quality and observed enhanced downstream performance.Our results suggest that HistoPerm is a promising approach for medical image analysis in digital pathology when access to labeled data is limited.
We incorporated HistoPerm into BYOL, SimCLR, and VICReg, and showed improvement in classification performance on two histology datasets.At the patch level, adding HistoPerm to BYOL, SimCLR, and VICReg improved accuracy by 8%, 3%, and 8% on the CD dataset.Similarly, on the RCC dataset, models with HistoPerm outperformed on accuracy by 2%, 1%, and 2% for BYOL, SimCLR, and VICReg, respectively.For CD, we see that models with HistoPerm at the slide-level increase accuracy by 18%, 2%, and 22% for BYOL, SimCLR, and VICReg, respectively.On the RCC data, HistoPerm increases slidelevel accuracy by 4% on BYOL and 1% on VICReg, but decreases performance by 2% for SimCLR.Critically, HistoPerm was able to outperform fully-supervised models at the slidelevel without patch-level annotations.These findings have important implications for using unlabeled histology images in clinical settings, as image annotation can be a labor-intensive and highly skilled process.Reducing the need for labeled data using HistoPerm, would increase the utility of existing representation learning approaches.
We demonstrated that the addition of HistoPerm can lead to a notable performance improvement on the CD dataset compared to the fully-supervised baseline.However, this trend was not observed on the RCC dataset, where all models performed worse than the fullysupervised baseline.For whole-slide classification, we used an average-pooling approach to aggregate the patch-level predictions.We expect that as we utilize more sophisticated approaches, like multi-head attention or self-attention, our slide-level classification results will outperform the presented results, including the fully-supervised ones.
Of note, the results on the RCC dataset did not show as much improvement as those on the CD dataset.It is possible that this difference is due to the higher morphological complexity and variability of the RCC samples, as indicated by the original study on this dataset. 14Despite the smaller improvements on the RCC dataset, the use of HistoPerm on both datasets showed clear benefits over standard representation learning approaches.In future work, we plan to investigate the relationship between histological pattern complexity and learned representation quality to enhance the ability of the model to generate more representative features.Furthermore, we intend on expanding our work to explore the biological underpinnings in depth.
While HistoPerm requires less labeled data than fully-supervised approaches, it still requires some labeled data.In future work, we aim to reduce the labeled data requirements further for HistoPerm to enable use in labeled data-constrained settings.We also plan to examine the impact of incorporating unlabeled data from diverse data sources to explore the generalizability of HistoPerm across histology datasets with varied preparation and scanning procedures.In addition, we intend to utilize datasets from multiple disease types and evaluate the effectiveness of the learned histologic representations for transfer learning.This is particularly relevant as data for certain disease types may be scarce, and pretrained representations could provide a solution for building effective image analysis models.
Finally, we will explore datasets for tasks like survival prediction in future work.

CONCLUSION
The presented study showed that the proposed permutation-based view generation method, HistoPerm, offered improved histologic feature representations and resulted in enhanced classification accuracy compared to current representation learning techniques.In some cases, HistoPerm even outperformed the fully-supervised model.This approach allows for the incorporation of unlabeled histology data alongside labeled data for representation learning, resulting in overall higher classification performance.Additionally, the use of HistoPerm may reduce the annotation workload for pathologists, making it a viable option for various digital pathology applications.
4. Predictor: Multilayer perceptron, , mapping  to  = () ∈ ℝ  ( .We instantiate  as a two-layer multilayer perceptron with a single hidden layer of size 4096 and output size  2 = 256.Now, we cover how these components are used in the standard BYOL, SimCLR, and VICReg methods as well as when HistoPerm is added.

Appendix C.1 BYOL
The BYOL architecture is split across two components called online and target branches, parameterized by  and , respectively.The online branch is composed of three stages: encoder  3 , projector  3 , and predictor  3 .Likewise, the target branch has two stages: encoder  4 and projector  4 .Encoders  3 and  4 map input views to a representation space, which are then fed to respective projectors  3 and  4 .Note that the predictor  3 is only used in the online network, as prior works have shown that this architectural asymmetry is necessary to avoid collapsing to the trivial solution. 3At the end of training, we only keep the online encoder  3 and use the pretrained weights as initialization for the fully-supervised downstream tasks.
Learning progresses by computing the mean squared error between online and target branches.For both unlabeled and labeled views, we compute the loss as follows: where sg(•) is the stop-gradient operation, so the target branch weights are not updated through optimization.As shown in the SimSiam paper 3 , the stop-gradient is necessary to avoid collapse.We symmetrize the loss by passing  &,$ ,  %,$ $ and  &,# ,  %,# to the online and target branches, respectively.In the code implementation for the loss function, we combine and process both mini-batches simultaneously to avoid inadvertently leaking any information about the source dataset through batch normalization.Note that in the case where  %,# =  %,$ $ = ∅, (i.e.,  % =  & = ∅), our method reduces to the default BYOL formulation.
Then, we perform the weight updates as follows: ← optimizer(, ∇ 3 ℒ, ) where  ∈ [0, 1] is a momentum hyperparameter and  is the learning rate for gradient descent.

Appendix C.2 SimCLR
The SimCLR architecture consists of an encoder  ; and projector  ; parameterized by weights .The encoder  ; maps pairs of input views to representations that are then fed to the projector  ; .Finally,  $ -normalization is applied to the outputs of the projector.At the end of training, we only keep the encoder  ; and use the pretrained weights as initialization for the fully-supervised downstream tasks.
Learning progresses using the NT-Xent loss as defined in the original SimCLR formulation. 4For both unlabeled and labeled views, we compute the loss as follows: where  is the temperature hyperparameter.Note that in the case where  %,# $ =  %,$ = ∅ (i.e.,  ℓ =  ℓ = ∅), our method reduces to the default SimCLR formulation.
Then, we perform the weight updates as follows: ← optimizer:,  ; ℒ, ; where  is the learning rate for gradient descent.

Appendix C.3 VICReg
The VICReg architecture consists of an encoder  E and expander ℎ E parametrized by weights .The encoder  E maps pairs of input views to representations that are then fed to the expander ℎ E .At the end of training, we only keep the encoder  E and use the pretrained weights as initialization for the fully-supervised downstream tasks.
Learning progresses using variance, invariance, and covariance loss terms. 5The variance term is as follows: where (, ) = †Var() +  Now, the invariance term is: ]^^^^^^_^^^^^^: " Lastly, the covariance term is: Given all these loss terms, the overall loss function is now: where , , and  are hyperparameters to weight the effect of each loss term.Note that in the case where  %,# $ =  %,$ = ∅ (i.e.,  ℓ =  ℓ = ∅), our method reduces to the default VICReg formulation.

Appendix E.2. Renal Cell Carcinoma Dataset
In Supplementary Tables 4 and 5 we present the dataset distribution for the RCC dataset at the patch-and slide-level, respectively.

Class
Training

Figure 1 .
Figure 1.Overview of our HistoPerm method.The joint embedding networks are fed randomly augmented views  * ,# and  * ,$ .For the labeled mini-batch of patches,  % , the solid or dashed patch outlines represent the labels.The numbers for labeled views  %,$ and  %,$ $ show the change in patch order before and after the permutation operation, respectively.The unlabeled mini-batch of views,  & , is fed to joint embedding networks without permutation as in the standard architectures.
for 50 epochs of training the networks with a 5-epoch warm-up and cosine learning rate decay 75 thereafter.The initial learning rate was 0.45 with a mini-batch size of 256 and weight decay of 10 -6 .We choose  = 0.75 (i.e., 64 unlabeled and 192 labeled examples) as the optimal balance between the unlabeled and labeled portions of the mini-batch.We provide details of how we selected  = 0.75 in the supplementary material.For experiments without HistoPerm, all 256 examples in the mini-batch are considered unlabeled.Linear Evaluation.Linear training uses the SGD optimizer with Nesterov momentum 76 for 80 epochs of training a linear layer on top of the frozen encoders with cosine learning rate

8770 (0.0049) 0.8773 (0.0062) 0.9721 (0.0018) 0.6084 (0.0101) 0.5604 (0.0055) 0.8530 (0.0066)
Table1shows that the use of HistoPerm consistently resulted in improved accuracy compared to baseline approaches across all datasets.Specifically, BYOL with HistoPerm outperformed standard BYOL by 8% and 2% on the CD and RCC datasets, respectively, in terms of classification accuracy.SimCLR with HistoPerm also demonstrated improved accuracy by 3% and 1% on the CD and RCC datasets, respectively.Additionally, the incorporation of HistoPerm into VICReg led to an increase in accuracy by 8% and 2% on the CD and RCC datasets, respectively.

Table 2 .
Slide-level linear performance results on the respective test sets.All reported values are the mean of three different runs with standard deviation in parentheses.The top results for each architecture are presented in boldface.We provide the supervised results on the top row for comparison.

Table 2 .
5The Celiac disease patch-level dataset splits for the Normal, Nonspecific duodenitis, and Celiac sprue classes.The assigned label per patch is the corresponding slide-level label.

Table 3 .
The Celiac disease slide-level dataset splits for the Normal, Nonspecific duodenitis, and Celiac sprue classes.

Table 4 .
The Renal Cell Carcinoma patch-level dataset splits for the Benign, Oncocytoma, Chromophobe, Clear cell, and Papillary classes.The assigned label per patch is the corresponding slide-level label.

Table 5 .
The Renal Cell Carcinoma slide-level dataset splits for the Benign, Oncocytoma, Chromophobe, Clear cell, and Papillary classes.