Self supervised contrastive learning for digital histopathology

Unsupervised learning has been a long-standing goal of machine learning and is especially important for medical image analysis, where the learning can compensate for the scarcity of labeled datasets. A promising subclass of unsupervised learning is self-supervised learning, which aims to learn salient features using the raw input as the learning signal. In this paper, we use a contrastive self-supervised learning method called SimCLR that achieved state-of-the-art results on natural-scene images and apply this method to digital histopathology by collecting and pretraining on 57 histopathology datasets without any labels. We find that combining multiple multi-organ datasets with different types of staining and resolution properties improves the quality of the learned features. Furthermore, we find using more images for pretraining leads to a better performance in multiple downstream tasks. Linear classifiers trained on top of the learned features show that networks pretrained on digital histopathology datasets perform better than ImageNet pretrained networks, boosting task performances by more than 28% in F1 scores on average. These findings may also be useful when applying newer contrastive techniques to histopathology data. Pretrained PyTorch models are made publicly available at https://github.com/ozanciga/self-supervised-histopathology.


Introduction
The number of labeled images in machine learning tasks is found to be positively correlated with the task performance; however, labeled data is scarce and expensive. The problem is exacerbated in medical image analysis tasks, where expert annotations are often required and crowdsourcing is not usually an option. In many cases, labeling must also be done on-site due to regulations regarding dissemination of private patient data. In any medical image analysis task, the most laborious and time-consuming step tends to be labeling the data and several approaches have been proposed to mitigate this data annotation bottleneck. Unsupervised and selfsupervised methods that can utilize unlabeled data, and semi-supervised methods that use partially labeled data, have been found to improve task performance (Peikari et al., 2018;Komura and Ishikawa, 2018;Campanella et al., 2019a).
Until recently, most self-supervised techniques have relied on natural-scene image properties which are not applicable to histopathology images. However, recent contrastive learning approaches can be applied to digital pathology images (see Section 2). In this work, we use residual networks pretrained with self-supervised learning to learn generalizable features. We employ Sim-CLR (Chen et al., 2020a), a contrastive self-supervised technique that has comparable performance to the supervised ResNet 50 network on top-1 classification accuracy for the ILSVRC2012 dataset. We pretrain residual networks with this method and use the pretrained networks in downstream tasks for multiple tasks on multiple, multiorgan digital histopathology datasets. We explore different data sampling strategies to understand the amount and the type of data that leads to a representation that improves task performances over ImageNet pretraining or training from scratch. Specifically, we examine the impact of the number of images used for pretraining, res-olution, staining, and tissue type on the learned representations and downstream task performance. Furthermore, we compare multiple image augmentation strategies to identify the best practices when pretraining for digital histopathology. Our results indicate that pretraining with unlabeled histopathology images can improve task performances over Imagenet pretraining and mitigate labeled data requirements for various classification, regression, and segmentation tasks.

Related work
Unsupervised learning has been a long-standing goal of computer-aided diagnostic systems. Previously, sparse and variational autoencoders have been used for unsupervised nuclei detection and transfer learning (e.g., learning filter banks at multiple scales) (Xu et al., 2015;Chang et al., 2017;Hou et al., 2019) and generative adversarial networks have been used for tissue and cell-level classification, nuclei segmentation, and cell counting (Hu et al., 2018). Most of these unsupervised methods are not applicable to structures larger than cells or to tasks more complicated than distinguishing tissue and cell types, mostly due to the small image sizes they are capable of working with (e.g., 64 × 64 pixels), and due to the limited information that can be encoded by such methods. They may also require custom networks, and may only be applicable to applications with specific resolution and staining properties.
Self-supervised learning is a promising subclass of unsupervised learning, where the raw input data is used to generate the learning signal instead of a prior such as mutual information or cluster density. These methods can generally be applied to larger images and work with standard architectures such as residual networks (He et al., 2016). Context-based self-supervised methods rearrange the image input and task the network to perform spatial reordering. For instance, Noroozi and Favaro (2016) tile an image into nine square pieces and then shuffle the tiles, whereas Gidaris et al. (2018) rotate the input image in 90 • angles. In either case, the aim is to obtain the original input using a neural network, which is effectively pretrained for downstream tasks by performing predictions on tile orderings or rotation angles, respectively. While contextual information can be exploited in natural scene images to obtain meaningful representations, structures within histopathological images are elastic and may form infinitely many valid groupings. Therefore, these techniques are not directly applicable to the histology domain. For instance, predicting rotations is not a viable task for whole slide images, since cells and surrounding structures will have a valid arrangement in the rotated image as well. In histopathology, Gildenblat and Klaiman (2019) use spatial adjacency as a signal for similarity, and the pretraining task is to label image pairs as similar or dissimilar based on the spatial distance on a whole slide image. Tellez et al. (2019) also utilize contrastive learning by applying augmentations on image patches extracted from WSIs. Their aim is to learn salient features by distinguishing if two augmented images are from the same source image.
More recently, contrastive approaches based on learning latent-space features by discriminating between unlabeled training samples achieved state-of-the-art results in computer vision tasks. Such contrastive learning methods assume that under minor transformations, two views of the same image patch should have similar feature representations (Becker and Hinton, 1992). Importantly, since contrastive methods only rely on consistency at the instance level, they do not require any spatial regularity between or within instances and are applicable to digital histopathology images. The consistency assumption has been exploited by Dosovitskiy et al. (2014) to obtain a parametric feature representation for each training instance. Later, Wu et al. (2018) extended Dosovitskiy et al. (2014 into a non-parametric feature representation using a dynamic memory bank to store latent features of data samples. The memory bank is used for selecting negative examples for each training instance, where any image that is not another view or augmentation of the original training instance is considered negative. The memory bank is then used to obtain negative samples without the need to recompute feature vectors. The use of simple image augmentations (e.g., resizing images, horizontal flips, color jittering, etc.) and memory banks have proved successful in learning representations by maximizing the mutual information between latent representations of positives (Bachman et al., 2019;Hénaff et al., 2019;He et al., 2020). Data augmentations have also been utilized by applying the appropriate contrastive loss function on the feature vectors of positive and negative image pairs 3 (Tian et al., 2019;Misra and Maaten, 2020).
Recently, Chen et al. (2020a) proposed a contrastive learning approach that does not require a custom network or a memory bank, but instead relies on using a large number of minibatch instances (≥ 256) for obtaining negative samples per training instance. By doing so, they were able to improve the quality of learned representations by providing more negative samples per training instance over training epochs. Along with a few architectural improvements, this method outperforms aforementioned techniques by a large margin (+7%) and has comparable performance to the supervised ResNet 50 network when its features are used in training a linear classifier for the ImageNet ILSVRC-2012 dataset (Russakovsky et al., 2015). This method was later incrementally improved by Grill et al. (2020) through techniques such as exponential moving averaging of the model weights (Tarvainen and Valpola, 2017) or by knowledge distillation and using larger projection layers (Chen et al., 2020b). We choose not to explore these techniques in this paper as the added increase in performance (∼ 4% in ImageNet top-1 accuracy) cannot be justified by the increase in the number of trainable parameters (e.g., ∼ 800 million trainable parameters versus ∼ 24 million).

Contrastive self-supervised learning
We use the contrastive learning method proposed by Chen et al. (2020a) which relies on maximizing agreement between representations of two stochastically augmented views of the same image (see Fig. 1). Specifically, given an image i, a probabilistic augmentation function f aug (·), a neural network or an encoder f θ (·) with parameters θ and an auxiliary projection layer pˆθ(·) with parametersθ, the aim is to match 2 normalized feature representations of two augmentations of the same image given by z i = pˆθ( f θ ( f aug (i))). Simultaneously, other images in a batch are made dissimilar from the image i through a contrastive loss function called NT-Xent (the normalized temperature-scaled cross-entropy loss), defined as where τ is the temperature parameter that helps weigh different examples to achieve hard negative mining, 1 is the indicator function which outputs 1 when k i and 0 otherwise, and the similarity function is a distance metric between two 2 normalized vectors. We refer to output of f θ (·) as the pre-activation layer output, and the output of pˆθ(·) as the MLP output in Fig. 1, or z i . The auxiliary projection layer is a single hidden layer MLP which is used to project the pre-activation layer output into a lower embedding space. Comparing z i and z j was found to be more effective in learning representations than directly comparing the pre-activation layer outputs. For our experiments, we use cosine similarity defined as similarity(u, v) = u T v/ u v . The authors experimentally find that NT-Xent helps learn better representations than similar loss functions such as margin (Schroff et al., 2015) or logistic (Mikolov et al., 2013) losses.
For each pretraining step with a batch size of 2N, each augmented image has one similar (or positive) and 2(N − 1) dissimilar (or negative) samples. By using samples in the same batch as negative samples, we avoid the expensive explicit negative example mining present in many methods Wu et al. (2018); He et al. (2020) and are able to scale up batch sizes where each pretraining step simultaneously optimizes for ∼ 4N 2 feature vectors.
Defining image diversity for histopathology. The contrastive method exploits the variability in visual properties between image patches to learn salient features. Obtaining visually diverse patches is challenging for digital histopathology, especially when images are viewed under high resolution. Given multiple unlabeled datasets, we assume that selecting images with different staining, resolution and tissue types will lead to a more diverse dataset compared to selecting image patches extracted from the same WSI or sampling images from the same dataset. In our early experiments, we found that pretraining with the dataset constructed using the former approach resulted in better validation performance than constructing a pretraining dataset where images are sampled from a single dataset. Furthermore, when images are inspected visually, the former approach exhibits more diversity (see Fig. 2).
(b) Sampling a single dataset (Camelyon 17). Figure 2: Defining image diversity in the context of digital histopathology. Images evenly sampled from all 60 datasets visually look more diverse compared to sampling only from a single dataset containing multiple WSIs. 5 creas, bladder, cervix, esophagus, head, neck, kidney, lung, thyroid, uterus, bone marrow, skin, brain, stomach, and ovary. Out of the total 57 datasets, 22 are comprised of image patches, 35 are WSI datasets. Most datasets are stained with hematoxylin and eosin (H&E) and come at the highest resolution of 0.25µm/pixel (commonly referred to as 40×). While the majority of the WSI datasets are from The Cancer Genome Atlas Program (TCGA) and Clinical Proteomic Tumor Analysis Consortium (CPTAC) databases, we also use multiple public challenge datasets from lymph and breast cancer tasks. Image patch datasets are collected from various publicly available sources and exhibit similar organ diversity as WSI datasets. For the detailed list of pretraining datasets, please refer to Appendix F.1.

Validation experiments
Validation datasets. We validate pretrained networks trained under multiple settings (e.g., training only with a single dataset or tissue type, different number of training images etc.) on five classification, two segmentation and one regression dataset. The classification datasets include BACH (four class breast cancer classification), Lymph (three class malignant lymph node cancer classification), BreakHisv1 (binary breast cancer classification), NCT-CRC-HE-100K (nine class colorectal cancer tissue classification), and Gleason2019 (five class prostate cancer classification). Segmentation datasets are BACH (four class breast cancer segmentation on WSIs) and Digest-Path2019 (WSI segmentation of early-stage colon tumors into healthy and cancerous tissue). Our single regression dataset is the BreastPathQ, which involves assigning a percentage cancer cellularity score to a given image patch. See Appendix C.1 for the detailed explanation of each validation set. In the following, abbreviations NCT, Bpq, and Dp19 refer to NCT-CRC-HE100K, BreastPathQ, and DigestPath2019, respectively.
Tasks. We compare pretrained networks with randomly initialized and ImageNet pretrained Resnet 18, 34, 50 and 101. For segmentation tasks, we use a UNet architecture (Ronneberger et al., 2015) which contains a pretrained encoder and a randomly initialized decoder. We compare two supervised training settings: fine-tuning and last layer training. In order to assess the learned representations directly, we freeze each residual network at the pre-activation layer and only train a linear classifier or a regressor on the learned representations (last layer training). The last layer supervised training setting is omitted for the segmentation task, since the UNet-like decoder (Yakubovskiy, 2020) that we use contains a comparable number of trainable parameters to the pretrained Resnet encoder, and can mitigate the benefits of using a pretrained encoder or ImageNet initialization. We also train each network without freezing any layers (fine-tuning). While fine-tuning is commonly employed, freezing various layers of a network may be used to avoid overfitting whenever the training dataset is small. Furthermore, pretrained features can also be used in clustering (see Section 5.3), feature selection, and in more traditional machine learning methods such as decision trees and support vector machines.
Validation setup. We train for 100 epochs per experiment, use Adam optimizer, a batch size of 128 with a weight decay of 0.00001 for Resnet18 models, and 0.0001 otherwise. We found that 100 epochs is enough for convergence for each validation dataset for all the networks (e.g., Resnet101) considered in this work. We use 50% of the original dataset as the training, 25% as validation, and 25% as the test set. We use macro F 1 score as our validation metric for the classification and segmentation tasks. Macro F1 weighs each class equally regardless of the number of samples per class, which accounts for the class imbalance seen in most digital histopathology datasets. For the regression task, we compute the mean absolute error (L 1 ) on the predictions and the ground truth regression labels (varying between 0 to 100%). The test metrics corresponding to the maximum validation metrics are reported. Please refer to the Appendix A for the definitions of evaluation metrics.

Results
Unless otherwise stated, all pretraining experiments are conducted with the Resnet18 model, and validation experiments were conducted without freezing any layers of the network. We use "pretraining" to refer to the unsupervised training. Pretrained networks are then used for "supervised" training on validation datasets. A dataset with half a million images of size 224 × 224 pixels can be pretrained for 1000 epochs using PyTorch on 4 Tesla 6 P100 GPUs in about 24 hours for a Resnet 18 model. For a detailed explanation of augmentation and hyperparameter selection for the pretraining stage, please refer to Appendix C.
In the following, we report the average classification and segmentation macro F 1 scores, as well as the L 1 error for the regression, in the appropriate subsection. The expanded results with individual datasets for each experiment are given in Appendix D.

Overall comparison
We compare networks pretrained with self supervision, ImageNet initialization, and randomly initialized networks for Resnet 18, 34, 50, and 101 in Table 1. For comparison to other self-and unsupervised methods, please refer to Appendix E. Table 1: The downstream task performance of networks trained on top of the pretraining, ImageNet initialization, and randomly initialized network. We report F 1 scores (higher is better) averaged over five validation datasets for the classification task (Cls.), mean L 1 error difference (lower is better) between the ground truth and the predicted cellularity percentage for one dataset for regression (Reg.), and average two F 1 scores (higher is better) for segmentation (Seg.).

Resnet
Pretraining We find that pretraining is superior to ImageNet on classification and regression tasks for all settings. Self supervision is comparable to ImageNet for segmentation in Resnet 18 and 34. For larger networks, ImageNet performs better.

Pretraining is most useful when only a small number of training images is available
It is generally accepted that pretrained networks boost the performance in medical image analysis (Tajbakhsh et al., 2016) and digital histopathology tasks (Mehra et al., 2018). We conduct validation experiments on the classification and regression tasks for all datasets using {5, 10, 20, 50, 100}% of each training dataset, where for each percentage value we repeat experiments five times to obtain a better estimate of the performance. By limiting the amount of training data, we compare the effect of training dataset size on different pretraining settings. Specifically, we compare random initialization, ImageNet pretraining and self-supervised pretraining for networks Resnet 18, 34, 50 and 101. We report detailed results per dataset on Figure 3.
We found that in the absence of a large training set, training a classifier or a regression network using a pretrained initialization can improve the performance. For instance, self supervision outperforms training from scratch (random setting) by over 40% for the NCT dataset when only 5% of the labeled images are used for training. When more data is used, self supervision still remains superior to both ImageNet and random initializations.
5.3. Features obtained by pretraining are more representative for histopathological image patches than Ima-geNet features We compare representations learned through selfsupervision with the pretrained ImageNet features and randomly initialized residual network weights by training linear classifiers on the pre-activation layer of the Resnet models for the validation datasets. For Resnet 18 and Resnet 34, this amounts to 512 features, and to 2048 features for Resnet 50 and Resnet 101. We use 100% of the available data for each experiment. We performed the same comparison for the regression task, where we froze the network at the pre-activation layer and trained a single regression layer. We do not include analogous frozen encoder experiments for the segmentation task since a decoder can contain millions of parameters that can be trained to achieve satisfactory performance regardless of the encoder weights. We report the average results in Table 2, and the detailed results per dataset on Figure 4. Regression, Fine tuning Self-supervised ImageNet Random Figure 3: Classification results showing the macro F 1 score on the five validation datasets, and the regression results showing the mean absolute error (L 1 ) on the BreastPathQ dataset for four differently sized Resnet models, when the supervised training was conducted without freezing any layers (fine tuning). We report the F 1 score and the L 1 error when {5, 10, 20, 50, 100}% of the labeled data is used for supervised training. Each percentage setting was run five times using different samples where the average and the standard deviation of five runs are shown as dots and error bars. The downstream task performance of linear classifiers trained on top of the learned features by self supervision, ImageNet initialization, and randomly initialized network. The network weights up to the pre-activation layer are frozen and not updated except for the linear classification layer. We report F 1 scores averaged over five validation datasets for the classification task, mean L 1 difference between the ground truth and the predicted cellularity percentage for one dataset for regression. Segmentation results using frozen encoders are omitted since a decoder can contain millions of parameters that can be trained to achieve satisfactory performance regardless of the encoder weights. The self-supervised network achieves better results than ImageNet initialization, which indicates our method has learned domain-specific features that can be useful, especially when training samples are scarce. One may freeze and obtain features from various layers of the network for training various machine learning models, including neural networks, support vector machines or random forest classifiers.
Unsupervised clustering using the learned representations. Learned representations can be directly used for clustering without any labeled training examples. In this section, we use learned representations to cluster image patches extracted from WSIs. Note that the learned features can also be used for querying an image to its nearest neighbors without clustering the dataset, which is useful in applications such as active learning for sample selection and various data retrieval systems.
Negative mining is an important task in most tasks involving WSIs. For illustration of the saliency of learned representations, we perform negative mining on WSIs from a dataset including post-neoadjuvant therapy BRCA specimens with annotations of regions containing tumor (Peikari et al., 2017;Martel et al., 2019). Randomly sampling patches from a WSI will result in a large class imbalance in favor of negatives, which leads to an increased ratio of false negatives. One may mitigate this issue by aggregating the sampled patches and selecting a subset of "representative" patches. This is done by clustering, where we rely on the perceived visual or morphological similarity of patches according to their relative distance to each other in the feature space. Specifically, we sample 1.4 million images from 69 WSIs and cluster images into three thousand clusters, where the number of clusters are determined using the Elbow heuristic (Ciga et al., 2021) (using the cluster sizes {1000, 1500, ..., 10000}, explained variance is 67.1% when 2500 clusters is used, 83.3% when 3000 clusters is used, and 91% when 5000 clusters is used), using their feature representations (∈ R 512 ) generated by the Resnet18 trained using self supervision. The features are clustered using the mini-batch K-means algorithm. The resulting clusters can be seen in Fig. 5 with three samples per cluster. In addition, we select a few clusters and highlight them on WSIs to segment various regions of interests without any supervision in Figure  6.
5.4. Using more unlabeled images in pretraining improves the downstream task performance Extracting patches of size 224 × 224 pixels from WSIs that contain foreground (refer to Appendix B) can lead to a few million patches per dataset, which can quickly become intractable. Therefore, we randomly sample a maximum of 100 patches from each WSI and use all images from non-WSI datasets to generate the unsupervised pretraining data. We use the maximum available resolution per WSI, which is ∼ 0.25µm/pixel or ∼ 0.50µm/pixel depending on the dataset. We use 206 thousand images from 23 non-WSI datasets and sample over four million patches from 35 WSI datasets containing around 25 thousand WSIs. We compare using {0.01, 0.1, 1, 10}% of the sampled datasets with a maximum of two thousand and a minimum of 10 images per dataset if the number of images corresponding to the percentage setting for a dataset is below 10. In the end, we obtain 4, 40 and 400 thousand images per percentage setting. In addition, we pretrain a model using only 10 images per dataset (∼ 0.01% of all  Fig. 5b, each yellow box shows three samples closest to the center of a randomly selected cluster for illustration. Note that visually similar patches with features such as out-of-focus regions, creases, ink marks of different colors, morphological structures such as ducts, outlines of the nipple, and red blood cells are clustered together. Furthermore, patches with varying nuclei formation patterns are also clustered roughly according to their density.  available images). In this setting, we sample 10 image patches from each WSI dataset as well, which can contain a few million image patches across all WSIs in that dataset. Therefore, the dataset is drastically undersampled. The results can be seen in Table 3. Table 3: The effect of the number of images used for pretraining in the downstream task performance for classification, regression, and segmentation tasks. We report F 1 scores averaged over five validation datasets for the classification task, mean L 1 difference between the ground truth and the predicted cellularity percentage for one dataset for regression, and average two F 1 scores for segmentation. We found that using a larger training set outperforms using a training set with fewer images. Notably, we observe diminishing returns with every order of magnitude of increase in dataset size. While the difference between using 0.01% and 0.1% of the available images is 7.4% in average F 1 for the classification task, the difference between 1% and 10% is around 2.1%.

The impact of resolution on the pretraining
To assess the impact of resolution on the learned features, we pretrain four networks using images at different resolutions. We use a breast image dataset, originally aimed to identify the invasive ductal carcinoma grade at multiple resolutions (Bolhasani et al., 2020). In the results shown in Table 4, 10×, 20×, and 40× refer to 1µm/pixel, 0.50µm/pixel, and 0.25µm/pixel, respectively. We pretrain networks using images with resolutions 10×, 20×, and 40×, in addition to using all available images (10, 20, 40×). Table 4: Examining the impact of resolution on task performance. We use images from one resolution for pretraining and evaluate the pretrained network on all available validation datasets. 10×, 20×, and 40× refer to 1µm/pixel, 0.50µm/pixel, and 0.25µm/pixel, respectively. We report F 1 scores averaged over five validation datasets for the classification task, mean L 1 difference between the ground truth and the predicted cellularity percentage for one dataset for regression, and average two F 1 scores for segmentation. We find that the networks pretrained on images at higher resolutions tend to perform better in downstream tasks. Furthermore, we find that combining multiple resolutions further improves task performances.

Transferability of features between tissue types and staining
To assess if it is better to pretrain using datasets drawn exclusively from a similar tissue type to that present in the target task, we conducted separate experiments using Resnet 18 with three different tissue types: breast, lymph nodes and prostate. Specifically, we trained a network using only Camelyon 16 and Camelyon 17 datasets for lymph nodes, and TCGA-PRAD and Prostate-MRI datasets for prostate. Due to the greater availability of data for breast, we trained on a larger number of datasets: TCGA-BRCA, TUPAC16, TNBC, several datasets from Andrew Janowczyk, ICPR2014 and ICPR2012 (see Table F.13 for dataset descriptions). For each experiment, roughly the same number of training images were used (∼ four thousand). The results can be viewed in Table 5.
Despite the site-specific pretraining (i.e., the tissue used in pretraining), we did not observe a strong correlation between the pretrained model and the validation performance. For instance, prostate models outperformed breast models on the BACH dataset for both breast cancer classification and segmentation. Similarly, the lymph model was outperformed by the breast model on the malignant lymph cancer detection. However, it should be noted that for some datasets the site-specific pretraining yielded better performance. For instance, breast model performed the best on the BreastPathQ dataset, and prostate model  Table 5: Examining the transferability of features between organs. We use images from one organ for pretraining and validate it on all available validation datasets. We report F 1 scores averaged over five validation datasets for the classification task, mean L 1 difference between the ground truth and the predicted cellularity percentage for one dataset for regression, and average two F 1 scores for segmentation.
outperformed other models on the Gleason prostate cancer grading. In addition, we observed poorer performance compared to training with all datasets, where the results in tables 1 and D.7 indicate using images from a more diverse pretraining dataset with comparable number of pretraining images leads to better performance across all tasks.

Discussion
In Section 5.1, we have observed that the self supervision performs better than the ImageNet initialization for the segmentation task for Resnet 18 and 34. In contrast, for Resnet 50 and 101, ImageNet performs better. We hypothesize this is due to the increased number of trainable parameters for the decoder, diminishing the effect of pretraining the encoder.
In Section 5.6, we have observed that the tissue type used as data for pretraining was not correlated to the downstream task performance on the validation datasets. We believe the lack of correlation and the degradation compared to using all tissue types can be due to the following factors: (1) limitations of the contrastive approach where representations are incapable of encoding domainspecific information in the absence of the other tissue types, (2) convolutional networks are highly sensitive to visual properties such as staining, resolution and morphological shapes, and do not encode abstract features in the absence of a specific objective (e.g., cancer grading), (3) the network can only be incentivized to encode a richer representation given a diverse pretraining dataset.

Conclusion
Our main objective in this work was to show that, by pretraining, we can learn better features to improve performances on multiple downstream tasks, including classification, regression, and segmentation. The selfsupervised method outlined in this paper is the first method to consistently have comparable performance to ImageNet pretraining without additional complexity. To our knowledge, there is no prior research on histopathological image analysis with a training regimen that consistently reaches or surpasses supervised training. In addition, this is the first study which uses a very large number of images in digital histopathology setting: 23 image datasets with over 206 thousand patches and ∼ 25 thousand gigaresolution images in 35 datasets that consist of whole-slide images.
We have shown that the success of the contrastive pretraining method heavily relies on the diversity of the unlabeled training set, as opposed to the number of images. This is an important consideration when one adopts a technique from the computer vision community, where most methods are validated on natural scene images that contain significantly more diversity than medical images. Furthermore, we have shown that the site which the training images were extracted from did not have a substantial effect on the quality of learned representations, as shown in Section 5.6. While this shows a clear divergence from the training of human experts who focus on a specific organ, it also can significantly increase the number of available datasets in training such systems.
In this work, we focused on the simplest contrastive method which significantly improved the state-of-the-art and experimented under multiple settings outlined in Section 4 to understand the capabilities and limitations of the contrastive training for histopathology images. We believe that so long as the fundamentals of the sample contrastive training framework remain the same, the insights we obtained will still be valid for future work on self-supervision. Overall, we found that contrasting images which are visually distinguishable helped in learning salient representations. In contrast, images which look similar with small nuances that are important in histopathology (e.g., single cell tumors that only occupy a small portion of a given patch) were not suitable for contrastive learning, and led to noisy representations. As this is rarely the case for the natural-scene images, researchers working on digital histopathology images need to address domain-specific issues to bridge the gap between histopathology and computer vision in self-supervised learning.

Conflict of interest
We have no conflict of interest to declare.

Acknowledgments
This work was funded by Canadian Cancer Society (grant #705772) and NSERC. It was also enabled in part by support provided by Compute Canada tp stands for true positive, fp is false positive and fn is false negative. We use macro F 1 score in reporting classification and segmentation results. Macro F 1 assigns the same weight to each class-wise score prior to averaging over all classes to take the class imbalance present in most of our validation datasets into account.
We use the L 1 error for the regression task. L 1 is defined as the mean absolute difference (MAD) between ground truth (gt i ) and the prediction (pred) over N samples, or N i=1 |gt i − pred i |. The subscript i indexes the ground truth and prediction belonging to sample i. Results. In our initial experiments, we observed Adam did not converge for batch sizes ≥ 256 and was discarded.
A batch size of 512 performed the best for both optimizers for all temperature values and is used in all the subsequent experiments. Lamb is a newer technique where You et al. (2019) have shown better large batch training performance with faster convergence on convolutional networks. However, we observed Lars outperformed Lamb in most settings and was therefore used for the remaining experiments. We found smaller temperatures generally performed better, and larger ones (1, 2, 10) did not converge. The temperature parameter of 0.1 performed the best for both optimizers. For the complete list of experiments, please refer to the Table C.6. Interestingly, we found that randomly cropping 1% of the image patch and resizing to its original size outperformed both 5% and 25% for medium and heavy color jittering settings. We found color jittering strongly correlated with the quality of learned representations. No color jittering has the lowest training loss since it is easy to distinguish images without tampering with their color properties. However, it was outperformed by both light and heavy jittering settings. Overall, we observed that more aggressive augmentations resulted in better representations. For instance, while 1% of a 224 × 224 patch (a 2 × 2 square) cannot be used to distinguish the original patch it was extracted from, it still outperformed {5%, 25%} for the medium jittering setting, which performed the best across all settings by over 0.6% over the next best setting in F 1 . We argue that the improvement is due to a regularization effect on the contrastive objective, which prevents saturation and enables the learning of more salient features. The average F 1 scores obtained from a combination of different augmentation and random cropping settings are shown in Fig. D Appendix C.1. Validation datasets For all validation datasets, training data is comprised of 50% of total data, with 25% for both validation and testing sets, unless otherwise mentioned. For classification, we used the following five datasets: BACH challenge dataset (Aresta et al., 2019), a malignant lymphoma classification dataset which we denote as "Lymph" (Orlov et al., 2010), BreakHis (Spanhol et al., 2016), NCT-CRC-HE-100K (Kather et al., 2018), and Gleason2019 (Nir et al., 2018. The BACH challenge classification dataset contains 400 patches of size 2048 × 1536 pixels extracted from breast biopsy WSIs. The dataset is evenly split into four classes (normal, benign, in situ carcinoma, and invasive carcinoma), and the challenge task was to automatically classify images into these classes. The Lymph dataset contains 374 H&E stained images from lymph nodes, sampled by multiple pathologists at different sites. The images are separated into three classes: CLL (chronic lymphocytic leukemia), FL (follicular lymphoma), and MCL (mantle cell lymphoma). Notably, this dataset contains a large degree of staining variation within dataset. The BreakHis dataset contains 7909 patches of size 700 × 460 pixels taken from WSIs of breast tumor tissue. The data is labelled as either benign or malignant, and images belong to one of four magnifying factors (40x, 100x, 200x, and 400x). The NCT-CRC-HE-100K dataset contains 107180 patches of size 224 × 224 pixels. The split differs for this dataset, with 75000 patches in the training set, 25000 in validation, and 7180 in the test set. The test set comes from a related dataset called CRC-VAL-HE-7K, which was recommended for use as validation for the larger dataset. Images are taken from 86 stained colorectal cancer tissue slides. The data is split into nine classes: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancerassociated stroma (STR), and colorectal adenocarcinoma epithelium (TUM). The Gleason2019 dataset contains 244 tissue micro-array (TMA) images from prostate biopsies. The segmentation labels for this dataset are the Gleason scores for each pixel in the image, ranging from 1 (healthy) to 5 (abnormal).
For segmentation, we used the following two datasets: BACH challenge dataset (Aresta et al., 2019), and Di-gestPath2019 (Li et al., 2019). The BACH segmentation dataset originates from part B of the challenge, and consists of 10 breast biopsy WSIs with pixel-level segmentation masks. These whole-slides are split 5/3/2 for train/validation/test respectively. The segmentation labels are the same 4 classes in the classification dataset. Patches of size 1024 × 1024 pixels were extracted from each slide with a sliding window. The DigestPath2019 challenge dataset contains 250 5000 × 5000 pixel image patches taken from colonoscopy tissue slides. For our task, we resize each patch and its corresponding pixel-level label image to 1024 × 1024. The segmentation masks provided for this dataset are divided into two classes (benign or malignant), and the aim of this challenge is to use these seg-mentation labels to identify early-stage colon tumors.
For regression, we used the BreastPathQ dataset (Akbar et al., 2019). It consists of a total of 2579 patches with size 512 × 512 pixels extracted from 69 WSIs of Post-NAT-BRCA specimens (Peikari et al., 2017). Breast-PathQ images are labelled according to the percentage of cancer cellularity in each patch. The aim is to predict the percentage cellularity given an input image patch. Table D.7: The downstream task performance of networks trained on top of the pretraining, ImageNet initialization, and randomly initialized network. We report F 1 scores averaged over five validation datasets for the classification task, mean L 1 difference between the ground truth and the predicted cellularity percentage for one dataset for regression, and average two F 1 scores for segmentation. Column with the header R indicates the Resnet model, and Reg. stands for regression.  Table D.8: The downstream task performance of linear classifiers trained on top ofthe learned features by self supervision, ImageNet initialization, and randomly initialized network. We report F 1 scores averaged over five validation datasets for the classification task, mean L 1 difference between the ground truth and the predicted cellularity percentage for one dataset for regression, and average two F 1 scores for segmentation. Column with the header R indicates the Resnet model.  Table D.9: Examining the impact of resolution on task performance. We use images from one resolution for pretraining and evaluate the pretrained network on all available validation datasets. 10×, 20×, and 40× refer to 1µm/pixel, 0.50µm/pixel, and 0.25µm/pixel, respectively. We report F 1 scores averaged over five validation datasets for the classification task, mean L 1 difference between the ground truth and the predicted cellularity percentage for one dataset for regression, and average two F 1 scores for segmentation.  Table D.10: The effect of the number of images used for pretraining in the downstream task performance for classification, regression, and segmentation tasks. We report F 1 scores averaged over five validation datasets for the classification task, mean L 1 difference between the ground truth and the predicted cellularity percentage for one dataset for regression, and average two F 1 scores for segmentation. mentation network used in our experiments (Yakubovskiy, 2020) with the Resnet50 encoder, and optimize the image reconstruction objective with L 1 loss to train an autoencoder. We use an earlier self supervised technique which is based on predicting the RGB channels of a grayscale image for learning representations (colorization) , and use a more recent self supervised method called CPCv2 that encodes small patches from a larger image in a sliding window. Then, the aim is to predict encoded representations based on spatial proximity (patches extracted from an image that are spatially adjacent are considered more similar) (Hénaff et al., 2019).