Consistency Regularisation in Varying Contexts and Feature Perturbations for Semi-Supervised Semantic Segmentation of Histology Images

Semantic segmentation of various tissue and nuclei types in histology images is fundamental to many downstream tasks in the area of computational pathology (CPath). In recent years, Deep Learning (DL) methods have been shown to perform well on segmentation tasks but DL methods generally require a large amount of pixel-wise annotated data. Pixel-wise annotation sometimes requires expert's knowledge and time which is laborious and costly to obtain. In this paper, we present a consistency based semi-supervised learning (SSL) approach that can help mitigate this challenge by exploiting a large amount of unlabelled data for model training thus alleviating the need for a large annotated dataset. However, SSL models might also be susceptible to changing context and features perturbations exhibiting poor generalisation due to the limited training data. We propose an SSL method that learns robust features from both labelled and unlabelled images by enforcing consistency against varying contexts and feature perturbations. The proposed method incorporates context-aware consistency by contrasting pairs of overlapping images in a pixel-wise manner from changing contexts resulting in robust and context invariant features. We show that cross-consistency training makes the encoder features invariant to different perturbations and improves the prediction confidence. Finally, entropy minimisation is employed to further boost the confidence of the final prediction maps from unlabelled data. We conduct an extensive set of experiments on two publicly available large datasets (BCSS and MoNuSeg) and show superior performance compared to the state-of-the-art methods.


Introduction
Segmentation of fundamental objects and regions in histology images is key to several downstream analysis tasks in computational pathology (CPath) [1,2] e.g., cancer type classification [3,4,5,6], tumour and glandular segmentation [7], and other tasks like mutation prediction [8,9].Their utility is not limited to diagnosis, they have also been employed for prognostic purposes e.g., tumour infiltrating lymphocytes (TILs) have been found to be significant prognostic biomarker in various types of tumours [10].Similarly, tumour progression has been linked with interaction between SSCL the tumour epithelial cells and tumour associate stroma [11].Hence, it is important to segment different types of histological objects precisely as their quantification is vital to downstream analysis.
Machine learning based traditional methods accomplished this task using different hand-crafted features e.g., colour [12], texture [13,14] and morphological features [15].Recently, deep learning (DL) algorithms have gained increasing attention in semantic segmentation due to their superior performance on natural and medical images [16,17,18,19].However, DL methods are known to be "data hungry" and require a large amount of annotated data.Precise annotation of histology images is an expensive and laborious process, requiring up to ∼5-6 hours of an expert histopathologist's time to annotate one whole-slide image (WSI) [20].To alleviate the annotation burden, other modes of training have been proposed such as patch based segmentation [21,22], coarse segmentation [23,24] and interactive segmentation [1,25] but these methods still require large-scale weak annotations involving human expert.
Semantic segmentation is a pixel-level classification task of predicting label for each pixel using pixel values.Most of the early DL methods were based on fully convolutional networks (FCN) [26] where pooling layers aggregate the information by focusing on "what" rather than "where" resulting in loss of spatial information.The subsequent studies addressed this shortcoming by using pooling layers with more advanced techniques involving skip connections, encoders and boundary in formation.As semantic segmentation is more than just assigning labels to pixels, it inevitably requires some contextual information along with knowledge of colour, edges and resolutions.In this regard, algorithms like UNet [16], PSPNet [27], HRNet [28] and DeepLab-v3 [17] use techniques like encoder-decoder architecture, wider receptive fields and dilated/atrous convolutions to improve the segmentation performance.More recently, another line of work focused on transforming the task of semantic segmentation to sequence-to-sequence prediction, where, self-attention mechanism is introduced using transformers [29] to encode the global context in each layer [30,18] for subsequent decoding.However, a downside of using transformer based technique is their computational complexity.
On the other hand, semi-supervised learning (SSL) can train DL models with a small set of annotated data by leveraging the unlabelled data for better representation learning hence boosting the performance.SSL methods consist of different techniques to incorporate unlabelled data for learning including pseudo labelling [31,32,33], generative adversarial modelling [34,35,36,37], consistency training [38,39,33,40] and entropy minimisation [41,42,43].However, SSL methods have an additional issue related to overfitting of small labelled input data which may lead to poor generalisation.
In this paper, we propose a novel consistency based SSL method for semantic segmentation which leverages unlabelled data in varying contexts and feature perturbations.Consistency regularisation is enforced by using context-aware contrastive learning in changing contexts and cross-consistency training is used to handle feature perturbations along with entropy minimisation for confident predictions.The main purpose of consistency regularisation is to enforce the model to output consistent predictions for unlabelled data under changing conditions.For consistency to work effectively, input space must hold the cluster assumption constraint i.e., same label is most likely to be shared among the nearby samples thus forming a cluster.Therefore, high density regions would correspond to clusters (i.e., samples with same labels) whereas the low density regions are separation spaces (i.e., object boundaries).As for histology and natural images, the pixel space might not hold the constraint of cluster assumption as it can be seen in Figure 1.The low density regions (i.e., high average distance) do not align well with the class boundaries in most of the scenarios e.g., in 1 st row we observe low density regions throughout the image, while, in the last row there exist a cluster of high density regions for foreground object i.e., road.However, the cluster assumption holds in encoders latent feature space [39], as we show and discuss later in this paper's Figure 10.Therefore, we applied the feature perturbations to encoders output rather than the input images.Also, due to the limited labelled data, the model may become overly dependent on just context overlooking the objects themselves, losing self-awareness [40].Therefore, to enforce consistency against changing contexts, we propose context-aware contrastive learning which helps the model learn high-level semantic features by contrasting the positive and negative pairs of images in different contexts.As shown in Figure 2, under varying contexts the model trained in a fully supervised manner is unable to produce consistent feature distributions as compared to our proposed method consistency regularisation in varying contexts and feature perturbations for semi-supervised semantic segmentation of Histology Images (CRCFP) with consistent feature distributions.While context-aware consistency brings robustness to changing contexts, cross-consistency training can help the model learn invariant feature representations that is robust to small perturbations.While context-aware and cross-consistency training regularisation can bring consistency in encoder's features representations, it often fails to optimise the pixel classifier leading to less confident prediction maps.Finally, entropy minimisation coupled with aforementioned techniques helps the model acquire high quality and confident predictions.We extensively evaluated our proposed CRCFP on two publicly available histology image datasets BCSS [44] and MoNuSeg [45] for two different semantic segmentation tasks i.e., tissue region segmentation and nuclei segmentation.In summary, our contributions are as follows: • We propose a consistency regularisation based SSL method against varying contexts and perturbations using a novel combination of context-aware consistency loss and cross-consistent training for feature generalisability.  2 Literature Review

Semantic Segmentation
The transformation of pixel values of an image to class labels using high level features is known as semantic segmentation and is fundamentally a challenging task.FCN extracts meaningful visual hierarchical features for various computer vision tasks e.g., classification, segmentation and object detection.However, due to the pooling layers spatial information is lost in aggregation which is vital in segmentation tasks and results in smaller output [26].Encoder-decoder based architectures solve this issue by recovering and refining the output spatially in a step wise fashion [46,47,48].Further improvements can be made possible with the help of skip connections which results in more refined boundaries and confident predictions [16].However, the downside of the encoder-decoder architectures is a limited receptive field resulting in missing long-range dependencies.Dilated/atrous convolutions [17,49,50,24], spatial pyramid pooling [51,27,52,7] and attention based algorithms [53,54,55,18] can enable aggregation of context by using larger receptive fields or maintaining spatial information.More recently attention mechanism [29] has been used to replace limited local receptive field of convolutions with global contexts using transformers.Images are transformed into sequence of patches for transformer [56] to process as transformers capture more consistent global contexts due to their self-attention mechanisms [30,18,57].Despite the advancements and improvements in semantic segmentation the bottleneck for high accuracy still remains to be dependent on pixel-wise annotations.
Figure 3: Overview of the proposed framework (CRCFP).The encoder and decoder are trained in a supervised manner with the cross-entropy (CE) loss for the labelled instances.For unlabelled instances, two cropped patches with partial overlap together with the input image were passed through the encoder, where the input image is used for contrastive and cross-consistency learning.

Semi-Supervised Learning
Semi-supervised learning (SSL) exploits the unlabelled data on top of limited labelled data for improving the model performance and internal feature representation.Recently SSL based methods have been widely adopted in the computer vision domain [58].Popular SSL techniques include pseudo labelling [31,32,33] where the model trained on limited data is used to predict the labels for unlabelled data known as pseudo labels.Generative adversarial based methods improve the generalisability of the trained model using various perturbations in the direction of maximum vulnerability, resulting in aligning the distributions of labelled and unlabelled input in latent space [35,36,59,37].Data interpolation based methods aim to augment input space to create perturbed linear inputs for training models [60,61,62], Temporal ensembling based methods aim to ensemble predictions over the epochs using momentum/moving average to enforce consistency between the predictions [63,64].Self-supervised learning based consistency training aims to contrast the unlabelled input using pre-text tasks for learning important representations [65,66,67,33,40] and entropy minimisation based method aims to maximise label assignment to either of the labels [41,42,43].

Contrastive Learning
Learning by contrasting pairs of similar (positive) and dissimilar (negative) images for improved representation learning is known as contrastive learning [68,65,69].Several loss functions have been proposed from maximum margin loss [70], triplet loss [71], N-pair loss [72] to contrastive predicting coding (CPC) [73] proposing mutual information based InfoNCE loss to improve contrastive learning.Contrastive learning has been used in both supervised and unsupervised learning tasks in conjunction with self supervision [66,65,74].Recently, it has been established that using more accurate positive and negative pairs along with larger batch sizes improves the quality of learned representations with heavy augmentations.Memory banks are adopted when large batches are not computationally feasible (i.e., doesn't fit the GPU memory) for contrastive loss using a large set of negative samples.

Semi-Supervised Semantic Segmentation
SSL based semantic segmentation approaches utilise the aforementioned techniques to extract knowledge from unlabelled data.Recently, CutMix, MixUp, and CutOut based augmentation techniques were used togather with the SSCL Figure 4: Directional contrastive loss working for context-aware consistency, where from ϕ u1 , ϕ u2 overlapping area (yellow overlay) positive pixels with higher confidence pull each other closer (orange arrows) while negative pixels from ϕ u2 as well as from memory bank push each other apart (red arrows).Where class masks ŷu1 , ŷu2 (dashed green arrows) were applied to get the negative samples from ϕ u2 and from the memory bank illustrated in the grey overlay.
student-teacher model where consistency was enforced between the mixed predictions [75].Guided collaborative training (GCT) by [76] performed network perturbations with the help of different network initialisation and enforced the dynamic consistency constraint between the predictions.Cross-consistency training (CCT) by [39] performed perturbations on the main encoder's features and enforced consistency over the multiple decoders output making it robust to various perturbation types.Context-aware consistency by [40] proposed directional consistency loss for contrasting different contexts by cropping two overlapping patches of the same input to improve the representation learning.Recently, in the field of computational pathology, a few methods for semi-supervised semantic segmentation have been proposed.[77] proposed a semi-supervised method for signet detection using with the help of self-supervised learning for label generation.[78] proposed a two stage SIM-FixMatch approach utilising self-supervised learning in the first stage and then using FixMatch for pseudo label generation along with consistency regularisation.[79] proposed an exponential moving average (EMA) student-teacher framework where the model is trained using the noisy labels to enforce the consistency over similar and dissimilar patch pairs.Cross-patch dense contrastive learning by [43] proposed a student-teacher based method to enforce EMA based consistency over predictions and to improve the internal representations.Pixel-wise contrastive loss was applied to background and foreground patches for improving the internal feature representations.
In this work, we show that (a) by enforcing consistency over varying contexts and feature perturbations in encoder's latent space, models can generalise better and (b) minimising entropy in output prediction maps can boost the confidence of the final predictions resulting in improved performance.

The Proposed Method
Figure 3 shows an overview of the proposed framework (CRCFP), where } represents the M unlabelled images.Labelled and unlabelled images x l and x u were sampled from L and U respectively in batches.Both images x l , x u are of H × W × D spatial dimensions with corresponding pixel-wise mask y l = R C×H×W only for labelled image where C is the number of classes.Each labelled image x l is passed through the supervised pathway of the CRCFP framework (blue arrows in Figure 3) whereas the unlabelled images x u pass through the unsupervised pathways of the framework (brown arrows in Figure 3) along with two overlapping patches extracted randomly from x u denoted as x u1 , x u2 (green arrows in Figure 3).Feature maps are extracted from the input image using the shared encoder h(•; θ h ) and decoder Further, f l and f u are processed by a pixel classifier C f for final prediction as ŷl = C f (f l ; θ p ) and ŷu = C f (f u ; θ p ) where ŷl is optimised using the cross-entropy loss over y l as L sup shown in equation 1. SSCL

Context-Aware Consistency
With only the supervised loss L sup , the model may start relying excessively on contexts due to limited labelled data.Context-aware consistency can alleviate this issue by aligning the two different contexts of the same patch with the help of contrastive learning.For this purpose, encoded feature maps f u1 and f u2 are projected to a low-dimensional space using a non-linear projector φ to preserve important contextual information.The choice of non-linear projection head as compared to linear and identity projection head is due to its superior performance [65].The projection head φ(•; θ z ) outputs projection maps as ϕ u1 = φ(f u1 ; θ z ) and ϕ u2 = φ(f u2 ; θ z ).Similar to [40], context-aware consistency is maintained between the overlapping regions of ϕ u1 and ϕ u2 using the directional contrastive loss L cont to keep the feature representation consistent under different contexts as shown in 4. For computing directional consistency loss, first class maps ŷui were extracted using pixel classifier C f and then maximum probability among all classes C is maintained using max probability as it is linked with higher confidence as shown in equation 2.
where i ∈ {1, 2} and higher probability features are used to align less confident features towards the more confident features [76,40,43] which can improve learning by avoiding the exchange of unreliable knowledge from the less confident feature as shown in Figure 4.In order to extract negative samples (i.e., negative pairs), class maps as ŷu1 = C f (f u1 ; θ p ) and ŷu2 = C f (f u2 ; θ p ) were used.A positive feature projection ϕ u1+ with class map ŷu1+ (i.e., in case of u 1 → u 2 ), the negative samples η should have (ŷ u1+ = ŷu− ) as shown in 5. Further, to avoid less confident features from contributing towards the loss, a threshold λ is applied to avoid an exchange of knowledge between less confident features.The cont(ϕu1,ϕu2) loss for one pair is calculated as shown below, where sim(.) is the cosine similarity measure with temperature τ , M c+ represents the binary mask for confident features corresponding to ϕ u1+ .M + is the binary mask for positive confident samples above threshold λ.M − is the binary mask for negative samples indicating different pseudo labels between ϕ u+ and ϕ u− .To increase the negative samples, we have used the memory bank which stores features from recent batches to further increase the negative samples for better contrastive performance [65,40,43].Finally, the directional contrastive loss L cont is calculated as below:

Cross-Consistency Training
As context-aware consistency improves the model's robustness towards changing contexts without losing self-awareness, the model is still susceptible to small perturbations in the input due to limited labelled data.Therefore, in order to leverage unlabelled data and make the model invariant to small perturbations, we utilise the cross-consistency training [39] where f u is perturbed K times for each perturbation type and consistency is maintained between the output of pixel classifier and auxiliary classifiers.This not only improves the model's robustness but also regularises the main pixel classifier towards correct predictions.We use ŷu to regularise the pixel classifier over the mean square error (MSE) loss by measuring the distance between the output of the main pixel classifier C f and the output of auxiliary SSCL classifiers C k f .Formally, a perturbation function p k with k ∈ {1, K} perturbations outputs a perturbed version of the f u as f k u = p k ( fu ) for a perturbation type and the cross-consistency training loss L cross can be defined as below, where d measures the squared distance between the output probabilities of the main pixel classifier and perturbed pixel classifier output.Following perturbations are applied to enforce the consistency: Feature Noise: A uniformly sampled noise from the interval [α, β] is added to the features map f u in two steps.First sampled noise is multiplied with f u to scale the noise relative to feature activations.Second, the scaled sample noise is then added to the feature map f u .This makes the noise proportional to each feature activation as shown below.
Feature Dropout: A uniform sample threshold γ is used to prune the less confident activations to stop the model from relying on those activations.This is done by first summing the f u over different channels and then normalising it using min-max normalisation resulting in f u .Anything below γ is dropped as seen below: where M drop is the binary mask containing threshold values for pruning the activations.
DropOut: A fraction of activations are dropped out spatially where the fraction is decided using the Bernoulli distribution with probability δ.

Entropy Minimisation
Context-aware contrastive learning and cross-consistency training improves the encoder's features but it often fails to improve the final pixel classifier leading to less reliable pseudo labels corrupting the training from unlabelled data.As higher confidence means better prediction maps resulting in more refined pseudo labels which can help train both context-ware and cross-training with improved positive/negative pairs and pseudo labels.Hence, in order to improve the confidence of predictions, we employ entropy regularisation following its applications in semi-supervised learning [41,80,69,43] as shown in 17 where it penalises the uncertain prediction in the unlabelled data and improves the overall confidence of the prediction maps.

Training
Finally, the entire framework is trained in an end-to-end fashion using a weighted combination of the above mentioned losses as shown below, where w sup , w cont , w cross and w ent correspond to the weights for each loss component respectively.

Datasets
We evaluated the proposed framework on two publicly available datasets, the Breast Cancer Semantic Segmentation (BCSS) [44] and Multi-organ Nucleus Segmentation Challenge (MoNuSeg) [45] dataset for semantic segmentation.The data was obtained from the respective challenge pages hosted on Grand Challenge for Medical Image Analysis website (https://grand-challenge.org/).
MoNuSeg.The MoNuSeg challenge was organised as a MICCAI 2018 satellite event and contains 21,623 annotated nuclei from 30 H&E stained images for training and contains 7223 annotated nuclei from 14 H&E stained images for testing purposes.Annotations were done by engineering students and then an expert pathologist served as quality control for the annotated nuclei.Each image is of size 1000 × 1000 extracted from a WSI scanned at 40× resolution of an individual patient obtained from The Cancer Genome Atlas Program (TCGA) [81].WSIs are sampled from 18 different centres and 7 different organs including breast, liver, kidney, prostate, bladder, colon and stomach with various tumour stages.
BCSS.The BCSS challenge was conducted in 2021 and contains over 20,000 annotated regions of interest (ROI) from 151 H&E stained WSIs with the same number of patients from TCGA [81].25 annotators including pathologists, residents, and medical students helped annotate this large scale data into 25 refined categories which are later merged into 5 broad categories as tumour, stroma, inflammatory, necrosis, and others.For this work, we have used the same 5 broad categories by relabelling the regions and then split them into training and test centres according to the [44] where there were 14 centres for training and 7 centres for testing.

Data Preparation
In order to validate the CRCFP framework, we evaluated it against different label proportions of each dataset.Where for BCSS different label proportions were collected from different centres (hospitals) to make training more susceptible to variation in colours enabling domain shift.DL methods often fail to perform well on samples from a different domain (centres), mainly due to domain shift, this also makes it a domain generalisation problem.Therefore, the training set was divided into portions by diving the total training centres as 1/1 (full), 1/2 (half), 1/4 (quarter), and 1/8 (one-eighth) centres where 1/8 results in training images coming from only 1 centre, while the test set remains intact as it is.Similarly, for 1/4 (quarter) training images comes from 4 centres and 7 centres for 1/2 (half).For MoNuSeg, different label proportions were based on training images themselves and are then divided into 1/1, 1/8, 1/16, and 1/32 proportions to make it comparable to the work of [43].Further, this whole process is repeated using 3 different random seeds and then the results are reported using mean aggregation with standard deviation.

Evaluation metrics
In order to compare our proposed method quantitatively with other state-of-the-art methods (SOTA), we have used different quantitative measures including accuracy, F1-score (Dice) and mean intersection over union (mIoU) for both the datasets.

Network Architecture
We used DeepLab-v3 [17] as base segmentation network with ResNet-50 [82] encoder pretrained on ImageNet [83].Where the projector consists of two fully connected (FC) layers of size 128 with ReLU as an intermediate activation layer, FC → ReLU → FC.Pixel classifiers consist of convolutional layers with a kernel of size 1 × 1 to reduce the number of channels to total classes with non-linear ReLU activation.The final layers upsamples the output using bi-linear interpolation to match the input size as H × W × C.

Experimental Settings
The input size for the proposed framework for both labelled and unlabelled images was 320 × 320.For contrastive learning, two patches x u1 and x u2 were randomly cropped from the unlabelled image with an overlap in the range of [0.1, 1.0] and are resized to match the input dimensions.For positive filtering mask λ was set to 0.75 and τ = 0.1 as temperature for cosine similarity.For cross-consistency training, number of auxiliary pixel classifiers were set to K = 4 for each perturbation type and for feature noise perturbation the parameters α = −0.3,β = 0.3 were used.For feature dropout perturbation, α = 0.75, β = 0.9 were used as they can help remove approximately 10% to 30% of active regions from the feature map.Also, for simple Dropout the probability for Bernoulli distribution was set to δ = 0.5.During training, a set of standard augmentations were applied to the input images including horizontal and vertical flipping, gaussian blur, colour and grey scaling.PyTorch was used for implementing this framework where for optimisation we train the whole framework for 80 epochs.For the initial 5 epochs, only supervised loss L sup was used to train the whole framework as this provides a stable head start for the semi-supervised learning.The batch size of 8 was used for labelled and unlabelled images with stochastic gradient descent (SGD) optimiser and a learning rate of 0.001.As a common practice, poly learning rate decay policy was used where the learning rate is scaled using 1 − ( iter max _iter ) power at each iteration with power = 0.9.Weights with respect to different losses L sup , L cont , L cross and L end were set to fixed values as w sup = 1, w cont = 0.1, w cross = 0.01 and w ent = 0.01 respectively.All models were trained with the same configurations for both datasets where two Nvidia GeForce 1080Ti GPUs are used for training.

Results
The performance of our proposed method (CRCFP) compared to recent SOTA semi-supervised semantic segmentation methods including DeepLab [17], CCT [39], CAC [40] and CDCL [43] 1 is shown in Table 1 and Table 2.As these methods are implemented using different configurations and baseline segmentation models.For a fair comparison, we have implemented these methods within a unified framework with the same segmentation baseline, experimental settings and data augmentations.
Table 1 shows the performance of our CRCFP model compared to supervised and semi-supervised methods on all matrices for the BCSS dataset.Particularly, when 1/8 proportion of the training centres was used, it can be seen that in terms of mIoU our method performs ∼6% better than the supervised method and ∼3% better than the recent CAC [40].Similarly, its worth noting that with 1/4 of the total centres, the CRCFP performance is almost similar to fully supervised method with all data.On the other hand, the poor performance of CCT [39] can be attributed towards heavy perturbations applied directly to the features where it brings perturbed features from different contexts closer without pushing dissimilar apart whereas CAC [40] not only bring them closes but also pushes away the features from different classes.However, it focuses more on encoder feature generalisation leaving pixel-classifier with less confident features.Figure 5 shows visual comparison of CRCFP with the SOTA algorithms, where, it can be observed that prediction maps of CRCFP are better as compared to the rest, specially highligted in the dashed red boxes.with a smaller standard deviation of 0.22.It can also be observed that fully supervised models are more susceptible to domain generalisation problem from the table as in 1/32 proportion of the training images the performance of DeepLab-v3 [17] is 4% better than the 1/16 proportion of the training images whereas there is more data available in the latter.This is due to the fact that in a random sampling of training images some training images are better indicators of the testing distribution due to similarities in the same stain, organ and tumour stage.However, most of the SOTA semi-supervised algorithms solve this issue with the help of unlabelled data as it can be seen that the performance increase with the increase in data for all these methods.Figure 6 shows a visual comparison of CRCFP with SOTA methods where it can be seen that our approach predicts fewer false positives as compared to CDCL [43].
Further, in order to validate the contribution of each component (i.e., context-aware consistency, cross-consistency training and entropy minimisation) we conducted an extensive ablation study.The ablation study is performed on the BCSS dataset due to its complexity and multi-class nature, where we studied the effect of using all data proportions for the different encoders and in stripping the framework.While studying the effect of negative samples and the number of auxiliary pixel classifiers we used 1/8 data proportion.

Encoder
To verify the performance boost by plugging in a bigger encoder in the base segmentation network, we replaced ResNet-50 with ResNet-101 for all data proportions.Table 3 shows the performance of the proposed CRCFP framework with a bigger encoder and it can be seen that there is a performance boost overall for most of the methods, especially for CCT [39].However, it can be observed that CRCFP with a smaller encoder (i.e., ResNet-50) still performs comparable/better than other SOTA techniques with a bigger encoder e.g., in 1/8 proportion CAC [40] with ResNet-101 achieves mIoU of 46.91 where CRCFP with ResNet-50 achieves mIoU of 47.09 showing superiority of our proposed method.Also, it is worth mentioning that with ResNet-101 the standard deviation we observed with ResNet-50 was reduced, owing to the fact that bigger encoders are more stable for semi-supervised learning frameworks.Overall the CRCFP framework provides improved and stable performance with bigger encoders as compared to the other methods.

Network Schemes
We validated the contribution of each component by breaking down the whole framework with respect to different losses and called them network schemes.We started with a baseline segmentation network i.e., DeepLab-v3 with ResNet-50 as SupOnly, Scheme.1 consists of using context-aware consistency loss, Scheme.2consists of using context-aware consistency loss with entropy minimisation and finally Scheme.3 is our proposed framework with context-aware consistency loss with cross-consistency training and entropy minimisation.Table 4 shows the schemes with respect to their respective losses being used, it can be seen that with each component's addition we can see improvement in overall performance.E.g., in 1/8 data proportion, the addition of context-aware consistency brings about 4% of improvement while entropy minimisation further bumps it up by 1% and finally cross-consistent training beings about 2% of improvement accumulating the overall performance to ∼7% from baseline supervised model.Also, for other data proportions the performance boost is not that much significant with the addition of these Scheme.2 and Scheme.3 as compared to Scheme.1.However, its worth mentioning that the standard deviation of Scheme.2 and Scheme.3 as compared to Scheme.1 is smaller which is due to the fact that these schemes brings confidence in prediction maps thus improving the overall performance with stability.

Negative Samples
As increasing the negative samples in training contrastive learning framework boosts the performance of the underlying model.This is done mostly by increasing the batch size to 2048 or 4096 where possible as the bigger the batch size the more samples you get for comparisons [65,84].However, where it is not possible, another workaround is to use a memory bank where negative samples from previous batches were stored for later use.Therefore, in order to get the upper bound of performance in our framework with respect to negative samples, we have experimented with different number of negative samples as seen in Table 5.It can be noticed that with increasing negative samples, the performance increases for a while and then it reaches the plateau and then increases with very little gain as it can also be observed visually in Figure 7.This can be due to the fact that there might not be many variations to cover in the training set with more negative samples, thus reaching stable performance or very little performance gain.Also, due to gradient  checkpoint functionality in PyTorch adding more negative samples does not effect the training efficiency drastically but does consume more compute time and memory.Hence, based on these observations, for this study, we set the number of negative samples to 1200 for its memory vs accuracy trade-off.

Auxiliary Pixel Classifier
To see the effect of a varying number of auxiliary pixel classifiers with respect to different perturbations we conducted experiments with K ∈ {1, 2, 4, 6, 8, 10} as seen in Table 6.It can be seen that increasing the number of pixel classifiers per perturbation increases the performance but the upper bound is achieved soon after it reaches K = 4, from where the performance drops slightly as can be observed in the Figure 8. Increasing the number of perturbations can result in more aggressive penalisation of the model overall as it accumulates to K × 3 losses which can deviate the model from learning meaningful representations.Based on this observation we set the number K = 4 for our study for the rest of the comparisons for both datasets.

Discussion
Interpretable features from histology slides can be extracted by segmenting objects/structures from ROIs e.g., nuclei, glands, stroma, tumours etc. Intrepretable features can enable discovery of novel digital bio-markers with explanations for histology images for hard tasks like survival analysi [85,10] and mutation prediction [86,87,8].Therefore, it is vital for the downstream tasks to have good quality and precise segmentation of region of interests.For this purpose, utilising unlabelled data for representation learning not only improves performance but also improves the internal representations for better learning.The qualitative and quantitative results along with the ablation study has shown superior performance of our proposed CRCFP with respect to other SOTA methods.However, it's worth exploring internal representations of the learned models (i.e.,feature embeddings) to account for (1) Consistency in feature space (2) Cluster assumption, for the sake of validation of aforementioned claims in the introduction section.

Feature Space Visualisation
In order to observe the consistency in feature space, feature embeddings were extracted from both our SSL based CRCFP trained on 1/2 proportion of the training data vs DeepLab-v3 trained on all data (i.e., fully supervised), since they achieved same performance.Extracted feature maps were upsampled to match the size of the input image (i.e., 320 × 320) and are then mapped to lower dimensions using UMAP [88] for visualisation purposes.It can be seen in Figure 9 that the feature embedding distributions are consistent with varying contexts specially in the 1 st and 2 nd column for our CRCFP model as compared to fully supervised ones.Similarly, it can be observed in the other examples where the varying context is inherent due to the sequential overlap in patch tessellation process.Whereas, the fully supervised model is susceptible to perturbations in contextual cues as can be observed.It is worth noting the last two columns where the shape of feature embedding distribution changes along with the orientation of same samples points from the same class.Specially, the ones shown in yellow dots as compared to our proposed framework where the distributions are almost consistent under these perturbations.

Cluster Assumption
Consistency regularisation based methods work on the basis of cluster assumption and have achieved SOTA results in semi-supervised classification and segmentation.The main idea behind consistency regularisation is to have high and low density regions where samples closer to each other are likely to share the same label forming a high density region with a low average distance.While the class boundaries are likely to be aligned with the low density regions i.e., high average distance.In order to observe cluster assumption, feature embeddings were extracted from CRCFP and were compared against RGB colour space as shown in Figure 10.Extracted feature maps were upsampled to match the size of the input image and then the average euclidean distance between each patch of size 21 × 21 centred around its 4 immediate spatial neighbours (left, right, top and bottom) was calculated.It can be seen in Figure 10(d) that the class boundaries are much more aligned and apparent in the feature space as compared to the colour space where the boundaries doesn't align well thus violating cluster assumption.This can be due to the fact that the CNNs at higher layers tends to learn more semantic based features from the basic low-level features.Also, interestingly the background/fat represented in white colour in input images somewhat holds the high density regions because there is not much change in colour values for that region.While the rest of the tissue area is not very homogeneous in pixel values due to the presence of cells of various shapes and sizes.

Conclusions
In this work, we haved presented a novel consistency based semi-supervised learning based semantic segmentation framework for region and nuclei segmentation in histology images.The proposed method is invariant to varying contexts and perturbations making it efficient and robust for semantic segmentation tasks.We have shown that contextaware consistency learning can exploit unlabelled images efficiently with the help of cross-consistency training and entropy minimisation.Extensive experiments on two publicly available large histopathological datasets have shown the superiority of the CRCFP framework by achieving new SOTA results for semi-supervised semantic segmentation.Also, detailed ablation studies for different network parameters and components show the contribution of each network component, demonstrating the effectiveness of our method.Future directions include improvements to the proposed method with respect to improving the context-aware loss for minor classes and finding histology specific perturbation

Figure 1 :
Figure 1: (1 st column) Example images from histological (BCSS) and natural (PASCAL VOC 2012) datasets; (2 nd column) Respective masks showing the foreground and background objects with boundaries; (3 rd column) Average Euclidean distance L 2 between the central patch of size 21 × 21 with four overlapping patches in the immediate neighbours in RGB colour space.Note that the darker blue colour represents the low density regions corresponding to high average distance.

Figure 2 :
Figure 2: (a) Images from the BCSS dataset with overlapping regions cropped sequentially from the same image to mimic changing contexts; (b) UMAP visualisations of features embedding distributions extracted from a fully supervised model; (c) UMAP visualisations of feature embedding distributions extracted from our semi-supervised model.Note that the feature embeddings are represented in the same UMAP space where dots with same colour represents feature embedding from the same class.

Figure 5 :
Figure 5: Visual comparison of the CRCFP with different state-of-the-art methods for tissue region segmentation with 1/2 training data only.Dashed red box highlights superior performance of our method as compared to SOTA methods.

Figure 6 :
Figure 6: Visual comparison of the CRCFP with different state-of-the-art techniques in nuclei image segmentation with 1/8 training data only.GT represents the ground truth nuclei masks, and SupOnly shows the models trained with labelled training data only.Red pixels correspond to the ground truth while green shows the prediction.Yellow pixels represent the overlap regions between the prediction and ground truth.

Figure 7 : 8 Figure 8 :
Figure 7: Performance graph with respect varying number of negatives samples used while training L cont loss with BCSS data split of 1/8

Figure 9 :
Figure 9: (a) BCSS dataset images with overlapping regions cropped sequentially from the same image to mimic changing contexts.(b) UMAP visualisations of features embedding distributions extracted from a fully supervised model.(c) UMAP visualisations of feature embedding distributions extracted from a semi-supervised model.Note that the feature embeddings are represented in the same UMAP space where dots with same colour represents feature embedding from the same class.

SSCLFigure 10 :
Figure 10: (a) Example images from BCSS test dataset.(b) Respective masks showing the foreground and background pixels.(c,d) Average euclidean distance L 2 between the central patch of size 21 × 21 with four overlapping patches in the immediate neighbours in RGB colour space and feature space respectively.Note that for feature space visualisation encoder embeddings were upsampled to map input size.The darker blue colour represents the low density regions corresponding to high average distance.

Table 1 :
Comparison of the state-of-the-art methods with mIoU, dice score and accuracy aggregated for 3 different random seeds as mean (standard deviation).The first column represents the fraction of data used for training the model.

Table 2
[40]s the performance of CRCFP surpassing other SOTA methods in all data proportions and metrics, especially in 1/32 proportion of the MoNuSeg dataset.It can be seen that our CRCFP outperforms the CAC[40]by 4.32% in mIoU

Table 2 :
Comparison of the state-of-the-art methods with mIoU, dice score and accuracy aggregated for 3 different random seeds as mean (standard deviation).The first column represents the fraction of data used for training the model.

Table 3 :
Comparison of the state-of-the-art methods on mean (standard deviation) of mean intersection of union (mIoU), dice score and accuracy with baseline encoder as ResNet-101.The first column represents the fraction of data used for training the model.

Table 4 :
CRCFP breakdown in different Schemes with respect to their loss functions.SupOnly correspond to baseline segmentation model with L sup loss only.Scheme.1 corresponds to addition of L cont loss on top of SupOnly.Scheme.2 corresponds to addition of L ent on top of Scheme.1 and finally Scheme.3 is addition of L cons on top of Scheme.2.Method Split L sup L cont L ent L cons mIoU

Table 5 :
Performance of CRCFP with respect different number of negatives samples used while training L cont loss with

Table 6 :
Performance of CRCFP with respect different number of K auxiliary classifiers used while training L cons loss