Multi-scale fully convolutional neural networks for histopathology image segmentation: from nuclear aberrations to the global tissue architecture

Histopathologic diagnosis is dependent on simultaneous information from a broad range of scales, ranging from nuclear aberrations ($\approx \mathcal{O}(0.1 \mu m)$) over cellular structures ($\approx \mathcal{O}(10\mu m)$) to the global tissue architecture ($\gtrapprox \mathcal{O}(1 mm)$). Bearing in mind which information is employed by human pathologists, we introduce and examine different strategies for the integration of multiple and widely separate spatial scales into common U-Net-based architectures. Based on this, we present a family of new, end-to-end trainable, multi-scale multi-encoder fully-convolutional neural networks for human modus operandi-inspired computer vision in histopathology.


Introduction
If the rumour is tumour, the issue is tissue.Histopathology is the gold standard and backbone of cancer diagnosis, providing important information for various stages of the treatment process.
This includes, amongst others, the assessment of resection specimens with respect to whether the resection margins are free of tumour cells and how close to the resection margins they reach, which can be viewed as a segmentation task.In addition, individualised treatment planning strongly relies on a finegrained grading of precursor and differently malignant cancer lesions, which can be understood as multi-class segmentation and/or classification problems.
Human pathologists meet these challenges with the help of highly specialised grading systems for all kinds of cancer and cancer precursors.Even though these grading systems vary for different kinds of cancer, most of them rely on a combination of features such as -Local alterations of the nuclear inner structure -Deformed and varying nuclear shapes or global alterations of the nuclei -Altered nucleus to stroma ratio for individual tumour cells and -Altered positions of the nuclei within their parent cells (e.g.not at the bottom anymore, as observed in many glandular tumours) -Altered cellular shapes and -Altered relative positions of neighbouring cells (e.g.not all on the same, single level but some stacked over others) -Disorganised long-range order (e.g.atypical or deformed glandular shapes) -Invasion, i.e. disrespecting global tissue order and borders between different layers As can be seen from this (in-extensive) list, diagnosis and grading of malignancy inherently involves a range of different scales.These scales span a factor of about or more than a thousand, ranging from sub-nuclear features (which lie on a spatial scale of ≈ O(0.1 µm)) over nuclear, cellular (≈ O(10 µm)), intercellular (≈ O(100 µm)) to glandular and other higher organisational features ( O(1 mm)).
The importance of the integration of information from different scales is reflected in how human pathologists approach these tasks: regions of interest are typically viewed at several different scales by turning the objective revolver of the microscopy back and forth again and again.
With its success in various computer vision tasks, deep learning methods have opened up a myriad of perspectives for computer-aided diagnoses (CADx) in histopathology.For the segmentation tasks, fully convolutional neural networks (FCNs, [9]) and, most prominently, U-Net-based architectures [12], have successfully been adopted [3,4,8].Whilst in many of these cases, generic computer vision networks have been applied to histopathology images, there are some approaches for specialised histopathology architectures, [2,7,14], and training techniques [4,15].Some of these works address the question on how additional context can be provided to the network, however, they are confined to local, same-scale context, [2,7], and/or sliding-window CNN techniques [2].
In this paper, we introduce a family of U-Net-based deep neural nets that are specifically designed for the extensive integration of largely different spatial scales.

Dataset
Throughout this work, we use the dataset provided for the PAIP 2019 challenge (part of the MICCAI 2019 Grand Challenge for Pathology) for training and evaluation.It is comprised of 50 de-identified whole-slide histopathology images from 50 patients that underwent resection for hepatocellular carcinoma (HCC) in three Korean hospitals (SNUH, SNUBH, SMG-SNU BMC) from 2005 to June 2018.The slides have been stained by hematoxylin and eosin and digitalised using an Aperio AT2 whole-slide scanner at ×20 power.
Figure 1 shows an exemplary whole-slide image from the PAIP dataset.It is worth noting that all individual cases of the given dataset do indeed include cancerous regions.All de-identified pathology images and annotations used in this research were prepared and provided by the Seoul National University Hospital under a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI18C0316).

Data Processing & Training
As compared to the annotations provided for the PAIP challenge, which comprise the classes "viable tumour" and "whole tumour" (i.e.including viable tumour cells, stroma, necrosis, ...), we added "overall tissue" as an another class for improved patch balancing and enhanced training stability.The "annotations" for the "overall tissue" class are created fully automatically before training, namely by thresholding of the original images with [R, G, B] ≤ [235, 210, 235] and subsequent application of morphological opening and closing operations.This threshold is the same as the one that has been used by the challenge organisers for cleaning of the whole and viable tumour annotations.
Images are standardised by Reinhard colour normalisation [11] with respect to the tissue regions and normalised to channel-wise zero average and unit variance.
The training process is split into (short pseudo-)epochs of 1920 patches each, with all patches randomly redrawn after each individual "epoch".(For this reason, the number of patches per epoch is arbitrary.)This allows for a close inspection of the training process, first, and, as compared to the use of a fixed number of pre-selected patches, it effectively reduces overfitting and/or resource requirements when facing enormously large images as in histopathology.For each individual "epoch", the patches are balanced with respect to both the individual cases in the training set and the available classes (background, overall tissue, whole tumour, viable tumour).
Our models are implemented using PyTorch [10].Training is conducted by use of an in-house developed training pipeline for generic multi-modal medical image deep learning.
Optimisation is performed using Adam [6] employing a learning rate of 10 −3 and a learning rate decay with γ = 0.5 every 57,600 iterations (30 epochs).
During training, we employ online data augmentation including the following operations: rotation, flip along the horizontal axis, saturation and brightness transformation.All models are trained for 228,480 iterations (120 epochs), which is found to be well above what is needed for convergence for all of our models.
We choose the binary cross entropy (BCE) loss as a common and widely used loss function for this study.In order to balance for class inhomogeneities and draw attention to the tumour regions, we use class-weights of [0, 1, 2, 6] for background, overall tissue, whole tumour and viable tumour, respectively.

Evaluation & Metrics
We have chosen the task of viable tumour cell segmentation, such as presented in task 1 of the PAIP 2019 challenge, as an example task that may profit from extensive multi-scale integration.In addition to task 1 of that challenge, we also examine the predictions for the "whole tumour" class.The segmentation of the viable tumour cells, however, remains our primary interest, also due to the fact that the respective annotations appear to be very well consistent.
To this end, we train and evaluate all of the models described in the forthcoming on the PAIP 2019 challenge dataset using a 5-fold cross-validation (CV) split.We examine both the BCE loss on the validation dataset, as well as the Jaccard index which has been used for assessment of the PAIP 2019 task 1 and provides an intuitive parameter for segmentation quality.
Validation is conducted at the following steps (number of iterations): 0, 3,840, 7,680, 13,440, 19,200, 28,800, 38,400, 57,600, 76,800, 96,000, 115,200, 134,400, 153,600, 172,800, 192,000, 211,200, 228,400.At each such step, four 3072 × 3072 px-sized sub-images per case are evaluated (totalling 40 validation sub-images from 10 cases per split).The positions of these sub-images are randomly chosen once, such that for any validation case, all four classes are represented in at least one of the sub-images.The sub-image positions are kept fixed for each individual split, such that all models for a given split are evaluated with respect to the same sub-images.
We evaluate all models with respect to their underlying models, i.e. all models are compared with the baseline U-Net.In addition, msYI -Net and msY 2 -Net are compared with the msY-Net they are built upon.As in all of these comparisons, one model can reproduce the other, a one-sided paired-sample t-test is used.In order to correct for multiple testing, we use the BenjaminiHochberg procedure.

Baseline method
Beyond the still popular sliding-window CNN-based techniques, U-Net-based FCN architectures form the de facto standard in histopathology image segmentation.Therefore, we have chosen a ResNet18-based U-Net [5] as baseline for our studies.An implementation of this architecture can be found in [1].The ResNet18, which forms the encoder of this architecture (cf.figure 2), has been pre-trained on the ImageNet dataset [13].Four our study, the baseline model is trained at full-resolution patches of the size 512 × 512.

msY-Net -Integrating local tissue architecture
Whole-slide images (WSI) are commonly saved as pyramid representations that underlie the common WSI formats such as *.mrxs, *.czi, *.svs and others.Therefore, when working with WSI images, retrieval of multiple scales and of very broad context of a given patch comes with a moderate overhead only.
Fig. 2. ResNet18-based U-Net architecture (baseline model).For each block the spacial image shape as well as the number of channels are given.Here, C f i ,fo m×n denotes a single m × n convolution with fi input and fo output feature maps, followed by ReLU activation.C f m×n l shall represent l consecutive m × n convolutions with f feature input and output maps, each followed by a ReLU activation function.For the encoding part, the blocks (C f res ) of a ResNet18 are used where f denotes number of the respective output feature maps.Each individual block introduces a spatial downscaling by a factor of 2, either through max pooling or strided convolutions.The decoder uses m × n bilinear upsampling (Um×n) to enlarge the spacial dimensions. ) of the n-times down-scaled context path is performed followed by n × n bilinear upsampling (Un×n), where n = 4 if n = 4 and n = 8 if n = 16.For the case n = n, another centre cropping with 16 × 16 is conducted.Both, now spatially consistent, paths are then merged by concatenation (⊕).Finally, the number of feature maps is reduced to the original number by a 1 × 1 convolution.Cf. section 2.5 for a detailed description of the rationale.
In order to provide the network with local architectural information (cf.figure 4 and the considerations in section 1), we construct a multi-scale Y-Net (msY-Net) as a dual-scale dual-encoder network building upon the baseline Res-U-Net architecture, cf.figure 3.
The msY-Net is provided with two input patches of the scales 1 and 4 (which correspond to the inner two rectangles in figure 4).The former, the full-resolution patch, is fed into the standard U-Net architecture.The latter is passed through a separate but analogous encoder architecture ("context encoder") built from another ResNet18.As the skip connections in the U-Net are for helping the decoder re-localise the high-level features and only the full-resolution patch is the one that needs to be segmented, the context encoder does not have any skip connections to the decoder.
The information from the context encoder is provided to the underlying U-Net at its bottleneck by concatenation.At this point, however, it is crucial to concatenate both paths in a way such that their spatial scales agree.This means that just before concatenation, the context-encoder high-level feature maps are cropped to the region that spatially corresponds to the bottleneck feature maps from the full-resolution encoder.The cropped context-encoder high-level feature maps are then bilinearly upsampled by a factor of 4 and concatenated to the bottleneck feature maps from the full-resolution encoder.Before being fed to the decoder, the resulting 2 × 512 = 1024 feature maps are reduced to 512 feature maps by 1 × 1 convolution.This operation shall learn which of the feature maps from the two paths are relevant and how they need to be related to each other.These steps form the "multi-scale merge block" that is at the heart of the msY-Net and figure 3.

msYI -Net and msY 2 -Net -Integration of larger context
In order to provide the model with large scale context information, cf.figure 4, we construct two models that have another large-context encoder for patches of the scale 16.
We examine two different ways of how to add this information to the underlying msY-Net: First, we add the large-context encoder in the same way the first context-encoder of the msY-Net has been added before, i.e. through another multi-scale merge block following the first multi-scale merge block.We refer to this model as msY 2 -Net.
For the second model, we keep the large-context encoder separate, parallel, so to say, to the underlying msY-Net.The I in msYI -Net shall refer to the large-context encoder that parallels the underlying msY-Net.The large-context encoder consists of another ResNet18 that is used for classification of the overall content of the full-resolution patch.Hence, the network has two outputs, the segmentation of the full-resolution patch from its msY-Net part and the classification of the full-resolution patch contents from the large-context encoder, its I -part.In addition, the classification output from the I -part is concatenated to the msY-Net logits just before the final output of the network and merged by a 1 × 1 convolution, so that the msY-Net-part can use the large-context classification information for adjusting its final predictions.For training of this model, however, both the segmentation and the classification output are treated separately and fed to a segmentation and a classification loss function, respectively.The model is then optimised with respect to the arithmetic mean of the two losses.For both loss functions, binary cross entropy losses are used.
In the msY-Net and the msY 2 -Net architectures, spatial correspondence between the full resolution encoder and the context encoder(s) is enforced in the multi-scale merge block (cf.figure 3).It should be noted that for the largecontext encoder in the msYI -Net, there is no such requirement.Therefore, this model can, in principle, be fed with large-context patches of entirely arbitrary scales.

Validation loss
All of the models that we have introduced above are trained and evaluated on fixed 5-fold splits of the dataset.The best validation loss that is individually reached is shown in table 3.1.For the msYI -Net, only the segmentation part of the loss is given.

Jaccard metrics
Table 3.2 shows the best Jaccard metric values that are reached for each individual model and CV split.As it can be seen, all the context-enhanced models perform significantly better than the baseline U-Net.This also holds when the models that are best converged with respect to the BCE loss are evaluated for their mean Jaccard index on the viable and the whole Fig. 4. Input patches to the different models under consideration, shown in an illustrative region of the whole-slide image depicted in figure 1.The innermost rectangular corresponds to the input with which a U-Net is provided that is trained on full-resolution patches of the size 512 × 512.The innermost plus the next larger rectangular is the input to the msY-Net model described in section 2.5.As can be seen, this scale contains information on how the cells are organised in strands and "trabeculae" (or whether the cells violate these patterns) that are impossible to deduce using the innermost patch alone.In this sense, it adds "architectural" information.The msYI -Net and the msY 2 -Net can make use of the inputs represented by all the three rectancles.The outermost rectangle contains information on the large-scale organisation of the tissue, such as the pseudocapsule, for instance, that is typical for hepatocellular carcinoma.The scale bar is 2 mm.tumour classes, cf.table 3.2.In addition, these results suggest that there may be an additional benefit from the large-context encoder as compared to the msY-Net context encoder alone.Though, this difference is statistically significant only for one of the two three-arm models, namely the msY 2 -Net.
Further studies and results in favour of the understanding that there is a benefit from the context-encoder over the baseline U-Net and that there might be an additional benefit from another large-context encoder can be found in the Supplementary Materials.

Conclusion
Using the segmentation of hepatocellular carcinoma in hematoxylin-eosin stained whole-slide images of the liver as an example task, our results show that the extensive integration of widely different spatial scales, as a "mimicry" of how humans approach analogous tasks, can benefit U-Net-based architectures.As the detailed structure of the encoder and the decoder are left entirely untouched, the approach presented herein can be seamlessly integrated into various encoderdecoder models.This study has important limitations: First of all, it is restricted to one particular task and organ and disease entity.It remains to future work to examine how these findings generalise to other tasks, organs and types of cancer or even other diseases.In addition, further work is needed to identify the optimal approach for connecting (particularly) the large-context encoder to the base network.
This proof-of-principle study advocates the benefit of extensive multi-scaling in histopathology deep learning.It goes without saying that future research is likely to optimise the models presented herein much further.

Fig. 1 .
Fig.1.An exemplary case from the PAIP dataset.The tissue sample is stained by hematoxylin and eosin.A hepatocalluar carcinoma (HCC) can easily be identified in the bottom left corner of the image.The so-called "pseudocapsule" around it is a typical, though not obligatory, feature of HCC.

Table 1 .
Best validation BCE loss per split and model.A denotes statistical significance at the level of 0.05 (cf.section 2.3).Results significantly above those of the baseline model are marked bold.

Table 2 .
Best Jaccard index for the viable tumour area.A ( ) denotes statistical significance at the level of 0.05 (0.005).Results significantly above those of the baseline model are marked bold.

Table 3 .
Class-average of the Jaccard index for whole and viable tumour, at best validation loss.A ( ) denotes statistical significance at the level of 0.05 (0.005).Results significantly above those of the baseline model are marked bold.