HookNet: multi-resolution convolutional neural networks for semantic segmentation in histopathology whole-slide images

We propose HookNet, a semantic segmentation model for histopathology whole-slide images, which combines context and details via multiple branches of encoder-decoder convolutional neural networks. Concentricpatches at multiple resolutions with different fields of view are used to feed different branches of HookNet, and intermediate representations are combined via a hooking mechanism. We describe a framework to design and train HookNet for achieving high-resolution semantic segmentation and introduce constraints to guarantee pixel-wise alignment in feature maps during hooking. We show the advantages of using HookNet in two histopathology image segmentation tasks where tissue type prediction accuracy strongly depends on contextual information, namely (1) multi-class tissue segmentation in breast cancer and, (2) segmentation of tertiary lymphoid structures and germinal centers in lung cancer. Weshow the superiority of HookNet when compared with single-resolution U-Net models working at different resolutions as well as with a recently published multi-resolution model for histopathology image segmentation


Introduction
Semantic image segmentation is the separation of concepts by grouping pixels belonging to the same concept, with the aim of simplifying image representation and understanding.In medical imaging, tumor detection and segmentation are necessary steps for diagnosis and disease characterization.This is especially relevant in histopathology, where tissue samples with a wide variety and amount of cells within a specific context have to be analyzed by pathologists for diagnostic purposes.
Introduction of high-resolution and high throughput digital scanners have de-facto revolutionized the field of pathology by digitizing tissue samples and producing gi-gapixel whole-slide images (WSI).In this context, the digital nature of WSIs allows for the possibility to use computer algorithms for automated histopathology image segmentation, which can be a valuable diagnostic tool for pathologists to identify and characterize different types of tissue, including cancer.

Context and details in histopathology
It has long been known that despite individual cancer cells may share morphological characteristics, the way they grow into specific patterns make a profound difference in the prognosis of the patient.As an example, in hematoxylin and eosin (H&E) stained breast tissue samples, different histological types of breast cancer can be distinguished.For instance, an invasive tumor that originates in the breast duct (invasive ductal carcinoma, IDC) can show a wide variety in growth patterns.In contrast, an  invasive tumor that originates in the breast lobules (invasive lobular carcinoma, ILC), is characterized by individually arranged tumor cells.Furthermore, the same type of ductal carcinoma cells can be confined within the breast duct (ductal carcinoma in situ, DCIS) or become invasive by spreading outside the duct (IDC) (see Figure 1) (Lakhani (2012)).To differentiate between these types of cancer, pathologists typically combine observations.For example, they look at the global architectural composition of the tissue sample and analyze the context of each tissue component, including cancer, to identify the presence of duct (both healthy and potentially cancerous) and other tissue structures.Additionally, they zoom-in into each region of interest, where the tissue is examined at a highresolution, to obtain the details of the cancer cells, and characterize the tumor based on its local cellular composition.Another example where pathologists take advantage of both context and details is the spatial distribution of immune cells, which may be detected in the presence of inflammation inside the tumor or the stromal compartment of the cancer regions, as well as in specific clustered groups called tertiary lymphoid structures (TLS), which may develop in response to cancer.In subsequent stages of TLS maturation, germinal centers (GC) are formed within the TLS (see Figure 1).It has been shown that the development of GC-containing TLS have a significant relevance for patient survival and is an essential factor for the understanding of tumor development and treatment (Sauts-Fridman et al. (2016), SiliÅa et al. (2018)).
A GC always lays within a TLS.TLS contain a high den-sity of lymphocytes with poorly visible cytoplasm, while GCs rather share similarities with other less dense tissues like tumor nests.To identify the TLS region and to differentiate between TLS and GC, both fine-grained details, as well as contextual information, are needed.

The receptive field and the field of view
In recent years, the vast majority of state of the art image analysis algorithms are based on convolutional neural networks (CNN), a deep learning model that can tackle several computer vision tasks, including semantic segmentation (Long et al. (2015), Jgou et al. (2017), Chen et al.(2018)).In semantic segmentation, label prediction at pixel level depends on the receptive field, which is the extent of the area of the input that is observable by a model.The size of the receptive field of a CNN depends on the filter size, the pooling factor, and the number of convolutional and pooling layers.By increasing these parameters, the receptive field also increases, allowing the model to capture more contextual information.However, this often comes at the cost of an increase in the input size, which causes a high memory consumption due to large feature maps.As a consequence, a number of implicit restrictions in model optimization have to be applied often, such as reduction of the number of model's parameters, number of feature maps, mini-batch size, or size of predicted output, which may result in an ineffective training and in an inefficient inference.
Another aspect concerning the observable information is the field of view (FoV), which is the distance over the area (i.e., the actual space that the pixels disclose) in an input image and depends on the spatial resolution of the input image.The FoV has implications for the receptive field: the same model, the same input size, and the same receptive field, can comprise a wider FoV by considering an image at lower resolution due to the compressed distance over the area (i.e., fewer pixels disclose the same FoV of the original resolution).Thereby, using a down-sampled representation of the original input image, the model can benefit from more contextual aggregation (Graham and Rajpoot (2018)), at the cost of losing high-resolution details.Furthermore, contextual aggregation is limited by the input dimensions, meaning that a receptive field size can only exceed the input dimensions if padded artificial input pixels are used (a technique usually referred to as the use of same padding), which do not contain contextual information.While reducing the original input dimensions can be used to focus on scale information (Kausar et al. (2018), Li et al. (2018)), the potential contextual information remains unchanged.

Multi-field-of-view multi-resolution patches
Whole-slide images (WSI) are pyramidal data structures containing multi-resolution gigapixel images, including down-sampled representations of the original image.In the context of CNN models development, it is not possible to capture a complete WSI at full resolution in the receptive field, due to the billions of pixels in a WSI, which exceeds the capacity of the memory of a single modern GPU that is usually used to train CNN models.A common way to overcome this limitation is to train a CNN with patches (i.e., sub-regions from a WSI).Due to the multi-resolution nature of WSIs, the patches can originate from different spatial resolutions, which are expressed in micrometers per pixel (µm/px).A patch is extracted by selecting a position within the WSI, together with a size and a particular resolution.When extracting a patch at the highest available resolution, the potential contextual information is not yet depleted because there is available tissue around the patch that is not considered in the receptive field.By extracting a concentric (i.e., centered on the same location of the whole-slide image) patch, with the same size but lower resolution, the same receptive field aggregates more contextual information and includes information that was not available before.Multiple concentric same sized patches, extracted at different resolutions, can be interpreted as having multiple FoVs.Hence we call a set of these patches: multi-field-view multi-resolution (MFMR) patches (see Fig 1).
To date, research using MFMR patches extracted from histopathology images, has mostly been focused on combining features obtained from MFMR patches for patch classification (Alsubaie et al. (2018), Wetteland et al. (2019), Sirinukunwattana et al. (2018)).However, using patch classification for the purpose of semantic segmentation results in a coarse segmentation map or heavy computation due to a required sliding window approach, which is needed for the segmentation of every pixel.For the task of segmentation, the use of MFRM patches is not straightforward: when combining features obtained from MFMR patches, pixel-wise alignment should be enforced when integrating them.Gu et al. (2018) proposed to use a U-Net architecture (Ronneberger et al. (2015)) for processing a high-resolution patch, and additional encoders to process lower resolution patches.Subsequently, feature maps from the additional encoders are cropped, upsampled, and concatenated in the decoder parts of U-Net, at the places where skip connections are concatenated as well.The feature maps from the additional encoder are up-sampled without skip connections, at the cost of localization precision.Furthermore, their proposed model concatenates feature maps at every depth in the decoder, which might be redundant and results in a high memory consuming model.Moreover, considering the necessity of pixel-wise alignment, their model is restricted to same padding, which can introduce artifacts.
A multi-class problem where classes are known to be subjected to context and fine-grained details can benefit from the combined information in a set of MFMR patches.However, this is still an unsolved problem.The challenge is to simultaneously output a high-resolution segmentation based on fine details detectable at high resolution and the incorporation of unconcealed contextual features.

Our contribution
In this paper, we introduce a multi-branch segmentation framework based on convolutional neural networks that can simultaneously incorporate contextual information and high-resolution details to produce fine-grained segmentation maps of histopathology images.The model, which we call HookNet, consists of encoder-decoder branches, that allow for multi-resolution representations at different depths, which we take advantage of by concatenating relevant features between branches in the decoder parts via a hooking mechanism.We give a detailed description of how to design and use the framework.In particular, we will show how to instantiate two U-Net branches and, in the design, limit the implementation to what is possible using a single modern GPU.Furthermore, we show the performance on multi-class and multi-organ problems, including tissue subjected to highresolution details as well as context.

Materials
In order to train and assess the performance of HookNet, we collected data for two applications in histopathology image segmentation, namely multi-class tissue segmentation in breast cancer sections, and segmentation of TLS and GC in lung cancer sections.

Breast dataset
We collected 86 breast cancer tissue sections containing IDC (n=34), DCIS (n=35) and ILC (n=17).For the DCIS and IDC cases, we used H&E stained tissue sections, which were initially made for routine diagnostics.All tissue sections were prepared according to the laboratory protocols from the Department of Pathology of Radboud University Medical Center, Nijmegen (the Netherlands).Slides were digitized using a Pannoramic P250 Flash II scanner (3DHistech, Hungary) at a spatial resolution of 0.24 µm/px.For ILC, new tissue sections were cut and stained for H&E, after which, slides were scanned using the same scanner and scanning protocol as for the IDC/DCIS cases.After inspection of the WSIs, the H&E stained ILC sections were de-stained and subsequently restained using P120 catenin antibody (P120) (CanasMarques and Schnitt ( 2016)), which stains lobular carcinoma cells cytoplasmic, rather than a membranous staining pattern in normal epithelial cells.P120 stained sections were subsequently scanned using the same scanner and protocol as the H&E sections (Brand et al. (2014)).This procedure allowed us to have both H&E and immunohistochemistry (IHC) of the same tissue section.Three people were involved in the creation of manual annotations: two medical research assistants, who had undergone a supervised training procedure in the pathology department to specifically recognize and annotate breast cancer tissue in histopathology slides, and a resident pathologist (MB), with six years of experience in diagnostics and research in digital pathology.To guide the procedure of annotating ILC, the resident pathologist visually identified and annotated the region containing the bulk of tumor in the HE slide.Successively, the research assistants used this information next to the available IHC slide to identify and annotate ILC cells.Additionally, the research assistants made annotations of DCIS, IDC, fatty tissue, benign epithelium, and an additional class of other tissue, containing inflammatory cells, skin/nipple, erythrocytes, and stroma.All annotations were finally checked by the resident pathologist and corrections were made when needed.The in-house developed open-source software ASAP (Litjens et al. ( 2018)) was used to make manual annotations.As a result, 6279 regions were annotated, of which 1810 contained ILC cells.Sparse annotations of tissue regions were made, meaning that drawn contours could be both non-exhaustive (i.e., not all instances of that tissue type were annotated) and non-dense (i.e., not all pixels belonging to the same instance were included in the drawn contour).Examples of sparse annotations are depicted in

Lung dataset
We randomly selected 27 diagnostic H&E-stained digital slides from the cancer genome atlas lung squamous cell carcinoma (TCGA-LUSC) data collection, which is publicly available in genomic data commons (GDC) Data Portal (Grossman et al. (2016)).For this dataset, sparse annotations of TLS, GC, tumor, and other lung parenchyma were made by a senior researcher (KS) with more than six years of experience in tumor immunology and histopathology, and checked by a resident pathologist (MB).As a result, 1.098 annotations, including 4 classes were annotated in this dataset.For model development and performance assessment, we used 3-fold crossvalidation, which allowed us to test the performance of the presented models on all available slides.All three folds contain 12:6:9 images for training:validating:testing.We made sure that all splits had an appropriate class balance.

HookNet: multi-branch encoder-decoder network
In this section we present "HookNet", a convolutional neural network model for semantic segmentation that processes concentric MFMR patches via multiple branches of encoder-decoder models and combines information from different branches via a "hooking" mechanism (see Figure 3).The aim of HookNet is to produce semantic segmentation by combining information from (1) lowresolution patches with a large field of view, which carry contextual visual information, and (2) high-resolution patches with a small field of view, which carry finegrained visual information.For this purpose, we propose HookNet as a model that consists of two encoder-decoder branches, namely a context branch, which extracts features from input patches containing contextual information, and a target branch, which extracts fine-grained details from the highest resolution input patches for the target segmentation.The key idea of this model is that finegrained and contextual information can be combined by concatenating feature maps across branches, thereby resembling the process pathologist go through when zooming in and out while examining tissue.
We present the four main HookNet components following the order in which they should be designed to fulfill the constraints necessary for a seamless and accurate segmentation output, namely (1) branches architecture and properties, (2) the extraction of MFMR patches, (3) constraints of the "hooking" mechanism, and (4) the handling of targets and losses.

Context and target branches
The first step in the design of HookNet is the definition of its branches.Without loss of generality, we designed the model under the assumptions that (1) the two branches have the same architecture but do not share their weights, and (2) each branch consists of an encoder-decoder model based on the U-Net (Ronneberger et al. (2015)) architecture.As in the original U-Net model, each convolutional layer performs valid 3x3 convolutions with stride 1, followed by max-pooling layers with a 2x2 down-sampling factor.For the up-sampling path, we adopted the approach proposed in Odena et al. (2016) consisting of nearest-neighbour 2x2 up-scaling followed by convolutional layers.

MFMR input patches
The input to HookNet is a pair (P C , P T ) of (M × M × 3) MFMR RGB concentric patches extracted at two different spatial resolutions r C and r T measured in µm/px for the context (C) and the target (T) branch, respectively.In this way, we ensure that the field of view of P T corresponds to the central square region of size (M r T r C × M r T r C × 3) of P C but at lower resolution.In order to get a seamless segmentation output and to avoid artifacts due to misalignment of feature maps in the encoder-decoder branches, specific design choices should be made on (1) the size and (2) the resolution of the input patches.First, M has to be chosen, such that all feature maps in the encoder path have an even size before each pooling layer.Stated initially in Ronneberger et al. (2015), this constraint is crucial for HookNet, as an unevenly sized feature map will also cause misalignment of feature maps not only via skip connections but also across branches.Hence, this constraint ensures that feature maps across the two branches remain pixel-wise aligned.Second, r T and r C should be chosen in such a way that given the branches architecture, a pair of feature maps in the decoding paths across branches comprise the same resolution.This pair is an essential requisite for the "hooking" mechanism detailed in section 3.3.In practice, given the depth D (i.e., the number of pooling layers) of the encoder-decoder architecture, r C and r T should take on values such that the following inequality is true: 2 D r T ≥ r C .

Hooking mechanism
We propose to combine, i.e., hook-up information from the context branch into the target branch via the simple concatenation of feature maps extracted from the decoding paths of the two branches.Our chose for concatenation as the operation to combine feature maps is based on the success of skip connections in the original U-Net, which are also using concatenation.Moreover, concatenation allows downstream layers to operate over all feature maps, and therefore learn the optimal operation to apply during the parameters optimization procedure.To take maximum advantage of semantic encoding, the feature maps should not be concatenated before the bottleneck layer.We postulate that hooking could be best done at the beginning of the decoder in the target branch, to take maximum advantage of the inherent up-sampling in the decoding path, where the concatenated feature maps can benefit from every skip connection within the target branch.We call this concatenation "hooking", and in order to guarantee pixel-wise alignment in feature maps, we define the spatial resolution of a feature map as S RF = 2 d r, where d is the depth in the encoder-decoder model and r is the resolution of the input patch measured in µm/px.To define the relative depths were the hooking can take place, we define a SFR ratio between a pair of feature maps as where d T and d C , are the relative depths for the target and context branch, respectively.In practice, hooking can take place when feature maps from both branches comprise the same resolution: S RF C S RF T = 1.As a result, the central square region in the feature maps of the context branch at depth d C , are corresponding to the feature maps of the target branch at depth d T .The size of this central square region is equal to the size of feature maps of the target branch because both feature maps comprise the same resolution.To do the actual hooking, simple cropping can be applied, such that context branch feature maps are pixel aligned concatenated together with feature maps in the target branch.

Targets and losses
The goal of HookNet is to predict a segmentation map based on P C and P T .Therefore, it can be trained with a single loss backpropagated via the target loss computed for the output of the target branch.Also, the context branch can generate the predictions for the lower resolution patch, and this context error can be used simultaneously with the target loss.For training purpose, we propose a loss function L = λL high + (1 − λ)L low , where L high and L low are pixel-wise categorical cross-entropy for the target and the context branch, respectively, and λ controls the importance of each branch.

Pixel-based-sampling
Patches are sampled with a particular tissue type, i.e., class label, at the center location of the sampled patch.Due to the sparseness of the ground truth labels, some patches contain less ground truth pixels than other patches.During training, we ensured that every class label is equally represented through the following pixelbased sampling strategy.In the first mini-batch, patches are randomly sampled.In all subsequent mini-batches, patch sampling is guided based on the accumulation of the ground-truth pixels for every class seen in the previous mini-batches.Classes that have a lower amount of pixel accumulation have a higher chance of being sampled to compensate underrepresented classes.

Model training setup
Patches were extracted with 284x284x3 in dimensions, and we used a mini-batch size of 12, which allows for two times the number of classes to be in a batch.Convolutional layers used valid convolutions, L 2 regularizer, and the ReLU activation function.Each convolutional layer was followed by batch-normalization.Both branches consisted of a depth of 4 (i.e., 4 down-sampling and 4 upsampling operations).As mentioned in section 3.1, for down-and up-sampling operations, we used 2x2 maxpooling and 2x2 nearest-neighbours followed by a convolutional layer.To predict the soft labels, we used the softmax activation function.The contribution of the losses from the target and context branch can be controlled with a λ value.We have tested λ = 1 to ignore the context loss, λ = 0.75 to give more importance to the target branch, λ = 0.5 for equal importance and λ = 0.25 to give more importance to the context loss.Moreover, we made use of the Adam optimizer with a learning rate of 5x10 −6 .We customized the number of filters for all models, such that every model has approximately 50 million parameters.We trained for 200 epochs where each epoch consist of 1000 training steps followed by the calculation of the F 1 score on the validation set, which was used to determine the best model.To increase the dataset and account for color changes induced by the variability of staining, we applied spatial, color, noise and stain (Tellez et al. (2018)) augmentations.No stain normalization techniques were used in this work.

Experiments
In order to assess HookNet we compared it to five individual U-Nets trained with patches extracted at the fol-lowing resolutions: 0.5, 1.0, 2.0, 4.0, and 8.0 µm/px.The models are expressed as U-Net(r t ) and HookNet(r t , r c ), where r t and r c are the input resolutions for the target and context branch, respectively.The aim of HookNet is to output high-resolution segmentation maps, and thereupon the target branch will process input patches extracted at 0.5 µm/px.For the context branch, we extracted patches at 2.0, or 8.0 µm/px (breast only) , which are the intermediate and extreme resolutions that we tested for the singleresolution models and showed potential value in single resolution performance measures for the breast and lung data (as can be seen in Tables 1 and 3).We have applied all models to the breast and the lung dataset, with an exception for HookNet trained with resolutions 0.5 and 8.0 µm/px on the lung data, because it was evident from the single resolution models (U-Net trained with resolutions 4.0 or 8.0 µm/px) that no potential information is available at these resolutions for this particular use case.In the HookNet models, 'Hooking', from the context branch into the target branch takes place at relative depths where the features maps of both branches comprise the same resolution, which is dependent on the input resolutions.Considering the target resolution 0.5 µm/px, we applied 'Hooking' from depth 2 (the middle) of the context encoder and depth 0 (the end) of the context decoder) into depth 4 (the start or bottleneck) of the target decoder, respectively for the context resolution 2.0 µm/px and 8.0 µm/px To the best of our knowledge, the model proposed by Gu et al. (2018), namely MRN, is the most recent model along the same line as HookNet.Therefore, we compared HookNet to MRN.However, HookNet is different from MRN by (1) using 'valid' instead of 'same' convolutions, (2) using an additional branch consisting of an encoderdecoder (which enables multi loss models) instead of a branch with an encoder only and (3) single upsampling via the decoder of the target branch instead of multiple independent upsamplings.We instantiated MRN with one extra encoder and used input sizes of 256x256x3.The convolutions in MRN make use of same padding, which results in a bigger output size compared to using valid convolutions, therefore allowing for more pixel examples in each output prediction.For this reason and to allow MRN to be trained on a single GPU, we used a mini-batch size of 6 instead of 12.
All U-Net models and the HookNet model using a single loss (where lambda=1) are trained within approxi-mately 2 days.HookNet trained with the additional contextual loss and MRN, are trained within approximately 2.5 days.We argue that this increase is due to the extra loss in HookNet and the larger size of the feature maps in MRN, which were a result of using 'same' padding.All training times were measured using a GeForce GTX 1080 Ti and 10 CPUs for parallel patch extraction and data augmentation.

Results
Quantitative performance, for the breast data set, in terms of F 1 score for each considered class as well as an overall Macro F 1 (Haghighi et al., 2018) are reported in Table 1, for all U-Net models and for each considered resolution.Quantitative performance for all models with target resolution 0.5µm/px (i.e., U-Net(0.5),MRN and HookNet) are reported in Table 2. Likewise, for the lung dataset, quantitative performance are reported in Table 3 for all U-Net models for each considered resolution and in Table 4 quantitative performance are reported for all models with target resolution 0.5µm/px (i.e., U-Net(0.5),MRN and HookNet).
Confusion matrices for U-Net and HookNet models for breast and lung test sets are depicted in Figure 4

Single-resolution models
Experimental results of single-resolution U-Net on DCIS and ILC confirm our initial hypothesis, namely an increase in performance that correlates with increase of context (microns per pixel) for DCIS (e.g., from F 1 =0.47 at 0.5 µm/px to F 1 =0.86 at 8.0 µm/px), and a completely opposite trend for ILC (e.g., from F 1 =0.85 at 0.5 µm/px to F 1 =0.20 at 8.0 µm/px), corroborating the needs for a multi-resolution model such as HookNet.As expected, the lack of context causes confusion between DCIS and IDC in U-Net(0.5),where breast duct structures are not visible due to the limited field of view, whereas the lack of details causes confusion between ILC and IDC in U-Net(8.0)(see Figure 4), where all loose tumor cells are interpreted as part of a single bulk.
The highest performance in IDC and benign breast epithelium where observed at relatively intermediate resolutions.The performance of segmentation of fatty tissue is comparable in every model, and the performance of segmenting other tissue decreases when using relatively low resolutions (4.0 and 8.0 µm/px) or high-resolution (0.5 µm/px).
For lung tissue, we observed an increase in performance that correlates with an increase in context and decrease in resolution.This is mostly due to an increase in F 1 score for GC in U-Net(2.0),similar to the increase observed for DCIS, whereas lack of details causes confusion between Tumor and Other in U-Net(2.0),similar to what observed for ILC in breast tissue.

Multi-resolution models
In breast tissue segmentation, performance of HookNet strongly depends on which fields of view are combined.We obtained the best results with an overall F 1 score of 0.91 for HookNet(0.5,8.0) with λ=0.75, which substantially differs from HookNet(0.5, 2.0).HookNet(0.5,8.0) shows an overall increase in all tissue types for the output resolution 0.5 µm/px except for a small decrease in performance for DCIS of U-Net trained with patches at resolution 8.0 µm/px, and improves the performance on IDC, mostly due to an improvement in ILC segmentation, which likely increases the precision of the model for IDC.Note that DCIS is the only class where U-Net working at the lowest considered resolution gives the best performance (F 1 =0.86).However, that same U-Net has a dismal F 1 score of 0.2 for ILC.HookNet(0.5,8.0) processes the same low-resolution input, but increases F 1 score for ILC by 0.66 compared to U-Net(8.0), and at the same time increases F 1 score for DCIS by 0.37 compared to U-Net(0.5).In general, HookNet(0.5,8.0) improves F 1 score for all classes compared to U-Net(0.5) and to U-Net(8.0), except for a small difference of 0.02 F 1 score in DCIS segmentation.As for single U-Net models, all HookNet models perform comparably in fatty tissue and other tissue classes, as can be observed in Figure 6.
In lung tissue segmentation, the best HookNet (with λ=1.0) outperforms U-net(0.5) on the classes TLS and GC with an increase of 0.03 and 0.1 in F 1 score, respectively, and at the same time shows a decrease in F 1 score for Tumor by 0.01.F 1 scores for the 'other' class are the same for both models.A mixture of different models are outperforming HookNet on all distinct classes (U-Net(1.0)for TLS, U-Net(2.0) for GC, U-Net(0.5) for Tumor, and MRN for Other).However, HookNet achieves the highest overall F 1 score.We observed that HookNet, using the same fields of view as MRN, performs better than MRN, for both the breast and lung tissue segmentation, with respect to the overall F 1 score.Finally, we observe that for breast tissue segmentation, HookNet(0.5,8.0) performs best when giving more importance to the target branch (i.e., λ=75), while for the lung tissue segmentation the best F 1 scores are obtained when ignoring the context loss (i.e., λ = 1).
To verify if there is a significant difference between HookNet and other models with the same target resolution (i.e., 0.5µm/px) we calculated the F1 score per test slide.We applied the Wilcoxon test, which revealed that for the breast dataset, the difference between HookNet and U-Net (p-value=0.004), and HookNet and MRN (p-value=0.001) are statistically significant.For the lung dataset, the differences between HookNet and U-Net (p-value=0.442), and HookNet and MRN (p-value=0.719) are not statistically significant.These results suggest that HookNet substantially benefits from wide contextual information (e.g., 8.0µm/px for the input resolution), whereas the added value of context may be less prominent, but still beneficial, when relevant contextual information is restricted (e.g., 2.0µm/px for the input resolution).Nonetheless, we argue that HookNet, based on the improvements made on TLS and GC, as can be seen by the F1 scores (see Table 4) and in the confusion matrix (see Figure 5), can reduce the confusion between classes that are subjected to contextual information.

Discussion
The main outcome of this research paper is two-fold.The first outcome is a framework to effectively combine information from context and details in histopathology images.We have shown its effect in segmentation tasks, in comparison with other single-resolution approaches, and with one multi-resolution recently presented.The presented framework takes MFMR patches as input, and applies a series of convolutional and pooling layers, ensuring that feature maps are combined according to (1) the same spatial resolution, without the needs for arbitrary upscaling and interpolation, as done in Gu et al. (2018), but allowing a direct concatenation of feature maps from the context branch to the target branch; (2) pixel-wise alignment, effectively combined with the use of valid convolutions, which mitigates the risk of artifacts in the output segmentation map.The optimal combination of fields of view used in the two branches of HookNet has been determined experimentally.We first tested single-resolution U-Net models and then combined the fields of view that showed the best performance in two critical aspects of the problem-specific segmentation task, namely segmentation of DCIS and ILC for the breast dataset, and Tumor and GC for the lung dataset.At the moment, no procedure exists to select the optimal combination of spatial resolutions a priori, and empirical case-based analysis is needed.
The second outcome consists of two models for multiclass semantic segmentation in breast and lung cancer histopathology samples stained with H&E.In both cases, we have included tumor as one of the classes to segment, as well as other types of tissue that can be present in the tumor tissue compartment, and made a specific distinction between three breast cancer subtypes, namely DCIS, IDC, and ILC.Although a specific set of classes in breast and lung cancer tissue samples have been used as applications to show the potential of HookNet, presented methods are general and extendable to an arbitrary number of classes, as well as applicable to histopathology images of other organs.Qualitative examples of segmentation output at whole-slide image level are depicted in Figure 8, which shows the potential for using the outcome of this paper in several applications.Segmentation of TLS and GC in lung squamous cell carcinoma can be used to automate TLS detection in lung cancer histopathology images, which will allow us to easily scale the analysis to a large number of cases, with the aim of further investigating the prognostic and predictive value of TLS count.At the same time, segmentation of tumor and other tissue types allows to describe the morphology and tissue architecture of the tumor microenvironment, for example identifying the region of the tumor bulk, or the interface between tumor and stroma, an active research topic in immune-oncology, due to the role of tumor-infiltrating lymphocytes (TILs), which have to be assessed in the tumor-associated stroma (Salgado et al. (2015)) as well as in the tumor bulk and at the invasive margin (Galon et al. (2006), Galon et al. (2014)).Furthermore, segmentation of both benign and malignant epithelial cells in breast cancer can be used as the first step in an automated pipeline for breast cancer grading, where the tumor region has to be identified to perform mitotic count, and regions of both healthy and cancer epithelial cells have to be compared to assess nuclear pleormophism.
In order to show the advantage of a multi-resolution approach compared to a single-resolution model in semantic segmentation of histopathology images, several design choices have been made in this paper.Our future research will be focused on investigating the general applicability and design of HookNet with respect to the used constraints.First, U-Net was used as the base model for HookNet branches.This choice was motivated by the effectiveness and flexibility of the encoder-decoder U-Net model, as well as the presence of skip connections.Other encoder-decoder models can be adopted to build a HookNet model.Second, inspired and motivated by the multi-resolution nature of WSIs, we developed and solely applied HookNet to histopathology images.However, we argue that HookNet has the potential to be useful for any application where a combination of context and details is essential to produce an accurate segmentation map.Several applications of HookNet can be found in medical imaging, but it has the potential for being extended to natural images as well.Third, we showed that using two branches allows to take advantage of clear trends like the performance of single-resolution models in DCIS and ILC (see Figure 1) in breast cancer data.However, when focusing on the IDC class, we note that a single-resolution U-Net performs best at intermediate resolutions.This motivates further research in incorporating more branches, to include intermediate fields of view as well.Fourth, we limited HookNet, as well as models used in compar-ison to 50M parameters, which allow model training using a single modern GPU with 11GB of RAM.Introducing more branches will likely require a multi-GPU approach, which would also allow for experimenting with deeper/wider networks, and will speed-up inference time.
We compared HookNet in a single-loss (λ = 1) and in a multi-loss setup (λ=0.75, 0.5, or 0.25).Our results showed that the multi-loss model, when giving more importance to the target branch (e.g., λ=0.75), performs best for the breast tissue segmentation, and that the single-loss model (e.g., λ=1.0) scores best for the lung tissue segmentation.Future work will focus on an extensive optimization search for the value of λ.
Finally, we reported results in terms of F 1 scores for both the Radboudumc and TCGA datasets based on sparse manual annotations.Although this is a common approach to obtain a large heterogeneous set of data in medical imaging, we observed that this approach limits the assessment of performance in the transition zones of different tissue types.Extending the evaluation to an additional set of densely annotated data is part of our future research as well as effort in generating such manual annotations.

Conclusion
In this paper, we proposed HookNet, a framework for high-resolution tissue segmentation.We applied the model to two different datasets, which all included high resolution and contextual dependent tissue.Our results show that the proposed model increases overall performance compared to single-resolution models and can simultaneously deal with both subtle differences at high resolution as well as contextual information.

Figure 1 :
Figure 1: Examples of ductal carcinoma in situ (DCIS), invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) in breast tissue, and tertiary lymphoid structures (TLS) and germinal centers (GC) in lung tissue.For each example, multi-resolution/multi-field-of-view (MFMR) patches (introduced in section 1.3) are shown: both a low resolution/large-field-of-view and a concentric, high-resolution/small-field-of-view patch are depicted.

Figure 2 :
Figure 2: Example of procedure to manually annotate ILC regions in breast cancer slides.Left: Slide stained with H&E, with manual annotation of the region containing the bulk of tumor and details of manual (sparse) annotations of ILC and of healthy epithelium.Right: Immunohistochemistry of the same tissue sample, de-stained from H&E and restained with P120, used by medical research assistant to guide manual annotations and to identify ILC cells, and details of the effects of P120 for ILC and healthy epithelium.
Figure 2. As a result 6 classes were annotated in this dataset.For training, validation, and testing purposes, the WSIs were divided into training (n=50), validation (n=18), and test (n=18) sets, all containing a similar distribution of cancer types.

Figure 3 :
Figure 3: HookNet model architecture.Concentric patches with multiple views at multiple resolutions (MFMR patches) are used as input to a dual encoder-decoder model.Skip connections for both branches are omitted for clarity.Feature maps are down-and up-sampled by a factor 2. In this example, the feature maps at depth 2 in the decoder part of the context branch comprise the same resolution as the feature maps in the bottleneck of the target branch.To combine contextual information with high-resolution information, feature maps from the context branch are hooked in the target branch before the first decoder layer by cropping and concatenation.
and Figure 5, respectively.Finally, visual results are shown for each class of breast and lung tissue in Figure 6 and Figure 7 respectively.

Figure 6 :Figure 7 :Figure 8 :
Figure 6: Segmentation results on breast tissue shown for DCIS, IDC, ILC, Benign epithelium, Other and Fat.HookNet results are shown for λ=0.75.The last three rows focus on failure examples of HookNet.

Table 1 :
Performance of U-Net with different input resolutions on the Radboudumc test set of breast cancer tissue types.Performance are reported in terms of F1 score per tissue type: ductalcarcinoma in-situ (DCIS), invasive ductal carcinoma (IDC), invasive lobular carcinoma (ILC) benign epithelium (BE), Other, and Fat, as well as overall score (Macro F 1 ) measured on all classes together.

Table 3 :
Performance of U-Net trained with different input resolutions on the TCGA test set of lung cancer tissue.Performance are reported in terms of F1 score per tissue type: tertiary lymphoid structures (TLS), germinal centers (GC), Tumor, and Other, as well as overall score (Macro F 1 ) measured on all classes together.