Multi-domain stain normalization for digital pathology: A cycle-consistent adversarial network for whole slide images

The variation in histologic staining between different medical centers is one of the most profound challenges in the field of computer-aided diagnosis. The appearance disparity of pathological whole slide images causes algorithms to become less reliable, which in turn impedes the wide-spread applicability of downstream tasks like cancer diagnosis. Furthermore, different stainings lead to biases in the training which in case of domain shifts negatively affect the test performance. Therefore, in this paper we propose MultiStain-CycleGAN, a multi-domain approach to stain normalization based on CycleGAN. Our modifications to CycleGAN allow us to normalize images of different origins without retraining or using different models. We perform an extensive evaluation of our method using various metrics and compare it to commonly used methods that are multi-domain capable. First, we evaluate how well our method fools a domain classifier that tries to assign a medical center to an image. Then, we test our normalization on the tumor classification performance of a downstream classifier. Furthermore, we evaluate the image quality of the normalized images using the Structural similarity index and the ability to reduce the domain shift using the Fr\'echet inception distance. We show that our method proves to be multi-domain capable, provides the highest image quality among the compared methods, and can most reliably fool the domain classifier while keeping the tumor classifier performance high. By reducing the domain influence, biases in the data can be removed on the one hand and the origin of the whole slide image can be disguised on the other, thus enhancing patient data privacy.


Introduction
The gold standard of cancer diagnosis is histopathologic investigation of tissue.This involves microscopic examination of dissected and stained tissue to examine signs and characteristics of specific diseases.During dissection, the tissue is fixed in formaldehyde and embedded in paraffin [Chatterjee, 2014].The following staining of the tissue sections serves to highlight the various structures and cells in the tissue [Ghaznavi et al., 2013].For light microscopy, the tissue sections are routinely stained with hematoxylin and eosin (H&E) [Alturkistani et al., 2015].The staining of the tissue helps the pathologist to make diagnoses based on certain features such as cell morphology or the arrangement of cells [Gurcan et al., 2009].Staining mainly depends on the formulation of the stain and the application time among other pipeline-dependent aspects [Kothari et al., 2014].After the tissue is stained, the slides are reviewed by pathologists.Both diagnosis and tumor grading are performed with the goal of providing prognosis and treatment recommendations [Farahani et al., 2015].With the advancement of technology and the rise of whole-slide imaging, i.e., the digitization of high-resolution microscopic images, digital pathology has rapidly gained importance [Alturkistani et al., 2015].Digitizing the slides allows the use of automated systems for detection, classification or segmentation of a desired entity.Deep Learning (DL) and Convolutional Neural Networks (CNN) emerged as an effective tool in automatic image analysis, using big data to parameterize models that no longer rely on hand-crafted features [LeCun et al., 2015].By using DL-based algorithms, pathologists can be relieved of monotonous and repetitive tasks by supporting them [Echle et al., 2021].DL has already been successfully tested in various clinical tasks, such as tumor detection [Cruz-Roa et al., 2017], mitosis detection [Saha et al., 2018, Albarqouni et al., 2016], grading of cancer [Bulten et al., 2020], predicting the lymph sentinel status [Kiehl et al., 2021], patient survival estimation [Wessels et al., 2022] or tumor subtyping [Wang et al., 2020].The appearance of slides is highly variable as a result of different scanners, staining techniques, and laboratories [Ciompi et al., 2017].This variability is not a problem for pathologists [Bancroft, 2008], but current deep learning algorithms have significant issues with this [Ciompi et al., 2017].Even very small changes in the image can lead to large deviations in performance for DL-based algorithms [Kurakin et al., 2016].This is especially true for domain shifts, where a different input data distribution can lead to reduced performance and in turn potentially harm the patient [Stacke et al., 2021].As early as 1994, Lyon et al. postulated that the standardization of dyes and stains will play an increasingly important role in the future [Lyon et al., 1994].However, not all differences are due to non-standardized processes and thus not all variations are avoidable [Niethammer et al., 2010].In conclusion, every tissue source site e.g.medical center or clinic, has a distinct signature due to biological variations in patients treated at various centers, specimen acquisition, staining, and digitization.On the one hand, this center-specific signature leads to biases in the data, which many algorithms suffer from; on the other hand, it can be used to determine the source of a whole slide image (WSI).The origin of the WSI can then be used to draw conclusions about the patient demographics, such as age, nationality and ethnicity.DL is able to determine the origin of a WSI with high accuracy [Howard et al., 2021].This means that patient data privacy is no longer guaranteed and also enables the misuse of this information.To integrate DL into the work of pathologists and clinicians, methods that make these algorithms robust and stain-invariant must be developed, as well as new approaches that can remove the center-specific signature.Stain normalization is one possible method for achieving this objective.HETZ ET AL., 2022

Stain normalization
Methods for normalization of histological slides have already been shown to be effective for some applications [Ciompi et al., 2017].These normalization methods transform images x of a domain X to look like they originated from a target domain Y or to match the appearance of a template image y.Because of this, the field of stain normalization is a very active field, which tries to map images from a source domain to a target domain.In this regard, Salvi et al. divides the field into the three areas: Global color normalization, color normalization after stain separation, and Deep Learning based normalization [Salvi et al., 2021].Methods based on global color normalization apply procedures that use the statistics of a template image and obtain a transformation from it, such as the global color transformation using Principal Component Analysis (PCA) by [Reinhard et al., 2001] or histogram specification proposed by [Gurcan et al., 2009].In methods based on stain normalization after stain separation, the individual staining components, usually hematoxylin and eosin, are separated.This technique makes use of the property that the two stains can be linearly separated by a transformation into optical density space [Roy et al., 2018].Each pixel can then be calculated by the product of a stain color appearance matrix, which is acquired by a template image, and the stain density map [Salvi et al., 2021].The estimation of the appearance matrix can be based on singular value decomposition as described by [Macenko et al., 2009], prior information [Ruifrok and Johnston, 2001], non-negative matrix factorization [Vahadane et al., 2016] or spectral matching [Tosta et al., 2019].More recent approaches rely increasingly on neural networks.Generative Adversarial Networks (GAN) are used to normalize stains and show promising results [Zanjani et al., 2018, Bentaieb andHamarneh, 2018].Following the work of [Gatys et al., 2016], stain normalization is treated as neural style transfer, where the goal is to give an input image the appearance of a learned distribution of images.GANs are deep generative models which consist of two networks: A generator which generates images and a discriminator which tries to separate the generated images from the actual images corresponding to the distribution of the training data.
The training is a minimax game in which the two models compete with each other.The goal of the generator is to generate images from a noise vector z such that the distribution of the generated images P G (z) corresponds as closely as possible to the distribution of the training data P data (x).The loss function can be described as: (1) The generator tries to minimize the loss function, while the discriminator tries to maximize it.The models are trained until they reach an optimum [Goodfellow et al., 2014].[Isola et al., 2017] extends this approach by passing an image instead of a noise vector z as input to the generator.This allows images to be transferred to another domain, which is essential for the normalization of histological stains.However, the disadvantage of this approach is that image pairs are required.In histopathology, it is rare for the same slide to be re-stained multiple times, so this approach is of limited use.Salehi et al. circumvent this drawback by converting images to gray-scale images, thus generating image pairs synthetically [Salehi and Chalechale, 2020].Due to the missing image pair problem, more work has been directed towards cycle-consistent adversarial networks (CycleGANs), proposed by [Zhu et al., 2017], which no longer require image pairs by using cycle-consistency.Approaches based on CycleGAN are particularly suitable for the field of image-to-image translation in histopathology.CycleGAN and its variants have already demonstrated in several studies that they are suitable for stain-to-stain translation or normalization tasks [Shaban et al., 2019, de Bel et al., 2021, Runz et al., 2021, Zhou et al., 2019].

Multi-domain stain normalization
The majority of previous literature focuses on transfer between two stains, but not on a many-to-one approach, which offers more flexibility in real-world settings.One-to-one approaches require to train a new model whenever stains from a new domain have to be normalized, see Fig. 1.Furthermore, they depend on the local availability of the data to perform the normalisation.In privacy-preserving settings such as federated or swarm learning, this is not the case.For this reason, we introduce MultiStain-CycleGAN, an unsupervised multi-domain capable stain normalization method based on CycleGAN.The method presented here follows a comparable approach as described [Tellez et al., 2019] but with the additional reconstruction of the input image to ensure the integrity of the image structure through the cycle consistency condition.In doing so, we reformulate the stain normalization task into an image-to-image translation task, in which heavily augmented input images are subjected to gray-scale conversion and transformed into the desired stain.The network learns to reconstruct images from gray images with different contrasts, which have the appearance of the learned staining and thus perform stain normalization.Our method is tested for several properties including tumor classification, domain classification, image quality of generated images and distribution shift after normalization.Our contributions can be summarized to: • Developing MultiStain-CycleGAN, a robust deep learning based multi-domain approach for stain normalization without the need for retraining to normalize untrained stainings.
• Achieving low accuracy on tissue source site classification to remove spurious domain factors and thus improving data privacy while generating images with the highest SSIM index and retaining tumor prediction accuracy • Detailed analysis of the studied normalization methods in terms of their ability to improve a downstream task, their image quality and their ability to disguise the origin of the images, as well as the influence of different data augmentation intensities.
The following parts of this work are structured as follows: Chapter 2 introduces the datasets and their attributes.In addition, it gives a description of how the data is acquired, preprocessed and stratified.Chapter 3 explains stain normalization with CycleGAN and how we derived a multi-domain approach from it.Furthermore, the metrics used to measure the distribution shift and the image quality of the normalized images as well as the used neural networks are explained.Chapter 4 describes the experimental setup and chapter 5 lists the results in detail.Afterwards the results are discussed and placed into context in chapter 6. Chapter 7 provides a concluding summary and identifies research gaps for future work.
2 Materials All in all, this dataset allows us to make valid statements about the proposed normalization method and the quality of the normalized images.The 50 lesion-level annotated slides were primarily used to evaluate tumor classification, and the remaining slides were then used to classify the tissue submitting site, image quality, and determine the distance between the target and source domains.For the preprocessing of the tile-based approach, we utilized our self-developed publicly available pipeline2 for the extraction of tiles from WSIs and a subsequent filtering by tissue presence according to [Khened et al., 2021].We employed a configuration such that each tile had a spatial extent of 64 × 64 µm and a resolution of 256 × 256 pixels.Furthermore, we sampled the tiles with an overlap of 25% for non-annotated and 50% for annotated tissue.For the evaluation of our multi-domain stain normalization, we chose CWZ as the target domain.

Normalization Network
Our MultiStain-CycleGAN required tiles from two of the five domains for training.In our experiments, we arbitrarily chose the two centers CWZ and LPON.For the training we used about 240 000 tiles from center CWZ and about 270 000 tiles from the center LPON.The tiles were extracted only from the slides containing lesion level annotations, i.e. 10 slides per center.

Tumor classifier
For the tumor classifier, we used a 5-fold cross-validation.The folds were stratified according to the class 'tumor' or 'non-tumor'.In training, we only used data from the target domain with lesion-level annotations, with each fold containing 39 000 tiles, resulting in 195 000 for the total number of train images.Of these, a total of 24 000 tiles belong to the 'tumor' class.For the test set, the target domain CWZ was omitted because in this work we only focused on images with domain shift for tumor classifier performance.For the remaining centers, we decided to use 5000 stratified randomly drawn tiles of both classes, resulting in a total test set of 40 000 tiles for the tumor classifier.

Domain classifier
Also for the training of the domain classifier we decided to use a 5-fold cross-validation, stratified by centers.Each training fold contains 290 000 tiles, in total about 1430 000 tiles, which were extracted from the 50 slides with lesion-level annotations.The test set consists of a total of 25 000 tiles, composed of 5 000 tiles per center.The test set is based on tiles from the 90 slides without lesion-level annotations, which have not been used in the evaluation before.

Methods
This section introduces the general methods needed for this work such as CycleGAN and stain normalization.Then, we present how we derived MultiStain-CycleGAN and how we perform stain normalization.Furthermore, we explain how we conduct the study.Last but not least, we describe the used performance metrics.

Cycle-consistent adversarial networks
Cycle-consistent adversarial networks proposed by Zhu et al. learn a mapping from a domain X to a domain Y using training data x i ∈ X and y i ∈ Y .Where X and Y denotes the respective datasets of the domains.In Fig. 3, the two mappings are illustrated with G : X → Y and F : Y → X .The domain-dependent adversarial discriminators D x and D y learn whether the input image is a generated image G(x) or F (y) or a sample x or y from the distribution of the training data [Zhu et al., 2017].Here, the objective function contains several loss terms.Among them are adversarial losses L Dx and L Dy , to learn matching the distribution of generated and train data [Goodfellow et al., 2014] and a cycle-consistency loss L rec , which helps to preserve the structure of input images, as well as identity losses L idtx and L idty , which help to keep the color palette close to the input image [Zhu et al., 2017].The two adversarial losses are applied to the two mapping functions G and F .For example, the adversarial loss for the function G : X → Y with its discriminator D y can be expressed as follows: (2) Here, the generator G tries to generate images G(x) that appear as if they originate from the distribution Y , while the discriminator D y tries to distinguish the generated samples G(x) from the real samples y.This process is repeated for the function F : Y → X with L GAN (F, D x , Y, X) [Zhu et al., 2017].Adversarial training can theoretically learn mappings G and F such that the generated images correspond to the distribution of the respective target domains X and Y. Furthermore, with a sufficiently large capacity of the model, it is possible to map the same input to several different outputs in the target domain.To reduce the solution space, Zhu et al. suggest that the mapping functions should be cycle-consistent.This behavior is enforced by the cycle consistency loss with: For certain tasks, including stain normalization, it is useful to add an identity loss which ensures that the mapping is consistent with the color of the input image.The two mapping functions G and F learn the identity function in case a sample from the real distribution represents the input image.This loss is described by: The complete objective function is therefore obtained as: [ Zhu et al., 2017].

Multi-domain stain normalization with MultiStain-CycleGAN
To convert CycleGAN into a many-to-one approach, we performed several modifications.To reduce the space of possible inputs and thus simplify the problem, the images are converted to 3-channel grayscale images.Thus, the new task for the two mapping functions G and F is the reconstruction of the RGB images of the respective domain from the gray-converted images.We hypothesize that if the reconstruction F (G(x)) and G(F (y)) is sufficiently good, the loss of information due to gray scale conversion will be compensated by the model and thus be valid.In order to increase the variance of the input data and thus improve the generalization ability of the generators, an augmentation function is applied.The augmentation function H then is obtained from the color augmentation and the gray value conversion.This function is essential to later normalize data outside the distribution of the raw training data.Depending on the intensity of the augmentation, more or less information can be lost in the image due to lack of contrast.The task of the generators changes to denoising and recoloring into the target domain.By applying the function H, the images are transformed into the intermediate domain W, which represents the input space (see Fig. 4).The use of the intermediate domain W allows us to normalize a large variation of input images at inference time.Since the original identity loss task is no longer relevant in this setup, the additional task of reconstructing unaugmented grayscale images was added instead.This additional task allows to compensate for the noisy images that may result from strong contrast augmentations, thus focusing G and F on the normalization of color instead of the denoising task.This domain faithful reconstruction loss of the gray converted input image x , y and the original image x, y is: Thus, the complete objective function for our model results in: For our experiments, we choose λ cyc = 10 and λ idt = 0.5 as in the original implementation [Zhu et al., 2017].The two generators G and F are each an adapted U-Net, which has proven to be a very effective architecture in various medical tasks [Ronneberger et al., 2015].The network contains several downsampling, and respective upsampling blocks, which double the number of filters and halve the image dimensions respectively and vice versa.For our evaluation, we implemented a tile size of 256 × 256 and a filter count for the innermost block of 32 proved effective.In the case of a downsampling block, the blocks consist of convolution layers and a leaky ReLu activation function with a slope a = 0.2 and an instance normalization layer described by [Ulyanov et al., 2016].In the case of an upsampling block, it consists of a transpose convolution layer, a ReLu activation function and an instance normalization layer.The two discriminators are PatchGANs proposed by [Isola et al., 2017], their loss can be interpreted as a kind of style loss.In addition, spectral normalization introduced by [Miyato et al., 2018] was used as a normalization layer to stabilize the training.The respective discriminator consists of three blocks, each consisting of a convolution layer, a leaky ReLu activation function and a normalization layer.The filter numbers increase quadratically with the depth of the discriminator.The least squares GAN (LSGAN) loss is used as loss, which allows a better image quality for generated images [Mao et al., 2017].To further stabilize the training, and reduce oscillations [Goodfellow, 2016], we implemented an image buffer as proposed by [Shrivastava et al., 2017], which includes a history of generated images, of size 50.Since in our experiments we noticed a tendency of the discriminator loss to converge to zero, we introduced an update threshold to avoid this, which prevents a gradient update as soon as one of the discriminator losses falls below a threshold value.The threshold was set to 0.1.
For our training, we used the ADAM optimizer [Kingma and Ba, 2014] with a learning rate of 10 −5 with a linear decay over 50 Epochs.We trained the MultiStain-CycleGAN for 100 epochs in total on a Nvidia V100.For the color augmentation we used the following factors: Saturation 0.75, brightness 0.75, contrast 0.5.Since we subject our images to gray conversion, changing the hue value has no effect on the output image.These parameters were determined empirically by visual inspection of test images.They were chosen in such a way that the morphology of the structures is largely preserved.

Image quality and domain shift metrics
For the evaluation of the image quality and the measurement of the domain shift of our normalization method, we chose the Structural Similarity (SSIM) Index [Wang et al., 2004] and the Fréchet Inception Distance (FID) [Heusel et al., 2017], respectively.These metrics are intended to help evaluate the quality of the generated images and quantify the change in domain shift.

Fréchet inception distance
In order to be able to make a statement about the domain shift before and after normalization and to evaluate different stain normalization methods with respect to their ability to reduce the domain gap, we decided to use the FID described by Heuse et al.The FID is an improvement over the Inception Score proposed by [Salimans et al., 2016] in terms of consistency with human perception as the disturbance of the image increases.We chose this metric because of its frequent application in generative tasks and the extensive evaluation of the method [Xu et al., 2018, Lucic et al., 2017].
The FID represents the difference between two Gaussians consisting of the features of an inception model.The FID is given by: Where the mean and covariance µ X , C X corresponds to the Gaussian of the generated data and µ Y , C Y corresponds to the Gaussian of the real world data.The FID is zero in case of matching images.Due to the issue described by [Liu et al., 2018] that using an ImageNet model to project generated images of domains unrelated to ImageNet into feature space can be ineffective.Following the suggestion of Liu et al., we used the model of [Ciga et al., 2022] as a domain-specific encoder, which was trained on several histopathology datasets.The FID correlates with human visual perception and measures of how large the perceived difference is between images from two different distributions.In the stain normalization use case, a high FID means that there is a large domain shift.This domain shift, and thus the FID, should be reduced by stain normalization methods.

Structural Similarity Index
To compare the perceived structure before and after normalization, we utilize the Structural Similarity Index proposed by Wang et al. for our evaluation.The SSIM index compares two images in terms of their similarity and image quality.
The SSIM index is 1 in case of two identical images.In contrast to the use of the Mean Squared Error (MSE) for the comparison of images, the SSIM index calculates contrast, structure and luminance separately and then combines them.
We chose this metric because it has already been used in stain normalization scenarios [Hoque et al., 2021, Shaban et al., 2019] and due to the high importance of keeping structural features unaffected by normalization.Preserving the structure after normalization is essential, otherwise the morphology of the cells is altered.This can lead to errors in classification, as will be shown in chapter 5.The SSIM is given by: Where µ x and σ 2 x correspond to the mean and variance of the image x, respectively.σ xy corresponds to the covariance of the images x and y, and C 1 and C 2 are constants to stabilize the denominator.

Network-based metrics
Image-based similarity metrics such as FID and SSIM can give clues about the visual similarity of the normalized pictures.However, they do not evaluate how well downstream classifiers perform on the normalized images.As the goal of stain normalization is to at least maintain task performance while ideally obscuring the origin of the slide, we additionally evaluate our approach with network-based metrics.

Domain classifier
To verify that the respective normalization methods are able to disguise the tissue source site, we use an Xception classifier proposed by [Chollet, 2017] pre-trained on ImageNet to follow [Howard et al., 2021].We measure accuracy because this is a balanced test dataset and we are interested in the clinic an image originates from.As described in chapter 2, we applied a 5-fold cross-validation to obtain an approximate distribution of accuracy and exclude outliers.Each model was trained with a learning rate of 10 −4 and a batch size of 32 for 50 epochs.Due to very large performance differences of the discriminator with different levels of color augmentations in training, we decided to use a high intensity of color augmentation.We used color jitter with the parameters brightness: 0.7, contrast: 0.7, saturation: 0.7, hue: 0.5.

Tumor classifier
For tumor classification, we utilized a ResNet18 pre-trained on ImageNet.Analogous to the domain classifier, we trained the classifier with a learning rate of 10 −4 and a batch size of 32 for 50 epochs.Again, we chose accuracy as the target metric due to the testset having balanced classes, as described in chapter 2. Also with this classifier we could 1) 2) 3) Figure 5: The target domain representative template images from three different slides used for the template-based methods.
see very large performance differences depending on the intensity of the color augmentation that was used.Thus, we employed the same parameters as described for the domain classifier in 3.4.1.
4 Experimental setup 4.1 Normalization For all experiments, we trained our MultiStain-CycleGAN to learn the transformation G : CWZ → LPON and F : LPON → CWZ.For simplicity, we examine only the normalization F .Due to the huge memory requirements of WSIs, we were forced to perform the evaluation in a tile-by-tile manner.Thus, in order to analyze all the criteria under consideration, both the domain dataset and the tumor classifier dataset have to be normalized using the normalization methods under investigation.To place our method in the context of existing literature, we compare it with other common normalization methods, which can be used in a multi-domain manner.We chose the methods of [Macenko et al., 2009], [Reinhard et al., 2001] and [Vahadane et al., 2016] because of their frequent use in stain normalization in histopathology.Due to the template-based nature of these methods, we selected three representative templates shown in Fig. 5 for these.This leads to three normalized datasets for each of the template-based approaches for the respective task.For the Machenko normalization and the Reinhard and Vahadane method, we utilized the implementations from torchstain3 and staintools 4 respectively.

Deep learning classifier
We trained each of the classifiers five times using different seeds.We trained our models with four different augmentation intensities to analyze the behavior of the studied methods considering color augmentation.Since we use tiles of size HETZ ET AL., 2022 256 × 256 as input for the normalization methods as described in chapter 2, they were brought to the size 224 × 224 using a center crop in order to use the pretrained ResNet18 and the Xception model.For each task, the accuracy for each of the normalized and unnormalized datasets is determined for each of the five models.

FID and SSIM index
We likewise use an open-source implementation5 to calculate the FID and have modified it as described in section 3.3.1 to use a model trained on histopathological data.We calculated the FID for both the normalized domain and tumor classifier datasets by computing the distance between 40 000 random tiles from the 10 WSIs with lesion-level annotations of the target domain and the respective normalized dataset.
The SSIM index is calculated using image pairs compared to the FID.Here, for each of the normalized datasets, the SSIM index is calculated using the normalized image and the non-normalized image.Thus, for each of the datasets, we estimate a mean value and the associated standard deviation.For the calculation we use the implementation of scikit-image [van der Walt et al., 2014].A summary of the results is given in table 1.This includes the classifier results averaged over each of the five models for the heavy augmentation level, the mean SSIM index, and the FID.In addition, we introduce a check that indicates whether the ranges µ ± σ of tumor classifier accuracy of the respective methods overlap with or exceed the range of the unnormalized case.Where µ denotes the estimated mean and σ denotes the estimated standard deviation of the tumor classifier accuracy.This is then referred to as sustained performance.An overview of the normalization results of the different methods studied is provided by Fig. 6.This shows the original images and the normalized variant in each case.The images are from all of the centers available in the Camelyon17 dataset.For comparison, images of the target domain CWZ are given in A. More tiles normalized with our method can be seen in 11.Here, the tiles are from the untrained centers RST, UMCU and RUMC.In the following, the results of domain classification will be discussed in more detail.It is shown that with decreasing SSIM index, tumor classifier performance decreases significantly.This is possibly due to loss of contrast and the associated structural changes and loss of information in the images.The result is similar for the FID, where accuracy decreases with increasing FID.

Domain classification
We examined the results of the tissue site classifier, or domain classifier, for the different normalization methods, as well as for the unnormalized case.A detailed overview of the different classification results of the domain classifier are given in table 2.
The domain could be estimated with very high accuracy of 0.952 in the unnormalized case.The Reinhard normalization still has a high accuracy of over 0.9 in all cases.The lowest accuracy was achieved by the Vahadane and Macenko normalization for template 2 with 0.605 and 0.634, respectively, followed by MultiStain-CycleGAN with an accuracy of 0.701.Fig. 7 shows the accuracy of the domain classifier over the two metrics SSIM index and the FID.In the left plot, as well as in table 1, it can be seen that the two methods that result in the lowest accuracy of the domain classifier also have a very low SSIM index.Further, our presented method achieves the highest SSIM index, with a value of 0.957.
In the right plot in Fig. 7, the accuracy of the domain classifier over the FID is shown.Again, the outlier character of the two methods Macenko Template 2 and Vahadane Template 2 can be seen.The two methods have by far the highest FID.Our method is placed in the middle and only slightly below the unnormalized data.In general, a trend emerges here, which shows that with increasing FID, the accuracy of the domain classifier also decreases, although our approach is somewhat out of line here.

Tumor classification
The evaluation of the tumor classifier was performed analogously to the domain classifier.Again, a detailed listing of the results is given in the appendix in table 3. The accuracy of tumor classification is relatively high at about 0.9 in all studied normalization methods as well as in the unnormalized case, with the exception of the methods Macenko Template 2 and Vahadane Template 2. These two methods have a significant loss in accuracy.Thus, for this task, no improvement in classifier performance is shown by applying the studied stain normalizations.Fig. 8 on the left shows tumor classifier accuracy as a function of the SSIM index.It can be seen that most methods have a very similar performance compared to the unnormalized case.Furthermore, several methods decrease the accuracy moderately to strongly compared to the unnormalized case.This case can be observed especially when the SSIM index decreases strongly.On the right side of Fig. 8, tumor classifier accuracy over FID is visualized.The tumor classifier accuracy decreases with increasing FID.However, the trend here is less clear than in the case of the left plot.

Discussion
As shown in Fig. 8, saturation effects occur for both metrics.It can be seen that the performance of the tumor classification is only significantly reduced when the SSIM index falls below a threshold.Similar results can be observed for FID, but the trend is somewhat more continuous.
As expected, a high FID and low SSIM index results in a significant performance loss of the tumor classifier.This possibly results from the fact that the two methods reduce the contrast too much, altering the morphology of the tissue.It follows that this is a severe loss of information.Due to the loss of essential information, the domain can no longer be estimated by the domain classifier.Interestingly, despite a mid-range FID, our method is significantly more capable than other methods in fooling the domain classifier, except for the two methods Macenko Template 2 and Vahadane Template 2. Thus, our method can beat the other methods by the highest SSIM index and comparatively very good ability to fool the domain classifier.This behavior is again supported by Fig. 9, which shows a clear gap in domain classifier accuracy between our method and the other normalization methods, and at the same time a clear gap in tumor classifier accuracy to the two outlier methods.Despite only slightly lower FID compared to the unnormalized case, our approach significantly lowered the accuracy of the domain classifier.Here, the visual perception and the FID drift apart, as one can see clear differences before and after normalization in Fig. 6.Thus, in our case, we see the FID only conditionally suitable to quantify domain shifts for histopathological data.Furthermore, the metrics show that our method is able to normalize stainings that are not contained within the training set and thus can normalize different domains with one model without the need for retraining.This is supported by the high tumor classifier accuracy despite distribution shifts through the different centers showing how good the ability of our method is to normalize unseen domains in training.We could not observe any significant improvement of the downstream task, in our case tumor classification, in the case of heavy data augmentation.This is partly consistent with the findings of [Tellez et al., 2019].Here, however, it could be due to the task being too trivial to solve.Based on the tables 2 and 3, it can be seen that it is essential to investigate stain normalization methods considering color augmentation.The results show that when using no to moderate color augmentation, our normalization method can achieve high improvements, which however become smaller when using strong color augmentation.Thus, the domain classifier can be fooled much less with heavy color augmentation applied in training and the accuracy increases by more than 30% compared to the use of no color augmentation.A similar picture emerges for tumor classification, where our normalization provides an accuracy gain of over 10% in the case of no color augmentation.In the case of medium color augmentation, we still achieve a slightly increased performance, which then disappears in the case of heavy color augmentation.Another point that the results show is that template-based approaches, depending on the template chosen, can deliver very different solutions and associated performances with very high variances.Thus, the choice of an inappropriate template can lead to performance losses.GAN-based approaches have an advantage here, since they take the entire training dataset into account and can thus lead to more consistent solutions.
In the end, we can say that we can partially reproduce the results of [Howard et al., 2021] and extend them with a GAN-based approach.Also, our results show that despite the use of commonly used stain normalization methods, the domain can be predicted with high accuracy.Our method is the exception here, as it results in a significant loss of domain classifier accuracy while maintaining high image quality.

Conclusion
We have presented MultiStain-CycleGAN, a new approach to stain normalization based on a modified CycleGAN, using an intermediate domain which works for multiple unseen stainings without the need to retrain and is thus multi-domain capable.We have extensively compared our approach with several commonly used normalization methods.We have intensively analyzed the different methods using different metrics and augmentation levels to understand the behavior of the methods under investigation.It has been shown that our method does not suffer from the problems of template-based approaches, and at the same time, in contrast to conventional GAN-based approaches, training a single model is sufficient.Furthermore, our method is best able to fool the domain classifier while providing the best image quality and high performance in the downstream task.In doing so, this work is a step towards disguising the origin of the tissue sections and reducing bias through the different domains.Continuing this work, other stainings such as IHC can be investigated.Furthermore, more complex downstream tasks should be analyzed, since in the case of our task none of the normalization methods led to an improvement in performance.

Figure 1 :
Figure 1: Left: Stain normalization based on conventional GAN approaches.For each staining a separate model has to be trained to normalize many stainings to the target domain.Right: Stain normalization with MultiStain-CycleGAN, which is trained on one staining and can normalize any H&E staining of the same tissue type.Dotted arrows indicate the data needed for training the respective model, normal arrows show the inference path.

Figure 2 :
Figure 2: Example slides for the different domains from the CAMELYON17 dataset.The images a)-e) show examples of tissue sections of the different centers with their different stainings: a) CWZ; b) RST; c) UMCU; d) RUMC; e) LPON.

Figure 3 :
Figure 3: The principle of image-to-image translation with CycleGAN proposed by Zhu et al.An image from a domain X is mapped to a domain Y by a generative model.After the mapping the image will be reconstructed into its original domain and the cycle-consistency loss is computed, enabling unpaired image-to-image translation.

Figure 4 :
Figure 4: Overview of the MultiStain-CycleGAN.Images x from a source domain will be mapped to an intermediate domain by a function H. H consists of a color augmentation function and a grayscale conversion.The generator G then transforms the gray image w into the target domain.This process, including projecting y into the intermediate domain, is repeated for the normalized image y again, to reconstruct the original image.The second path has been omitted for clarity.Further, instead of feeding the network a real image from the respective domain for calculating the identity loss used by Zhu et al., a reconstruction task of unaugmented gray images H (x) is done.The intermediate domain allows to normalize any H&E stains without having to re-train the model.

Figure 6 :
Figure 6: Examples of normalization of the different methods studied.Tiles are taken from slides of all centers.Our method achieves very good stain adaptation.Furthermore, the problems of the template-based normalization are shown by very different results depending on the template.

Figure 8 :
Figure8: The dependence of tumor classifier accuracy on SSIM index on the left and FID on the right.Each method is represented by the five models trained.It is shown that with decreasing SSIM index, tumor classifier performance decreases significantly.This is possibly due to loss of contrast and the associated structural changes and loss of information in the images.The result is similar for the FID, where accuracy decreases with increasing FID.

Figure 11 :
Figure 11: Example images of our proposed normalization.Each row shows images of one of the three untrained centers RST, UMCU and RUMC.

Table 1 :
Estimated mean and standard deviation of domain and tumor classifier accuracy for the different normalization methods analyzed.Furthermore, the mean SSIM index and its standard deviation as well as the FID for the tumor classifier dataset are shown.
The behavior of the domain classifier under different normalization methods over the two metrics SSIM index and FID.For each method, the five trained models are visualized.MultiStain-CycleGAN achieves the highest SSIM index while performing well in fooling the domain classifier.Macenko 2 and Vahadane 2 acquire the low accuracy of the domain classifier with a strong perturbation of the image content.
The accuracy of the domain classifier over the tumor classifier accuracy of the different methods.MultiStain-CycleGAN is the only method that is able to both fool the domain classifier and maintain tumor classifier accuracy.Despite its frequent use, Macenko normalization in our case decreases tumor classifier accuracy in all cases.The Reinhard normalization provides a normalization that is very consistent, but does not offer the capability to fool the domain classifier.