Generalization of intensity distribution of medical images using GANs

The performance of a CNN based medical-image classification network depends on the intensities of the trained images. Therefore, it is necessary to generalize medical images of various intensities against degradation of performance. For lesion classification, features of generalized images should be carefully maintained. To maintain the performance of the medical image classification network and minimize the loss of features, we propose a method using a generative adversarial network (GAN) as a generator to adapt the arbitrary intensity distribution to the specific intensity distribution of the training set. We also select CycleGAN and UNIT to train unpaired medical image data sets. The following was done to evaluate each method’s performance: the similarities between the generalized image and the original were measured via the structural similarity index (SSIM) and histogram, and the original domain data set was passed to a classifier that trained only the original domain images for accuracy comparisons. The results show that the performance evaluation of the generalized images is better than that of the originals, confirming that our proposed method is a simple but powerful solution to the performance degradation of a classification network.

performance for new medical images with intensity distributions that are completely different from training data sets. This unstable performance change makes it impossible to commercialize that CNNs for medical domain, because it is impossible to obtain a data set that considers all conditions in the image-shooting environment. Also, because of the infinite number of new data sets with a variety of intensities, training new ones to the network each time is a very expensive task. Therefore, in order to solve this problem practically, generalization of new data sets needs to be considered.
Traditionally, histogram processing, such as histogram equalization and histogram matching, was used to adjust the similarity of intensity distribution. However, it is very difficult to adjust the intensity distribution of all input images to the distribution of a training data set with these methods.
The task of transforming image data from an arbitrary domain into a target domain is known as image-to-image translation. This is a kind of domain adaptation. Imageto-image translation has been actively researched using generative adversarial networks (GANs) [14][15][16] and variational auto-encoders (VAEs) [17,18].
The Pix2Pix network [19] performs image-to-image translation using a paired data set. For each image in the original domain, the paired data set contains an image converted to the target domain. It is not easy to get a paired data set like this, but Cycle-GAN [20] and UNIT [21] solved this limitation and proposed that a GAN can learn with an unpaired data set. In the medical imaging domain, it is practically impossible to obtain a true paired data set. Therefore, much research has been done through GANs that can be trained with unpaired data sets. Figure 1 shows example of paired and unpaired image dataset. The paired data shows what multiple chest X-rays of a single person taken on several machines might look like. In reality, this data is virtually unattainable, and the paired data above shows fake images that we created. The unpaired data shows X-rays of people taken on different machines. This is usually the data set we are dealing with. The intensities of the two sets are very different.
GANs have been applied to medical imaging in earnest since 2017 [22]. In particular, many studies show data augmentation using GANs for image synthesis [23,24], and most of them were conducted using magnetic resonance (MR) and computed tomography (CT) data. Data augmentation is useful for training the network, but it is not a good method for maintaining existing network performance; MR and CT Fig. 1 Examples of our paired data and unpaired data can obtain high-quality images but are less accessible than X-rays. Therefore, image synthesis needs to be considered to maintain a network's performance of the X-ray target.
In this paper, we propose a generalization framework that adjusts a set of X-ray images with arbitrary intensity distribution to match the intensity distribution of a training set, using CycleGAN and UNIT as generalizers to maintain the accuracy of a medical-image classification network (Fig. 2).

Contributions
This paper presents two contributions: 1. A solution of performance degradation for lesion classification and other tasks via the generalization of medical-image intensity distribution. 2. Data augmentation using intensity generalization in the medical image domain suffering from a lack of data.

Paper organization
The rest of the paper is organized as follows: In "Related works" section, we introduce traditional intensity generalization using histogram processing and recent research in image-to-image translation tasks using GANs, finally providing some GAN applications for the medical image domain. In "Methods" section, we detail the architecture of our proposed generalization network that adjusts the intensity of new test images to those of the original training data set. We also provide brief details of CycleGAN and UNIT as generalizers in our network. Performance comparisons of the proposed networks are in "Experiments" section. Finally, in "Conclusion" section, we present our conclusions.

Related works
To adjust the intensity of arbitrary image data sets to the intensity of a specific image data set is a difficult task. We introduce the traditional approach and new approach with generative adversarial networks to solve this problem in this section.

Traditional method
Traditionally, histogram matching was used to solve this problem. Histogram matching (or histogram specification) is the method in which an input histogram matches a target histogram using their cumulative distribution function (CDF). The cumulative histogram is calculated from each image; any value to be matched to another histogram, x j , has a cumulative histogram value given by G(x i ). This is the cumulative distribution value in the target image, namely H(x j ). The input data value, x i , is replaced by x j , where G(x i ) is equal to H(x j ). However, this method cannot handle matching histograms between two different data sets and simply uses CDF matching of both histograms. Therefore, it cannot be used for challenges such as finding the probability distribution or the manifold in the training data set.

Generative adversarial networks
A GAN is a generative model that estimates the probability distribution of training data sets, p(x), and generates new data, G(x), similar to that distribution. This allows the GAN to find the manifold in a specific domain. Sampling in a well-approximated manifold space yields results that are similar to the original but have different details. A GAN consists of two neural networks: the generator and the discriminator. The generator learns to generate images that can fool the discriminator. The discriminator learns whether an input image is an original image or a fake image from the generator. That is, the generator and discriminator have different goals. The discriminator needs to maximize log(D(x)), while the generator needs to minimize log(1-D(G(z))) where z is a random vector. Therefore, this network can be considered adversarial. Vanilla GAN loss is called adversarial loss [14], defined in Eq. 1 as follows: The vanilla GAN generates z with random noise. Because of this, the generator at the beginning of the training always produces a completely fake image. This allows the discriminator to completely distinguish whether the input is fake or not. That is, D(G(z)) becomes 1, and a learning generator is impossible. Therefore, we will have maximized log(D(G(z))) instead of minimizing log(1-logD(G(z))).
To solve the problem of a z vector with random noise, a conditional GAN (cGAN) [25] is proposed. We can use the conditional input vector, c, to add to the random noise of z using concatenation for a better output image. In the cGAN, a new vector combining z and c becomes the input for the generator and the loss function, providing Eq. 2 below, where y is a given label vector.
The c vector does not have a specific type. For example, image labels can be used as the c vector [26].
There are some popular methods using cGAN in image-to-image translation. Pix2Pix proposes a modified loss function for cGAN in Eq. 3 combined with L1 regularization in Eq. 4 for denoising the generated result. Also, their L1 loss contains self-similarity between G(x,z) and label y. However, in the actual implementation of Eq. 3, we don't need random noise because our input image is sufficiently complex. This network shows good performance but has the limitation that it is only trained by a paired data set. To solve the unpaired data set problem, CycleGAN improved Pix2Pix is proposed. They use an idea called cycle-consistency loss (or reconstructed loss) that performs bidirectional conversion between the source domain and the target domain.
In addition to the cGAN, there are many GAN variants. There are networks that use auxiliary classifiers or VAE. In addition to image-to-image translations using cGAN, UNITs that use VAE and weight sharing have been widely used.

GANs for computer-aided diagnosis
Various GAN methods have already been applied to CAD, especially in the synthesis, segmentation, reconstruction (such as enhancement or denoising), and classification fields. Most studies have focused on synthesis and segmentation [22] so that these are well suited for image-to-image translation.
For the segmentation researches, Li et al. [27] proposed cGAN combining with Pix-2Pix and ACGAN [28] for MR segmentation. Dai et al. [29] proposed SCAN network, which shows that adversarial loss can be applied to organ segmentation in X-rays.
For the synthesis researches, the performance of converting between MR and CT is outstanding. Emami et al. [30] synthesized brain MRIs from CT using cGAN with a paired data set. On the other hand, Wolterink et al. [24] synthesized MRI images into CT images using CycleGAN with an unpaired data set. Dar et al. [31,32] studied the transformation between T1-and T2-weighted MR images using CycleGAN. Mahmood et al. [33] applied adversarial training methods to depth-estimation from monocular endoscopy.
Although there have not been many studies, some studies have used GAN for classification. Madani et al. [34] shows that DCGAN [16] can be used for classification. They used a discriminator as a classifier and conducted data augmentation using generated images in the training process.
The research so far has the following unsatisfactory points: 1. Most studies focus on MR and CT images. There are few studies on X-ray images. X-ray images are readily available in many areas regardless of the medical infrastructure, so application to X-rays is meaningful, too. 2. There is no research on maintaining the performance of the classifier to make it robust regardless of the intensity difference. Research has only focused on data augmentation. 3. Most of the research has utilized CycleGAN and Pix2Pix for their tasks and compare their performance. This trend is evident regardless of the application. However, there is no performance comparison between UNIT based VAE and CycleGAN. (3) Therefore, we propose generalizing medical image intensity to maintain the performance of a network using GAN, comparing the performance of CycleGAN and UNIT.

Methods
We propose a generalization method for a new data set with a different intensity distribution from the training data set to maintain the performance of good existing classification networks (Fig. 3).
However, it is impossible to collect such a paired data set in a medical image domain. For example, it requires shooting one person on several machines at the same time to obtain a varied intensity distribution data set for one medical image. This is an impossible task and, indeed, unnecessary. Therefore, our generalizer is chosen a GAN that should be trained with an unpaired data set.
Our chosen CycleGAN and UNIT are popular image-to-image translation GANs that can be trained with unpaired data sets and work well in various domains as well as medical imaging. We introduce two networks below.

Generalizer using CycleGAN
The CycleGAN is the widely used network for style transfer tasks. The results of this network tend to maintain features of the original domain, such as the shape of instance, as much as possible. Figure 4 shows the structure of a CycleGAN used to generalize the intensity distribution of medical images. The key idea is cycle consistency, that is the loss between the original domain image and the reconstructed image for the training of an unpaired data set. The reconstructed image is retransformed from the fake image to the original domain image. First, the generator G XY generates a domain-transformed image, G XY (X), and then obtains a reconstructed image, G YX (G XY (X)), through G YX that reconstructs the transformed image, G XY (X), into the original domain. By reducing the loss between the original domain image In this case, CycleGAN uses forward-backward cycle-consistency losses. Forward cycle-consistency loss is the loss in converting from domain X to domain Y and then retransforming back to the original domain (Fig. 4a). Backward cycle-consistency loss, on the other hand, is the loss when converting from the target domain to the original domain and then back to the target domain (Fig. 4b). This cycle-consistency loss is similar to L1 loss containing self-similarity (Eq. 4), and it summarized in Eq. 5. The final objective function is constructed by adding it to the existing GAN loss.
For model stability and to avoid mode collapse, CycleGAN uses a least-square loss function [35] instead of vanilla GAN loss (Eq. 1) in Eq. 6.
Then their final loss function is in Eq. 7.

Generalizer using UNIT
The UNIT consists of two structures combined with VAE and GANs as shown Fig. 5, and their core idea is a shared latent vector for inferring joint distribution between different domain data sets. The aim of the shared latent vector is to limit the space of joint distribution. A single generator is conceptually divided into an encoder, G e , and a decoder, G d . The two encoders, G eA and G eB , respectively, generate latent vectors, z ~ q(z|x) through weight sharing. q is treated by random vector of N(E(x), I) where I is an identity matrix. These vectors are reshared in the decoders G dA and G dB , and the decoders also share weights. Therefore, UNIT is quite complex, using VAE and GAN losses together.
Suppose we have different domain data, A and B, for the same x. The entire loss function is shown in Eq. (8). UNIT also learns in both directions by applying cycle-consistency, but we represent the loss of only one direction. The other loss is easily obtained by changing the domain.
Two VAE trained by minimize a variational upper bound [19], in Eq. 9.

GAN loss is based on cGAN loss as follows:
To model cycle-consistency condition, the loss is given by Eq. 11. We use CycleGAN and UNIT as generalizers. Both networks have very different structures depending on whether VAE is used or not. In the next section, we will compare their performance as generalizers for both networks.

Experiments
This section presents the experimental methods and evaluation result. We propose three conditions for evaluation of our generalization framework and four methods to solve these conditions. Moreover, we compare the performance of three generalizers including UNIT, GAN and histogram matching. Also, we describe our datasets in this section. Figure 6 shows an example of the unpaired medical image data set used in the experiments of this paper. Our data set includes frontal chest X-ray image data, labeled tuberculosis (TB) or non-TB from the National Library of Medicine and National Institutes of Health (Bethesda, MD, USA) [36][37][38].

Experiment data
There were two data sets: the Shenzhen data set from Shenzhen No. 3, People's Hospital, captured with Philips DR Digital Diagnose systems, and the Montgomery County (MC) data set from the Department of Health and Human Services of Montgomery County (MD, USA), captured with a Eureka stationary X-ray machine.
The Shenzhen data set consisted of 336 cases with TB and 326 non-TB cases. The MC data set consisted of 58 cases with TB and 80 non-TB cases. Because both data sets were captured using different machines, the intensity distribution of the two data sets is completely different, so the network cannot find lesions in other test data sets, even though its original classification performance is high.

Experimental process
The experiment was divided into a generalization step and a classification step. Our test scenario assumes a situation where a new MC data set comes to the classifier that learned the Shenzhen data set. In other words, the MC data set was used as a test set for generalization performance, and the Shenzhen data set was used as a training set to train the classifier. Our classifier is based pretrained AlexNet [2] with 0.95 ± 0.02 area under Fig. 6 Intensity distribution of our unpaired dataset curve (AUC). The MC data sets are generalized by two generalizers, which are Cycle-GAN and UNIT, respectively. In the generalization step, the intensity distribution of the MC data set was adjusted to the Shenzhen data set using each generalizer.
Unlike with simple translation, the generalization of medical images requires the preservation of highly advanced features. In particular, detailed feature retention is necessary to distinguish the presence or absence of lesions.
For evaluation, we have to consider as three conditions: Conditions (1) and (2) can be judged visually, but it is very difficult to confirm in condition (3). We propose the following methods to solve this problem: 1. Show visualizations using the two generalizers. 2. Use the histogram as a simple measure for conditions (1) and (2) Figure 7 shows the results of generalization using CycleGAN (Fig. 7a), UNIT (Fig. 7b), and histogram matching (Fig. 7c). Figure 8 shows the histogram comparison of each method. Table 1 also shows the mean SSIM score of each image and their standard deviation (std). The difference in the results can be seen with the naked eye. The generalized image using CycleGAN is very similar to the training set, which is the target domain, and the intensity of the generalized result (red) is similar to the training domain (blue) in Fig. 8a. Compared with the distribution of the MC domain, the results are also the most distant (Fig. 8b). CycleGAN also showed good performance in the SSIM results (Table 1) as well as in visual information and histogram results. The SSIM is 0.737, a higher score than other methods. Also, the std is small, meaning CycleGAN is a stable method for the generalization task. It can be confirmed that all the features of the lung are maintained as the intensity transformation is properly performed.

Experimental result
Generalization with UNIT, on the other hand, has bad results. As a result of the visualization, the biggest problem was that the blurring was very severe (Fig. 7b). Therefore, UNIT did not show intensities similar to the target domain in the histogram results (yellow) in Fig. 8a, it is the furthest distribution from the target domain. This can be seen immediately in the SSIM results. The SSIM of UNIT was 0.691, which was lower than histogram matching. However, UNIT performed better than histogram matching. As can be seen in Fig. 9, histogram matching was completely blacked out, and the features had completely disappeared. This shows that UNIT did not preserve the structural characteristics of the original domain. This is minimized when an image in the generator's target domain is given as input. That is, when the image in the target domain comes in, the generator does nothing. This is especially good for coloring. UNIT also showed poor results in the preservation of features. However, it is difficult to be sure whether the results are preserving the features well by only evaluating the visualization results. Therefore, we use a pretrained classifier to test the accuracy of the resulting data sets. The accuracy was confirmed by a ROC curve and AUC.

Table 1 The SSIM between three methods and the original data set
We calculate the SSIM for each image, average that, and get the standard deviation (std). We denote the result as mean ± std. The CycleGAN generalizer shows 0.737 of the SSIM (given in italic), higher than the SSIM of other methods

SSIM (%)
Generalizer-CycleGAN 73.7 ± 5.92 Generalizer-UNIT 69.1 ± 4.84 Histogram matched dataset 73.2 ± 8.23  Figure 10 shows the ROC curve results. The difference in the curves of methods can be seen clearly. As seen in Table 2, the AUC of the CycleGAN is 0.84 and UNIT is 0.81. This is a novel score, because the AUC of the original data set is 0.73. This shows that the generalized image through the GANs performed appropriate conversion to the target domain while preserving the important features.
We have shown through experiments that the intensity generalization of medical images through GAN is effective. Generalizers using CycleGAN (given in italic) showed the best performance in all experiments

Conclusion
In this paper, we proposed a method to generalize the intensity of arbitrary medical images by using a GAN generalizer using CycleGAN and UNIT (based on VAE) to maintain the accuracy of a medical-image classification network. Performing generalizations without losing important features of lesions is a very sensitive task, and we evaluated the results in the following way. We created three data sets, based on two generalizers and histogram matching. We presented the detailed result images and intensity distribution of the data sets using histograms and measured the similarity of the generalized results numerically using SSIM. We also evaluated the accuracy of the proposed method and the existing method with AUC. As a result, both generalization methods using the GAN were 0.5 to 1.0 higher than the AUC of the original data set. We confirmed that the  intensity distribution of our proposed method creates images very similar to the training domain data set without significant feature loss. We have also shown that CycleGAN, which maintains the characteristics of instances, is more suitable for the generalization of medical images. These results show that our proposed generalization is an effective method to maintain performance in a classification network that suffers from performance degradation due to differences in the intensity of medical images. Recently, structure of generator that greatly improves the quality of the generated image [39,40] and model with advanced few-shot capability [41] are proposed. As future work, Applying these methods to our generalization module would allow the robustness and accuracy of our framework.