DU-GAN: Generative Adversarial Networks with Dual-Domain U-Net Based Discriminators for Low-Dose CT Denoising

LDCT has drawn major attention in the medical imaging field due to the potential health risks of CT-associated X-ray radiation to patients. Reducing the radiation dose, however, decreases the quality of the reconstructed images, which consequently compromises the diagnostic performance. Various deep learning techniques have been introduced to improve the image quality of LDCT images through denoising. GANs-based denoising methods usually leverage an additional classification network, i.e. discriminator, to learn the most discriminate difference between the denoised and normal-dose images and, hence, regularize the denoising model accordingly; it often focuses either on the global structure or local details. To better regularize the LDCT denoising model, this paper proposes a novel method, termed DU-GAN, which leverages U-Net based discriminators in the GANs framework to learn both global and local difference between the denoised and normal-dose images in both image and gradient domains. The merit of such a U-Net based discriminator is that it can not only provide the per-pixel feedback to the denoising network through the outputs of the U-Net but also focus on the global structure in a semantic level through the middle layer of the U-Net. In addition to the adversarial training in the image domain, we also apply another U-Net based discriminator in the image gradient domain to alleviate the artifacts caused by photon starvation and enhance the edge of the denoised CT images. Furthermore, the CutMix technique enables the per-pixel outputs of the U-Net based discriminator to provide radiologists with a confidence map to visualize the uncertainty of the denoised results, facilitating the LDCT-based screening and diagnosis. Extensive experiments on the simulated and real-world datasets demonstrate superior performance over recently published methods both qualitatively and quantitatively.


I. INTRODUCTION
Computed tomography (CT) can provide the cross-sectional images of the internal body by the x-ray radiation, which is one of the most important imaging modalities in clinical diagnosis. Although CT plays an essential role in diagnosing diseases, the widespread use of CT is raising more and more public concerns towards its safety since CT-related Xray radiation may cause unavoidable damage to the health of humans and induce cancers. Consequently, reducing the radiation dose of CT as low as reasonably achievable (a.k.a. ALARA) is a well-accepted principle in CT-related research over the past decades [1]. The reduction of radiation dose, however, inevitably brings the noise and artifacts into the reconstructed images, severely compromising the subsequent diagnosis and other tasks such as LDCT-based lung nodule classification [2].
A straightforward way to address this issue is to reduce the noise in the LDCT image [3], [4]. However, it remains a challenging problem due to its ill-posed nature. In recent years, various deep learning based methods have been proposed for LDCT denoising [5], [6], [7], [8], [9], [10], [11], achieving impressive results. There are two key components in designing a denoising model: network architecture and loss function; the former one can determine the capacity of the denoising model while the latter one can control how the denoised images visually look like [7]. Although different network architectures such as 2D convolutional neural networks (CNNs) [5], 3D CNNs [7], [10], and residual encoder-decoder CNNs (RED-CNN) [12] have been explored for LDCT denoising, literature has shown that the loss function is relatively more important than the network architecture as it has a direct impact on the image quality [7], [13].
One of the most popular loss functions is the mean-squared error (MSE), which computes the average of the squares of the per-pixel errors between the denoised and normaldose images. Although gaining impressive performance in terms of peak signal-to-noise (PSNR), MSE usually leads to over-smoothened images, which has been proven to poorly correlate with the human perception of image quality [14], [15]. In view of this observation, alternative loss functions such as perceptual loss, 1 loss, and adversarial loss have been investigated for LDCT denoising. Among them, adversarial loss has been shown to be a powerful one as it can dynamically measure the similarity between the denoised and normaldose images during the training, which enables the denoised images to preserve more texture information from normaldose one. The computation of the adversarial loss is based on the discriminator, which is a classification network to learn a representation differentiating the denoised images from the normal-dose images; it can measure the most discriminant difference either in a global or local level, depending on that one unit of the output of discriminator corresponds to the whole image or a local region. Such a discriminator is prone to forgetting previous difference because the distribution of synthetics samples shifts as the generator constantly changes through training, failing to maintain a powerful data representation to characterize the global and local image difference [16]. As a result, it often results in the generated images with discontinued and mottled local structures [17] or images with incoherent geometric and structural patterns [18]. In addition to the noise, LDCT images may contain severe streak artifacts caused by photon starvation, which may not be effectively removed through the loss function solely in the image domain.
To learn a powerful data representation to regularize the denoising model in the adversarial training, we propose a U-Net [19] based discriminator in the GANs framework for LDCT denoising, termed DU-GAN, which can simultaneously learn the global and local difference between the denoised and normal-dose images in image and gradient domains. More specifically, our proposed discriminator follows the U-Net architecture including an encoder and a decoder network, where the encoder encodes the input to a scalar value focusing on the global structures while the decoder reconstructs a perpixel confidence map capturing the changes of local details between the denoised and normal-dose images. In doing so, it can provide not only the per-pixel feedback but also the global structural difference to the denoising network. In addition to the adversarial training in the image domain, we also apply another U-Net based discriminator in the image gradient domain to alleviate the artifacts caused by photon starvation and enhance the edge of the denoised images. Moreover, to regularize the U-Net based discriminator, we introduce the CutMix data augmentation to mix the denoised and normaldose images. Consequently, the U-Net based discriminator can provide radiologists with the per-pixel outputs as a confidence map to visualize the uncertainty of the denoised results, which can facilitate radiologists' screening and diagnosis when using the denoised LDCT images.
The benefits of the proposed DU-GAN are as follows. 1) Unlike existing GAN-based denoising methods that use a classification as the discriminator, the proposed DU-GAN utilizes a U-Net based discriminator for LDCT denoising, which can simultaneously learn global and local difference between the denoised and normal-dose images. Consequently, it can provide not only the perpixel feedback but also the global structural difference to the denoising model. 2) In addition to adversarial training in the image domain, the proposed DU-GAN also performs adversarial train-ing in the image gradient domains, which can alleviate the streak artifacts caused by photon starvation and enhance the edge of the denoised images.
3) The proposed DU-GAN can provide radiologists with a confidence map visualizing the uncertainty of the denoised results through the CutMix technique, which could facilitate radiologists' screening and diagnosis when using the denoised LDCT images. 4) Extensive experiments on simulated and real-world datasets demonstrate the effectiveness of the proposed method through both qualitative and quantitative comparisons. The remainder of this paper is organized as follows. We briefly survey the developments of the LDCT denoising methods and generative adversarial networks in Section II. We present our LDCT denoising framework DU-GAN with dualdomain U-Net based discriminators, and then introduce the CutMix regularization technique as well as the network architectures and loss functions in our framework in Section III, followed by both qualitative and quantitative comparisons with the state-of-the-art methods on the simulated and real-world datasets in Section IV. Finally, we conclude this paper in Section V.

II. RELATED WORK
This section briefly surveys the development of LDCT denoising and generative adversarial networks.

A. LDCT Denoising
The noise reduction algorithms for LDCT can be summarized into three categories: 1) sinogram filtration; 2) iterative reconstruction; and 3) image post-processing. As a significant difference from routine CT, the LDCT acquires noisy sinogram data from scanner. A straightforward solution is to perform the denoising process on the sinogram data before image reconstruction, i.e. sinogram filtration-based methods [20], [21], [22]. Iterative reconstruction methods combine the statistics of raw data in the sinogram domain [23], [24] and the prior information in the image domain such as total variation [25] and dictionary learning [26]; these pieces of generic information can be effectively integrated in the maximum likelihood and compressed sensing frameworks. These two categories, however, require the access to raw data that are typically unavailable from commercial CT scanner.
Different from the previous two categories, image postprocessing methods directly operate on the reconstructed images that are publicly available after removing patient privacy. Traditional methods such as non-local means [27] and block-matching 3D [28], however, lead to the loss of some critical structural details and result in over-smoothened denoised LDCT images. The rapid development of deep learning techniques has advanced many medical applications. In LDCT denoising, deep-learning-based models have achieved impressive results [5], [7], [9], [10], [12], [29]. There are two critical components in designing a deep-learning-based denoising model: network architecture and loss function; the former one determines the capacity of a denoising model while the The generator produces denoised LDCT images, and two independent branches with U-Net based discriminators perform at the image and gradient domains. The U-Net based discriminator provides both global structure and local per-pixel feedback to the generator. Furthermore, the image discriminator encourages the generator to produce photo-realistic CT images while the gradient discriminator is utilized for better edge and alleviating streak artifacts caused by photon starvation.
later one controls how the denoised images visually look like.
Although the literature has proposed several different network architectures for LDCT denoising such as 2D CNNs [5], 3D CNN [7], [10], RED-CNN [5], and cascaded CNN [12], the literature has shown that the loss function plays a relatively more important role than network architecture as it has a direct impact on the image quality [7], [13]. The simplest loss function is the MSE, which however has been shown to poorly correlate with the human perception of image quality [14], [15]. In view of this observation, alternative loss functions such as perceptual loss, 1 loss, adversarial loss, or mixed loss functions have been investigated for LDCT denoising. Among them, adversarial loss has been shown to be a powerful one as it can dynamically measure the similarity between the denoised and normal-dose images during the training, which enables the denoised images to preserve more texture information from normal-dose one. Adversarial loss reflects either global or local similarity, depending on the design of discriminator. Unlike the conventional adversarial loss, the adversarial loss used in this study is based on a U-Net based discriminator, which can simultaneously characterize global and local difference between the denoised and normal-dose images, better regularizing the denoising model. That is, DU-GAN enjoys both advantages of the per-pixel discriminator capturing the changes at pixel level and traditional classification discriminator focusing on global structures. In addition to the adversarial loss in the image domain, the adversarial loss in the image gradient domain proposed in this paper can alleviate the streak artifacts caused by photon starvation and enhance the edge of the denoised images.

B. Generative Adversarial Networks (GANs)
As one of the most hot research topics in recent years, GANs [14] and their variants have been successfully applied to various tasks [30], [31], [32]. They typically consist of two networks: 1) a generator learning to capture the data distribution of training data and produce new samples that are indistinguishable from the real ones, and 2) a discriminator attempting to distinguish real samples from fake ones produced by the generator. These two networks are trained alternatively, ending once the balance is achieved. In the context of LDCT denoising, the generator aims to produce photo-realistic denoised results to fool the discriminator while the discriminator tries to distinguish the real normal-dose CT (NDCT) images and denoised ones. To foster the stability of training GANs, various variants of GANs have been proposed, such as Wasserstein GAN (WGAN) [33], WGAN with gradient penalty (WGAN-GP) [34], and least-squares GANs [35].
In this paper, we adopt the least-squares GANs [35], spectral normalization [36], and U-Net based discriminator [16] to form the GANs framework for LDCT denoising. As a significant difference, our DU-GAN performs adversarial training in both image and gradient domains, which can reduce noise and alleviate streak artifacts simultaneously. We note that the proposed DU-GAN is also suitable for other variants of GAN such as WGAN and WGAN-GP.
III. METHODOLOGY Fig. 1 presents the proposed DU-GAN for LDCT denoising, which contains a denoising model as generator, and two U-Net based discriminators in both image and gradient domains. We highlight that the U-Net based discriminator is able to learn the global and local difference between denoised and normal-dose images. Next, we present all components, network architecture, and loss functions in detail, followed by its complexity.

A. The Denoising Process
The denoising process is to learn a generative model G that maps an LDCT image I LD ∈ R w×h of size w × h to its normal-dose CT (NDCT) counterpart I ND ∈ R w×h by removing the noise in LDCT image. Formally, it can be written as: where I den denotes the denoised LDCT image. Typically, LDCT denoising can be seen as a specific image translation problem. Therefore, the GANs-based methods [7], [8], [9], [37] utilize the GANs to improve the visual quality of denoised LDCT images thanks to its strong capability of GANs in generating high-quality images. Different from the conventional GANs that take a noise vector to generate an image, our denoising model serves as the generator that only takes the LDCT image as the input. In this study, we used the RED-CNN [6] as the denoising model to demonstrate the effectiveness of the dual-domain U-Net based discriminators in the adversarial training.

B. Dual-Domain U-Net Based Discriminator
The GANs-based methods [7], [8], [9], [37] for LDCT denoising usually maintain the competition of GANs under the structural level, whose discriminator progressively downsamples the input into a scalar value and are trained with Wasserstein GANs [33], [34], as shown in Fig. 2(a). However, the discriminator is prone to forgetting previous samples because the distribution of synthetics samples shifts as the generator constantly changes during training, failing to maintain a powerful data representation to characterize the global and local image difference [16], [38].
To address the problems above, we introduce the U-Net based discriminators in both image and gradient domains.
1) U-Net based discriminator in the image domain: To learn a powerful data representation that can characterize both global and local difference, we design an LDCT denoising framework based on GANs to deal with LDCT denoising. Traditionally, U-Net contains an encoder, a decoder, and several skip connections copying the feature-maps from the encoder to the decoder to preserve high-resolution features, which has demonstrated its state-of-the-art performance in many semantic segmentation tasks [39], [40] and image translation tasks [16], [31]. In the context of LDCT denoising, we highlight that U-Net and its variants are only used as the denoising model, which have not been explored as the discriminator. We adopt the U-Net to replace the standard classification discriminator in GANs to have a U-Net style discriminator that allows the discriminator to maintain both global and local data representation. Fig. 2(b) details the architecture of U-Net based discriminator.
Here, we use D img to denote the U-Net based discriminator in the image domain. The encoder of D img , D img enc , follows the traditional discriminator that progressively downsamples the input using several convolutional layers, capturing the global structure context. On the other hand, the decoder D img dec performs progressive upsampling with skip connections from encoder D img enc in a reverse order, further enhancing the ability of discriminator to draw the local details of real and fake samples. Furthermore, the discriminator loss is computed from the outputs of both D img enc and D img dec , while the traditional discriminator used in previous works [7], [8], [37] only classifies the inputs into being real and fake from the encoder. In doing so, the U-Net based discriminator can provide more informative feedback to the generator including both local per-pixel and global structural information. In this paper, we employ the least-squares GANs [35] rather than conventional GANs [14] for the discriminators to stabilize the training process and improve the visual quality of denoised LDCT. Formally, the discriminator loss for D img from both D img enc and D img dec can be written as: where 1 is the decision boundary of least-squares GANs.
2) U-Net based discriminator in the gradient domain: However, the competition in the image domain alone is only able to force the generator towards generating photo-realistic denoised LDCT images; it is insufficient to encourage better edge for keeping the pathological changes of original NDCT images and alleviate the streak artifacts caused by photon starvation in LDCT. Previous methods such as [9] measure the different MSE in the gradient domain, which may be insufficient to enhance the edge as MSE tends to blur image. To this end, we propose to perform an additional GANs competition in the gradient domain, where our motivation is presented in Fig. 3. Specifically, the streaks and edge in CT images are highlighted in their horizontal and vertical gradient magnitudes. Therefore, another branch of the gradients estimated by a Sobel operator [41] is performed aside the image branch, which encourages better edge information and alleviates streak artifacts. Similar to (2), we can define the discriminator loss in the gradient domain L D grd , where D grd represents the discriminator in the gradient domain.
3) Dual-domain U-Net based discriminators: Combining the U-Net based discriminators in the image and gradient domains, two independent GANs competitions are maintained during training. The overall framework of our proposed LDCT denoising model is shown in Fig. 1. In detail, the generator is to denoise an LDCT image, which is then fed into two independent discriminators operating in the image and gradient domains. The discriminator D img in the image domain branch penalizes the generator generating photo-realistic denoised LDCT while the discriminator D grd in the gradient domain branch encourages better edge while alleviating streak artifacts caused by photon starvation. Additionally, the discriminator in each branch employs a U-Net based architecture to encourage the generator focusing both global structure and local details, which can also boost the interpretability of the denoising process with the per-pixel confidence map output by D img dec and D grd dec . Finally, the dual-domain U-Net based discriminator loss can be defined as follows: (3)

C. CutMix Regularization
The discriminator D img suffers from the decreasing capability in recognizing the local differences between real and fake samples as the training goes, which may unexpectedly harm the denoising performance. Besides, the discriminator is supposed to focus on structure change at the global level and local details at the per-pixel level. To address these issues, we adopt the CutMix augmentation technique to regularize the discriminator inspired by [16], [42], which can empower the discriminator to learn the intrinsic difference between real and fake samples. Specifically, CutMix technique generates a new training image from two images by cutting patches from the one and pasting them to another. We define this augmentation technique in the context of LDCT denoising as follows: where M ∈ {0, 1} w×h is a binary mask controlling how to mix the NDCT and denoised images, and represents the element-wise multiplication.
The mixed samples should be regarded as fake samples globally by the encoder D img enc since the CutMix operation has destroyed the global context of NDCT image; otherwise the CutMix may be introduced to denoised LDCT images during the training of GANs, causing undesirable denoising. Similarly, the D img dec should be able to recognize the mixed area to provide the generator with accurate per-pixel feedback. Therefore, the regularization loss of CutMix can be formulated as: where M used in CutMix also serves as the ground truth for D img dec . Furthermore, to penalize the outputs of discriminator to be consistent with the per-pixel predictions after the Cut-Mix operation, we further introduce another consistency loss following [16] to regularize the discriminator with CutMix operation, which can be written as: where · F represents the Frobenius norm.
During training, the binary mask M is generated following the same pipeline as [42], [43]. Specifically, we first sample the combination ratio r from beta distribution Beta(1, 1) and then uniformly sample the top-left coordinates of the bounding box of cropping regions from I ND to I den , with preserving the r ratio. Similar to [42], [43], we employ a probability p mix to control whether to apply the CutMix regularization technique for each mini-batch samples, which is empirically set to 0.5. Fig. 4 presents the visual results of D img with CutMix regularization technique. It can be observed that the outputs of D img dec are the spatial combination of the real and generated patches with respect to the real/fake classification score. Therefore, the results have demonstrated the strong discriminative capability of the U-Net based discriminator in accurately learning per-pixel differences between real and generated samples, even though they are cut and mixed together to fool the discriminator. Besides learning the per-pixel local details, D img enc can accurately predict the proportion of real patches, i.e., the mixed ratio, as it is to focus on the global structures.

D. Network Architecture
As we described above, our proposed method follows the GANs framework to optimize the generator effectively for LDCT denoising, with the U-Net based discriminator focusing on both global structures and local details, and an extra gradient branch encouraging better boundaries and details. In this subsection, we describe the network architectures of the generator and U-Net based discriminator.
1) RED-CNN based generator: In this paper, we employ RED-CNN [6] as the generator of our framework for LDCT denoising since this paper mainly focuses on the adversarial loss from dual-domain U-Net based discriminators. The main difference from [6] is that our framework is optimized in GANs manner, while the vanilla RED-CNN suffers the problem of over-smoothened LDCT images with MSE. Specifically, RED-CNN employs the U-Net architecture but removes the downsampling/upsampling operations to prevent information loss. We stack 10 (de)convolutional layers at both encoder and decoder, each of which has 32 filters for the sake of the computation cost, followed by a ReLU activation function. There are in total 10 residual skip connections. It is important to note that although RED-CNN is adopted as the generator in our framework, the proposed method can be also adapted to other GANs-based methods such as CPCE [7] and WGAN-VGG [8] with only changing the discriminators.
2) U-Net based discriminator: As detailed in Section III-B, there are two independent discriminators in both image and gradient domains, each of which follows a U-Net architecture. Specifically, D enc has 6 downsampling ResBlocks [44] with increasing number of filters; i.e. 64, 128, 256, 512, 512, and 512. At the bottom of D enc , a fully-connected layer is used to output the global confidence score. Similarly, D dec used the same number of ResBlocks in a reverse order to process the bilinearly upsampled features and the skip residuals of the same resolution, followed by a 1 × 1 convolutional layer to output the per-pixel confidence map. Most importantly, a spectral normalization layer [36] and a Leaky ReLU activation with a slope of 0.2 for negative input follow each convolutional layer of D except the last one.
We note that the network architectures of the generator and discriminator were proposed in literature; we did not propose a new network architecture to achieve the performance gain. One of our key contributions is to use the U-net as the discriminator in dual-domain to capture both local details and global structures for LDCT denoising.
E. Loss Functions 1) Adversarial loss: Here we employ the sum of these two branches as the adversarial loss, which is defined in the context of least-squares GANs as follows: where ∇ denotes the Sobel operator to obtain the image gradient.
2) Pixel-wise loss: To encourage the generator output the denoised LDCT images that match the NDCT images with both pixel level and gradient level, we adopt an pixel-wise loss between the NDCT images and denoised LDCT images, which includes a pixel loss and gradient loss for each branch as shown in Fig. 1. The additional gradient loss can encourage to better preserve edge information at the pixel level. The two losses can be written as: Note that we employ the mean squared error in pixel level rather than the feature level using pretrained model [7], [8] for the sake of computation cost, and the absolute mean error in gradient level as the gradients is much sparser than pixels.

3) Final loss:
To encourage the generator to generate photo-realistic denoised LDCT images with better edge information and alleviate streak artifacts, the final loss function to optimize the generator G is expressed as: where λ adv , λ img and λ grd are the weights for L adv , L img and L grd , respectively. Here, we empirically determine the hyper-parameters in a sequential way. First, with only pixelwise loss, our proposed DU-GAN reduces to RED-CNN since discriminators are not included during training. Although fast convergence, only optimizing the MSE loss leads to oversmoothing and blurred results, causing the loss of structural details. We set λ img to be 1. Second, we tune the λ adv to control the importance of adversarial loss to capture the texture details. We start from a small value for λ adv , and then gradually increase the importance of the adversarial loss, and visualize the denoising results. Finally, we tune λ grd to capture edge information with a large value as the gradients are much sparser than pixels. The discriminators D img and D grd are optimized by minimizing the following mixed loss: Note that we employ the same loss function in (11) to optimize both D img and D grd but they are independent to each other and D grd has an additional Sobel operator to compute the gradients.
F. Complexity of DU-GAN Next, we discuss the complexity of the DU-GAN in terms of hyper-parameters and computational costs. First, compared to MSE-based methods that directly optimize mean-squared loss, DU-GAN is a GANs-based method that introduces an additional adversarial loss to the training process. Compared to vanilla GANs-based methods with a traditional classification discriminator, DU-GAN proposed to use the U-Net based discriminator to focus on both local details and global structures. Furthermore, DU-GAN also introduces another gradient branch along the original pixel branch to encourage clear boundaries. Therefore, there is only one extra hyperparameter to control the importance of the gradient branch. Second, the main computational costs of DU-GAN come from the proposed U-Net based discriminator and gradient branch. However, such computational costs are affordable considering the better denoising quality and performance for our DU-GAN and only happen during the training stage. That is, the inference efficiency is still the same as the traditional ones.

IV. EXPERIMENTS
This section presents the datasets, implementation details, qualitative and quantitative evaluations, uncertainty visualization, and ablation study.
A. Datasets 1) Simulated dataset: The LDCT dataset used in this study was originally for the 2016 NIHAAPM-Mayo Clinic Low-Dose CT Grand Challenge, and lately released in [45]. It provides scans from three regions of the body with different simulated low doses; i.e., head with 25% of normal-dose, abdomen with 25%, and chest with 10%. In our experiments, we used the 25% abdomen and 10% chest datasets, named Mayo-25% and Mayo-10%, respectively. We evaluated our method on abdomen scans for comparisons with most previous works, and conducted experiments on chest scans since 10% of normaldose at chest is rather challenging compared to the 25% of normal-dose at abdomen. For each dataset, we randomly select 20 patients for training and another 20 patients for testing; no identity overlapping between training and testing. In detail, 300K and 64K image patches were randomly selected from each set. For more information about this dataset, please refer to [45].
2) Real-world dataset: The real-world dataset from [37] includes 850 CT scans of a deceased piglet obtained by a GE scanner (Discovery CT750 HD). The dataset provides CT scans of the normal-dose, 50%, 25%, 10% and 5% dose with a size of 512 × 512, 708 of which is served for training while the left for testing. We evaluated our method on 5% low-dose CTs as it is the most challenging dose, where the dataset is named Piglet-5%. We randomly selected 60K and 12K image patches from training and testing sets, respectively. For more information about this dataset, please refer to [37].

B. Implementation Details
Following [7], [8], [46], we employed the image patches with a size of 64 × 64 and a window of [−300, 300] to train all models with emphasis on tissue CT window, which are then directly applied to the whole image for visualization and testing. Note that we excluded those image patches that were mostly air. During training, all images are linearly normalized to [0, 1].
During training, we trained the model with a maximum of 100K iterations and with a mini-batch of size 64 on one NVIDIA V100 GPU. All networks in the proposed framework are initialized with He initialization [48], and optimized by Adam optimization method [49] with a fixed learning rate of 10 −4 . The hyperparameters in the loss functions were empirically set as follows: λ adv was 0.1; λ img was 1; and λ grd was 20. We implemented four deep-learning-based methods including RED-CNN [6], WGAN-VGG [8], CPCE-2D [7], Q-AE [46], and CNCL [47] with the reference of official source code.

C. Qualitative Evaluations
To demonstrate the effectiveness of the proposed method in generating photo-realistic denoised results with faithful details, Fig. 5 showcases the representative results from three different datasets while Fig. 6 presents the results of one neck CT slice with strong streak artifacts. The regions-of-interest (ROIs) marked by the red rectangles are zoomed below, respectively.
All methods present visually well denoised results to some degrees. However, RED-CNN and Q-AE over-smoothed and blurred the LDCT images as they are optimized by the MSE loss that tends to average the results, causing the loss of structural details. Although WGAN-VGG and CPCE-2D have greatly improved the visual fidelity, as expected, due to the use of adversarial loss, minor streak artifacts can still be observed since their traditional classification discriminator only provide the generator with global structure feedback. Besides, they employed the perceptual loss in the high-level feature space to suppress the blurriness resulting from MSE loss. The perceptual loss, however, can only preserve the structures of NDCT images since some local details may be lost after processed by a pre-trained model. For example, the low attenuation lesions in Fig. 5, and the bones in Fig. 6 are less clear by WGAN-VGG and CPCE-2D while they can be easily observed in NDCT as well as the results of our methods. Most importantly, the small structures with their boundaries are consistently preserved with a clear visual fidelity. This benefits from the well-designed dual-domain U-Net based discriminators, which can provide feedback of both global structures and local details to the generator, compared to the traditional classification discriminator used in WGAN-VGG and CPCE-2D with only structure information. Besides, the gradient domain branch can also encourage the denoising model to better preserve edge information.
Beyond encouraging better edge, Fig. 6 also demonstrates its impressive performance in dealing with the LDCT images with strong streak artifacts caused by photon starvation. Compared to the baseline methods that produce strongly blurry and ghosted denoised results, our method can effectively address this problem in the following aspects: 1) streak artifacts can be easily detected by the gradient domain branch; and 2) once detected, the dual-domain U-Net discriminators can fill the occlusion area by adversarial training to alleviate the impact of streak artifacts. In summary, all of these results further validate the superiority of our methods.

D. Quantitative Evaluations
For quantitative evaluations, we adopted three widely-used metrics including peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and root mean square error (RMSE). More specifically, PSNR and RMSE measure the denoising performance at pixel level while SSIM computes the structural similarity within a window. Table I presents the results of different methods. First, RED-CNN and Q-AE are MSE-based denoising methods as they are directly trained with solely MSE loss. Although they achieve better PSNR and RMSE results, the visual results in Figs. 5 and 6 confirm that MSEbased methods produce over-smoothed results compared to the NDCT images, leading to lose of structural information [7], [8], [50]. Note that the over-smoothed denoising results lead to a lower SSIM score. Second, WGAN-VGG, CPCE-2D, and our DU-GAN are GAN-based methods. CPCE-2D performed better than WGAN-VGG due to the conveying path since WGAN-VGG has to reconstruct the denoised results from the input LDCT images. Obviously, our method performs the best in terms of SSIM score with high visual fidelity while the PSNR and RMSE are also better than WGAN-VGG and CPCE-2D, indicating the superior denoising performance of our method while better structural fidelity.
Though DU-GAN used the same network architecture of the RED-CNN as the denoising model, their qualitative and quantitative difference directly comes from the adversarial training and dual-domain U-Net based discriminators. The results of DU-GAN preserve more structural details that are important for diagnosis, at the cost of compromising the   quantitative metrics such as PSNR and RMSE. We note that PSNR and RMSE are pixel-wise metrics, poorly correlating with human perception of image quality [14]. Fig. 4 shows the proposed discriminator with the U-Net architecture and CutMix regularization can robustly learn the per-pixel differences of local details between NDCT and denoised LDCT images by the decoder and also focus on the global structures by the encoder. With this well-trained discriminator, we can provide radiologist with a confidence map showing the uncertainty of the denoised results, since it is to learn the distribution of real samples, i.e., NDCT images. Therefore, we directly applied the trained discriminator D img to the LDCT, NDCT, and the denoised LDCT images of different methods. Fig. 7 shows the uncertainty visualization. Obviously, the discriminator can accurately distinguish the LDCT from NDCT images, on both global score and per-pixel confidence. As both RED-CNN and Q-AE over-smoothen the LDCT images, the abdomen area of transverse CT image becomes better than LDCT images on the confidence map, according to the results of D img dec . This also explains why RED-CNN and Q-AE have the lowest global score of D img enc , which indicates that the discriminator can robustly detect the blurriness in the CT images. Furthermore, although CPCE-2D can produce more clear denoised results than RED-CNN, the streak artifacts significantly compromised the quality of the denoised results. Similarly, the WGAN-VGG has learned more local details than CPCE-2D but it still cannot handle the impact of the streak artifacts. On the contrary, the proposed method can produce the most photo-realistic denoised results with the highest global score. Compared to the traditional classification discriminator used in CPCE-2D and WGAN-VGG, our DU-GAN can provide the generator with the per-pixel feedback by learning the local detail differences. It can be seen from the per-pixel of D img dec . In other words, we achieve a more smooth per-pixel confidence, indicating that the discriminator cannot distinguish the real and fake samples at the per-pixel level.

F. Ablation Study
In this subsection, we conducted the ablation study of our method to fully explore the proposed method in terms of the importance of different components, the architectures of discriminator, and the different patch sizes. The ablation study was done on the testing set of Mayo-10% dataset, which includes a total of 6,590 slices from 20 patients.
1) Components analysis: We investigate the impact of the U-Net based discriminator in the image domain, CutMix regularization, and dual-domain training (i.e., with gradient branch) by gradually applying them to the baseline method. Similar to WGAN-VGG and CPCE-2D, the baseline method only includes the traditional classification discriminator with the same hyperparameters for fair comparison.  Table II presents the quantitative results for ablation study. First, replacing the traditional classification discriminator with a U-Net based discriminator can simultaneously provide the generator with both global structure and local per-pixel feedback, which leads to a significant increase in terms of SSIM. Second, when we further use CutMix technique to regularize the U-Net based discriminator, the mixed samples can boost the discriminant capacity of discriminator and make discriminator more focus on the local details, leading to the increased SSIM score and a slightly decreased PSNR and RMSE. Last, further adding the U-net based discriminator in the gradient domain into the method above forming the dualdomain training yields our method. Specifically, the additional gradient domain training can help our method remove the streak artifacts and encourage more clear edge in the denoised LDCT images. As a result, it can effectively improve all metrics including the PSNR and RMSE in pixel space and SSIM in visual similarity. 2) Architectures of discriminator: Since the architectures of discriminator play a critical role in the training of GANs, it is worthwhile studying the advantage of the U-Net based discriminator over other classical discriminator architectures such as patch discriminator [31], pixel discriminator [31], and traditional global discriminator. Compared to traditional classification discriminator that classifies the real and fake samples at image level, patch discriminator focuses on the image patches. Due to the patch training of low-dose CT denoising, this discriminator architecture can be seen as the patch discriminator. The discriminator with seven convolutional layers and one fully-connected layer is regarded as the global discriminator. On the other hand, the pixel discriminator [31] contains 7 1×1 convolutional layers to penalizes the generator at per-pixel level. For fair comparisons, we trained patch and pixel discriminator with image patches and trained the global discriminator with the whole images with the size of 512×512, respectively. Table III shows that the combination of global and pixel information in U-Net based discriminator produces the best SSIM score. This indicates the advantage of U-netbased discriminator for LDCT denoising over other classical discriminator architectures such as patch discriminator, pixel discriminator, and traditional global discriminator. Instead of pixel discriminator only capturing per-pixel difference and traditional classification discriminator only focusing on global structure, the U-Net based discriminator has the advantages of both worlds, yielding better quantitative results and denoising quality.
3) Patch size: Due to the U-Net architecture of the discriminator, it is also important to analyze the influence of the patch size during training. However, it is very difficult to directly train the denoising model from scratch. Therefore, we trained our model with the image size of 64×64, 128×128, 256×256, and 512 × 512, and we fine-tuned the generator based on the model trained on previous smaller size. Table IV shows that a small patch size can achieve better performance because the larger patch sizes may introduce training difficulties with less training samples.
V. DISCUSSION AND CONCLUSION In this paper, we proposed a novel DU-GAN for LDCT denoising. The introduced U-Net based discriminator can not only provide the per-pixel feedback to the denoising network but also focus on the global structure. We further add an extra U-Net based discriminator into the gradient domain, which can enhance the edge information and alleviate the streak artifacts caused by photon starvation. We also examined that the CutMix technique can boost the training of discriminator, which can provide the radiologists with a confidence map on the uncertainty of the denoised results. Extensive experiments demonstrated the effectiveness of the proposed method through visual comparison and quantitative comparison.
Although a different architecture or model size could certainly affect results, our DU-GAN has demonstrated its generalization ability on two simulated low-dose CT datasets of different doses and one real-world dataset, with the same network architecture and hyperparameters. Our ablation study validates each component and their relative importance of all components should be consistent on a new dataset, which indicates that our DU-GAN can be easily adapted to different scenarios.
We acknowledge some limitations in this work. First, we used the qualitative and quantitative comparisons to evaluate the image quality. A human reader study may be needed to further validate its potential in clinical diagnosis, though there are significant difference between the proposed and other baseline methods. Second, the U-net based discriminator can provide radiologists with a confidence map of the denoised images. How this helps radiologists in clinical routine could be examined with specific tasks such as liver lesion diagnosis, which can be further studied as a future direction. Third, DU-GAN could introduce slightly more computational cost during training since it employs the U-Net as the discriminator and adopts a dual-domain training strategy. However, we emphasize that the extra computational cost is relatively affordable as DU-GAN trains the whole framework based on the 64 × 64 image patches instead of the original image size 512 × 512. Therefore, we believe that the computational cost can be significantly reduced. We note that the extra computational cost only happens for the training stage. That is, the inference efficiency is still the same as the traditional one as the the dualdomain discriminators are not involved during testing stage. Finally, we only validated DU-GAN with two dose levels in this paper and it is worth further validating DU-GAN with lower radiation dose.
In conclusion, the proposed DU-GAN achieves better denoising performance than other GAN-based models and has great potential for clinical use with uncertainty visualization.