Low-Dose CT Image Denoising Based on Improved WGAN-gp

In order to improve the quality of low-dose computational tomography (CT) images, the paper proposes an improved image denoising approach based on WGAN-gp with Wasserstein distance. For improving the training and the convergence efficiency, the given method introduces the gradient penalty term to WGAN network. The novel perceptual loss is introduced to make the texture information of the low-dose images sensitive to the diagnostician eye. The experimental results show that compared with the state-of-art methods, the time complexity is reduced, and the visual quality of low-dose CT images is significantly improved.


Introduction
Computational tomography (CT) has been applied extensively attention and development in recent years, but the radiation generated during CT scanning might pose a potential hazard to the human body. However, reducing the radiation dose by reducing the X-ray tube current can introduce noise and artifacts for CT images, which affects the physician's diagnosis. In response to these problems, many methods [Xia, Xiong, Athanasios et al. (2017); Du, Li and Guo (2013)] have been proposed to improve the quality of reconstructed low-dose CT images (LDCT). These methods are mainly divided into three categories, including projection domain method, iterative reconstruction method and post-processing method. The projection domain method decides the denoising algorithm suitable for the characteristics of the projection domain, and then uses the traditional analytical reconstruction algorithm which is the filtered back projection (FBP) algorithm to reconstruct the image. The iterative reconstruction method uses the likelihood function to connect the image and the projected image according to the statistical characteristics. The image reconstruction [Munteanu, Brisan, Chiroiu et al. (2012); Cui, McIntosh and Sun (2018) ;Nie, Xu, Feng et al. (2018)] is performed in an iterative manner by incorporating a priori information into the objective function. [Chen, Zhang, Zhang et al. (2017a)] proposed adaptive weighted non-local priors according to the non-local mean Algorithms. These methods improve the quality of CT image reconstruction, but the disadvantage of long iteration time can occur. The post-processing method directly operates on low-dose CT images [Jayashree and Bhuvaneswaran (2019)]. Since the features of LDCT images contain significant strip artifacts and noise, the purpose of the post-processing method is to remove artifacts and noise in the image. Chen et al. [Chen, Zhang, Zhang et al. (2017b)] proposed a fast dictionary learning method to improve the quality of the reconstructed CT images based on dictionary learning and sparse representation, but the dictionary learning method may introduce blurring and artifacts. Chen et al. [Chen, Zhang, Zhang et al. (2017a)] applied the non-local mean (NLM) method to CT image denoising. This method can significantly improve image quality, but residual error and excessive smoothing can still be observed in processed images, and because CT image noise is unevenly distributed, these problems have not been solved well. The recent rapid development of deep neural networks provides a new way to solve the problem of low-dose CT image denoising. Due to the powerful feature learning and mapping capabilities of deep neural networks, deep neural networks show better reconstruction quality and faster speed than traditional methods. So many work is carried out in this respect. Chen et al. [Chen, Zhang, Zhang et al. (2017b)] first applied convolutional neural network (CNN) to low-dose CT image denoising. Compared with traditional methods in visual effects and evaluation indicators, it shows certain superiority. Chen et al. [Chen, Zhang, Zhang et al. (2017b)] proposed the residual encoder convolutional neural network (RED-CNN), and obtained the best results on the objective evaluation index, but the network complexity is high. These deep neural networks have achieved good results in low-dose CT image denoising. These networks all use the mean square error (MSE) as a loss function, minimizing the mean square error usually results in excessive edge smoothing and loss of details. At the same time, the image texture that is crucial to human perception will be neglected. For the original GAN, there are problems such as difficulty in training, and the gradient is easy to disappear. Yang et al. [Yang, Yan, Zhang et al. (2018)] applied WGAN to lowdose CT image denoising, and achieved good results. However, due to the inherent defects of WGAN design, the convergence speed can be further improved.

Denoising based on WGAN-gp
This section firstly introduces the denoising model, and then introduces the proposed WGAN-gp in this paper. WGAN uses Wasserstein distance instead of simple pixel MSE as its loss function. This method can well overcome the problem of excessive edge smoothing and the loss of details. At the same time, WGAN-gp with gradient penalty term improves the deficiency of WGAN and accelerates the convergence speed. The proposed novel perceptual loss is also well preserving the image texture that is vital to human perception.

Denoising model
represents a normal dose CT image (NDCT), N N z × ∈  indicating a corresponding low-dose CT image (LDCT). The goal of denoising is to find a function G that maps z to x . : can be regarded as a sample of the NDCT image distribution r P , and N N z × ∈  is a sample of the corresponding LDCT image distribution L P . The function G maps the LDCT image distribution L P to a specific image distribution g P . It makes the generated distribution similar to the real sample distribution r P .

WGAN-gp
The Generative Adversarial Networks (GAN) is a combined neural network comprising a generator network G and a discriminator network D. The generator G accepts a random vector z to generate an image G(z). The discriminator D receives the real image x and generates an image G(z). To give the real image a higher score, the discriminator gives sthe generated "false" image a lower score. With multiple iterations of training, the image generated by the generator G is getting closer to the real image. At the same time, the discriminator D cannot judge whether the input image is real or generated. Then the network is trained successfully. The training D and G are solved by the following methods. Regularized Wasserstein distance, which is also called the entropy regularized optimal transport distance, which formula is as the follow.
where  represents the expectation, and z P is the random vector(noise) sample distribution and r P is the real data sample distribution, g P is the sample distribution for the generator transformation. ( ) D x and ( ( )) D G z are both probabilities with a value of [ ] 0,1 . The original GAN has a fatal flaw: the better the discriminator training, the more serious the generator gradient disappears. Specifically, for any sample x, the contribution to the discriminator loss function is that.
(3) Deriving for ( ) D x , the optimal discriminator function is as the follow.
Substituting the generator loss function ( In this case, if the intersection between r P and g P are zero measures, the generator's loss function is always zero and the gradient disappears. For this problem, the Wasserstein distance [Arjovsky, Chintala and Bottou (2017); Champion, Pascale and Juutinen (2008)] is used instead of the JS divergence. The advantage of the Wasserstein distance is that even if there is no overlap with the two distributions, and they can still measure their distance. The method of WGAN training D and G as the follow.
The first two terms perform Wasserstein distance estimation, and the last one is the network regularization gradient penalty term. Compared with the original GAN, WGAN removes the logarithmic function in the loss, and also removes the sigmoid layer in the discriminator D. WGAN is small enough to be simple and simple, but there are still disadvantages of training difficulties and slow convergence.
In the literature [Kingma and Ba (2014)] WGAN was used in the low-dose CT image denoising and achieved good results. By adding gradient penalty terms, the convergence speed is further improved. The specific improvement is: The loss function of WGAN is defined as the follow: For the discriminator, in order to give scores after the input sample changed, there will be no drastic changes by adding Lipschitz limits.
( ) The LP norm of the discriminator ( ) D x gradient is not more than a constant k. The Lipschitz limit is specifically achieved by weight clipping. For each time the discriminator parameters are updated, the absolute value of all the discriminator parameters is checked whether there is more than one threshold. If a threshold exists, a regression threshold range is enforced. This introduces two drawbacks. The first makes the neural network a binary network, and the parameters are easy to take up and down the bounds for the discriminator wants to maximize the score gap between the true and false samples. The second is that the gradient disappears or the gradient explodes, because the discriminator multi-layer network enlarges or reduces the gradient. The Lipschitz limit can be expressed by additionally setting the loss term. Set k to 1 and get the discriminator loss of WGAN-gp by weighting the original discriminator loss with WGAN.
The third term  represents the entire sample space, and the estimation expectation in the high-dimensional sample space is exponential, which is practically difficult to implement. Therefore, it is not necessary to add the Lipschitz constraint to the entire space. As long as the sample is generated, the central region between the generated sample and the real sample are sampled. Therefore, the final discriminator loss is as the following.
where  x x P  , in the third term x is the random interpolation sample on the line of r x and g x ,  (1 )

Loss function
The purpose of the loss function in the network is generally to preserve image detail, usually using mean square error (MSE) as the loss function.
Although this method minimizes the difference in pixels before generating the denoised image G(z) and the NDCT image x, it may result in image blurring and loss of details. The improved idea is to introduce a perceptual loss based on advanced features to better characterize the image. This paper uses the perceived loss proposed in literature [Johnson, Alahi and Li (2016)].
(12) where φ show the feature extractor, w, h and d represent the width, height and depth of the feature space. The perceived loss in this paper is defined as the loss of the ReLU active layer of the pre-trained 19-layer visual geometry group (VGG-19) network, thus φ representing VGG-19. The compound loss function proposed in this paper is as the follow.

Network architecture design
WGAN-gp network structure consists of three parts. Fig. 1. shows generator G, which is a convolutional neural network with 8 convolutional layers. The convolution kernel size of each convolutional layer is 3×3 pixels, the generator's first seven hidden layers have 32 filters. The last layer uses a 3×3 filter to generate a feature map. The modified linear unit (ReLU) is the activation function, where n is the convolution kernel and s is the convolution step. Fig. 3. shows the discriminator D network. The discriminator D has six convolutional layers, the first two convolutional layers have 64 filters, the middle two convolutional layers have 128 filters, and the last two convolutional layers have 256 filters. As in the generator, the convolution kernel size of each convolutional layer is 3×3 pixels. After the convolutional layer, there are two fully connected layers, the first has 1024 outputs, the other has only an output. Fig. 2. shows the VGG-19 network. Through this network, the novel perceptual loss proposed in this paper is obtained.

Experimental results and analysis
The experimental computer hardware is configured for Intel Core i7-6700k and NVIDIA GeForce GTX 1080Ti. The model is tested on Python using the TensorFlow library.

Network training
300 different CT images were selected as training data with the size of 512×512 pixels from the TCGA-COAD clinical CT dataset. Low-dose CT images can be obtained by simulation: fan beam projection for normal dose CT (NDCT) images transform. The obtained projection matrix S is subjected to exponential operation and Poisson noise is added. Then the logarithm is taken, and transformed into the projection domain by the FBP algorithm to obtain a simulated low-dose CT image. The projection domain is added with noise.
where b is taken as 6 10 , which is the number of emitted photons; n I is the number of photons received by the detector; n S is the projection matrix after noise pollution. The input to the training network is an image block of size 64 × 64 pixels. The blank image block is pre-excluded when the image block is selected. In the experiment, all networks are optimized by the Adam algorithm [Kingma and Ba (2014)], which can replace the first-order optimization algorithm of the traditional stochastic gradient descent process, and iteratively update the neural network weight based on the training data. In this paper, the super-parameter of the Adam algorithm is set to:

Subjective evaluation
Ten images of TCGA-COAD clinical dataset were randomly selected as test images, which were not repeated with the 300 images trained. The selected 10 images are shown in Fig. 4. Comparing this method with the literature RED-CNN method, the denoising effect is shown in Fig. 5. It can be seen from the figure that the proposed method and Red-CNN have achieved good results. The difference is that the network mentioned in this paper retains multiple image details. Fig. 6. is an enlarged view of the boxed area of Fig. 5. As shown in the red box, this method retains more image detail than Red-CNN. The gray bone area using Red-CNN in Fig. 6(c). is already difficult to identify, while the method in this paper is better preserved.

Objective comparison
For quantitative analysis, the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are used as the evaluation indicators for the denoising effect of low-dose CT images. The detailed data of the test chart are shown in Tab. 1. As can be seen from Tab. 1, the proposed method has 7 test charts in comparison with Red-CNN in objective indicators. The average PSNR is higher than Red-CNN by 0.8 dB. SSIM is slightly ahead of Red-CNN in terms of indicators.

Algorithm complexity comparison
By calculating the average time consumption of 20 forward propagations for each test chart, the average time-consuming data is obtained, as shown in Tab  As it can be seen from Tab. 2, WGAN is more than 50% less time-consuming than RED-CNN. WGAN-gp speeds up the convergence rate, which is 64% less than WGAN timeconsuming, and will have a better performance on the GPU.

Conclusion
In view of the difficulty of traditional algorithms in suppressing noise in low-dose CT images, this paper uses the generation-fighting network (WGAN-gp) to greatly shorten the convergence process. With the new loss function, the relevant details of the image are preserved. Experimental results show that in the denoising effect of low-dose CT images, the proposed method is better than Red-CNN. Compared with WGAN network, WGANgp network with gradient penalty term significantly improves the convergence speed.