Low-dose CT via convolutional neural network

: In order to reduce the potential radiation risk, low-dose CT has attracted an increasing attention. However, simply lowering the radiation dose will significantly degrade the image quality. In this paper, we propose a new noise reduction method for low-dose CT via deep learning without accessing original projection data. A deep convolutional neural network is here used to map low-dose CT images towards its corresponding normal-dose counterparts in a patch-by-patch fashion. Qualitative results demonstrate a great potential of the proposed method on artifact reduction and structure preservation. In terms of the quantitative metrics, the proposed method has showed a substantial improvement on PSNR, RMSE and SSIM than the competing state-of-art methods. Furthermore, the speed of our method is one order of magnitude faster than the iterative reconstruction and patch-based image denoising methods.


Introduction
In recent decades, X-ray computed tomography has been widely used in both diagnostic and industrial fields. With the huge number of CT scans, the potential radiation risk becomes a public concern [1, 2]. Most current commercial CT scanners utilize the filtered backprojection (FBP) method to analytically reconstruct images. One of the most used methods to reduce the radiation dose is to lower the operating current of the X-ray tube. However, directly lowering the tube current will significantly degrade the image quality due to the excessive quantum noise caused by an insufficient number of photons in the projection domain. Many approaches were proposed to improve the quality of low-dose CT images. These approaches are generally in the three classes: sinogram filtering, iterative reconstruction, and image processing.
Sinogram filtering directly smoothens raw data before FBP is applied. Structural adaptive filtering and bilateral filtering are two efficient methods proposed by Balda's and Manduca's groups respectively [3,4]. After investigating the statistical model of noisy sinogram data, Li et al. developed a penalized likelihood method to suppress the quantum noise [5]. Two groups respectively improved this method via multiscale decomposition [6,7]. Iterative reconstruction solves the low-dose CT problem iteratively, aided by prior information on target images. Different priors were proposed, involving total variation (TV) [8-11], nonlocal means (NLM) [12-14], dictionary learning [15] and low rank matrix decomposition [16]. Despite the successes achieved by these two approaches, they are often restricted in practice due to the difficulty of accessing projection data since the vendors are not generally open in this aspect. Meanwhile, the iterative reconstruction methods involve heavy computational costs.
In contrast to the first two categories, image processing does not rely on projection data, is directly applied on low-dose CT images, and can be readily integrated into the current CT workflow. However, it is underlined that the noise in low-dose CT images does not obey a uniform distribution. As a result, it is not easy to remove image noise and artifacts completely with traditional methods. Extensive efforts were made to suppress image noise via image processing for low-dose CT.  [42]. Wang shared his opinions on deep learning for image reconstruction [43]. In [44], Kang et al. proposed an image denoising method based on a deep CNN Inspired by the great potential of deep learning in image processing, here we propose a deep convolutional neural network to map low-dose CT images towards corresponding normal-dose CT images. Note that the work in [44] differs from ours in two aspects. First, their method works on wavelet coefficients of low-dose CT images, while our method is directly applied to images. Second, their network has 26 layers, which is very large and timeconsuming for training, while ours has only 3 layers. The goal of our paper is to demonstrate the effectiveness of deep learning in low-dose CT image restoration, and as shown below our proposed method can achieve an excellent performance at a much-reduced computational cost, and can be further improved in the future work. In the second section, the network and training details are described. In the third section, qualitative and quantitative results are presented. In the last section, relevant issues are discussed, and the conclusion is drawn.

Noise reduction model
Due to the encryption of raw projection data, post-reconstruction restoration is a reasonable alternative for sinogram-based methods. Once the target image is reconstructed from a lowdose scan, the problem becomes image restoration, noise reduction, or image denoising. The only difference between low-dose CT and natural image restoration is that the statistical property of low-dose CT images cannot be precisely determined in the image domain. This property will significantly compromise the performance of noise-type-dependent methods, such as median filtering, Gaussian filtering, anisotropic diffusion, etc., which were respectively designed for specific noise types. On the other hand, learning-based methods are immune to this problem, because this kind of methods is to a large degree optimized in reference to training samples, instead of noise type.
We model the noise reduction problem for low-dose CT as follows. Let is the corresponding normal-dose image, and their relationship can be formulated as represents the corrupting process that contaminates normal-dose CT image due to the quantum noise. Then, the noise reduction problem can be converted to find a function f : where f is treated as the best approximation of 1 σ − , and here we will use a deep neural network to approximate f .

Convolutional neural network
In this study, the low-dose CT problem is solved in the three steps: patch coding, non-linear filtering, and reconstruction. Next, we introduce each steps in details.

Patch encoding
Sparse representation (SR) is now popular in the field of image processing. The key idea of SR is to represent patches extracted from an image with a pre-trained dictionary. Such dictionaries can be categorized into two groups according to how dictionary atoms are constructed. The first group is the analytic dictionary such as DCT, Wavelet, FFT, etc [45]. The other one is learned dictionary [46], which has better specificity and can preserve more application-oriented details assuming proper training samples. The SR can be implemented using convolution operations with a series of filters, each of which is an atom. Our method is similar in that SR is involved in the step for patch encoding via a neural network. First, we extract patches from training images with a fixed slide size. Second, the first layer for patch coding can be formulated as where 1 W and 1 b denote the weights and biases respectively, * represents the convolution operator, y is extracted patch from images, and ReLU( ) max(0, ) is the activation function [47]. Clearly, 1 b has the same size as that of 1 * W y . In CNN, 1 W can be seen as 1 n convolution kernels with a uniform size of 1 1 s s × . After patch encoding, we embed the image patches into a feature space, and the output 1 ( ) C y is a feature vector, whose size is 1 n .

Non-linear filtering
After processed by the first layer, a 1 n -dimensional feature vector is obtained from the extracted patch. In the second layer, we transform 1 n -dimensional vectors into 2 ndimensional ones. This operation is equivalent to a filtration on the feature map from the first layer. The second layer to implement non-linear filtering can be formulated as where 2 W is composed of 2 n convolution kernels with a uniform size of 2 2 s s × and 2 b has the same size as that of 2 1 ( ) C * W y . If the desired network only has two layers, the output 2 n -dimensional vectors of this layer are the corresponding cleaned patches for the final reconstruction.
Generally speaking, inserting more layers is a valid way to potentially boost the capacity of the network. However, a deeper CNN is at a cost of more complex computation especially longer training time. In our initial feasibility study, there were three layers in our neural network; see below for more details.

Reconstruction
In this step, the processed overlapping patches are merged into a final complete image. These overlapping patches must be properly weighted before their summation. This operation can also be considered as filtration by a pre-defined convolutional kernel, as formulated by where 3 W consists of only 1 convolution kernel with a size of 3 3 s s × , and 3 b has the same size as that of 3 2 ( ) C * W y . Equations (3)-(5) are all convolutional operations, although they have been designed for different purposes. In other words, the CNN architecture is in use for our low-dose CT image denoising.

Training
Once the network is configured, the parameter set, which denote low-dose and corresponding normal-dose image patches respectively, and N is the total number of training samples. Hence, the estimation of the parameters can be achieved by minimizing the following loss function in terms of the mean squared error: The loss function is optimized using the stochastic gradient descent method [48].

Experimental design and results
There are many factors that may affect the performance of the proposed model, here we focus on key aspects of radiation dose, training data and testing data. Meanwhile, representative state-of-the-arts methods, both iterative reconstruction and post-reconstruction restoration, were selected for comparison, including: 1) ASD-POCS [8]: This method is an early iterative CT reconstruction method with TV minimization as an efficient sparse constraint.
2) K-SVD [21]: This is a typical dictionary learning based method. The noisy patch can be restored by requesting a sparse linear combination of learned dictionary atoms.

3) BM3D [23]
: This is currently the most popular denoising method, which is based on the self-similarity of images. The parameters of the competing methods were respectively set according to the recommendations in the original references.
The peak signal to noise ratio (PSNR), root mean square error (RMSE), and structural similarity index (SSIM) [49] were used as quantitative metrics. All the experiments were tested using MATLAB 2015b on a PC (Intel i7 6700K CPU, 16 GB RAM and GTX 980 Ti graphics card).

Data set preparation
In total, 7,015 CT normal-dose images of 256 × 256 from 165 patients including different parts of the human body were downloaded from NCIA (National Cancer Imaging Archive). Figure 1 illustrates several typical slices included in our training set. Siddon's ray-driven algorithm [50] was used to simulate fan-beam geometry. The source-torotation center distance was 40 cm while the detector-to-rotation center was 40 cm. The image region was 20 cm × 20 cm. The detector width was 41.3 cm containing 512 detector elements.
The data were uniformly sampled in 1024 views over a full scan. The input patches of the network were extracted from the original images of size 33 m = . The sliding step was 4. The original 200 training images included in our first experiment resulted in about 6 10 samples. There are two reasons why patches were used, instead of whole images: the first one is that the images can be better represented by local structures; the other is that deep learning requires a big training data set and chopping the original images into patches can efficiently boost the number of samples. Meanwhile, there are other two possible ways to enlarge the number of training samples. The first is the simplest way, which is to directly add more images into the training set. However, in many situations, collecting a big number of samples is very difficult. The second way is called data augmentation, which is an effective alternative. These two ways were evaluated and will be reported below.
200 normal-dose and corresponding low-dose image pairs were randomly selected from the original data set as the training set. For fairness, 100 low-dose images were randomly selected from the original data set as the testing set, excluding the images from the patients involved in the training set.

Parameter setting
In this study, three layers were used in the neural network. The filter numbers, 1 n and 2 n , were respectively set to 64 and 32, and the corresponding filter sizes, 1 s , 2 s and 3 s , were set to 9, 3 and 5. The initial weights of filters in each layer were randomly set, which satisfies the Gaussian distribution with zero mean and standard deviation 0.001. The initial learning rate was 0.001 and slowly decayed to 0.0001 during the training process.

Visual inspection
We selected 2 representative slices, which contain different parts (chest and abdomen) of the human body from the results of the testing set. Figures 2 and 3 contain the results obtained using different methods.
In both Figs. 2 and 3, the noise and artifacts caused by the lack of incident photons severely degrade the image quality. Important details and structures cannot be all discriminated. All the methods can eliminate the noise and artifacts to different degrees. However, due to the piecewise constant assumption behind TV minimization, ASD-POCS caused blocky effects in both Figs. 2 and 3. KSVD and BM3D cannot efficiently suppress the streak artifacts near the bone as indicated by the red arrows in Fig. 2, and the zoomed parts ( Fig. 2(g)-2(l)) marked by the red boxes show more details. In Fig. 3, the arrows point to details and boundaries, and demonstrate that only the proposed CNN based method seems recovering the most of these details.

Quantitative measurement
To quantitatively evaluate the proposed CNN-based algorithm, PSNR, RMSE and SSIM were measured of all the images in the testing set. The results for the restored images in Figs. 2 and 3 are listed in Tables 1 and 2.  It can be seen in Table 1 that, for all the metrics of Fig. 2, our proposed method achieved the best results, which are consistent to the visual inspection. In Table 2, CNN also achieved the best performance except for SSIM. One possible reason is that most part of the abdomen image is soft tissue and BM3D is more preferable in this situation. Table 3 shows the average values of the measurements for all the 100 images in the testing set. It is observed that the proposed CNN based method has outperformed the competing method in terms of all the metrics. For qualitative evaluation, 20 normal-dose images from the testing set and their corresponding low-dose versions, and 20 processed images obtained using different methods were randomly selected for consideration. Artifact reduction, noise suppression, contrast retention and overall quality were included as qualitative indicators with five grade assessment (1 = worst and 5 = best). Three radiologists (R1-R3) with more than 8 years of clinical experience evaluated these images and provided their scores. The low-dose images and the processed images using different algorithms were compared with the original normaldose images. The student t test with 0.05 P < was performed to assess the discrepancy. The statistical results are summarized in Table 4.

Qualitative measurement
As demonstrated in Table 4, the qualities of low-dose images are much lower than normal-dose ones in terms of the scores. All the image denoising methods significantly improve the image quality, and BM3D and the proposed CNN based algorithm achieved the best results. The scores for the CNN are closer to the ones of the normal-dose images, and the t test results show a similar trend that the differences between the normal-dose images and the results from CNN-based method are not statistically significant in all the qualitative indices except for contract retention.

Sensitivity analysis
In this sub-section, two important factors, the noise level of images and the size of the training set were evaluated for their impacts on the performance of CNN.

1) Noise level of images
It is well known that the performance of learning-based methods is related to the samples in the training set. The noise levels of images in the training and testing sets were consistent in our study. The results proved that the effectiveness of the proposed CNN-based method comparing to the state-of-the-art methods in the specified noise level ( 5 0 10 b = ). However, the noise level of target images was not always available. Hence, different combinations for noise levels (of training and testing images) were used to validate the robustness of our method:  Table 5. CNN200-1 denotes that the training set only included a single type data (situations (a)-(d)) while CNN200-4 denotes that the training set included mixed data (situations (e)-(h)). In all the cases except BM3D-F, the parameters in ASD-POCS, KSVD and BM3D were adjusted to achieve the best average PSNR values at different noise levels. BM3D-F indicates running BM3D with the fixed parameter for 5 0 10 b = . From Table 5, it can be obtained that (1) CNN200-4 with mixed training set had the best results in almost every metric in all the situations with different noise levels. Especially at the high noise levels, the advantage of CNN200-4 is more evident; and (2) when the noise level was low ( 5 0 5 10 b = × ), BM3D achieved the best performance than the other methods but as the noise level increased, CNN200-1 became competitive to BM3D, which can be seen as an evidence for the robustness of the CNN-based method at different noise levels. Meanwhile, we must mention that the parameters in the other methods were adjusted for different noise levels, and it can be predicted that if the parameters were fixed, the performance of the other methods would deteriorate at other noise levels. BM3D-F is an example of this situation. It can be seen that the results for other noise levels (except 5 0 5 10 b = × ) became worse than the benchmark. The CNN200-1 should have even better results than BM3D-F when the noise level is unknown.

2) Size of the training set
The size of training set is another factor we considered to sense its impact on the performance of the CNN-based method. In this subsection, we extended the size of the training set in two ways: increasing the number of images included in the training set and boosting the number of samples via data augmentation. In the first way, we simply increased the number of images in the training set to 2,000. Also, data augmentation has been proved powerful to avoid overfitting when the training set is small, and already successfully applied in many different tasks, such as speech recognition, text categorization, object recognition, and sentiment classification [51][52][53]. In this experiment, three kinds of transformation operations, including rotation (by 45 degrees), flipping (vertical and horizontal) and scaling (scale factors were 2 and 0.5), were used for data augmentation. After these operations, the training set became 12 times larger than the original one. In both the situations, the testing set remained the same. Figure 4 shows the results of the CNN-based method with different sizes of the training set using the same image in Fig. 3. CNN2000-1 and CNN2000-4 gave similar meanings to that from the previous experiment, and the '2000' denotes the number of images included in the training set. CNN200-1-DA and CNN200-4-DA denote that data augmentation was performed in data preparation. In Fig. 4, all the methods can efficiently suppress the noise and artifacts and had similar visual effects. However, it is noticed that the methods trained at a single noise level ( Fig. 4(a), 4(c) and 4(e)) produced images smoother that the ones at multiple noise levels ( Fig. 4(b), 4(d) and 4(f)). Some structural differences were marked with the red arrows. In Fig. 5, the significant parts are zoomed. Two small structural details indicated by the red arrows cannot be identified in the low-dose image (Fig. 5(b)). After processed by all the CNN-based methods, these details became clearer to different degrees than the counterparts in the low-dose image. The methods relying on training at various noise levels kept these details more faithfully relative to the normal-dose image than the methods trained at the single noise level. These observations are consistent with the previous results that the network trained at a single noise level is robust in dealing with testing data corrupted by noise at different levels, and the training with data corrupted at various noise levels will further enhance the robustness. From the quantitative results given in Table 6, the improvements of quantitative metrics can be noticed with 2000 images in the training set, but after data augmentation, the metrics were not improved as a result of injecting more images into the training set. The possible reason is that the original number of samples was already close to 106 level which is large, and data augmentation can significantly boost the number to 107 level, but the features of the augmented training set was not enhanced as effectively as by adding real training images. This result can be also explained by the fact that the performance of the current network has been basically optimized by the number of samples in the initial training set, and no essential need to further increase the number of samples. However, the impact of different network configurations will be investigated in our future work. In addition, all the methods to increase the number of training samples will significantly extend the training time. How to balance the computational cost and the performance is an important issue to be considered in the practical applications.

Real data test
In the previous sections, the corresponding low-dose CT images in both the training and testing sets were generated via numerical simulation. To validate the effectiveness of our method for real data, low-dose raw projections from sheep lung perfusion were used. In this study, an anesthetized sheep was scanned at normal and low doses respectively on a SIEMENS Somatom Sensation 64-slice CT scanner (Siemens Healthcare, Forchheim, Germany) in a helical scanning mode (credited to Dr. Eric Hoffman with University of Iowa, Iowa, USA, and the retrospective use of the data with his permission). The normal-dose scan was acquired with 100 kVp, 150 mAs protocol while the low-dose scan was acquired with 80 kVp, 17 mAs protocol. All the sinograms of the central slice were extracted, which were in a fan-beam geometry. The radius of the trajectory was 57 cm. 1,160 projections were uniformly collected over a full scan. For each projection, 672 detector bins were equi-angularly distributed to define a field-of-view (FOV) of 25.05 cm in radius. In this experiment, the reconstructed images were 768 × 768 pixels with a physical size of 43.63 × 43.63 cm.
The results from the low-dose sinogram are in Fig. 6. It can be observed that there is strong noise in the FBP reconstruction image in Fig. 6(b) and the steak artifacts exist near the highly attenuating structures. ASD-POCS could remove most of the noise and artifacts, but the blocky effect also raised and blurred some important structures. BM3D and CNN-based method (CNN200-1) suppressed the noise better than KSVD that gave steak artifacts in Fig.  6(d) and 6(e). Three regions indicated by the red arrows are examples where the CNN-based method achieved the best results. Figure 7 shows the differences between the low-dose FBP image and the counterparts obtained using the other methods, which serves as a supplement evidence for the structural preservation of the CNN-based method. Little artifacts can be seen in Fig. 7(d), which means that little structure information was lost.

Discussions and conclusion
Deep learning has achieved exciting results in the fields of computer vision and image processing, such as image segmentation, object detection, object tracking, image supperresolution, etc. In medical imaging, deep learning has been so far applied in image analysis such as, segmentation and detection, and its potential for other applications has to be widely explored. In this paper, a deep convolution neural network is presented to denoise images for lowdose CT. The proposed method learns a feature mapping from low-to normal-dose images. After the training stage, our method has achieved a competitive performance relative to the state-of-the-art methods in both qualitative and quantitative aspects. We have also assessed several factors for performance tradeoffs, suggesting the robustness of the proposed method at different noise levels and with respect to various training steps including the size of the training set. It is hypothesized that use of a deeper network with a larger capacity may further improve the performance, which we will investigate in the future.
Operating on 3D blocks should be an effective way to improve the performance of the proposed method. However, the goal of this initial paper is to demonstrate the effectiveness of deep learning. For this reason, operating on 2D patches is easier to implement and compare with other representative low-dose CT denoising methods, most of which were originally designed for 2D images. Clearly, training with 3D blocks will cost much more computational time, which will be pursued in our future work.
The computational cost of the proposed CNN based method needs to be commented on as well. The main cost of computational time is spent on the training stage. Although the training is normally done on GPU, it is still time-consuming. For a smaller training set with 200 images, about 10 6 patches were involved, and it took us 17 hours. For a larger training set with 2,000 images, about 10 7 patches were involved and 56 hours were spent. Other iterative methods do not need a training phase, but the execution time is much longer than the CNN based method. In this study, the average execution times (CPU mode) for ASD-POCS, KSVD, BM3D and CNN were 23.79, 40.88, 3.33 and 2.05 seconds respectively. Actually, once the network is trained offline, the CNN based method is much more efficient than the other methods in terms of real execution time.
In conclusion, we have been much encouraged by the excellent performance of the deep convolutional neural network approach in the case of noise reduction for low-dose CT. The results have demonstrated the potential of the CNN-based method for medical imaging. In the future, the proposed network structure will be refined, and adapted to handle other topics in CT imaging, such as few-view reconstruction, metal artifact reduction, and interior CT. Another possible direction is to investigate different hyper parameters of the network, especially the number of layers, and other network architectures to deal with these CT problems and other imaging tasks.