Neural networks-based regularization for large-scale medical image reconstruction

In this paper we present a generalized Deep Learning-based approach for solving ill-posed large-scale inverse problems occuring in medical image reconstruction. Recently, Deep Learning methods using iterative neural networks (NNs) and cascaded NNs have been reported to achieve state-of-the-art results with respect to various quantitative quality measures as PSNR, NRMSE and SSIM across different imaging modalities. However, the fact that these approaches employ the application of the forward and adjoint operators repeatedly in the network architecture requires the network to process the whole images or volumes at once, which for some applications is computationally infeasible. In this work, we follow a different reconstruction strategy by strictly separating the application of the NN, the regularization of the solution and the consistency with the measured data. The regularization is given in the form of an image prior obtained by the output of a previously trained NN which is used in a Tikhonov regularization framework. By doing so, more complex and sophisticated network architectures can be used for the removal of the artefacts or noise than it is usually the case in iterative NNs. Due to the large scale of the considered problems and the resulting computational complexity of the employed networks, the priors are obtained by processing the images or volumes as patches or slices. We evaluated the method for the cases of 3D cone-beam low dose CT and undersampled 2D radial cine MRI and compared it to a total variation-minimization-based reconstruction algorithm as well as to a method with regularization based on learned overcomplete dictionaries. The proposed method outperformed all the reported methods with respect to all chosen quantitative measures and further accelerates the regularization step in the reconstruction by several orders of magnitude.


Introduction
In inverse problems, the goal is to recover an object of interest from a set of indirect and possibly incomplete observations. In medical imaging, for example, a classical inverse problem is given by the task of reconstructing a diagnostic image from a certain number of measurements, e.g. x-ray projections in computed tomography (CT) or the spatial frequency information (k-space data) in magnetic resonance imaging (MRI). The reconstruction from the measured data can be an ill-posed inverse problem for different reasons. In low-dose CT, for example, the reconstruction from noisy data is ill-posed because of the ill-posedeness of the inversion of the Radon transform. In accelerated MRI, on the other hand, the reconstruction from incomplete data is ill-posed since the underlying problem is essentially underdetermined and therefore no unique solution exists without integrating prior information.
In order to constrain the space of possible solutions, a typical approach is to impose specific a-priori chosen properties on the solution by adding a regularization (or penalty) term to the problem. Well known makes the method general and applicable to problems which are ill-posed due to noisy measurements (e.g. low-dose CT) or ones where incomplete data is available (accelerated MRI) as well.
This paper is organized as follows. In section 2, we formally introduce the inverse problem of image reconstruction and motivate our proposed approach for the solution of large-scale ill-posed inverse problems. We demonstrate the feasibility of our method by applying it to 3D low-dose cone beam CT and 2D radial cine MRI in section 3. We further compare the proposed approach to an iterative reconstruction method given by total variation-minimization (TV) and a learning-based method (DIC) using Dictionary Learning-based priors in section 4. We then conclude the work with a discussion and conclusion in section 5 and section 6.

Large-scale image reconstruction with CNN-priors
In this section, we present the proposed deep learning scheme for solving large-scale, possibly non-linear, inverse problems. For the sake of clarity, we do not focus on a functional analytical setting but consider discretized problems of the form where A : X → Y is a discrete possibly non-linear forward operator between finite dimensional Hilbert spaces, y ∈ Y is the measured data, z ∈ Y the noise and x ∈ X the unknown object to be recovered. The operator A could for example model the measurement process in various imaging modalities such as the x-ray projection in CT or the Fourier encoding in MRI. Depending on the nature of the underlying imaging modality one is considering, problem (1) can be ill-posed for different reasons. For example, in low-dose CT, the measurement data is inherently contaminated by noise. In cardiac MRI, k-space data is often undersampled in order to speed up the acquisition process. This leads to incomplete data and therefore to an undetermined problem with an infinite number of theoretically possible solutions. In order to constrain the space of solutions of interest, a typical approach is to impose specific a-priori chosen properties on the solution x by adding a regularization (or penalty) term R(x) and using Lagrange multipliers. Then, one solves the relaxed problem D(Ax, y) + λ R(x) → min, where D(·, ·) is an appropriately chosen data-discrepancy measure and λ > 0 controls the strength of the regularization. The choice of D(·, ·) depends on the considered problem. For the examples presented in Sections 3 and 4 we choose the discrepancy measure as the squared norm distance in the case of radial cine MRI and the Kullback-Leibler divergence in the case of low dose CT, respectively. A particular feature of proposed the large-scale approach is the one of a regularizer depending on the actual data, see (4) below.

CNN-based regularization
Clearly, the regularization term R(x) significantly affects the quality and the characteristics of the solution x.
Here, we propose a generalized approach for solving high-dimensional inverse problems by the following three steps: First, an initial guess of the solution is provided by a direct reconstruction from the measured data, i.e. x ini = A † y, where A † : Y → X denotes some reconstruction operator. Then, a CNN is used to remove the noise or the artefacts from the direct reconstruction x ini in order to obtain another intermediate reconstruction x CNN which is used as a CNN-prior in a generalized Tikhonov functional As a third and final step, the CNN-Tikhonov functional (3) is minimized resulting in the proposed CNN-based reconstruction. Note that the regularization of the problem, i.e. obtaining the CNN-prior, is decoupled from the step of ensuring data-consistency of the solution via minimization of (3). This means that, in our case, the regularization term R(x) in (2) depends on the data y and is of the form where f Θ denotes a function which decomposes the initial reconstruction A † y into patches, processes them with a pre-trained CNN u Θ with trainable parameters Θ and recomposes the patches in order to provide the CNN image-prior x CNN . Therefore, for minimizing (3) we only have to apply f Θ one single time. Thus, the training of the network u Θ is facilitated since it only has to learn to map the manifold given by the initial reconstructions to the manifold of the ground truth images. This also allows to use deeper and more sophisticated CNNs as the ones typically used in iterative networks. Given the high-dimensionality of the considered problems, network training is further carried out on sub-portions of the image samples, i.e. on patches or slices which are previously extracted from the images or volumes. This is motivated by the fact that in most medical imaging applications, one has typically access to datasets with only a relatively small number of subjects. The images or volumes of these subjects, on the other hand, are elements of a high-dimensional space. Therefore, one is concerned with the problem of having topologically sparse training data with only very few data points in the original high-dimensional image space. Working with sub-portions of the image samples increases the number of available data points and at the same time decreases its ambient dimensionality.

Large-scale CNN-prior
Suppose we have access to a finite set of N ground truth samples (x k ) N k=1 and corresponding initial estimates (x ini,k ) N k=1 . We are in particular interested in the case where N is relatively small and the considered samples x k have a relatively large size, which is the case for most medical imaging applications. For any sample x ∈ X we consider its decomposition in N p,s possibly overlapping patches where R p,s j and (R p,s j ) T extract and reposition the patches at the original position, respectively, and the diagonal operator W p,s accounts for weighting of regions containing overlaps. The entries of the tuples p and s specify the size of the patches and the strides in each dimension and therefore the number of patches N p,s which are extracted from a single image.
We aim for improved estimates x CNN,k = f θ (x ini,k ) ≈ x k via a trained network function f θ to be constructed. Since the operator norm of W p,s is less or equal to one, by the triangle inequality, we can estimate the average error Inequality (6) suggests that it is beneficial estimating each patch of the sample x k by a neural network u θ applied to R p,s j x ini,k rather than estimating the whole sample at once. The neural network u θ is trained on a subset of pairs of all possible patches extracted from the N samples in the dataset, where I N,Np,s := {1, . . . , N} × {1, . . . , N p,s }. During training, we optimize the set of parameters θ to minimize the L 2 -error between the estimated output of the patches and the corresponding ground truth patch by minimizing where N train is the number of training patches. Denote by f θ the composite function which decomposes a sample image or volume x into patches, applies a neural network u θ to each patch, and reassembles the sample from them. This results in the proposed CNN-prior x CNN given by where x ini = A † y is the initial reconstruction obtained from the measured data.
Remark 1. Since in (6), the right-hand-side is an upper bound of the left-hand-side, inequality (6) guarantees that the set of parameters found by minimizing (8) is also suitable for obtaining the prior x CNN . Therefore, u θ is powerful enough to deliver a CNN-prior to regularize the solution of (3). Figure 1 illustrates the process of extracting patches from a volume using the operator R p,s j , processing it with a neural network u θ and repositioning it at the original position using the transposed operator (R p,s j ) T . The example is shown for a 2D cine MR image sequence. Figure 1. Workflow for obtaining a CNN-prior by patch-based processing: First, the initial reconstruction is divided into patches, then the network u θ is applied to all patches. Reassembling all processed patches results in the CNN-prior which is then used for regularization of the inverse problem.

Reconstruction algorithm
After having found the CNN prior (9), as a final reconstruction step, the optimality condition for minimization problem (3) is solved with an iterative method dependent on the specific application. Minimizing this functional increases data-consistency compared to x CNN since the discrepancy to the measured data is used again. The solution of (3) is then the final CNN-based reconstruction of the proposed method. Algorithm 1 summarizes the proposed three-step reconstruction scheme.
Data: trained network u θ , function f θ , noisy or incomplete measured data y, regularization parameter λ > 0 Output: Return x REC Note that the regularizer R(x) = ∥x − x CNN ∥ 2 2 is strongly convex. Therefore, if the discrepancy term D(Ax, y) is convex, then the Tikhonov functional (3) is strongly convex and can be efficiently minimized by most gradient based iterative schemes including Landweber's iteration and Conjugate Gradient type methods. The specific strategy for minimizing (3) depends on the considered application. In the case of an ill-conditioned inverse problem with noisy measurements, it might be beneficial that (3) is only approximately minimized. For example, for the case of low-dose CT, early stopping of the Landweber iteration is applied as additional regularization method due to the semi-convergence property of the Landweber iteration (Strand 1974). Such a strategy is used for the numerical results presented in section 4.

Convergence analysis
Another benefit of our approach is that minimization of the Tikhonov functional (3) corresponds to convex variational regularization with a quadratic regularizer. Therefore, one can use well known stability, convergence and convergence rates results (Engl et al 1996, Scherzer et al 2009, Grasmair 2010, Li et al 2020. Consequently, opposed to most existing neural network based reconstruction algorithms, the proposed framework is build on the solid theoretical fundament for regularizing inverse problems. As an example of such results we have the following theorem. Theorem 1. Let A : X → Y be linear, x CNN ∈ X, y 0 ∈ A(X), and y δ ∈ Y satisfy ∥y δ − y 0 ∥ ≤ δ. Then the following hold: 1. For all δ, λ > 0, the quadratic Tikhonov functional has a unique minimizer x δ,λ . 2. The exact data equation Ax = y 0 has a unique x CNN -minimizing solution x 0 ∈ arg min{∥x − x CNN ∥ 2 : Proof. The change of variables reduces (10) to standard Tikhonov regularization ∥Ax −ȳ δ ∥ 2 + λ∥x∥ 2 2 → min x for the inverse problem Ax =ȳ 0 . Therefore, Items (1.) -(3.) follow from standard results that can be found for example in (Engl et al 1996, section 5).
Theorem 1 also holds in the infinite-dimensional setting (Engl et al 1996, section 5) reflecting the stability of the proposed CNN regularization. Similar results hold for non-linear problems and general discrepancy measures (Grasmair 2010). Moreover, one can derive quantitative error estimates similar to Scherzer et al (2009), Grasmair (2010, Li et al (2020). Such theoretical investigations, however, are beyond the scope of this paper.

Experiments
In the following, we evaluated our proposed method on two different examples of large-scale inverse problems given by 3D low-dose CT and 2D undersampled radial cine MRI. We compared our proposed method to the well-known TV-minimization-based and dictionary learning-based approaches presented in Block et al (2007), Wang and Ying (2014) and Tian et al (2011), Xu et al (2012, which we abbreviate by TV and DIC, respectively. Further details about the comparison methods are discussed later in the paper.

2D radial cine MRI
Here we applied our method to image reconstruction in undersampled 2D radial cine MRI. Typically, MRI is performed using multiple receiver coils and therefore, the inverse problem is given by where x ∈ C N with N = N x · N y · N t is an unknown complex-valued image sequence. The encoding operator E I is given by Here, C i denotes the ith coil sensitivity map, n c is the number of coil-sensitivity maps, F the 2D frame-wise operator and S I with I ⊂ J = {1, . . . , N rad }, |I| := m ≤ N rad , a binary mask which models the undersampling process of the N rad Fourier coefficients sampled on a radial grid. The vector y I ∈ C M with M = m · n c corresponds to the measured data. Here, we sampled the k-space data along radial trajectories chosen according to the golden-angle method (Winkelmann et al 2006). Note that problem (11) is mainly ill-posed not due to the presence of noise in the acquisition, but because the data acquisition is accelerated and hence only a fraction of the required measurements is acquired. If we assume a radial data-acquisition grid, problem (11) is a large-scale inverse problem mainly because of two reasons. First, the measurement vector y I corresponds to n c copies of the Fourier encoded image data multiplied by the corresponding coil sensitivity map. Second, the adjoint operator E I H consists of two computationally demanding steps. The radially acquired k-space data is first properly re-weighted and interpolated to a Cartesian grid, for example by using Kaiser-Bessel functions (Rasche et al 1999). Then, a 2D inverse Fourier operation is applied to the image of each cardiac phase and the final image sequence is obtained by weighting the images from each estimated coil-sensitivity map and combining them to a single image sequence. We refer to the reconstruction obtained by x I = E I H y I as the non-uniform fast Fourier-transform (NUFFT) reconstruction. Therefore, in radial multi-coil MRI, the measured k-space data is high-dimensional and the application of the encoding operators E I and E I H is further more computationally demanding than sampling on a Cartesian grid, see e.g Smith et al (2019). This makes the construction of cascaded networks which also process the k-space data (Han et al 2019) or repeatedly employ the forward and adjoint operators , Qin et al 2018 computationally challenging. Therefore, the separation of the regularization given by the CNNs from the data-consistency step is necessary in this case due to computational reasons.
As proposed in section 2, we solve a regularized version of problem (11) by minimizing where x CNN is obtained a-priori by using an already trained network. For this example, for obtaining the CNN-prior x CNN , we adopted the XT,YT approach presented in Kofler et al (2019) which has been reported to achieve comparable results to the 3D U-net (Hauptmann et al 2019). Using the XT,YT approach has the main advantage of only using 2D convolutional layers while still exploiting the spatio-temporal information of the cine MR images. The XT,YT approach works by first extracting all x, t-and y, t-slices from the image, subsequently processing them with a modified version of the 2D U-net and then properly reassembling them to obtain an estimate of the artefact-free image. Since the network only has to learn the manifold of the x, tand y, t-spatio-temporal slices, the approach was shown to be applicable even when only little training data is available. Since the XT,YT method was previously introduced to only process real-valued data (i.e. the magnitude images), we followed a similar strategy by processing the real and imaginary parts of the image sequences separately but using the same real-valued network u θ . This further increases the amount of training data by a factor of two. More precisely, let R xt j and R yt j denote the operators which extract the jth two-dimensional spatio-temporal slices in xt-and yt-direction from a 3D volume (R xt j ) T and (R yt j ) T their respective transposed operations which reposition the spatio-temporal slices at their original position.
By u θ we denote a 2D U-net as the one described in Kofler et al (2019) which is trained on spatio-temporal slices, i.e. on a dataset of pairs which consist of the spatio-temporal slices in xt-and yt-direction of both the real and imaginary parts of the complex-valued images. The network u θ was trained to minimize the L 2 -error between the ground truth image and the estimated output of the CNN. Our dataset consists of radially acquired 2D cine MR images from n = 19 subjects (15 healthy volunteers and 4 patients with known cardiac dysfunction) with 30 images covering the cardiac cycle. The ground truth images were obtained by kt-SENSE reconstruction using N θ = 3400 radial lines. We retrospectively generated the radial k-space data y I by sampling the k-space data along N θ = 1130 radial spokes using n c = 12 coils. Note that sampling N θ = 3400 already corresponds to an acceleration factor of approximately~3 and therefore, N θ = 1130 corresponds to an accelerated data-acquisition by an approximate factor of~9. The forward and the adjoint operators E I and E H I were implemented using the ODL library . The complex-valued CNN-regularized image sequence x CNN was obtained by .
Given x CNN , functional (15) was minimized by setting its derivative with respect to x to zero and applying the pre-conditioned conjugate gradient (PCG) method to iteratively solve the resulting system. PCG was used to solve the system Hx = b with Since the XT,YT method gives access to a large number of training samples, training the network u Θ for 12 epochs was sufficient. The CNN was trained by minimizing the L 2 -norm of the error between labels and output by using the Adam optimizer (Kingma and Ba 2014). We split our dataset in 12/3/4 subjects for training, validation and testing and performed a 4-fold cross-validation. For the experiment, we performed n iter = 16 subsequent iterations of PCG and empirically set λ = 0.1. Note that due to strong convexity, (15) has a unique minimizer and solving system (17) yields the desired minimizer. The obtained results can be found in Subsection 4.1.

3D low-dose computed tomography
The current generation of CT scanners performs the data-acquisition by emitting x-rays along trajectories in the form of a cone-beam for each angular position of the scanner. Therefore, for each angle ϕ of the rotation, one obtains an x-ray image which is measured by the detector array and thus, the complete sinogram data can be identified with a 3D array of shape (N ϕ , N rx , N ry ). Thereby, N ϕ corresponds to the number of angles the rotation of the scanner is discretized by and N rx and N ry denote the number of elements of the detector array. The values of these parameters vary from scanner to scanner but are in the order of N ϕ ≈ 1000 for a full rotation of the scanner and N rx × N ry ≈ 320 × 800 for a 320-row detector array, which is for example used for cardiac CT scans (Dewey et al 2009). The volumes obtained from the reconstructions are typically given by an in-plane number of pixels of N x × N y = 512 × 512 and varying number of slices N z , dependent on the specific application. For this example, we consider a similar set-up as in Adler and Öktem (2017). The non-linear problem is given by where p denotes the average number of photons per pixel, µ is the linear attenuation coefficient of water, R corresponds to the discretized version of a ray-transform with cone-beam geometry and the vector η denotes the Poisson-distributed noise in the measurements. Following our approach, we are interested in solving where D KL denotes the Kullback-Leibler divergence which corresponds to the log-likelihood function for Poisson-distributed noise. According to the previously introduced notation, the prior x CNN is given by where f θ denotes a CNN-based processing method with trainable parameters θ and x η = R † (−µ −1 ln(p −1 y η )) with R † being the filtered back-projection (FBP) reconstruction. Since our object of interest x is a volume, it is intuitive to choose a NN which involves 3D convolutions in order to learn the filters by exploiting the spatial correlation of adjacent voxels in x-, y-and z-direction. In this particular case, u θ denotes a 3D U-net similar to the one presented in Hauptmann et al (2019). Due to the large dimensionality of the volumes x, the network u θ cannot be applied to the whole volume. Instead, following our approach, the volume was divided into patches to which the network u θ is applied. Therefore, the output x CNN was obtained as described in (9), where u θ operates on 3D patches given by the vector p = (128, 128, 16), which denotes the maximal size of 3D patches which we were able to process by a 3D U-net. The strides used for the extraction and the reassembling of the volumes used in (9) is empirically chosen to be s = (16, 16, 8).
Training of the network u θ was performed on a dataset of pairs according to (7), where we retrospectively generated the measurements y η by simulating a low-dose scan on the ground truth volumes. For the experiment, we used 16 CT volumes from the randomized DISCHARGE trial (Napp et al 2017) which we cropped to a fixed size of 512 × 512 × 128. The simulation of the low-dose scan was performed as described in Adler and Öktem (2017) by setting p = 10 000 and µ = 0.02. The operator R is assumed to perform N ϕ = 1000 projections which are measured by a detector array of shape N rx × N ry = 320 × 800. For the implementation of the operators, we used the ODL library . The source-to-axis and source-to-detector distances were chosen according to the DICOM files. Since the dataset is relatively small, we performed a 7-fold cross-validation where for each fold we split the dataset in 12 patients for training, 2 for validation and 2 for testing. The number of training samples N train results from the number of patches times the number of volumes contained in the training set. We trained the network u θ for 115 epochs by minimizing the L 2 -norm of the error between labels and outputs. For training, we used the Adam optimizer (Kingma and Ba 2014). With the described configuration of p and s, the resulting number of patches to be processed in order to obtain the prior x CNN is therefore given by N p,s = 9 375. In this example, the solution x REC to problem (19) was then obtained by performing n iter = 4 iterations of Landweber's method where we further used the filtered-back projection R † as a left-preconditioner to accelerate the convergence of the scheme. For the derivation of the gradient of (19) with respect to x, we refer to Adler and Öktem (2017). The regularization parameter was empirically set to λ = 1. The results can be found in subsection 4.2.

Reference methods
Here we discuss the methods of comparison in more detail and report the times needed to process and reconstruct the images or volumes. The data-discrepancy term D(·, ·) was again chosen according to the considered examples as previously discussed. The TV-minimization approach used for comparison is given by solving where G denotes the discretized version of the isotropic first order finite differences filter in all three dimensions, i.e. in x-, y-and t-direction for the MR example and in x-, y-and z-direction for the CT example. The solution of problem (20) was obtained by introducing an auxiliary variable z and alternating between solving for x and z. For the solution of one of the sub-problems, an iterative shrinkage method was used, see Chambolle (2005) for more details. The second resulting sub-problem was solved by iteratively solving a system of linear equations, either by Landweber for the CT example or by PCG for the MRI example, as mentioned before.
The dictionary learning-based method used for comparison is given by the solution of the problem where, in contrast to our proposed method, x DIC was obtained by the patch-wise sparse approximation of the initial image estimate using an already trained dictionary D. Therefore, using a similar notation as in (9), the prior x DIC is given by where the dictionary D was previously trained by 15 iterations of the iterative thresholding and K residual means algorithm (ITKRM) (Schnass 2018) on a set of ground truth images which were given by the high-dose images for the CT example and the kt-SENSE reconstructions from N θ = 3400 radial lines for the MRI example. Here again, the dictionary D operates on three-dimensional patches, for the CT example/ x, y, z)-patches and the MR example (x, y, t)-patches. Note that for each fold, for training the dictionary D, we only used the data which we included in the training set for our method. This means we trained a total of seven dictionaries for the CT example and four dictionaries for the MRI example. For each iteration of ITKRM, we randomly selected a subject to extract 10 000 3D training patches. The corresponding sparse codes γ j were then obtained by solving which is a sparse coding problem and was solved using orthogonal matching pursuit (OMP) (Tropp and Gilbert 2007). Thereby, the image x ini corresponds to either the FBP-reconstruction x η for the CT example or to the NUFFT-reconstruction x I for the MRI example. In both cases, we used patches of shape given by p = (4, 4, 4) and strides given by s = (2, 2, 2). The number of atoms K and the sparsity levels were set to K = 4 · d, with d = 4 · 4 · 4 and S = 16. Note that, in contrast to Xu et al (2012) and Wang et al (2004), Caballero et al (2014), the dictionary and the sparse codes were not learned during the reconstruction, as the sparse coding step of all patches would be too time consuming for very large-scale inverse problems, such as the CT example. Instead, the dictionary and the sparse codes were used to generate the prior x DIC which makes the method also more similar and comparable to ours. The parameter λ is set as previously stated in the manuscript, depending on the considered example.

Quantitative measures
For the evaluation of the reconstructions we report the normalized root mean squared error (NRMSE) and the peak signal-to-noise ratio (PSNR) as error-based measures and the structural similarity index measure (SSIM) (Wang et al 2004) Figure 2 shows an example of the results obtained with our proposed method. Figure 2(A) shows the initial NUFFT-reconstruction x I obtained from the undersampled k-space data y I . The CNN-prior x CNN obtained by the XT,YT network can be seen in figure 2(B) and (shows) a strong reduction of undersampling artefacts but also blurring of small structures as indicated the yellow arrows. The CNN-prior x CNN is then used as a prior in functional (15) which is subsequently minimized in order to obtain the solution x REC which can be seen in figure 2(C). Figure 2(D) shows the kt-SENSE reconstruction from the complete sampling pattern using N θ = 3400 radial spokes for the acquisition. From the point-wise error images, we clearly see that the NRMSE is further reduced after performing the further iterations to minimize the CNN-prior-regularized functional. Further, fine details are recovered as can be seen from the yellow arrows in figure 2(C). Figure 3 shows a comparison of all different reported methods. As can be seen from the point-wise error in figure 3(B), the TV-minimization (Block et al 2007) method was able to eliminate some artefacts but less  Figure 4 shows all the intermediate results obtained with the proposed method. Figure 4(A) shows the initial FBP-reconstruction which is contaminated by noise. The FBP-reconstruction was then processed using the function f θ described in (9) to obtain the prior x CNN which can be seen in figure 4(B). From the point-wise error, we see that patch-wise post-processing with the 3D U-net removed a large portion of the noise resulting from the low-dose acquisition. Solving problem (19) increases data-consistency since we make use of the measured data y η . Note that in contrast to the previous example of undersampled radial MRI, the minimization of the functional increased data-consistency of the solution but also contaminated the solution with noise, since the measured data is noisy due to the simulated low-dose scan protocol.  (19) showed a slight decrease in PSNR and NRMSE which is related to the use of the noisy-measured data. However, fine diagnostic details as the coronary arteries are still visible in the prior x CNN and in the solution x REC as indicated by the yellow arrows. SSIM slightly increased while HPSI stayed approximately the same. Figure 5 shows a comparison of images obtained by the different reconstruction methods. In figure 5(A), we see again the FBP-reconstruction obtained from the noisy data. Figure 5(B) shows the result obtained by the TV-minimization method which removed some of the noise as can be taken from the point-wise error image. The result obtained by the DIC method can be seen in figure 5(C) which further reduced image noise compared to the TV method and surpasses TV with respect to the reported statistics, as can be seein in table 2. Finally, figure 5(D) shows the solution x REC obtained with our proposed scheme and figure 5(E) shows the ground truth image. The reconstruction using the CNN output as a prior further increased the SSIM of the final result and is visually more similar to the ground truth image. The NRMSE and PSNR on the other hand were slightly reduced due to the use of the noisy measured data. HPSI remained approximately the same as can be seen from table 2.  Table 3 summarizes the times for the different components of the reconstructions using all different approaches for both examples. The abbreviations 'SHRINK' and 'LS1' stand for 'shrinkage' and 'linear system -one iteration' and denote the times which are needed to apply the iterative shrinkage method for the TV approach and to solve the sub-problems which are solved using iterative schemes, respectively.  Obviously, in terms of achieved image quality, the advantage of the DIC-and the CNN-based Tikhonov regularization are given by obtaining stronger priors which allow to use a smaller number of iterations to regularize the solution. The advantage of our proposed approach compared to the dictionary learning-based is the highly reduced time to compute the prior which is used for regularization. The reason lies in the fact that the DIC-based method requires to solve problem (23) to obtain the prior x DIC , while in our method a CNN is used to obtain the prior x CNN . Since problem (23) is separable, OMP is applied for each image/volume patch which is prohibitive as the number of overlapping patches in a 3D volume is in the order of O(N x · N y · N z ) or O(N x · N y · N t ), respectively. Obtaining x CNN , on the other hand, does not involve the solution of any minimization problem but only requires the application of the network u θ to the different patches. As this corresponds to matrix-vector multiplications with sparse matrices, its computational cost is lower and the calculations are further highly accelerated by performing the computations on a GPU.

Discussion
The proposed three-steps reconstruction scheme provides a general framework for solving large-scale inverse problems. The method is motivated by the observations stated in the ablation study (Kofler et al 2018), where the performance of cascades of CNNs with different numbers of intercepting data-consistency layers but approximately fixed number of trainable parameters was studied. First, it was noted that the replacement of simple blocks of convolutional layers by multi-scale CNNs given by U-nets had a visually positive impact on the obtained results. Further, it was empirically shown that the results obtained by cascades of U-nets of different length but with approximately the same number of trainable parameters were all visually and quantitatively comparable in terms of all reported measures. This suggests that, for large-scale problems, where the construction of cascaded networks might be infeasible, investing the same computational effort and expressive power in terms of number of trainable parameters in one single network might be similarly beneficial to intercepting several smaller sub-networks by data-consistency layers as for example in Schlemper et al (2018), Qin et al (2018).
Due to the large sizes of the considered objects of interest, the prior x CNN is obtained by processing patches of the images. Training the network on patches or slices of the images further has the advantage of reducing the computational overhead while naturally enlarging the available training data and therefore being able to successfully train neural networks even with datasets coming from a relatively small number of subjects. Further, as demonstrated in Kofler et al (2019), for the case of 2D radial MRI, one can also exploit the low topological complexity of 2D spatio-temporal slices for training the network u θ . This allows to reduce the network complexity by using 2D-instead of 3D-convolutional layers and still exploiting spatio-temporal correlations and therefore to prevent overfitting. Note that the network architectures we are considering are CNNs and, since they mainly consist of convolutional and max-pooling layers, we can expect the networks to be translation-equivariant and therefore, patch-related artefacts arising from the re-composition of the processed overlapping patches are unlikely to occur in the CNN-prior.
We have tested and evaluated our method on two examples of large-scale inverse problems given by 2D undersampled radial MRI and 3D low-dose CT. For both examples, our method outperformed the TV-minimization method and the dictionary learning-based method with respect to all reported quantitative measures. For the case of 2D undersampled radial cine MRI, using the CNN-prior as a regularizer in the subsequent iterative reconstruction increased the achieved image quality with respect to all reported measures, as can be taken from table 1. For the CT example, due to the inherent presence of noise in the measured data, the quantitative measures of the final reconstruction are only similar to the ones obtained by post-processing the FBP-reconstruction. However, performing a few iterations to minimize functional (19), increased data-consistency of the obtained solution and resulted in a slight re-enhancement of the edges and gave back the CT images their characteristic texture. Future work to qualitatively assess the achieved image quality with respect to clinically relevant features, e.g. the visibility of coronary arteries for the assessment of coronary artery disease in cardiac CT, is already planned.
Using the CNN for obtaining a learning-based prior is faster by several orders of magnitude compared to the dictionary learning-based approach. This is because obtaining the prior with a CNN reduces to a forward pass of all patches, i.e. to multiplications of vectors with sparse matrices, where instead, the sparse coding of all patches involves the solution of an optimization problem for each patch. Further, the time needed for OMP is dependent on the sparsity level and the number of atoms of the dictionary, see Sturm and Christensen (2012). In our comparison, for the 2D radial MRI example, the total reconstruction times of our proposed method and the DIC-based regularization method mainly differ in the step of obtaining the priors x DIC and x CNN . Note that, in contrast to Wang and Ying (2014) and Caballero et al (2014), in our comparison, the prior x DIC was only calculated once. In the original works, however, the proposed reconstruction algorithms use an alternating direction method of multipliers (ADMM) which alternates between first training the dictionary D and sparse coding with OMP and then updating the image estimate. Therefore, the realistic time needed to reconstruct the 2D cine MR images according to Wang et al (2004) and Caballero et al (2014) is given by the product of the seven minutes needed for one sparse approximation and the number of iterations in the ADMM algorithm and the total time used for PCG for solving the obtained linear systems. Note that for the 3D low-dose CT example, even one patch-wise sparse approximation of the whole volume already takes about one hour and therefore, applying an ADMM type of reconstruction method is computationally prohibitive. Also, note that, even if the size of the image sequences for the MRI example is smaller than the one of the 3D CT volumes, the reconstruction of the 2D cine MR images takes relatively long compared to the CT example due to the fact that we use two different iterative methods (Landweber and PCG) for two different systems with different operators. Further, the number of iterations for the CT example is on purpose smaller than for the MR example, as the measurement data is noisy and early stopping of the iteration can already be thought of as a proper regularization method, see for example Strand (1974). Also, the operators used for the CT examples were implemented by using the operators provided by the ODL library and are therefore optimized for performing calculations on the GPU. On the other hand, for the MRI example, we used our own implementation of a radial encoding operator E which could be further improved and accelerated.
Clearly, one difficulty of the proposed method is the one shared by all iterative reconstruction schemes with regularization: the need to choose the hyper-parameter λ which balances the contribution of the regularization and the data-fidelity term can highly affect the achieved image quality, especially when the data is contaminated by noise. In cascaded networks, the parameter λ can on the other hand be learned as well during training. Further, some other hyper-parameters as the number of iterations to minimize Tikhonov functional have to be chosen as well. In this work, we empirically chose λ but point out that an exhaustive parameter search might yield superior results.
A limitation of presented work is that the applied CNNs were task-specific and trained for only one undersampling factor in the MR example and for one dose-reduction level in the CT example. However, note that this issue can be easily overcome by for example including a greater variety of samples in the training data. Further, more sophisticated approaches could involve regularization of the CNN based on generative adversarial networks (GANs) (Rick Chang et al 2017). In Rick Chang et al (2017), it was reported that it is possible to train a single CNN to solve arbitrary inverse problems. Using a similar training strategy for the CNN and combining it with our approach could in particular overcome the problem arising from the use of different acceleration factors in MR or dose-reduction levels in CT.
The proposed method is related to the ones presented in Schlemper et al (2018), Qin et al (2018), Kofler et al (2018) in the sense that steps 2 and 3 in Algorithm 1 are iterated in a cascaded network which represents the different iterations. However, in Schlemper et al (2018) and Qin et al (2018), the encoding operator is given by a Fourier transform sampled on a Cartesian grid and therefore is an isometry. Thus, assuming a single-coil data-acquistion, given x CNN , the solution of (3) has a closed-form solution which is also fast and cheap to compute since it corresponds to performing a linear combination of the acquired k-space data and the one estimated from the CNN outputs and subsequently applying the inverse Fourier transform. In the case where the operator A is not an isometry, one usually needs to either solve a system of linear equations in order to obtain a solution which matches the measured data or, alternatively, rely on another formulation of the functional (3) which is suitable for more general, also non-orthogonal operators (Kofler et al 2018). However, if the operator A and its adjoint A H are computationally demanding to apply as in the case of radial multi-coil MRI, or if the objects of interest are high-dimensional, e.g. 3D volumes in low-dose CT, the construction of cascaded or iterative networks is prohibitive with nowadays available hardware. In contrast, in the proposed approach, since the regularization is separated from the data-consistency step, large-scale problems can be tackled as well. Further, in  and , efficient solutions for making the CNN applicable to images with different noise levels in low-dose CT were proposed and compared to commercial algorithms for low-dose CT image reconstruction. By employing a CNN which can successfully remove different levels of noise or artefacts from images, the procedure proposed in Algorithm 1 can also be iterated.
By separating the application of the CNN, the regularization and further iterations needed to obtain the reconstruction, one can also choose to employ more complex and sophisticated NNs to obtain the CNN-prior x CNN as it is typically the case for cascaded or iterative networks. For example, in Schlemper et al (2018) or Adler and Öktem (2017), the CNNs were given by simple blocks of fully convolutional neural networks with residual connection. In contrast, in Kofler et al (2018), the CNNs were replaced by more sophisticated U-nets Ronneberger et al (2015), Jin et al (2017). However, the examples in Kofler et al (2018), Öktem (2017, 2018), Gupta et al (2018) all use two-dimensional CT geometries, which do not correspond to the ones used in clinical practice. Therefore, particularly for large-scale inverse problems where the construction of iterative networks is infeasible, our method represents a valid alternative to obtain accurate reconstructions.
While in this work we used relatively simple neural network architectures based on a plain U-net as in Jin et al (2017), further focus could be put on the choice of the network u θ , also by using more sophisticated approaches, e.g. improved versions of the U-net Han and Ye (2018) or generative adversarial networks for obtaining a more accurate prior to be further used in the proposed reconstruction scheme.

Conclusion
We have presented a general framework for solving large-scale ill-posed inverse problems in medical image reconstruction. The strategy consists in strictly separating the application of the CNN, the regularization of the solution and the step needed to ensure data-consistency by solving the problem in three stages. First, an initial guess of the solution is obtained by the direct reconstruction from the measured data. As a second step, the initial solution is patch-wise processed by a previously trained CNN in order to obtain a CNN image-prior which is then used in a Tikhonov-regularized functional to obtain the final reconstruction in a third step. The strict separation of the steps of obtaining a CNN-prior and then subsequently minimizing a Tikhonov-functional allows to tackle large-scale problems. For both shown examples of 2D undersampled radial MRI and 3D low-dose CT, the proposed method outperformed the total variation-minimization method and the dictionary learning-based approach with respect to all reported quantitative measures. Since the reconstruction scheme is a general one, we expect the proposed method to be successfully applicable to other imaging modalities as well.