Abstract
The reconstruction of images from their corresponding noisy Radon transform is a typical example of an ill-posed linear inverse problem as arising in the application of computerized tomography (CT). As the (naïve) solution does not depend on the measured data continuously, regularization is needed to reestablish a continuous dependence. In this work, we investigate simple, but yet still provably convergent approaches to learning linear regularization methods from data. More specifically, we analyze two approaches: one generic linear regularization that learns how to manipulate the singular values of the linear operator in an extension of our previous work, and one tailored approach in the Fourier domain that is specific to CT-reconstruction. We prove that such approaches become convergent regularization methods as well as the fact that the reconstructions they provide are typically much smoother than the training data they were trained on. Finally, we compare the spectral as well as the Fourier-based approaches for CT-reconstruction numerically, discuss their advantages and disadvantages and investigate the effect of discretization errors at different resolutions.
Similar content being viewed by others
1 Introduction
Linear inverse problems are at the heart of a variety of imaging applications, including restoration tasks such as image deblurring as well as the inference of unknown images from measurements that contain implicit information about them as, for instance, arising in computerized tomography (CT), positron emission tomography (PET), or magnetic resonance imaging (MRI). All of these problems are commonly modeled as the task of recovering an image \(u\) from measurements
for a linear operator \(A\) and noise \(\nu\) characterized by some error bound (noise level) \(\delta \in [0,\infty [\) to be specified later. A key challenge for most practically relevant problems is that \(A\!\!: X \rightarrow {Y}\) is a compact linear operator with the infinite dimensional range (assuming here that X and Y are infinite-dimensional Hilbert spaces), leading to zero being an accumulation point of its singular values, and making the pseudo-inverse \(A^\dagger\) discontinuous.
The compactness of A makes it possible to expand it by its singular values, which means to write it in the form
where the singular values \(\sigma _n > 0\) are non-increasing, \(\{u_n\}_{n \in {\mathbb {N}}}\) and \(\{v_n\}_{n \in {\mathbb {N}}}\) form orthonormal bases of the orthogonal complement of the nullspace of the operator \({\mathcal {N}}(A)^\perp\), or the closure of its range \(\overline{{\mathcal {R}}(A)}\), respectively. With a slight abuse of notation, \(\langle \cdot , \cdot \rangle\) denotes the inner-product on the spaces X and Y. Classical linear regularization strategies therefore aim to approximate \(u\) with so-called spectral regularization operators of the form
for a suitable function \(g_\delta\) that remains bounded for all \(\delta >0\) but for which \(g_\delta (\sigma ) \rightarrow 1/\sigma\) as \(\delta \rightarrow 0\). With a suitable speed of such a pointwise convergence, the continuous dependence on the data, i.e., \(R(f^\delta ;\;g_\delta ) \rightarrow A^\dagger f\), can be reestablished. An extensive overview on requirements for this kind of convergence and classical examples for \(g_\delta\) including Tikhonov, Lavrentiev, or truncated SVD regularization, can be found in [18].
In the specific case of the linear operator
being the Radon operator defined on functions on the whole space (which is not compact in contrast to integral operators on bounded domains), we have to work with a continuous spectrum. The most commonly used reconstruction technique for inverting the Radon transform is the filtered back-projection, which follows a very similar strategy to (3), but exploits the structure of the operator in a different way. With the central slice theorem, the inverse Radon transform applied to some range-element \(f\in {{\mathcal {R}}}(A)\) is given by
where \(A^*\) denotes the adjoint operator of \(A\) (also called back-projection operator), \({\hat{\rho }}(r) = |r |\) is called the ramp-filter, \({\mathcal {F}}\!_{\text { 1-D}}\) is the one-dimensional Fourier transform with respect to the spatial offset variable of the Radon transform, and r (in the definition of \({\hat{\rho }}\)) is the variable resulting from the Fourier transform.
In analogy to the regularized version (3), the most common classical way to ensure a stable reconstruction is to replace (4) by
with the filter \(\rho _\delta\) chosen to avoid amplification of high frequency components with the large frequency \(|r|\). Common choices include the Hamming and ramp / Ram-Lak filters.
In this work, we study (3) and (5) for functions \(g_\delta\) (respectively, \(\rho _\delta\)) that can be learned from data. In particular, we show that the shape of the optimal functions \(g_\delta\) and \(\rho _\delta\) can be characterized in the closed form. We prove that the learned approaches result in convergent regularization methods under certain conditions, investigate their behavior under different discretizations, and conduct numerical experiments on CT reconstruction problems.
2 Related Work
Regularization methods for linear inverse problems in general and manipulations of the singular values in particular, have long been studied in applied mathematics, c.f. the classical reference [18] or the more recent overview [12]. Classical examples of (3) include Lavrentiev, Tikhonov, or truncated SVD regularization. Subsequently, a lot of research has focused on nonlinear regularization techniques, such as variational methods or (inverse) scale space flows. Even more recently, researchers have focused on learning reconstruction schemes through neural networks, which tend to show significantly stronger practical performances, but often lack theoretical guarantees, e.g., being convergent regularizations (independent of their discretization) with error bounds in suitable (problem-specific) metrics. We refer to the two overview papers [12, 18] for classical and nonlinear regularization theory and recall some machine-learning specific regularization approaches below.
The simplest form of benefiting from data-driven approaches is pre- or post-processing networks, i.e., parameterized functions \({\mathcal {G}}\) that are either applied to the data \(f^\delta\) before exploiting a classical reconstruction technique, or to a preliminary reconstruction like (3). Common architectures in the area of image reconstruction problems are simple convolutional neural networks (CNNs) or multiscale approaches such as the celebrated U-Net architecture [20, 33]. Direct reconstructions (with different types of problem specific information being accounted for) can, for instance, be found in [19, 26, 41]. Natural extensions of pre- and post-processing networks use \(A\) and its adjoint to switch between the measurement and reconstruction spaces with intermediate learnable operations in structures that are often motivated by classical iterative reconstruction/optimization methods such as gradient descent, forward-backward splitting, or primal-dual approaches, e.g., LEARN [15], FISTA-Net [38], or the learned primal-dual method [1]. Yet, without further restrictions, such methods do not allow to prove error estimates or convergence results.
Coming from the perspective of regularization schemes based on variational methods, many approaches have suggested to learn the regularizer, starting from (sparsity-based) dictionary learning, e.g., [2, 27], over learning (convex and non-convex) regularizers motivated by sparsity penalties, e.g., [14, 21, 34], to schemes that merely learn descent directions [29] or operators provably being convergent to a global optimum [32] or playing the role of [28, 31] a proximal operator, to reparametrizations of the unknown in unsupervised settings (e.g., the deep image prior [36]) or in a learned latent space of realistic reconstruction [13, 22]. In a similar fashion, deep equilibrium models [9] generalize the optimality conditions arising from learning regularizers (c.f. [30]) and can, for instance, be combined with convex neural networks [5] for learning suitable regularizations. Yet, the above approaches are either not viewed in the light of infinite dimensional problems, or are non-convex, such that standard regularization results do not apply, or at least require the ability to compute global minimizers (as in the analysis of [25] or strict assumptions like the tangential cone condition [7]). For a recent overview on data-driven approaches in the context of inverse problems, we refer the reader to [6].
Even if properties like convexity of the regularizer allow to deduce convergence and/or error estimates, the complex parametrization of the above methods makes it highly difficult to understand how a data-driven approach extracts information from the training data, which quantities matter and what properties a learned regularization has. Therefore, our approach in this paper is to extend our previous work [11] by studying the simple linear (singular-value based) reconstruction (3) as well as the application-specific learning of a filter in Radon-inversion (5) to provide more insights into properties of learned regularization techniques. While optimizing such spectral filters in a data-driven manner has been studied in the past (c.f. [17]), to the best of our knowledge it has not been studied to which extent the obtained reconstruction operators fulfill the characteristics of classical regularization methods.
While a large number of different data-driven approaches have been tailored to CT-reconstruction (c.f. [8, 10, 16, 39] for particular examples or [24, 37, 40] for surveys), they largely follow the above categories in terms of theoretical guarantees and an understanding of the underlying learning process.
Our framework is close to learning the optimal Tikhonov regularizer, which was extended to the infinite dimensional case and studied with a focus on the generalization to unseen data in [3]. Although our approach is more restricted and requires the knowledge of the forward operator and its singular value expansion, it allows for deriving reasonable assumptions to guarantee the convergence in the no-noise-limit.
3 Supervised Learning of Spectral Regularizations
In this section, we derive a closed-form solution for \(g_\delta (\sigma _n)\) in (3) when the expectation of the squared norm difference between \(u\) and the regularization applied to data \(f^\delta\) is minimized. Here \(u\) and \(f^\delta {= Au+ \nu }\) come from a distribution of training data. We then analyze the corresponding regularization operator and show that the operator is a convergent linear regularization.
3.1 Optimally Learned Spectral Regularization
In this section, we study the approach of learning the function \(g_\delta\) in the approach of (3) from a theoretical perspective. First of all note that due to the assumption of A being compact (and hence having a discrete spectrum), only the evaluations of \(g_\delta\) at \(\sigma _n\) matter, such that we will focus on the optimal values \(g_n:= g_\delta (\sigma _n)\) directly. For the sake of simplicity, we can guarantee the well-definedness of \(g(\sigma _n) = g_n\) by assuming the singular values \(\sigma _n\) to be strictly decreasing and have multiplicity \(\mu (\sigma _n) = 1\). Let us further assume that our data formation process (1) arises with noise drawn independently from \(u\) from a noise distribution parameterized by the noise level \(\delta\) with the zero mean.
For a fixed noise level \(\delta\), the most common way to learn parameters (i.e., the \(g_n\) in our case) is to minimize the expectation of a suitable loss, e.g., the squared norm, over the training distribution of pairs \((u, f^\delta )\) of the desired ground truth \(u\) and the noisy data \(f^\delta\). Thus, the ideal learned method is obtained by choosing
where \({\mathbb {E}}_{u,\nu }\) denotes the expected value with respect to the joint distribution of the ground truth data and the noise. With the spectral decomposition (3) we have
where \(u_0 \in {\mathcal {N}}(A)\) is the unique projection of u onto the nullspace of the operator. Due to the independence of \(u\) and \(\nu\) together with the zero mean of the noise we find
hence
with
where \({\mathbb {E}}_{u}\) and \({\mathbb {E}}_{\nu }\) denote the expected values with respect to the marginal distributions of the ground truth data, or the noise, respectively. Since the whole training only makes sense in the chosen spaces X and Y if the ground truth data u are indeed elements of X, we shall assume in the following without further notice that
The above problem can be minimized for each \(g_n\) separately by solving a quadratic minimization problem, where a solution is given by
To obtain uniqueness of the solution, we assume that \(\Pi _n > 0\) or \(\Delta _n > 0\) for all \(n \in {\mathbb {N}}\) throughout this paper.
Remark 1
The above result can be generalized to the case of singular values not strictly decreasing. Since \(\sigma _n\) is converging to zero, each singular value has the finite multiplicity and we can rewrite them as a sequence of strictly decreasing singular values \(\sigma _n\) with the multiplicity \(\mu (\sigma _n) \geqslant 1\) and corresponding singular functions \(\{u_{nm}\}_{m=1}^{\mu (\sigma _n)}\) and \(\{v_{nm}\}_{m=1}^{\mu (\sigma _n)}\), i.e.,
In this case, the optimal coefficients \(g_n\) are again given by (6), where we use the generalized variance coefficients
We observe that for \(\Pi _n > 0\), (6) can be rewritten as
and therefore has the same form as the classical Tikhonov regularization
with the commonly fixed regularization parameter \(\alpha\) being replaced by a componentwise adaptive and data-driven term \(\Delta _n/\Pi _n\). We note that the restricted structure of the regularizer prevents it from approximating the nullspace component of the unknown. This is quantified by the term \({\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right)\) which is independent of the choice of g. For the generalization error this implies
and hence,
In general we will need an assumption about the structure of the noise in the data set. First of all we interpret the noise level \(\delta\) in a statistical sense as a bound for the noise standard deviation, however formulated such that it includes white noise. This leads to
which we will assume throughout the paper without further notice. Note that when studying the zero noise limit \(\delta \rightarrow 0\) it is natural to interpret \(\Delta _n\) as depending on \(\delta\) and in these instances we will use the explicit notation \(\Delta _n(\delta )\). We will refer to the case of white noise as \(\Delta _n(\delta ) = \delta ^2\) for all \(n\in {\mathbb {N}}\).
Obviously we expect the clean images to be smoother than the noise, which means that their decay in the respective singular basis is faster. We thus formulate the following assumption, which is automatically true for white noise (\(\Delta _n = \delta ^2\) for all n), since \(\Pi _n\) goes to zero.
Assumption 1
For \(\delta > 0\), there exists \(n_0 \in {\mathbb {N}}\) such that the sequence \(\Delta _n/\Pi _n\) is well-defined for \(n \geqslant n_0\) and diverging to \(+\infty\).
For some purposes it will suffice to assume a weaker form, which only has a lower bound instead of divergence to infinity.
Assumption 2
There exists \(c > 0\) and \(n_0 \in {\mathbb {N}}\) such that \(\Delta _n(\delta ) \geqslant c \, \delta ^2 \, \Pi _n\) for every \(n \geqslant n_0\) and \(\delta > 0\).
With Assumption 2 we can conclude that the operator \(R(\cdot ;\;{\overline{g}})\) with \({\overline{g}}\) given by (6) is a linear and bounded operator for strictly positive \(\delta\).
Lemma 1
Let Assumption 2 be satisfied. Then \(R(\cdot ;\;{\overline{g}})\) with \({\overline{g}}\) as defined in (6) is a bounded linear operator for \(\delta > 0\).
Proof
By construction, the operator R is linear in its argument \(f^\delta\). We first note that \(\Pi _n = 0\) implies \({\overline{g}}_n= 0\). For \(\Pi _n > 0\) and \(n \geqslant n_0\), the condition \(\Delta _n \geqslant c \, \delta ^2 \, \Pi _n\) implies
due to \(\sigma ^2_n + \Delta _n/\Pi _n \geqslant 2\sigma _n\sqrt{\Delta _n/\Pi _n}\). For \(n<n_0\) we have
Hence, we conclude
which implies that the operator \(R(\cdot ;\;{\overline{g}})\) is a bounded linear operator for \(\delta > 0\).
Remark 2
In practical scenarios with finite (empirical) data \((u^i, f^i)_{i=1,\cdots , N}\) one typically minimizes the empirical risk
instead of the expectation, leading to coefficients
with
Yet, a convergence analysis of the empirical risk minimization requires further (strong) assumptions to control \(\Gamma _n^N\), such that we limit ourselves to the analysis of the ideal case (6) in this work.
3.2 Range Conditions
We start our analysis of the learned regularization method with an inspection of the range condition (cf. [12]), which for Tikhonov-type regularization methods equals the so-called source condition in classical regularization methods (cf. [18]). More precisely, we ask which elements \(u \in X\) can be obtained from the regularization operator \(R(\cdot ,{\overline{g}})\) for some data f, i.e., we characterize the range of \(R(\cdot ,{\overline{g}})\). Intuitively, one might expect that the range of R includes the set of training samples, or in a probabilistic setting the expected smoothness of elements in the range (with expectation over the noise and training images) should equal the expected smoothness of elements of the training images. However, as we shall see below, the expected smoothness of reconstructions is typically higher.
We start with a characterization of the range condition, where we again denote the range of an operator by \({\mathcal {R}}\).
Proposition 1
Let \({\overline{g}}\) be given by (6). Then \(u \in {\mathcal {R}}(R(\cdot ,{\overline{g}}))\) if and only if
If Assumption 2 is satisfied we have in particular \(u \in {\mathcal {R}}(A^*)\) and in the case of white noise the range condition (10) holds if and only if
Proof
Given the form of \({\overline{g}}\) we see that \(u \in {\mathcal {R}}(R(\cdot ,{\overline{g}}))\) if
for some \(f \in Y\), which is further equivalent to \(\frac{1}{{\overline{g}}_n} \langle u,u_n \rangle \in \ell ^2({\mathbb {N}})\). We see that
and due to the boundedness of \(\sigma _n\), we find \(\sigma _n \langle u,u_n \rangle \in \ell ^2({\mathbb {N}})\) for any \(u \in X\). Hence, the range condition is satisfied if and only if
holds, which is just (10). Under Assumption 2 we have \(\frac{\Delta _n}{\sigma _n\Pi _n}\geqslant \frac{c \, \delta ^2}{\sigma _n}\), thus
This implies \(u =A^*w\) with
For white noise, we have \(\Delta _n = \delta ^2\), which implies that (10) is satisfied if and only if (11) is satisfied.
Let us mention that under Assumption 1 the range condition of the learned regularization method is actually stronger than the source condition \(u=A^*w\), since we even have
with the weight \(\frac{\Delta _n^2}{\Pi _n^2}\) diverging under Assumption 1. This indicates that the learned regularization might be highly smoothing or even oversmoothing, which depends on the roughness of the noise compared to the signal, i.e., the quotient \(\frac{\Delta _n}{\Pi _n}\).
Although the range of the reconstruction operator is still dense in the range of \(A^*\) if \(\Pi _n > 0\) for all \(n \in {\mathbb {N}}\), the oversmoothing effect of the learned spectral regularization method can be made clear by looking at the expected smoothness of the reconstructions, i.e.,
A straight-forward computation yields
While we expect (at least on average) similar smoothness of the reconstructions as for the training images, i.e., \({\tilde{\Pi }}_n \approx \Pi _n\), we find under Assumption 2 that
for large n, i.e., \(\frac{{{\tilde{\Pi }}}_n}{\Pi _n}\) converges to zero at least as fast as \(\sigma _n^2\). The distribution of reconstruction thus has much more mass on smoother elements of X than the training set. The main reason for this behavior seems to stem from the noise in the data. Thus, although the method is supervised and tries to match all reconstructions to the training data, it needs to produce a linear operator that maps noise to suitable elements of X as well. Note that the behavior changes with \(\delta \rightarrow 0\). For small \(\frac{\Delta _n}{\sigma _n^2 \Pi _n}\), the denominator is approximately equal to one, hence
i.e., in the limit the right degree of smoothness will be restored.
Another perspective on the expected smoothness is to investigate the bias of the obtained reconstruction operator, or, since we cannot expect to reconstruct nullspace elements
Inserting the definition of \({\overline{g}}\) we derive with that for any \(N \in {\mathbb {N}}\)
which yields the convergence of the bias to zero as \(\delta \rightarrow 0\). We see that, just like the velocity of the convergence of the variance coefficients \(\tilde{\Pi }_n\), the velocity of the convergence of the bias is determined by the convergence of the factors \(\frac{\Delta _n}{\sigma _n^2 \Pi _n} \rightarrow 0\).
3.3 Convergence of the Regularization
In the following we will investigate the convergence properties of the learned spectral regularization as the noise level \(\delta\) tends to zero. For this sake we will write \(\Delta _n(\delta )\) and \({\overline{g}}(\delta )\) here to make clear that we are looking at a sequence converging to zero.
We shall naturally consider the convergence in the mean-squared error, i.e.,
This equals the 2-Wasserstein distance (cf. [4, Sect. 2.1]) between the concentrated measure at u and the distribution generated by the regularization applied to random noisy data, thus can be understood as a concentration of measure in the zero noise limit.
In addition to the convergence of the regularization for single elements \(u \in X\) we can also look at the mean-squared error over the whole distribution of given images and noise, i.e.,
which is identical to the quantity (7) with emphasis of the dependency of \(\Delta _n\) on \(\delta\).
With the following theorem we verify that \(e(u, \delta )\) as defined in (14) and \({\overline{e}}(\delta )\) as defined in (15) are indeed converging to zero for \(\delta\) converging to zero.
Theorem 1
For \(u \in {\mathcal {N}}(A)^\perp\), let \(\langle u,u_n\rangle = 0\) for all indices n with \(\Pi _n = 0\). Then \(e(u,\delta ) \rightarrow 0\) as \(\delta \rightarrow 0\). Moreover, we also have \({\overline{e}}(\delta ) \rightarrow 0\) for \(\delta \rightarrow 0\).
Proof
We start by computing
Taking expectations with respect to the noise yields
Now let \(\epsilon > 0\) be arbitrary. Due to the summability of \(\langle u, u_n \rangle ^2 + \Pi _n\) we can choose N large enough such that
Now we can choose
to obtain
for \(\delta\) sufficiently small, which implies the convergence to zero.
In the same way we can estimate
where the first equality follows from (7), and obtain the convergence by the same argument.
4 Learned Radon Filters
In this section, we show the correspondence between the data-driven regularization of the inverse Radon transform and filtered back-projection. We therefore analyze the properties of an optimal filter for the regularized filtered back-projection (5), i.e., we want to find a function \({\overline{\rho }}\!: {\mathbb {R}} \times [0,\uppi ] \rightarrow {\mathbb {C}}\) such that
The function \(\rho\) now plays the role of the learned spectral coefficients. In particular, \(\rho (r, \theta ) = |r |\) corresponds to an inversion, i.e., \(g_n = \frac{1}{\sigma _n}\), and \(\rho (r, \theta ) = 1\) yields the adjoint operator, which corresponds to \(g_n = {\sigma _n}\). In analogy to the derivation of the optimal coefficients in Sect. 3, we first rewrite the \(L^2\)-error of the difference of a single \(u\in L^2({\mathbb {R}}^2)\) and the corresponding regularized reconstruction of the measurement \(f^\delta \in L^2({\mathbb {R}}^2)\) as follows (using the isometry property of the Fourier transform):
In the last line, we use the central slice theorem to get
where the latter equality holds since the range of the Radon transform is dense in \(L^2({\mathbb {R}} \times [0,\uppi ])\) and thus
Note that for the reconstruction operator to be well-defined and continuous on \(L^2\), the filter function \(\rho\) has to be essentially bounded. With the same arguments used in Sect. 3 we derive
with \(\psi (r,\theta ) = \frac{\rho (r,\theta )}{\vert r \vert }\),
By the above formula we can restrict \(\psi\) and thus \(\rho\) to be real-valued, as adding an imaginary component would always increase the expected value. Accordingly, the optimal filter is real-valued and given by
To fulfill the boundedness of \({\overline{\rho }}\) we assume that
as \(|r|\longrightarrow \infty\) for every \(\theta \in [0,\uppi ]\), which can be recognized as the obvious conditions of ground-truth data being smoother than noise on average. The expected error using the optimal filter is then given by
Using the central slice theorem again we can further determine the expected smoothness \(\tilde{\Pi }(r,\theta ):= {\mathbb {E}}\left( \left|{\mathcal {F}}\!_{\text { 1-D}}AR(f^\delta ;\;{\overline{\rho }})(r,\theta )\right|^2\right)\) of the reconstructions as
This reveals the analog to the smoothing effect obtained by the spectral reconstruction that was discussed in Sect. 3.2.
Remark 3
Similar to the spectral regularization, optimal filters can also be computed for a finite set of training data. Rewriting the empirical risk for input-output pairs \(\{u^i, f^{\delta ,i}\}_{i = 1}^N\) as
And computing the optimality conditions for \(\psi\) yields
with
and
5 Numerical Experiments with the Discretized Radon Transform
In this section, we support the theoretical results of the previous sections with numerical experiments.Footnote 1
5.1 Datasets
The following experiments are performed on a synthetic dataset and the LoDoPaB-CT dataset (cf. [23]). The synthetic dataset contains 32 000 images of random ellipses supported on a circle contained completely in the image domain (see Fig. 4 for an example). Each ellipse has a random width and height, and is roto-translated by a random distance and angle. All ellipses are completely contained within the image borders, and there may be an overlapping between two or more ellipses. We split the data into \(64\%\) training, \(16\%\) validation, and \(20\%\) test data. The LoDoPaB-CT dataset contains over 40 000 human chest scans and is split into 35 820 training, 3 522 validation, and 3 553 test images. For both datasets, we simulate sinograms as described in the following section and model noisy data by adding Gaussian noise with zero mean and variance \(\delta ^2\). In “Appendix A” we additionally report results for uniformly distributed noise.
5.2 Discretization and Optimization
We perform numerical experiments with the Radon transform for discrete data based on the Spline-0-Pixel-Model where we assume the image functions \(u\in L^2({\mathbb {R}}^2)\) to be of the form
where \(I \in {\mathbb {N}}\) denotes the number of pixels in each direction, \(u_{ij} \in {\mathbb {R}}\) for each \(i,j < I\) and the 0-spline \(\beta _0\) is defined as
Accordingly we have \(u^{(ij)} = u\left( \frac{i}{I}, \frac{j}{I}\right)\), so with a slight abuse of notation we write \(u \in {\mathbb {R}}^{I\times I}\). Reformulating the Radon transform for this particular choice of input functions yields a finite linear operator (see e.g., [35] for a detailed derivation) and can thus be represented by a matrix A. For sufficiently small dimensions, the finite set of vectors \(u_n\) and \(v_n\) is given by the SVD of A, more precisely \(A = V\Sigma U^{\rm{T}}\), where \(u_n\) and \(v_n\) are the column vectors of the orthogonal matrices U and V and \(\Sigma\) is a diagonal matrix containing the singular values in the non-increasing order. Since the considered problem has finite dimensions, Assumption 2 is always fulfilled. We still see a faster decay of \(\Pi ^N\) compared to \(\Delta ^N\) as is depicted in Fig. 1.
Since the computation of the SVD has high memory requirements and the matrix multiplication with U and \(V^{\rm{T}}\) is computationally expensive, this approach is only feasible for low resolution data. Using the fact that V is orthogonal, we see the connection to the continuous optimization (16) by rewriting the regularized inversion operator as
where the division is to be interpreted element-wise, and where \(\text {diag}(\Sigma )\) returns the main diagonal of \(\Sigma\).
As we will analyze in more detail below, the discretization of the Radon transform leads to (4) being incorrect for \({\mathcal {F}}\) denoting the discrete Fourier transform. Thus, we will compare four different approaches in our numerical results.
-
i)
The spectral regularization (3) with coefficients computed according to the analytical solution (9).
-
ii)
The spectral regularization (3) with coefficients optimized in a standard machine learning setting, i.e., initializing the parameters with zeros, partitioning the data into batches of 32 instances, and then using the PyTorch implementation of Adam (with default parameters except for a learning rate of 0.1) for optimization. Ideally, these results should be identical to i).
-
iii)
The learned filtered back-projection (5) with coefficients computed according to the analytical solution (17), knowing that the resulting coefficient might not be optimal due to the aforementioned discretization error in (5). To partially compensate for this, we replaced the discretization of the ramp filter \(\vert r\vert\) by filter coefficients we optimized on clean data. For all non-zero noise levels we kept these pseudo-ramp filter coefficients from the no-noise baseline.
-
iv)
The learned filtered back-projection (5) with coefficients that optimize
$$\begin{aligned} {\overline{\rho }}^N = \rm{arg}\min _{\rho } \frac{1}{N} \sum _{i=1}^N\Bigg \Vert u^i - A^{\rm T}\left( F^{-1} {\overline{\rho }}^N F\right) \Bigg\Vert ^2, \end{aligned}$$where F denotes the discrete Fourier transform. In keeping with the results from Sect. 4, we only consider real-valued \(\sigma\). The optimization is done in the same way as described above for the SVD. We point out that for rotation invariant distributions, i.e., cases where the noise and data distributions are constant in the direction of the angles, the filter \(\rho\) can be chosen constant in the direction of the angles as well. By this the number of necessary parameters is equal to the number of positions s. We further note that due to the inevitable discretization errors, we cannot expect perfect reconstructions, not even for perfect measurements without noise.
We refer to Approaches ii) and iv) as learned and to Approaches i) and iii) as analytic and compare their behavior below. Although the discrete Radon transform A is an approximation of the continuous Radon transform (denoted by \({\hat{A}}\) in the following), several scaling factors have to be taken into account to relate the discrete regularization to a continuous regularization. We collect these factors in the following remark.
Remark 4
For data supported on the uniform square, we consider discretizations \(u \in {\mathbb {R}}^{I\times I}\) and \(v \in {\mathbb {R}}^{K\times L}\) of functions \({\hat{u}}\) and \({\hat{v}}\). Here I denotes the number of pixels in each direction, K denotes the number of equispaced angles in \([0,\uppi ]\), and L denotes the number of equispaced positions in \(\left[ -\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{2}\right]\). Then, the following approximations hold for
-
(i)
the Radon transform and its adjoint:
$$\begin{aligned} {\hat{A}}{\hat{u}} \approx Au \quad \text {and}\quad {\hat{A}}^*{\hat{v}} \approx \frac{\sqrt{2}\uppi }{KL} A^{\rm{T}} v, \end{aligned}$$ -
(ii)
the SVD:
$$\begin{aligned} {\hat{\sigma }}_n \approx \sqrt{\frac{\sqrt{2}\uppi }{KL}}\,\sigma _n,\qquad {\hat{u}}_n \approx I \,u_n, \quad \text {and}\quad {\hat{v}}_n \approx \sqrt{\frac{KL}{\sqrt{2}\uppi }}\, v_n, \end{aligned}$$where \({\hat{u}}_n\), \({\hat{v}}_n\) are normalized with respect to the \(L^2\)-norm, and \(u_n\), \(v_n\) are normalized with respect to the Euclidean norm,
-
(iii)
the data-driven (co-)variance terms:
$$\begin{aligned} {\hat{\Pi }}_n = \frac{\Pi _n}{I^2},\qquad {\hat{\Delta }}_n = \frac{\sqrt{2}\uppi }{KL}\,\Delta _n,\quad \text {and}\quad {\hat{\Gamma }}_n = \sqrt{\frac{\sqrt{2}\uppi }{KL}}\,\frac{\Gamma _n}{I}. \end{aligned}$$
5.3 Results
In the following, we present several numerical results for the approaches introduced in the previous section.
5.3.1 Learned vs. Analytic
The loss as well as the PSNR- and SSIM-values obtained by the different approaches on the test dataset is documented in Table 1. Based on these results, we deduce that the SVD-based regularization performs better than the FFT-based regularization. In the SVD case we see that the results obtained by the analytic coefficients are still slightly better than the ones obtained by the learned coefficients. The opposite is true in the FFT case, where the analytic solution performs worse than the learned one. This had to be expected by the discretization errors that are made in the computation of the analytic optimum. Figure 2 shows the loss curves for both data-driven optimization approaches as well as the loss obtained by the analytically optimal filters. As expected for the SVD-approach the learned loss converges to the analytic loss. On the other hand, the loss approximated with the analytic formula for the Fast-Fourier-approach is higher than the actual data-driven loss, which may again indicate that the optimization balances out the discretization errors. This gap between the continuous and discrete optima can also be seen in higher resolution data, as the comparison of the analytic filters to the learned filters in Fig. 3 shows. Based on these insights, we only show the results for analytic SVD-coefficients and learned FFT-filters in the following.
We further see that the obtained losses of both approaches are more similar the higher the noise level is which is also visible in the reconstructions (see Fig. 4). Here, we compare our two approaches to naïve reconstruction with filtered back-projection (FBP) and Tikhonov reconstruction, where the reconstruction operator is chosen individually for each example with the discrepancy principle. The reconstruction examples also reveal the predicted oversmoothing effect for both approaches.
5.3.2 Generalization to Different Resolutions
Since the operator we use in our experiments is discretized, the optimal coefficients (or filters, respectively) depend on the resolution of the operator and the corresponding data. However, both approaches offer ways to transfer coefficients optimized for a certain resolution to another resolution. One approach to generalize the learned SVD-based coefficients would be to resort to the classic linear regularization formulation (3) and retrieve a function \(g(\sigma )\) by inter-/extrapolating the optimal values \(g(\sigma _n) = {\overline{g}}_n\). A generalization of the FFT-filters can be obtained in a similar way by interpolation of \({\overline{\rho }}(\theta , r)\) in the direction of the angles \(\theta\) and extrapolation in the direction of the frequencies r. Comparing the optimal coefficients obtained for different resolutions after rescaling them in accordance to Remark 4 (see Fig. 5 for SVD-coefficients and Fig. 6 for FFT-filters) we see that all curves behave similarly but there are obvious differences in the magnitude and location of the peak for different resolutions. While the optimal coefficients seem to converge in the domain of small singular values (or low frequencies, respectively), up to the resolutions we were able to compute, there is no obvious convergence visible for the domain of large singular values in the SVD-approach or the domain of high frequencies in the FFT-approach. Note that since high singular values correspond to low frequencies, this reveals an opposite behavior of both approaches. Inspecting the curves for \({\hat{\Pi }}\) that they are almost identical for high resolutions (see Fig. 5). Therefore, a convergence of the optimal spectral coefficients for higher resolutions seems probable.
5.4 Generalization to Different Data Distributions
To evaluate how well the learned regularizers generalize between different datasets, we show empirical results on the LoDoPaB-CT data set in this section. Figures 7, 8, and 9 compare the results obtained with a regularizer trained on LoDoPaB-Ct dataset with the ones obtained with a regularizer trained on the synthetic ellipse dataset. While both approaches clearly improve the reconstructions compared to the filtered back-projection, the effect of optimizing the regularizers on the correct dataset is still clearly visible.
6 Conclusions and Outlook
In this work, we studied the behavior of data-driven linear regularization of inverse problems. For problems emerging from compact linear forward operators we showed that the optimal linear regularization can be computed analytically with the singular value expansion of the operator. An analysis of the range conditions revealed an oversmoothing effect depending on the noise level. We further confirmed that the optimal regularization is convergent under suitable assumptions. By deriving analogous results for the Radon transform, a specific non-compact linear operator, we could establish a connection to the well-known approach of filtered back-projection. Numerical experiments with a discretized (and thus compact) version of the Radon transform verified our theoretical findings for compact operators. We additionally proposed a computationally more efficient discretization of the filtered back-projection using the FFT, which allows for finer discretizations of the operator. However, the filters obtained by this approach turned out to strongly deviate from the theoretical results for the continuous case. Especially in the regime of low noise, the discretization errors that come with this approach caused a heavy accuracy loss compared to the SVD-approach. Additionally, we empirically evaluated the convergence behavior of both regularization approaches for the discrete Radon transform with an increasing level of discretization.
For the future, we would like to better understand the discretization error that arises from the discretization of the Radon transform and its impact on the computed filters. It would also be interesting to study more general nonlinear data-driven regularization methods and to find formulations and criteria for which one can prove that these methods are convergent regularizations, similar to the analysis carried out in Sect. 3.3.
Data Availability
Further information on how to access the LoDoPaB-CT dataset can be found at https://doi.org/10.1038/s41597-021-00893-z. The generation of the synthetic data as well as the code for our experiments can be found at https://github.com/AlexanderAuras/convergent-data-driven-regularizations-for-ctreconstruction.
Notes
Our code is available at
https://github.com/AlexanderAuras/convergent-data-driven-regularizations-for-ct-reconstruction.
References
Adler, J., Öktem, O.: Learned primal-dual reconstruction. IEEE Trans. Med. Imaging 37(6), 1322–1332 (2018). https://doi.org/10.1109/TMI.2018.2799231
Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006)
Alberti, G.S., De Vito, E., Lassas, M., Ratti, L., Santacesaria, M.: Learning the optimal Tikhonov regularizer for inverse problems. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 25205–25216. Curran Associates Inc., New York (2021)
Ambrosio, L., Gigli, N.: A user’s guide to optimal transport. In: Modelling and Optimisation of Flows on Networks, pp. 1–155. Springer, Berlin (2013). https://doi.org/10.1007/978-3-642-32160-3_1
Amos, B., Xu, L., Kolter, J.Z.: Input convex neural networks. In: ICML, pp. 146–155. PMLR (2017)
Arridge, S., Maass, P., Öktem, O., Schönlieb, C.-B.: Solving inverse problems using data-driven models. Acta Numer. 28, 1–174 (2019)
Aspri, A., Banert, S., Öktem, O., Scherzer, O.: A data-driven iteratively regularized Landweber iteration. Numer. Funct. Anal. Optim. 41(10), 1190–1227 (2020)
Baguer, D.O., Leuschner, J., Schmidt, M.: Computed tomography reconstruction using deep image prior and learned reconstruction methods. Inverse Prob. 36(9), 094004 (2020). https://doi.org/10.1088/1361-6420/aba415
Bai, S., Kolter, J.Z., Koltun, V. Deep equilibrium models. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc., New York (2019)
Barutcu, S., Aslan, S., Katsaggelos, A.K., Gürsoy, D.: Limited-angle computed tomography with deep image and physics priors. Sci. Rep. 11(1), 17740 (2021). https://doi.org/10.1038/s41598-021-97226-2
Bauermeister, H., Burger, M., Moeller, M.: Learning spectral regularizations for linear inverse problems. In: NeurIPS 2020 Workshop on Deep Learning and Inverse Problems (2020)
Benning, M., Burger, M.: Modern regularization methods for inverse problems. Acta Numer. 27, 1–111 (2018)
Bora, A., Jalal, A., Price, E., Dimakis, A.G.: Compressed sensing using generative models. In: ICML, pp. 537–546. PMLR (2017)
Chen, Y., Ranftl, R., Pock, T.: Insights into analysis operator learning: from patch-based sparse models to higher order MRFs. IEEE Trans. Image Process. 23(3), 1060–1072 (2014)
Chen, H., Zhang, Y., Chen, Y., Zhang, J., Zhang, W., Sun, H., Lyu, Y., Liao, P., Zhou, J., Wang, G.: LEARN: learned experts’ assessment-based reconstruction network for sparse-data CT. IEEE Trans. Med. Imaging 37(6), 1333–1347 (2018)
Chen, H., Zhang, Y., Kalra, M.K., Lin, F., Chen, Y., Liao, P., Zhou, J., Wang, G.: Low-dose CT with a residual encoder-decoder convolutional neural network. IEEE Trans. Med. Imaging 36(12), 2524–2535 (2017). https://doi.org/10.1109/TMI.2017.2715284
Chung, J., Chung, M., O’Leary, D.P.: Designing optimal spectral filters for inverse problems. SIAM J. Sci. Comput. 33(6), 3132–3152 (2011)
Engl, H.W., Hanke, M., Neubauer, A.: Regularization of Inverse Problems, vol. 375. Kluwer, Dordrecht (1996)
He, J., Wang, Y., Ma, J.: Radon inversion via deep learning. IEEE Trans. Med. Imaging 39(6), 2076–2087 (2020)
Jin, K.H., McCann, M.T., Froustey, E., Unser, M.: Deep convolutional neural network for inverse problems in imaging. IEEE Trans. Image Process. 26(9), 4509–4522 (2017). https://doi.org/10.1109/TIP.2017.2713099
Kobler, E., Effland, A., Kunisch, K., Pock, T.: Total deep variation for linear inverse problems. In: CVPR, pp. 7546–7555 (2020)
Latorre, F., Ektekhari, A., Cevher, V. Fast and provable ADMM for learning with generative priors. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc., New York (2019)
Leuschner, J., Schmidt, M., Baguer, D.O., Maass, P.: LoDoPaB-CT, a benchmark dataset for low-dose computed tomography reconstruction. Sci. Data. 8, 109 (2021). https://doi.org/10.1038/s41597-021-00893-z
Leuschner, J., Schmidt, M., Ganguly, P.S., Andriiashen, V., Coban, S.B., Denker, A., Bauer, D., Hadjifaradji, A., Batenburg, K.J., Maass, P., van Eijnatten, M.: Quantitative comparison of deep learning-based image reconstruction methods for low-dose and sparse-angle CT applications. J. Imaging 7(3), 44 (2021)
Li, H., Schwab, J., Antholzer, S., Haltmeier, M.: NETT: solving inverse problems with deep neural networks. Inverse Prob. 36(6), 065005 (2020)
Li, Y., Li, K., Zhang, C., Montoya, J., Chen, G.-H.: Learning to reconstruct computed tomography images directly from sinogram data under a variety of data acquisition conditions. IEEE Trans. Med. Imaging 38(10), 2469–2481 (2019). https://doi.org/10.1109/TMI.2019.2910760
Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., Bach, F.: Supervised dictionary learning. In: NeurIPS (2008)
Meinhardt, T., Moeller, M., Hazirbas, C., Cremers, D.: Learning proximal operators: using denoising networks for regularizing inverse imaging problems. In: ICCV, pp. 1781–1790 (2017)
Moeller, M., Mollenhoff, T., Cremers, D.: Controlling neural networks via energy dissipation. In: ICCV, pp. 3256–3265 (2019)
Riccio, D., Ehrhardt, M.J., Benning, M.: Regularization of inverse problems: deep equilibrium models versus bilevel learning. arXiv:2206.13193 (2022)
Rick Chang, J., Li, C.-L., Poczos, B., Vijaya Kumar, B., Sankaranarayanan, A.C.: One network to solve them all—solving linear inverse problems using deep projection models. In: ICCV, pp. 5888–5897 (2017)
Romano, Y., Elad, M., Milanfar, P.: The little engine that could: regularization by denoising (RED). SIAM J. Imaging Sci. 10(4), 1804–1844 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, pp. 234–241. Springer, Cham (2015)
Roth, S., Black, M.J.: Fields of experts. Int. J. Comput. Vis. 82(2), 205–229 (2009)
Servieres, M.C.J., Normand, N., Subirats, P., Guedon, J.: Some links between continuous and discrete Radon transform. In: Fitzpatrick, J.M., Sonka, M. (eds.) Medical Imaging 2004: Image Processing, vol. 5370, pp. 1961–1971. SPIE, WA, United States. International Society for Optics and Photonics (2004). https://doi.org/10.1117/12.533472
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: CVPR, pp. 9446–9454 (2018)
Wang, G., Ye, J.C., De Man, B.: Deep learning for tomographic image reconstruction. Nat. Mach. Intell. 2(12), 737–748 (2020). https://doi.org/10.1038/s42256-020-00273-z
Xiang, J., Dong, Y., Yang, Y.: FISTA-Net: learning a fast iterative shrinkage thresholding network for inverse problems in imaging. IEEE Trans. Med. Imaging 40(5), 1329–1339 (2021). https://doi.org/10.1109/TMI.2021.3054167
Xu, M., Hu, D., Luo, F., Liu, F., Wang, S., Wu, W.: Limited-angle x-ray CT reconstruction using image gradient \(\ell _0\)-norm with dictionary learning. IEEE Trans. Radiat. Plasma Med. Sci. 5(1), 78–87 (2021). https://doi.org/10.1109/TRPMS.2020.2991887
Zhang, M., Gu, S., Shi, Y.: The use of deep learning methods in low-dose computed tomography image reconstruction: a systematic review. Complex Intell. Syst. 8, 5545–5561 (2022). https://doi.org/10.1007/s40747-022-00724-7
Zhu, B., Liu, J.Z., Cauley, S.F., Rosen, B.R., Rosen, M.S.: Image reconstruction by domain-transform manifold learning. Nature 555(7697), 487–492 (2018). https://doi.org/10.1038/nature25988
Acknowledgements
A part of this work was carried out while SK and MBu were with the Friedrich-Alexander-Unversität Erlangen-Nürnberg. SK and MB acknowledge support from DESY (Hamburg, Germany), a member of the Helmholtz Association HGF. SK, AA, HB, MM, and MBu acknowledge the support of the German Research Foundation, projects BU 2327/19-1 and MO 2962/7-1. DR acknowledges support from the EPSRC grant EP/R513106/1. MBe acknowledges support from the Alan Turing Institute. This research utilized Queen Mary’s Apocrita and Andrena HPC facilities, supported by the QMUL Research-IT http://doi.org/10.5281/zenodo.438045.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
On behalf of all authors, the corresponding author declares that there is no conflict of interest.
Appendix A Results for Uniformly Distributed Noise
Appendix A Results for Uniformly Distributed Noise
In this Appendix, we present the results for the numerical experiments introduced in Sect. 5 with another noise model. More precisely, instead of using centered Gaussian noise with variance \(\delta ^2\), for a noise level \(\delta > 0\), we corrupt the simulated sinograms with noise that is uniformly distributed in the interval \([-\sqrt{3}\delta , \sqrt{3}\delta ]\). Based on the reconstructions shown in Figs. 10, 11, and 12 we deduce that the method is robust with respect to different noise models.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kabri, S., Auras, A., Riccio, D. et al. Convergent Data-Driven Regularizations for CT Reconstruction. Commun. Appl. Math. Comput. (2024). https://doi.org/10.1007/s42967-023-00333-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42967-023-00333-2