1 Introduction

Linear inverse problems are at the heart of a variety of imaging applications, including restoration tasks such as image deblurring as well as the inference of unknown images from measurements that contain implicit information about them as, for instance, arising in computerized tomography (CT), positron emission tomography (PET), or magnetic resonance imaging (MRI). All of these problems are commonly modeled as the task of recovering an image \(u\) from measurements

$$\begin{aligned} f^\delta = Au+ \nu \end{aligned}$$
(1)

for a linear operator \(A\) and noise \(\nu\) characterized by some error bound (noise level) \(\delta \in [0,\infty [\) to be specified later. A key challenge for most practically relevant problems is that \(A\!\!: X \rightarrow {Y}\) is a compact linear operator with the infinite dimensional range (assuming here that X and Y are infinite-dimensional Hilbert spaces), leading to zero being an accumulation point of its singular values, and making the pseudo-inverse \(A^\dagger\) discontinuous.

The compactness of A makes it possible to expand it by its singular values, which means to write it in the form

$$\begin{aligned} Au= \sum _{n=1}^\infty \sigma _n \langle u, u_n \rangle v_n, \end{aligned}$$
(2)

where the singular values \(\sigma _n > 0\) are non-increasing, \(\{u_n\}_{n \in {\mathbb {N}}}\) and \(\{v_n\}_{n \in {\mathbb {N}}}\) form orthonormal bases of the orthogonal complement of the nullspace of the operator \({\mathcal {N}}(A)^\perp\), or the closure of its range \(\overline{{\mathcal {R}}(A)}\), respectively. With a slight abuse of notation, \(\langle \cdot , \cdot \rangle\) denotes the inner-product on the spaces X and Y. Classical linear regularization strategies therefore aim to approximate \(u\) with so-called spectral regularization operators of the form

$$\begin{aligned} R(f^\delta ;\;g_\delta ) = \sum _{n=1}^\infty g_\delta (\sigma _n) \langle f^\delta , v_n \rangle u_n \, \end{aligned}$$
(3)

for a suitable function \(g_\delta\) that remains bounded for all \(\delta >0\) but for which \(g_\delta (\sigma ) \rightarrow 1/\sigma\) as \(\delta \rightarrow 0\). With a suitable speed of such a pointwise convergence, the continuous dependence on the data, i.e., \(R(f^\delta ;\;g_\delta ) \rightarrow A^\dagger f\), can be reestablished. An extensive overview on requirements for this kind of convergence and classical examples for \(g_\delta\) including Tikhonov, Lavrentiev, or truncated SVD regularization, can be found in [18].

In the specific case of the linear operator

$$\begin{aligned} A\!\!: L_2({\mathbb {R}}^2) \rightarrow L_2({\mathbb {R}}{\times } [0,\uppi ]) \end{aligned}$$

being the Radon operator defined on functions on the whole space (which is not compact in contrast to integral operators on bounded domains), we have to work with a continuous spectrum. The most commonly used reconstruction technique for inverting the Radon transform is the filtered back-projection, which follows a very similar strategy to (3), but exploits the structure of the operator in a different way. With the central slice theorem, the inverse Radon transform applied to some range-element \(f\in {{\mathcal {R}}}(A)\) is given by

$$\begin{aligned} A^{-1}f = A^*\left( {\mathcal {F}}\;^{-1}_{\text { 1-D}} \left( {\hat{\rho }} \cdot {\mathcal {F}}\!_{\text { 1-D}}{f }\right) \right) , \end{aligned}$$
(4)

where \(A^*\) denotes the adjoint operator of \(A\) (also called back-projection operator), \({\hat{\rho }}(r) = |r |\) is called the ramp-filter, \({\mathcal {F}}\!_{\text { 1-D}}\) is the one-dimensional Fourier transform with respect to the spatial offset variable of the Radon transform, and r (in the definition of \({\hat{\rho }}\)) is the variable resulting from the Fourier transform.

In analogy to the regularized version (3), the most common classical way to ensure a stable reconstruction is to replace (4) by

$$\begin{aligned} R(f^\delta ;\;\rho _\delta ) = A^*\left( {\mathcal {F}}\;^{-1}_{\text { 1-D}} \left( \rho _\delta \cdot {\mathcal {F}}\!_{\text { 1-D}}f^\delta \right) \right) \end{aligned}$$
(5)

with the filter \(\rho _\delta\) chosen to avoid amplification of high frequency components with the large frequency \(|r|\). Common choices include the Hamming and ramp / Ram-Lak filters.

In this work, we study (3) and (5) for functions \(g_\delta\) (respectively, \(\rho _\delta\)) that can be learned from data. In particular, we show that the shape of the optimal functions \(g_\delta\) and \(\rho _\delta\) can be characterized in the closed form. We prove that the learned approaches result in convergent regularization methods under certain conditions, investigate their behavior under different discretizations, and conduct numerical experiments on CT reconstruction problems.

2 Related Work

Regularization methods for linear inverse problems in general and manipulations of the singular values in particular, have long been studied in applied mathematics, c.f. the classical reference [18] or the more recent overview [12]. Classical examples of (3) include Lavrentiev, Tikhonov, or truncated SVD regularization. Subsequently, a lot of research has focused on nonlinear regularization techniques, such as variational methods or (inverse) scale space flows. Even more recently, researchers have focused on learning reconstruction schemes through neural networks, which tend to show significantly stronger practical performances, but often lack theoretical guarantees, e.g., being convergent regularizations (independent of their discretization) with error bounds in suitable (problem-specific) metrics. We refer to the two overview papers [12, 18] for classical and nonlinear regularization theory and recall some machine-learning specific regularization approaches below.

The simplest form of benefiting from data-driven approaches is pre- or post-processing networks, i.e., parameterized functions \({\mathcal {G}}\) that are either applied to the data \(f^\delta\) before exploiting a classical reconstruction technique, or to a preliminary reconstruction like (3). Common architectures in the area of image reconstruction problems are simple convolutional neural networks (CNNs) or multiscale approaches such as the celebrated U-Net architecture [20, 33]. Direct reconstructions (with different types of problem specific information being accounted for) can, for instance, be found in [19, 26, 41]. Natural extensions of pre- and post-processing networks use \(A\) and its adjoint to switch between the measurement and reconstruction spaces with intermediate learnable operations in structures that are often motivated by classical iterative reconstruction/optimization methods such as gradient descent, forward-backward splitting, or primal-dual approaches, e.g., LEARN [15], FISTA-Net [38], or the learned primal-dual method [1]. Yet, without further restrictions, such methods do not allow to prove error estimates or convergence results.

Coming from the perspective of regularization schemes based on variational methods, many approaches have suggested to learn the regularizer, starting from (sparsity-based) dictionary learning, e.g., [2, 27], over learning (convex and non-convex) regularizers motivated by sparsity penalties, e.g., [14, 21, 34], to schemes that merely learn descent directions [29] or operators provably being convergent to a global optimum [32] or playing the role of [28, 31] a proximal operator, to reparametrizations of the unknown in unsupervised settings (e.g., the deep image prior [36]) or in a learned latent space of realistic reconstruction [13, 22]. In a similar fashion, deep equilibrium models [9] generalize the optimality conditions arising from learning regularizers (c.f. [30]) and can, for instance, be combined with convex neural networks [5] for learning suitable regularizations. Yet, the above approaches are either not viewed in the light of infinite dimensional problems, or are non-convex, such that standard regularization results do not apply, or at least require the ability to compute global minimizers (as in the analysis of [25] or strict assumptions like the tangential cone condition [7]). For a recent overview on data-driven approaches in the context of inverse problems, we refer the reader to [6].

Even if properties like convexity of the regularizer allow to deduce convergence and/or error estimates, the complex parametrization of the above methods makes it highly difficult to understand how a data-driven approach extracts information from the training data, which quantities matter and what properties a learned regularization has. Therefore, our approach in this paper is to extend our previous work [11] by studying the simple linear (singular-value based) reconstruction (3) as well as the application-specific learning of a filter in Radon-inversion (5) to provide more insights into properties of learned regularization techniques. While optimizing such spectral filters in a data-driven manner has been studied in the past (c.f. [17]), to the best of our knowledge it has not been studied to which extent the obtained reconstruction operators fulfill the characteristics of classical regularization methods.

While a large number of different data-driven approaches have been tailored to CT-reconstruction (c.f. [8, 10, 16, 39] for particular examples or [24, 37, 40] for surveys), they largely follow the above categories in terms of theoretical guarantees and an understanding of the underlying learning process.

Our framework is close to learning the optimal Tikhonov regularizer, which was extended to the infinite dimensional case and studied with a focus on the generalization to unseen data in [3]. Although our approach is more restricted and requires the knowledge of the forward operator and its singular value expansion, it allows for deriving reasonable assumptions to guarantee the convergence in the no-noise-limit.

3 Supervised Learning of Spectral Regularizations

In this section, we derive a closed-form solution for \(g_\delta (\sigma _n)\) in (3) when the expectation of the squared norm difference between \(u\) and the regularization applied to data \(f^\delta\) is minimized. Here \(u\) and \(f^\delta {= Au+ \nu }\) come from a distribution of training data. We then analyze the corresponding regularization operator and show that the operator is a convergent linear regularization.

3.1 Optimally Learned Spectral Regularization

In this section, we study the approach of learning the function \(g_\delta\) in the approach of (3) from a theoretical perspective. First of all note that due to the assumption of A being compact (and hence having a discrete spectrum), only the evaluations of \(g_\delta\) at \(\sigma _n\) matter, such that we will focus on the optimal values \(g_n:= g_\delta (\sigma _n)\) directly. For the sake of simplicity, we can guarantee the well-definedness of \(g(\sigma _n) = g_n\) by assuming the singular values \(\sigma _n\) to be strictly decreasing and have multiplicity \(\mu (\sigma _n) = 1\). Let us further assume that our data formation process (1) arises with noise drawn independently from \(u\) from a noise distribution parameterized by the noise level \(\delta\) with the zero mean.

For a fixed noise level \(\delta\), the most common way to learn parameters (i.e., the \(g_n\) in our case) is to minimize the expectation of a suitable loss, e.g., the squared norm, over the training distribution of pairs \((u, f^\delta )\) of the desired ground truth \(u\) and the noisy data \(f^\delta\). Thus, the ideal learned method is obtained by choosing

$$\begin{aligned} {\overline{g}} = \text {arg}\min _g {\mathbb {E}}{_{u,\nu }}( \Vert u- R(f^\delta ;\;g) \Vert ^2 ), \end{aligned}$$

where \({\mathbb {E}}_{u,\nu }\) denotes the expected value with respect to the joint distribution of the ground truth data and the noise. With the spectral decomposition (3) we have

$$\begin{aligned} \Vert u- R(f^\delta ;\;g) \Vert ^2&= {\Vert u_0 \Vert ^2 +} \sum _n ( (1- \sigma _n g_n)\langle u, u_n \rangle + g_n \langle \nu , v_n \rangle )^2 \\&= {\Vert u_0 \Vert ^2 +} \sum _n ( (1- \sigma _n g_n)^2 \langle u, u_n \rangle ^2 + g_n^2 \langle \nu , v_n \rangle ^2 \\&\quad\; {-} 2 (1- \sigma _n g_n)g_n \langle u, u_n \rangle \langle \nu , v_n \rangle ), \end{aligned}$$

where \(u_0 \in {\mathcal {N}}(A)\) is the unique projection of u onto the nullspace of the operator. Due to the independence of \(u\) and \(\nu\) together with the zero mean of the noise we find

$$\begin{aligned} {\mathbb {E}}{_{u,\nu }}( \langle u, u_n \rangle \langle \nu , v_n \rangle ) = 0, \end{aligned}$$

hence

$$\begin{aligned} {\mathbb {E}}{_{u,\nu }}( \Vert u- R(f^\delta ;\;g) \Vert ^2 ) = {{\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right) +} \sum _n ( (1- \sigma _n g_n)^2 \Pi _n + g_n^2 \Delta _n) \end{aligned}$$

with

$$\begin{aligned} \Pi _n:= {\mathbb {E}}{_{u}}( \langle u, u_n \rangle ^2), \qquad \Delta _n:= {\mathbb {E}}{_{\nu }}(\langle \nu , v_n \rangle ^2 ), \end{aligned}$$

where \({\mathbb {E}}_{u}\) and \({\mathbb {E}}_{\nu }\) denote the expected values with respect to the marginal distributions of the ground truth data, or the noise, respectively. Since the whole training only makes sense in the chosen spaces X and Y if the ground truth data u are indeed elements of X, we shall assume in the following without further notice that

$$\begin{aligned} \sum _n \Pi _n = \sum _n {\mathbb {E}}_{u}( \langle u, u_n \rangle ^2) = {\mathbb {E}}_{u}\left( \sum _n \langle u, u_n \rangle ^2\right) = {\mathbb {E}}_{u}(\Vert u\Vert ^2) < \infty . \end{aligned}$$

The above problem can be minimized for each \(g_n\) separately by solving a quadratic minimization problem, where a solution is given by

$$\begin{aligned} {\overline{g}}_n = \frac{\sigma _n \Pi _n}{\sigma _n^2 \Pi _n + \Delta _n}. \end{aligned}$$
(6)

To obtain uniqueness of the solution, we assume that \(\Pi _n > 0\) or \(\Delta _n > 0\) for all \(n \in {\mathbb {N}}\) throughout this paper.

Remark 1

The above result can be generalized to the case of singular values not strictly decreasing. Since \(\sigma _n\) is converging to zero, each singular value has the finite multiplicity and we can rewrite them as a sequence of strictly decreasing singular values \(\sigma _n\) with the multiplicity \(\mu (\sigma _n) \geqslant 1\) and corresponding singular functions \(\{u_{nm}\}_{m=1}^{\mu (\sigma _n)}\) and \(\{v_{nm}\}_{m=1}^{\mu (\sigma _n)}\), i.e.,

$$\begin{aligned} Au = \sum _n \sigma _n \sum _{m = 1}^{\mu (\sigma _n)} \langle u, u_{nm}\rangle v_{nm} \qquad \text {and} \qquad R(f,g) = \sum _n g_n \sum _{m = 1}^{\mu (\sigma _n)} \langle f, v_{nm}\rangle u_{nm}. \end{aligned}$$

In this case, the optimal coefficients \(g_n\) are again given by (6), where we use the generalized variance coefficients

$$\begin{aligned} \Pi _n:= {\mathbb {E}}_u\left( \sum _{m = 1}^{\mu (\sigma _n)}\langle u, u_{nm} \rangle ^2\right) , \qquad \Delta _n:= {\mathbb {E}}_{\nu }\left( \sum _{m = 1}^{\mu (\sigma _n)}\langle \nu , v_{nm} \rangle ^2 \right) .\end{aligned}$$

We observe that for \(\Pi _n > 0\), (6) can be rewritten as

$$\begin{aligned} {\overline{g}}_n = \frac{\sigma _n}{\sigma _n^2+ \Delta _n/ \Pi _n} \end{aligned}$$

and therefore has the same form as the classical Tikhonov regularization

$$\begin{aligned} {\hat{g}}_n = \frac{\sigma _n}{\sigma _n^2 + \alpha } \end{aligned}$$

with the commonly fixed regularization parameter \(\alpha\) being replaced by a componentwise adaptive and data-driven term \(\Delta _n/\Pi _n\). We note that the restricted structure of the regularizer prevents it from approximating the nullspace component of the unknown. This is quantified by the term \({\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right)\) which is independent of the choice of g. For the generalization error this implies

$$\begin{aligned}&\min _g \, {\mathbb {E}}{_{u,\nu }}( \Vert u- R(f^\delta ;\;g) \Vert ^2 ) {- {\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right) } \\ =& {} {\mathbb {E}}{_{u,\nu }}( \Vert u- R(f^\delta ;\;{\overline{g}}) \Vert ^2 ) {- {\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right) }= \sum _n ( (1- \sigma _n {\overline{g}}_n)^2 \Pi _n + {\overline{g}}_n^2 \Delta _n) \\ =& \sum _n \left( \left( 1- \frac{\sigma _n^2}{\sigma _n^2 + \Delta _n/\Pi _n} \right) ^2 \Pi _n + \frac{\sigma _n^2}{(\sigma _n^2 + \Delta _n/\Pi _n)^2} \Delta _n \right) \\ = & \sum _n \left( \Pi _n + \frac{\sigma _n^2 \Delta _n + \sigma _n^4 \Pi _n - 2 \Pi _n \sigma _n^2 (\sigma _n^2 + \Delta _n/\Pi _n)}{(\sigma _n^2 + \Delta _n / \Pi _n)^2} \right) \\ = & \sum _n \left( \Pi _n - \frac{\sigma _n^2 \Delta _n + \sigma _n^4 \Pi _n}{(\sigma _n^2 + \Delta _n / \Pi _n)^2} \right) = \sum _n \frac{\Pi _n (\sigma _n^2 + \Delta _n / \Pi _n)^2 - \sigma _n^2 \Delta _n - \sigma _n^4 \Pi _n}{(\sigma _n^2 + \Delta _n / \Pi _n)^2} \\ = & \sum _n \frac{\sigma _n^2 \Delta _n + \Delta _n^2 / \Pi _n}{(\sigma _n^2 + \Delta _n / \Pi _n)^2} = \sum _n \frac{\Delta _n}{ \sigma _n^2 + \Delta _n / \Pi _n} = \sum _n \frac{\Delta _n \Pi _n}{ \sigma _n^2 \Pi _n + \Delta _n} \, , \end{aligned}$$

and hence,

$$\begin{aligned} \min _g \, {\mathbb {E}}{_{u,\nu }}( \Vert u- R(f^\delta ;\;g) \Vert ^2 ) {- {\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right) } = \sum _n \frac{\Delta _n \Pi _n}{ \sigma _n^2 \Pi _n + \Delta _n} \, . \end{aligned}$$
(7)

In general we will need an assumption about the structure of the noise in the data set. First of all we interpret the noise level \(\delta\) in a statistical sense as a bound for the noise standard deviation, however formulated such that it includes white noise. This leads to

$$\begin{aligned} \delta ^2 = {\sup _{n \in {\mathbb {N}}}} \;\Delta _n, \end{aligned}$$
(8)

which we will assume throughout the paper without further notice. Note that when studying the zero noise limit \(\delta \rightarrow 0\) it is natural to interpret \(\Delta _n\) as depending on \(\delta\) and in these instances we will use the explicit notation \(\Delta _n(\delta )\). We will refer to the case of white noise as \(\Delta _n(\delta ) = \delta ^2\) for all \(n\in {\mathbb {N}}\).

Obviously we expect the clean images to be smoother than the noise, which means that their decay in the respective singular basis is faster. We thus formulate the following assumption, which is automatically true for white noise (\(\Delta _n = \delta ^2\) for all n), since \(\Pi _n\) goes to zero.

Assumption 1

For \(\delta > 0\), there exists \(n_0 \in {\mathbb {N}}\) such that the sequence \(\Delta _n/\Pi _n\) is well-defined for \(n \geqslant n_0\) and diverging to \(+\infty\).

For some purposes it will suffice to assume a weaker form, which only has a lower bound instead of divergence to infinity.

Assumption 2

There exists \(c > 0\) and \(n_0 \in {\mathbb {N}}\) such that \(\Delta _n(\delta ) \geqslant c \, \delta ^2 \, \Pi _n\) for every \(n \geqslant n_0\) and \(\delta > 0\).

With Assumption 2 we can conclude that the operator \(R(\cdot ;\;{\overline{g}})\) with \({\overline{g}}\) given by (6) is a linear and bounded operator for strictly positive \(\delta\).

Lemma 1

Let Assumption 2 be satisfied. Then \(R(\cdot ;\;{\overline{g}})\) with \({\overline{g}}\) as defined in (6) is a bounded linear operator for \(\delta > 0\).

Proof

By construction, the operator R is linear in its argument \(f^\delta\). We first note that \(\Pi _n = 0\) implies \({\overline{g}}_n= 0\). For \(\Pi _n > 0\) and \(n \geqslant n_0\), the condition \(\Delta _n \geqslant c \, \delta ^2 \, \Pi _n\) implies

$$\begin{aligned} {\overline{g}}_n = \frac{\sigma _n}{\sigma _n^2 + \Delta _n/\Pi _n} \leqslant \frac{1}{2 \sqrt{\Delta _n/\Pi _n}} \leqslant \frac{1}{2 \sqrt{c} \delta } \, , \end{aligned}$$

due to \(\sigma ^2_n + \Delta _n/\Pi _n \geqslant 2\sigma _n\sqrt{\Delta _n/\Pi _n}\). For \(n<n_0\) we have

$$\begin{aligned} {\overline{g}}_n \leqslant \frac{1}{\sigma _{n_0}}. \end{aligned}$$

Hence, we conclude

$$\begin{aligned} \Vert R(f^\delta , {\overline{g}}) \Vert \leqslant \frac{1}{\min \{ \sigma _{n_0},2 \sqrt{c} \delta \}}\Vert f^\delta \Vert, \qquad \forall f^\delta , \end{aligned}$$

which implies that the operator \(R(\cdot ;\;{\overline{g}})\) is a bounded linear operator for \(\delta > 0\).

Remark 2

In practical scenarios with finite (empirical) data \((u^i, f^i)_{i=1,\cdots , N}\) one typically minimizes the empirical risk

$$\begin{aligned} \frac{1}{N} \sum _{i=1}^N \Vert u^i - R(f^i; \;g) \Vert ^2 \end{aligned}$$

instead of the expectation, leading to coefficients

$$\begin{aligned} {\overline{g}}_n^N = \frac{\sigma _n \Pi _n^N {+} \Gamma _n^N}{\sigma _n^2 \Pi _n^N + \Delta _n^N + 2 \sigma _n \Gamma _n^N} \, \end{aligned}$$
(9)

with

$$\begin{aligned} \Pi _n^N:= \frac{1}{N} \sum _{i=1}^N \langle u^i, u_n \rangle ^2, \quad \Delta _n^N:=\frac{1}{N} \sum _{i=1}^N \langle \nu ^i, v_n \rangle ^2, \quad \Gamma _n^N:=\frac{1}{N} \sum _{i=1}^N \langle u^i, u_n \rangle ~\langle \nu ^i, v_n \rangle . \end{aligned}$$

Yet, a convergence analysis of the empirical risk minimization requires further (strong) assumptions to control \(\Gamma _n^N\), such that we limit ourselves to the analysis of the ideal case (6) in this work.

3.2 Range Conditions

We start our analysis of the learned regularization method with an inspection of the range condition (cf. [12]), which for Tikhonov-type regularization methods equals the so-called source condition in classical regularization methods (cf. [18]). More precisely, we ask which elements \(u \in X\) can be obtained from the regularization operator \(R(\cdot ,{\overline{g}})\) for some data f, i.e., we characterize the range of \(R(\cdot ,{\overline{g}})\). Intuitively, one might expect that the range of R includes the set of training samples, or in a probabilistic setting the expected smoothness of elements in the range (with expectation over the noise and training images) should equal the expected smoothness of elements of the training images. However, as we shall see below, the expected smoothness of reconstructions is typically higher.

We start with a characterization of the range condition, where we again denote the range of an operator by \({\mathcal {R}}\).

Proposition 1

Let \({\overline{g}}\) be given by (6). Then \(u \in {\mathcal {R}}(R(\cdot ,{\overline{g}}))\) if and only if

$$\begin{aligned} \sum _n \frac{\Delta _n^2}{\Pi _n^2 \sigma _n^2} \langle u , u_n \rangle ^2 < \infty . \end{aligned}$$
(10)

If Assumption 2 is satisfied we have in particular \(u \in {\mathcal {R}}(A^*)\) and in the case of white noise the range condition (10) holds if and only if

$$\begin{aligned} \sum _n \frac{\langle u , u_n \rangle ^2}{\Pi _n^2 \sigma _n^2} < \infty . \end{aligned}$$
(11)

Proof

Given the form of \({\overline{g}}\) we see that \(u \in {\mathcal {R}}(R(\cdot ,{\overline{g}}))\) if

$$\begin{aligned} \langle u,u_n \rangle = {\overline{g}}_n \langle f, v_n \rangle \end{aligned}$$

for some \(f \in Y\), which is further equivalent to \(\frac{1}{{\overline{g}}_n} \langle u,u_n \rangle \in \ell ^2({\mathbb {N}})\). We see that

$$\begin{aligned} \frac{1}{{\overline{g}}_n} = \sigma _n + \frac{\Delta _n}{\sigma _n\Pi _n} \end{aligned}$$

and due to the boundedness of \(\sigma _n\), we find \(\sigma _n \langle u,u_n \rangle \in \ell ^2({\mathbb {N}})\) for any \(u \in X\). Hence, the range condition is satisfied if and only if

$$\begin{aligned} \frac{\Delta _n}{\sigma _n\Pi _n} \langle u,u_n \rangle \in \ell ^2({\mathbb {N}}) \end{aligned}$$

holds, which is just (10). Under Assumption 2 we have \(\frac{\Delta _n}{\sigma _n\Pi _n}\geqslant \frac{c \, \delta ^2}{\sigma _n}\), thus

$$\begin{aligned} \sum _n \frac{1}{\sigma _n^2} \langle u, u_n \rangle ^2 < \infty . \end{aligned}$$

This implies \(u =A^*w\) with

$$\begin{aligned} w = \sum _n \frac{1}{\sigma _n} \langle u, u_n\rangle v_n \in Y. \end{aligned}$$

For white noise, we have \(\Delta _n = \delta ^2\), which implies that (10) is satisfied if and only if (11) is satisfied.

Let us mention that under Assumption 1 the range condition of the learned regularization method is actually stronger than the source condition \(u=A^*w\), since we even have

$$\begin{aligned} \sum _n \frac{\Delta _n^2}{\Pi _n^2} \langle w, v_n \rangle ^2 < \infty \end{aligned}$$

with the weight \(\frac{\Delta _n^2}{\Pi _n^2}\) diverging under Assumption 1. This indicates that the learned regularization might be highly smoothing or even oversmoothing, which depends on the roughness of the noise compared to the signal, i.e., the quotient \(\frac{\Delta _n}{\Pi _n}\).

Although the range of the reconstruction operator is still dense in the range of \(A^*\) if \(\Pi _n > 0\) for all \(n \in {\mathbb {N}}\), the oversmoothing effect of the learned spectral regularization method can be made clear by looking at the expected smoothness of the reconstructions, i.e.,

$$\begin{aligned} {\tilde{\Pi }}_n = {\mathbb {E}}_{u,\nu }(\langle R(Au+\nu ,{\overline{g}}) , u_n \rangle ^2 ) \, . \end{aligned}$$
(12)

A straight-forward computation yields

$$\begin{aligned} {\tilde{\Pi }}_n = \frac{\sigma _n^2 \Pi _n}{\sigma _n^2 \Pi _n + \Delta _n} \Pi _n = \frac{1}{1 + \frac{\Delta _n}{\sigma _n^2 \Pi _n}} \Pi _n. \end{aligned}$$
(13)

While we expect (at least on average) similar smoothness of the reconstructions as for the training images, i.e., \({\tilde{\Pi }}_n \approx \Pi _n\), we find under Assumption 2 that

$$\begin{aligned} {\tilde{\Pi }}_n \leqslant \frac{1}{1 + \frac{c \delta ^2}{\sigma _n^2 }} \Pi _n \approx \frac{\sigma _n^2}{c\delta ^2} \Pi _n \end{aligned}$$

for large n, i.e., \(\frac{{{\tilde{\Pi }}}_n}{\Pi _n}\) converges to zero at least as fast as \(\sigma _n^2\). The distribution of reconstruction thus has much more mass on smoother elements of X than the training set. The main reason for this behavior seems to stem from the noise in the data. Thus, although the method is supervised and tries to match all reconstructions to the training data, it needs to produce a linear operator that maps noise to suitable elements of X as well. Note that the behavior changes with \(\delta \rightarrow 0\). For small \(\frac{\Delta _n}{\sigma _n^2 \Pi _n}\), the denominator is approximately equal to one, hence

$$\begin{aligned} {{\tilde{\Pi }}}_n \approx \Pi _n, \end{aligned}$$

i.e., in the limit the right degree of smoothness will be restored.

Another perspective on the expected smoothness is to investigate the bias of the obtained reconstruction operator, or, since we cannot expect to reconstruct nullspace elements

$$\begin{aligned} \bar{e}_0 := {\mathbb {E}}_u\left( \Vert u - R(Au,\,{\overline{g}})\Vert ^2 \right)&- {\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right) . \end{aligned}$$

Inserting the definition of \({\overline{g}}\) we derive with that for any \(N \in {\mathbb {N}}\)

$$\begin{aligned} \bar{e}_0 = \sum _n \frac{1}{\left( \frac{\sigma _n^2 \Pi _n}{\Delta _n} + 1\right) ^2}\Pi _n \leqslant \frac{\delta ^2}{\min _{\begin{subarray}{c} n \leqslant N \\ \Pi _n> 0 \end{subarray}} (\sigma _n^2 \Pi _n)}\sum _{n \leqslant N} \Pi _n + \sum _{n > N} \Pi _n, \end{aligned}$$

which yields the convergence of the bias to zero as \(\delta \rightarrow 0\). We see that, just like the velocity of the convergence of the variance coefficients \(\tilde{\Pi }_n\), the velocity of the convergence of the bias is determined by the convergence of the factors \(\frac{\Delta _n}{\sigma _n^2 \Pi _n} \rightarrow 0\).

3.3 Convergence of the Regularization

In the following we will investigate the convergence properties of the learned spectral regularization as the noise level \(\delta\) tends to zero. For this sake we will write \(\Delta _n(\delta )\) and \({\overline{g}}(\delta )\) here to make clear that we are looking at a sequence converging to zero.

We shall naturally consider the convergence in the mean-squared error, i.e.,

$$\begin{aligned} e(u,\delta ) := {\mathbb {E}}_\nu (\Vert u- R(f^\delta ;\;{\overline{g}}{(\delta )})\Vert ^2) .\end{aligned}$$
(14)

This equals the 2-Wasserstein distance (cf. [4, Sect. 2.1]) between the concentrated measure at u and the distribution generated by the regularization applied to random noisy data, thus can be understood as a concentration of measure in the zero noise limit.

In addition to the convergence of the regularization for single elements \(u \in X\) we can also look at the mean-squared error over the whole distribution of given images and noise, i.e.,

$$\begin{aligned} \begin{aligned} {\overline{e}}(\delta )&:= {\mathbb {E}}_u(e(u,\delta ) ) {- {\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right) } \\&\;= {\mathbb {E}}_{u,\nu }(\Vert u- R(f^\delta ;\;{\overline{g}}{(\delta )})\Vert ^2) {- {\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right) ,} \, \end{aligned} \end{aligned}$$
(15)

which is identical to the quantity (7) with emphasis of the dependency of \(\Delta _n\) on \(\delta\).

With the following theorem we verify that \(e(u, \delta )\) as defined in (14) and \({\overline{e}}(\delta )\) as defined in (15) are indeed converging to zero for \(\delta\) converging to zero.

Theorem 1

For \(u \in {\mathcal {N}}(A)^\perp\), let \(\langle u,u_n\rangle = 0\) for all indices n with \(\Pi _n = 0\). Then \(e(u,\delta ) \rightarrow 0\) as \(\delta \rightarrow 0\). Moreover, we also have \({\overline{e}}(\delta ) \rightarrow 0\) for \(\delta \rightarrow 0\).

Proof

We start by computing

$$\begin{aligned} \Vert u- R(f^\delta ; \;{\overline{g}}{(\delta )}) \Vert ^2&= \sum _{n: \Pi _n > 0} \left[ (1 - \sigma _n {\overline{g}}_n{(\delta )})^2 \langle u, u_n \rangle ^2 + {\overline{g}}{(\delta )}^2 \langle \nu , v_n \rangle ^2 \right. \\&\quad \left. +\, 2(1 - \sigma _n {\overline{g}}_n{(\delta )}){\overline{g}}_n{(\delta )} \langle u, u_n \rangle \langle \nu , v_n \rangle \right] \,. \end{aligned}$$

Taking expectations with respect to the noise yields

$$\begin{aligned} {\mathbb {E}}_\nu (\Vert u- R(f^\delta ;\; {\overline{g}}{(\delta )}) \Vert ^2)&= \sum _{\begin{subarray}{c} n, \\ {\Pi _n> 0} \end{subarray}} \left[ (1 - \sigma _n {\overline{g}}_n{(\delta )})^2 \langle u, u_n \rangle ^2 + {\overline{g}}_n{(\delta )}^2 \Delta _n{(\delta )} \right] \\&= \sum _{\begin{subarray}{c} n, \\ {\Pi _n> 0} \end{subarray}} \frac{\Delta _n{(\delta )}^2 \langle u, u_n \rangle ^2 + \sigma _n^2 \Pi _n^2 \Delta _n{(\delta )}}{(\sigma _n^2 \Pi _n + \Delta _n{(\delta )})^2} \\&= \sum _{\begin{subarray}{c} n, \\ {\Pi _n> 0} \end{subarray}} \left( \frac{\Delta _n{(\delta )}}{\sigma _n^2 \Pi _n + \Delta _n{(\delta )}}\right) ^2 \langle u, u_n \rangle ^2 + \frac{\sigma _n^2\Delta _n {(\delta )}}{(\sigma _n^2 + \frac{\Delta _n{(\delta )}}{\Pi _n})^2} \\&\leqslant \sum _{\begin{subarray}{c} n \leqslant N, \\ {\Pi _n> 0} \end{subarray}} \frac{\delta ^2}{\sigma _n^2 \Pi _n + \Delta _n{(\delta )}}\langle u, u_n \rangle ^2 + \sum _{\begin{subarray}{c} n> N, \\ {\Pi _n> 0} \end{subarray}} \langle u, u_n \rangle ^2 \\&\quad+\sum _{\begin{subarray}{c} n \leqslant N, \\ {\Pi _n> 0} \end{subarray}} \frac{\sigma _n^2 \delta ^2}{\sigma _n^4} + \sum _{n> N} \Pi _n\\&\quad\leqslant \delta ^2 \left( \frac{\Vert u \Vert ^2}{{\min _{\begin{subarray}{c} n \leqslant N, \\ \Pi _n> 0 \end{subarray}} \sigma _n^2 \Pi _n}} + \frac{N}{\sigma _N^2} \right) + \sum _{n>N} (\langle u, u_n \rangle ^2 + \Pi _n). \end{aligned}$$

Now let \(\epsilon > 0\) be arbitrary. Due to the summability of \(\langle u, u_n \rangle ^2 + \Pi _n\) we can choose N large enough such that

$$\begin{aligned} \sum _{n>N} (\langle u, u_n \rangle ^2 + \Pi _n) < \frac{\epsilon }{2}. \end{aligned}$$

Now we can choose

$$\begin{aligned} \delta < \sqrt{\frac{\epsilon }{2}} \left( \frac{\Vert u \Vert ^2}{\min _{\scriptsize{\begin{array}{c} n \leqslant N, \\ \Pi _n > 0 \end{array}}} \sigma _n^2 \Pi _n} + \frac{N}{\sigma _N^2} \right) ^{-1/2} \end{aligned}$$

to obtain

$$\begin{aligned} {\mathbb {E}}_\nu (\Vert u- R(f^\delta ;\; {\overline{g}}) \Vert ^2) < \epsilon \end{aligned}$$

for \(\delta\) sufficiently small, which implies the convergence to zero.

In the same way we can estimate

$$\begin{aligned} {\mathbb {E}}_{u,\nu }( \Vert u- R(f^\delta ;\;{\overline{g}}) \Vert ^2 ) {- {\mathbb {E}}_u\left( \Vert u_0 \Vert ^2\right) }&= \sum _n \frac{\Delta _n \Pi _n}{ \sigma _n^2 \Pi _n + \Delta _n} \, , \\&\leqslant {N} \frac{\delta ^2}{\sigma _N^2} + \sum _{n > N} \Pi _n \, , \end{aligned}$$

where the first equality follows from (7), and obtain the convergence by the same argument.

4 Learned Radon Filters

In this section, we show the correspondence between the data-driven regularization of the inverse Radon transform and filtered back-projection. We therefore analyze the properties of an optimal filter for the regularized filtered back-projection (5), i.e., we want to find a function \({\overline{\rho }}\!: {\mathbb {R}} \times [0,\uppi ] \rightarrow {\mathbb {C}}\) such that

$$\begin{aligned} {\overline{\rho }} = \text {arg}\min _\rho {\mathbb {E}} {_{u, \nu }}( \Vert u- R(f^\delta ;\;\rho ) \Vert ^2 ). \end{aligned}$$
(16)

The function \(\rho\) now plays the role of the learned spectral coefficients. In particular, \(\rho (r, \theta ) = |r |\) corresponds to an inversion, i.e., \(g_n = \frac{1}{\sigma _n}\), and \(\rho (r, \theta ) = 1\) yields the adjoint operator, which corresponds to \(g_n = {\sigma _n}\). In analogy to the derivation of the optimal coefficients in Sect. 3, we first rewrite the \(L^2\)-error of the difference of a single \(u\in L^2({\mathbb {R}}^2)\) and the corresponding regularized reconstruction of the measurement \(f^\delta \in L^2({\mathbb {R}}^2)\) as follows (using the isometry property of the Fourier transform):

$$\begin{aligned} \Vert u- R(f^\delta ;\;\rho ) \Vert ^2&= \int _{{\mathbb {R}}^2} \vert {\mathcal {F}}\!_{\text {2-D}}u(\xi ) - {\mathcal {F}}\!_{\text { 2-D}}R(f^\delta ;\;\rho )(\xi )\vert ^2 \, \mathop {}\!\!\textrm{d}\xi \\&= \int _0^{\uppi } \int _{{\mathbb {R}}} \vert r\vert \,\vert {\mathcal {F}}\!_{\text { 2-D}}u(r {\varvec {\theta }}) - {\mathcal {F}}\!_{\text { 2-D}}R(f^\delta ;\;\rho )(r{\varvec {\theta } })\vert ^2 \,\mathop {}\!\!\textrm{d}r\textrm{d}{\varvec {\theta }}\\&= \int _0^{\uppi } \int _{{\mathbb {R}}} \vert r\vert \,\left| {\mathcal {F}}\!_{\text { 1-D}}Au(r,\theta ) - \frac{\rho (r,\theta )}{\vert r \vert }\,{\mathcal {F}}\!_{\text { 1-D}}f^\delta (r,\theta )\right| ^2 \,\mathop {}\!\!\textrm{d}r\mathop {}\!\!\textrm{d}{\varvec {\theta }}. \end{aligned}$$

In the last line, we use the central slice theorem to get

$$\begin{aligned} {\mathcal {F}}\!_{\text { 2-D}}u(r{\varvec {\theta }}) = {\mathcal {F}}\!_{\text { 1-D}}Au(r,\theta ) \qquad \text {and} \qquad {\mathcal {F}}\!_{\text { 2-D}}R(f^\delta ;\;\rho )(r{\varvec {\theta }}) =\frac{\rho (r,\theta )}{\vert r \vert }\,{\mathcal {F}}\!_{\text { 1-D}}f^\delta (r,\theta ), \end{aligned}$$

where the latter equality holds since the range of the Radon transform is dense in \(L^2({\mathbb {R}} \times [0,\uppi ])\) and thus

$$\begin{aligned} AR(f^\delta ;\;\rho ) = {\mathcal {F}}\;^{-1}_{\text { 1-D}}\left( \frac{\rho }{\vert \cdot \vert }\, \cdot {\mathcal {F}}\!_{\text { 1-D}}f^\delta \right) . \end{aligned}$$

Note that for the reconstruction operator to be well-defined and continuous on \(L^2\), the filter function \(\rho\) has to be essentially bounded. With the same arguments used in Sect. 3 we derive

$$\begin{aligned} {\mathbb {E}}( \Vert u- R(f^\delta ;\;\rho ) \Vert ^2 ) = \int _0^{\uppi } \int _{{\mathbb {R}}} |r|\left( |1-\psi (r,\theta )|^2 \, \Pi (r, \theta ) + |\psi (r,\theta )|^2 \, \Delta (r,\theta )\right) \mathop {}\!\textrm{d}r \textrm{d}\theta \end{aligned}$$

with \(\psi (r,\theta ) = \frac{\rho (r,\theta )}{\vert r \vert }\),

$$\begin{aligned} \Pi (r, \theta ) = {\mathbb {E}}\left( \left|{\mathcal {F}}\!_{\text { 1-D}} Au(r,\theta )\right|^2\right), \qquad \text {and} \qquad \Delta (r, \theta ) = {\mathbb {E}}\left( \left|{\mathcal {F}}\!_{\text { 1-D}} \nu (r,\theta )\right|^2\right) . \end{aligned}$$

By the above formula we can restrict \(\psi\) and thus \(\rho\) to be real-valued, as adding an imaginary component would always increase the expected value. Accordingly, the optimal filter is real-valued and given by

$$\begin{aligned} {\overline{\rho }}(r,\theta ) = \vert r \vert \,\frac{\Pi (r,\theta )}{\Pi (r,\theta ) + \Delta (r, \theta )}. \end{aligned}$$

To fulfill the boundedness of \({\overline{\rho }}\) we assume that

$$\begin{aligned} \frac{\Delta (r,\theta )}{\Pi (r,\theta ) |r|} \longrightarrow \infty , \end{aligned}$$

as \(|r|\longrightarrow \infty\) for every \(\theta \in [0,\uppi ]\), which can be recognized as the obvious conditions of ground-truth data being smoother than noise on average. The expected error using the optimal filter is then given by

$$\begin{aligned} {\mathbb {E}}( \Vert u- R(f^\delta ;\;{\overline{\rho }}) \Vert ^2 )= \int _0^{\uppi } \int _{{\mathbb {R}}} |r|\, \Pi (r, \theta ) \left( 1- \frac{\Pi (r, \theta )}{\Pi (r, \theta ) + \Delta (r,\theta )}\right) \mathop {}\!\textrm{d}r \mathop {}\!\!\textrm{d}\theta . \end{aligned}$$

Using the central slice theorem again we can further determine the expected smoothness \(\tilde{\Pi }(r,\theta ):= {\mathbb {E}}\left( \left|{\mathcal {F}}\!_{\text { 1-D}}AR(f^\delta ;\;{\overline{\rho }})(r,\theta )\right|^2\right)\) of the reconstructions as

$$\begin{aligned} \tilde{\Pi }(r,\theta ) = \left( \frac{{\overline{\rho }}(r,\theta )}{\vert r \vert }\right) ^2\,{\mathbb {E}}\left( \left|{\mathcal {F}}\!_{\text { 1-D}} f^\delta (r,\theta )\right|^2\right) = \frac{\Pi (r,\theta )}{\Pi (r,\theta ) + \Delta (r,\theta )} \Pi (r,\theta ). \end{aligned}$$

This reveals the analog to the smoothing effect obtained by the spectral reconstruction that was discussed in Sect. 3.2.

Remark 3

Similar to the spectral regularization, optimal filters can also be computed for a finite set of training data. Rewriting the empirical risk for input-output pairs \(\{u^i, f^{\delta ,i}\}_{i = 1}^N\) as

$$\begin{aligned} \frac{1}{N} \sum _{i = 1}^N (\Vert u- R(f^{\delta ,i};\;\rho ) \Vert ^2 )&=\int _0^{\uppi } \int _{{\mathbb {R}}} |r|\left( |1-\psi (r,\theta )|^2 \, \Pi ^N(r, \theta ) + |\psi (r,\theta )|^2 \, \Delta ^N(r,\theta ) \right. \\ {}&\quad \left. -\, 2(1-\psi (r,\theta ))\psi (r,\theta )\Gamma ^N \right) \mathop {}\!\textrm{d}r \mathop {}\!\!\textrm{d}\theta . \end{aligned}$$

And computing the optimality conditions for \(\psi\) yields

$$\begin{aligned} {\overline{\psi }}^N = \frac{\Pi ^N{+}\Gamma ^N}{\Pi ^N+\Delta ^N+2\Gamma ^N} \end{aligned}$$
(17)

with

$$\begin{aligned} \Pi ^N(r,\theta ) = \frac{1}{N} \sum _{i = 1}^N\vert {\mathcal {F}}\!_{\text { 1-D}} Au_i(r,\theta ) \vert ^2, \qquad \Delta ^N(r,\theta ) = \frac{1}{N} \sum _{i = 1}^N\vert {\mathcal {F}}\!_{\text { 1-D} } \nu _i(r,\theta ) \vert ^2, \end{aligned}$$

and

$$\begin{aligned} \Gamma ^N(r,\theta ) = \frac{1}{{2}N} \sum _{i = 1}^N \left( {\mathcal {F}}\!_{\text { 1-D}} Au_i(r,\theta )\,\overline{{\mathcal {F}}\!_{\text { 1-D}}\nu _i(r, \theta )} + \overline{{\mathcal {F}}\!_{\text { 1-D}} Au_i(r,\theta )}\,{\mathcal {F}}\!_{\text { 1-D}}\nu _i(r, \theta )\right) . \end{aligned}$$

5 Numerical Experiments with the Discretized Radon Transform

In this section, we support the theoretical results of the previous sections with numerical experiments.Footnote 1

5.1 Datasets

The following experiments are performed on a synthetic dataset and the LoDoPaB-CT dataset (cf. [23]). The synthetic dataset contains 32 000 images of random ellipses supported on a circle contained completely in the image domain (see Fig. 4 for an example). Each ellipse has a random width and height, and is roto-translated by a random distance and angle. All ellipses are completely contained within the image borders, and there may be an overlapping between two or more ellipses. We split the data into \(64\%\) training, \(16\%\) validation, and \(20\%\) test data. The LoDoPaB-CT dataset contains over 40 000 human chest scans and is split into 35 820 training, 3 522 validation, and 3 553 test images. For both datasets, we simulate sinograms as described in the following section and model noisy data by adding Gaussian noise with zero mean and variance \(\delta ^2\). In “Appendix A” we additionally report results for uniformly distributed noise.

5.2 Discretization and Optimization

We perform numerical experiments with the Radon transform for discrete data based on the Spline-0-Pixel-Model where we assume the image functions \(u\in L^2({\mathbb {R}}^2)\) to be of the form

$$\begin{aligned} u(x,y) = \sum _{i,j = 0}^{I-1} u^{(ij)} \,\beta _0\left( I\left( x-\frac{i}{I}\right) \right) \,\beta _0\left( I\left( x-\frac{j}{I}\right) \right) \quad \text {for a.e.}\; (x,y) \in {\mathbb {R}}^2, \end{aligned}$$

where \(I \in {\mathbb {N}}\) denotes the number of pixels in each direction, \(u_{ij} \in {\mathbb {R}}\) for each \(i,j < I\) and the 0-spline \(\beta _0\) is defined as

$$\begin{aligned} \beta _0(t) = {\left\{ \begin{array}{ll} 1, &{} \text {if } |t |\leqslant \frac{1}{2},\\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Accordingly we have \(u^{(ij)} = u\left( \frac{i}{I}, \frac{j}{I}\right)\), so with a slight abuse of notation we write \(u \in {\mathbb {R}}^{I\times I}\). Reformulating the Radon transform for this particular choice of input functions yields a finite linear operator (see e.g., [35] for a detailed derivation) and can thus be represented by a matrix A. For sufficiently small dimensions, the finite set of vectors \(u_n\) and \(v_n\) is given by the SVD of A, more precisely \(A = V\Sigma U^{\rm{T}}\), where \(u_n\) and \(v_n\) are the column vectors of the orthogonal matrices U and V and \(\Sigma\) is a diagonal matrix containing the singular values in the non-increasing order. Since the considered problem has finite dimensions, Assumption 2 is always fulfilled. We still see a faster decay of \(\Pi ^N\) compared to \(\Delta ^N\) as is depicted in Fig. 1.

Fig. 1
figure 1

Comparison of coefficients \(\langle u, u_n\rangle\) and \(\langle \nu , v_n\rangle\) with \(n= 5,200,\,4\;000\) for \(64 \times 64\) ground truth images, \(93 \times 256\) sinograms, and noise level \({\delta } = 0.005\). While the variance of the data-coefficients decreases with increasing n, the variance of the noise-coefficients is stable

Since the computation of the SVD has high memory requirements and the matrix multiplication with U and \(V^{\rm{T}}\) is computationally expensive, this approach is only feasible for low resolution data. Using the fact that V is orthogonal, we see the connection to the continuous optimization (16) by rewriting the regularized inversion operator as

$$\begin{aligned} R(f^\delta ;\;{\overline{g}}^N) = A^{\rm{T}}\,\left( V \,\frac{{\overline{g}}^N}{\text {diag}(\Sigma )} \,V^{\rm{T}}\right) f^\delta , \end{aligned}$$
(18)

where the division is to be interpreted element-wise, and where \(\text {diag}(\Sigma )\) returns the main diagonal of \(\Sigma\).

As we will analyze in more detail below, the discretization of the Radon transform leads to (4) being incorrect for \({\mathcal {F}}\) denoting the discrete Fourier transform. Thus, we will compare four different approaches in our numerical results.

  1. i)

    The spectral regularization (3) with coefficients computed according to the analytical solution (9).

  2. ii)

    The spectral regularization (3) with coefficients optimized in a standard machine learning setting, i.e., initializing the parameters with zeros, partitioning the data into batches of 32 instances, and then using the PyTorch implementation of Adam (with default parameters except for a learning rate of 0.1) for optimization. Ideally, these results should be identical to i).

  3. iii)

    The learned filtered back-projection (5) with coefficients computed according to the analytical solution (17), knowing that the resulting coefficient might not be optimal due to the aforementioned discretization error in (5). To partially compensate for this, we replaced the discretization of the ramp filter \(\vert r\vert\) by filter coefficients we optimized on clean data. For all non-zero noise levels we kept these pseudo-ramp filter coefficients from the no-noise baseline.

  4. iv)

    The learned filtered back-projection (5) with coefficients that optimize

    $$\begin{aligned} {\overline{\rho }}^N = \rm{arg}\min _{\rho } \frac{1}{N} \sum _{i=1}^N\Bigg \Vert u^i - A^{\rm T}\left( F^{-1} {\overline{\rho }}^N F\right) \Bigg\Vert ^2, \end{aligned}$$

    where F denotes the discrete Fourier transform. In keeping with the results from Sect. 4, we only consider real-valued \(\sigma\). The optimization is done in the same way as described above for the SVD. We point out that for rotation invariant distributions, i.e., cases where the noise and data distributions are constant in the direction of the angles, the filter \(\rho\) can be chosen constant in the direction of the angles as well. By this the number of necessary parameters is equal to the number of positions s. We further note that due to the inevitable discretization errors, we cannot expect perfect reconstructions, not even for perfect measurements without noise.

We refer to Approaches ii) and iv) as learned and to Approaches i) and iii) as analytic and compare their behavior below. Although the discrete Radon transform A is an approximation of the continuous Radon transform (denoted by \({\hat{A}}\) in the following), several scaling factors have to be taken into account to relate the discrete regularization to a continuous regularization. We collect these factors in the following remark.

Remark 4

For data supported on the uniform square, we consider discretizations \(u \in {\mathbb {R}}^{I\times I}\) and \(v \in {\mathbb {R}}^{K\times L}\) of functions \({\hat{u}}\) and \({\hat{v}}\). Here I denotes the number of pixels in each direction, K denotes the number of equispaced angles in \([0,\uppi ]\), and L denotes the number of equispaced positions in \(\left[ -\frac{\sqrt{2}}{2},\frac{\sqrt{2}}{2}\right]\). Then, the following approximations hold for

  1. (i)

    the Radon transform and its adjoint:

    $$\begin{aligned} {\hat{A}}{\hat{u}} \approx Au \quad \text {and}\quad {\hat{A}}^*{\hat{v}} \approx \frac{\sqrt{2}\uppi }{KL} A^{\rm{T}} v, \end{aligned}$$
  2. (ii)

    the SVD:

    $$\begin{aligned} {\hat{\sigma }}_n \approx \sqrt{\frac{\sqrt{2}\uppi }{KL}}\,\sigma _n,\qquad {\hat{u}}_n \approx I \,u_n, \quad \text {and}\quad {\hat{v}}_n \approx \sqrt{\frac{KL}{\sqrt{2}\uppi }}\, v_n, \end{aligned}$$

    where \({\hat{u}}_n\), \({\hat{v}}_n\) are normalized with respect to the \(L^2\)-norm, and \(u_n\), \(v_n\) are normalized with respect to the Euclidean norm,

  3. (iii)

    the data-driven (co-)variance terms:

    $$\begin{aligned} {\hat{\Pi }}_n = \frac{\Pi _n}{I^2},\qquad {\hat{\Delta }}_n = \frac{\sqrt{2}\uppi }{KL}\,\Delta _n,\quad \text {and}\quad {\hat{\Gamma }}_n = \sqrt{\frac{\sqrt{2}\uppi }{KL}}\,\frac{\Gamma _n}{I}. \end{aligned}$$

5.3 Results

In the following, we present several numerical results for the approaches introduced in the previous section.

5.3.1 Learned vs. Analytic

The loss as well as the PSNR- and SSIM-values obtained by the different approaches on the test dataset is documented in Table 1. Based on these results, we deduce that the SVD-based regularization performs better than the FFT-based regularization. In the SVD case we see that the results obtained by the analytic coefficients are still slightly better than the ones obtained by the learned coefficients. The opposite is true in the FFT case, where the analytic solution performs worse than the learned one. This had to be expected by the discretization errors that are made in the computation of the analytic optimum. Figure 2 shows the loss curves for both data-driven optimization approaches as well as the loss obtained by the analytically optimal filters. As expected for the SVD-approach the learned loss converges to the analytic loss. On the other hand, the loss approximated with the analytic formula for the Fast-Fourier-approach is higher than the actual data-driven loss, which may again indicate that the optimization balances out the discretization errors. This gap between the continuous and discrete optima can also be seen in higher resolution data, as the comparison of the analytic filters to the learned filters in Fig. 3 shows. Based on these insights, we only show the results for analytic SVD-coefficients and learned FFT-filters in the following.

Table 1 Performance of the different approaches on test data for \(64 \times 64\) ground truth images, \(93 \times 256\) sinograms, and different noise levels \({\delta }\)
Fig. 2
figure 2

Training loss curves for both optimization approaches and different noise levels \({\delta }\) for \(64 \times 64\) ground truth images and \(93 \times 256\) sinograms. The scale on the right shows the loss obtained by the analytically computed filters

Fig. 3
figure 3

Comparison of analytic and learned Fast-Fourier-Filters for \(256 \times 256\) ground truth images and \(365\times 256\) sinograms (values for negative r are omitted due to symmetry)

We further see that the obtained losses of both approaches are more similar the higher the noise level is which is also visible in the reconstructions (see Fig. 4). Here, we compare our two approaches to naïve reconstruction with filtered back-projection (FBP) and Tikhonov reconstruction, where the reconstruction operator is chosen individually for each example with the discrepancy principle. The reconstruction examples also reveal the predicted oversmoothing effect for both approaches.

Fig. 4
figure 4

Comparison of the ellipse reconstructions obtained by different approaches for different levels of additive gaussian noise. The first row (\(\delta = 0\)) shows the reconstructed sinograms before adding noise

5.3.2 Generalization to Different Resolutions

Since the operator we use in our experiments is discretized, the optimal coefficients (or filters, respectively) depend on the resolution of the operator and the corresponding data. However, both approaches offer ways to transfer coefficients optimized for a certain resolution to another resolution. One approach to generalize the learned SVD-based coefficients would be to resort to the classic linear regularization formulation (3) and retrieve a function \(g(\sigma )\) by inter-/extrapolating the optimal values \(g(\sigma _n) = {\overline{g}}_n\). A generalization of the FFT-filters can be obtained in a similar way by interpolation of \({\overline{\rho }}(\theta , r)\) in the direction of the angles \(\theta\) and extrapolation in the direction of the frequencies r. Comparing the optimal coefficients obtained for different resolutions after rescaling them in accordance to Remark 4 (see Fig. 5 for SVD-coefficients and Fig. 6 for FFT-filters) we see that all curves behave similarly but there are obvious differences in the magnitude and location of the peak for different resolutions. While the optimal coefficients seem to converge in the domain of small singular values (or low frequencies, respectively), up to the resolutions we were able to compute, there is no obvious convergence visible for the domain of large singular values in the SVD-approach or the domain of high frequencies in the FFT-approach. Note that since high singular values correspond to low frequencies, this reveals an opposite behavior of both approaches. Inspecting the curves for \({\hat{\Pi }}\) that they are almost identical for high resolutions (see Fig. 5). Therefore, a convergence of the optimal spectral coefficients for higher resolutions seems probable.

Fig. 5
figure 5

Comparison of the optimal spectral coefficients for different resolutions of input data noise level \({\delta } = 0.005\). All sinograms have a spatial resolution of one ray per pixel and an angular resolution of 256 angles

Fig. 6
figure 6

Comparison of the optimal FFT-filters for different angles and frequency resolutions for \(256\times 256\) ground truth images with noise level \({\delta } = 0.005\). Since the filters are not constant in the direction of the angles, multiple, slightly different curves correspond to one filter. The filters for higher spatial resolutions are cut off at frequency \(r = 183\)

5.4 Generalization to Different Data Distributions

To evaluate how well the learned regularizers generalize between different datasets, we show empirical results on the LoDoPaB-CT data set in this section. Figures 7, 8, and 9 compare the results obtained with a regularizer trained on LoDoPaB-Ct dataset with the ones obtained with a regularizer trained on the synthetic ellipse dataset. While both approaches clearly improve the reconstructions compared to the filtered back-projection, the effect of optimizing the regularizers on the correct dataset is still clearly visible.

Fig. 7
figure 7

Example image from the LoDoPaB-CT dataset and the reconstructions of an uncorrupted sinogram by the different methods

Fig. 8
figure 8

Comparison of LoDoPaB reconstructions obtained by SVD-approach, trained on different datasets, for different levels of additive Gaussian noise

Fig. 9
figure 9

Comparison of LoDoPaB reconstructions obtained by FFT-approach, trained on different datasets, for different levels of additive Gaussian noise

6 Conclusions and Outlook

In this work, we studied the behavior of data-driven linear regularization of inverse problems. For problems emerging from compact linear forward operators we showed that the optimal linear regularization can be computed analytically with the singular value expansion of the operator. An analysis of the range conditions revealed an oversmoothing effect depending on the noise level. We further confirmed that the optimal regularization is convergent under suitable assumptions. By deriving analogous results for the Radon transform, a specific non-compact linear operator, we could establish a connection to the well-known approach of filtered back-projection. Numerical experiments with a discretized (and thus compact) version of the Radon transform verified our theoretical findings for compact operators. We additionally proposed a computationally more efficient discretization of the filtered back-projection using the FFT, which allows for finer discretizations of the operator. However, the filters obtained by this approach turned out to strongly deviate from the theoretical results for the continuous case. Especially in the regime of low noise, the discretization errors that come with this approach caused a heavy accuracy loss compared to the SVD-approach. Additionally, we empirically evaluated the convergence behavior of both regularization approaches for the discrete Radon transform with an increasing level of discretization.

For the future, we would like to better understand the discretization error that arises from the discretization of the Radon transform and its impact on the computed filters. It would also be interesting to study more general nonlinear data-driven regularization methods and to find formulations and criteria for which one can prove that these methods are convergent regularizations, similar to the analysis carried out in Sect. 3.3.