Data driven regularization by projection

We demonstrate that regularisation by projection and variational regularisation can be formulated in a purely data driven setting when the forward operator is given only through training data. We study convergence and stability of the regularised solutions. Our results also demonstrate that the role of the amount of training data is twofold. In regularisation by projection, the amount of training data plays the role of a regularisation parameter and needs to be chosen depending on the amount of noise in the measurements. In this case using more data than allowed by the measurement noise can decrease the reconstruction quality. In variational regularisation, however, the amount of training data controls the approximation error of the forward operator and hence more training data always results in better reconstructions.


Introduction
Inverse problems are concerned with the reconstruction of an unknown quantity u ∈ U from indirect measurements y ∈ Y which are related by the forward model A : U → Y that describes the relationship between the quantities of interest and the measurements. The forward operator A models the physics of data acquisition and may involve, for instance, integral transforms (such as the Radon transform, see for instance [1,2]) and partial differential equations (PDEs) (see for instance [3]). In this paper we consider linear inverse problems.
Until recently the methods for solving inverse problems were model driven, meaning that the physics and chemistry of the measurement acquisition process was represented as precisely as possible with the forward operator A (for some relevant applications see [1,4,5]). Nowadays, with the rise of the area of big data, methods that combine forward modelling with data driven techniques are being developed [6]. Some of these techniques build upon the similarity between deep neural networks and classical approaches to inverse problems such as iterative regularisation [7,8] and proximal methods [9]. Some are based on postprocessing of the reconstructions obtained by a simple inversion technique such as filtered backprojection [10]. Others use data driven regularisers in the context of variational regularisation [11,12] or use deep learning to learn a component of the solution in the null space of the forward operator [13,14].
On the other hand, methods that discard forward modelling (that means modeling of the operator A) have emerged. They are appealing since they do not require knowledge on the physics of the data acquisition process, bypass the costly forward model evaluation and often yield results of superior visual quality. However, it has been demonstrated that for ill-posed inverse problems naïve applications of such methods can be unstable with respect to small perturbations in the measurement data [15,16]. Moreover, there is currently no theory for purely data driven regularisation in inverse problems, i.e. a theory in the setting when the forward operator is given only via training pairs {u i , y i } such that Au i = y i . (1.1) In this paper we make a first step of an analysis for purely data driven regularization by utilizing the similarity to the concept of regularization by projection. We demonstrate that regularisation by projection [17,18] and variational regularisation [19] can be formulated in a data driven setting and usual results such as convergence and stability can be obtained. A connection between regularisation by projection and machine learning has been observed by several authors, for instance, in the context of statistical learning [20,21], where it was demonstrated that subsampling plays the role of regularisation by (random) projections.
The paper is organised as follows. In Section 2 we present a formal problem statement and introduce our notation. In Section 3 we study regularisation by projection. We derive sufficient conditions on the training data and the measurement data, under which projections in the space of images result in a regularisation. On the other hand, if no adequate prior conditions on the training data are known, we show in Section 3.5 that we can use a data-driven convergent (with respect to the number of training data) variational regularisation method. In Section 4 we analyse projections in the space of training data, which always define a regularisation, but we demonstrate that instead such a method needs training data not for the forward operator, but for its adjoint. For both projection methods, the amount of training data plays the role of a regularisation parameter that controls the stability of the inversion process. Increasing the amount of training data beyond the amount consistent with the measurement error therefore compromises stability of the inversion and renders it ill-posed. This observation may shed some light on the counter-intuitive phenomenon that using more training data in data driven inversion methods can in fact deteriorate their performance [15]. Our theoretical analysis is complemented by numerical experiments in Sections 3.4.1, 3.5.1 and 4.1, based on the Radon transform.

Main Assumptions
We consider a standard linear inverse problem where U and Y are separable Hilbert spaces and A : U → Y is a linear bounded operator. We assume that A is injective, but its inverse A −1 is unbounded (hence the problem of solving (2.1) is ill-posed). Instead of the exact (ideal) data y we are given noisy measurement data y δ , satisfying Throughout this paper, the solution of (2.1) with exact data will be denoted by u † . Different from the standard setting we assume that we do not have (numerical) access to the operator A. This happens for instance if the modeling of the forward operator is incomplete, uncertain, or the numerical evaluation is costly. Instead we assume that we are given a collection of training pairs as in (1.1).
The goal of this paper is to develop stable algorithms for solving (2.1) without explicit knowledge of the forward operator A having only • approximate information y δ of y on the measurement data and • training pairs (1.1).
Throughout the paper, we refer to elements of U as images and to elements of Y as data. The spaces U and Y are referred to as image and data space, respectively. The results we obtain are by no means restricted to the imaging context, but we think that this is a convenient terminology.
We make the following assumptions on the training pairs throughout this paper. We also need to make an assumption that the training data are sufficiently rich in the sense that the collection of all training images is dense in U.
Assumption 2 (Density). We assume that the training images spaces are dense in U, that is n∈N Un = U.
As a consequence of the previous assumptions, we have

Regularisation by Projection
In this section we assume that y ∈ R(A). Regularisation by projection consists in approximating the solution u † of (2.1) by the minimum norm solution of the projected equation The minimum norm solution of this equation is unique and is given by where (APn) † denotes the Moore-Penrose inverse of APn (see [22]). The projection takes place in the image space U, hence the superscript in our notation u U n . The next result shows that there exists a simple expression for u U n . Theorem 4. The Moore-Penrose inverse of APn is given by Proof. First we observe that R(A −1 Qn) = Un = (N (APn)) ⊥ : To see this, note that for and u = A −1 Qny we have by definition For any u ∈ Un we can also find a representation in the form u = A −1 Qny for some y ∈ Y. Thus the first identity is shown. Now, let z ∈ N (APn). Since A is injective, this is equivalent to z ∈ N (Pn) = U ⊥ n , or in other words u ∈ (N (APn)) ⊥ if and only u ∈ N (Pn) = Un. We directly verify the Moore-Penrose equations [22]: Since the Moore-Penrose equations uniquely characterise the Moore-Penrose inverse, the assertion follows.
The counterexample by Seidman [17] demonstrates that in general the minimum norm solution of (3.1) does not converge to the exact solution u † as n → ∞, i.e. A −1 Qn is not a regularisation of A −1 . To overcome this issue, we proceed in two different ways. One is to assume some sufficient conditions on the training pairs that ensure convergence of the minimum norm solution of (3.1). The other one is to regularise the problem (3.1) using a variational regularisation.

Gram-Schmidt orthogonalisation in the data space
The application of the Gram-Schmidt orthogonalisation to the training data {y i }i=1,...,n gives an orthonormal basis {ȳ i }i=1,...,n of Yn. By solving Aū i = ȳ i for i = 1, . . . , n, we obtain, in general, a non-orthogonal basis {ū i }i=1,...,n of Un. In a matrix form, we can write that Yn = Ȳ n Rn and Ū n = UnR −1 n , where Yn, Ȳ n and Ū n denote the matrices composed of the basis vectors {y i }i=1,...,n, {ȳ i }i=1,...,n and {ū i }i=1,...,n, respectively, and Rn is an upper triangular n × n matrix Using this orthonormal basis for the space Yn, we can represent every element y ∈ Yn as n i=1 (y, ȳ i )ȳ i , hence the minimum norm solution of (3.1) is given by Taking the limit n → ∞ and applying the Gram-Schmidt process to the sequence {y i } i∈N , we obtain an orthonormal basis of R(A).

Remark 5.
It is easy to verify that the Gram-Schmidt transformed images {ū i } i∈N satisfy Proof. Since by Assumption 1 all training images {u i } i∈N are uniformly bounded and A is compact, the sequence {y i } i∈N has a convergent subsequence (that we do not relabel), which, in particular, satisfies Then we obtain the following estimate which proves the assertion.
In the following sections we provide sufficient conditions on the training data and the exact solution to get the convergence of the minimum norm solution u U n to u † .

Weak convergence
In this section we derive sufficient conditions for the partial sums (3.8) to be convergent (for n → ∞). For this purpose we use the expansion (3.3) for Qny ∈ Yn with respect to the training data {y i }i=1,...,n. In order to guarantee boundedness of (3.8), we make the following assumption: There exists a constant C λ > 0 such that the coefficients in the expansion (3.3) satisfy Theorem 7. Under Assumptions 1 and 3 the approximations (3.8) are uniformly bounded with respect to n. Moreover, they converge weakly to the exact solution u † (up to a subsequence).
Proof. Applying the inverse A −1 to (3.3) and using the fact that the norms of the training images {u i } i∈N are uniformly bounded (Assumption 1), we get that where in the last inequality Assumption 3 has been applied. Therefore, since the sequence {u U n } is uniformly bounded, it contains a weakly convergent subsequence (that we don't relabel) such that u U n u. Since A is bounded we get Au U n Au. On the other hand, applying A to (3.8), we get that hence Au = y and since A is injective this proves the assertion.

Strong convergence
A sufficient condition for strong convergence of the partial sums (3.8) is that To ensure that this series is convergent, we make an assumption similar to Assumption 3. For this purpose, for fixed n, we expand Qn−1y n with respect the basis {y i }i=1,...,n, Assumption 4. There exists a constant Cα > 0 such that the expansion coefficients in (3.12) satisfy Proposition 8. Under Assumptions 1 and 4 the following estimate holds Proof. We have The estimate (3.13) then follows from (3.9).
Under Assumptions 1 and 4, ū n is bounded from above by C y n −Q n−1 y n , where C is a constant independent from n. Therefore, if we can control the decay of the coefficients |(y, ȳ i )|, we can guarantee convergence of the series (3.11) and hence convergence of (3.8). We make the following assumption. Assumption 5. The exact data y = Au † satisfy the following condition Remark 9. Since |(y, ȳ i )| is just the norm of the projection of y onto the span ofȳ i , we can write the condition in Assumption 5 as follows Therefore, Assumption 5 requires that the exact data y are well approximated by the sequence of the training data {y i } i∈N , i.e. (Qi − Qi−1)y decays sufficiently fast, while the training data themselves retain some diversity in the sense that (Qi − Qi−1)y i does not decay too fast. Theorem 10. Under Assumptions 4 and 5 the approximations (3.8) converge strongly to u † , that is u U n → u † . Proof. This is a direct consequence of Proposition 8 and the sufficient condition (3.11).
Example 11. In the following example we interpret the Assumptions 3, 4 and 5 in the case when idealistic training pairs are available, i.e. they are eigenfunctions of the compact operator A where {σi; v i , z i } i∈N denotes the singular system of A (see [18]). Because (3.14) Weak convergence: We can write Consequently, if (3.14) holds, then u U n is uniformly bounded and weakly convergent according to Theorem 7. Strong convergence: Because Qn−1y n = 0, Assumption 4 is trivially satisfied. Moreover, because ȳ i = z i and y i − Qi−1y i = y i = σi z i = σi, it follows from (3.14) that Assumption 5 is again satisfied if (u † , u i ) i∈N ∈ 1 .
We note that, in this setting, we get both the weak and strong convergence under the same condition.

Noisy data
If we are given noisy data y δ = y + ∆ with ∆ ∈ R(A) and ∆ δ, we cannot expect convergence as n → ∞ due to the unboundedness of the inverse A −1 . We observe the typical semi-convergence behaviour: While the first term converges to zero under assumptions of Theorem 10, the second term explodes. The amount of training pairs n therefore plays the role of a regularisation parameter that balances the influence of the two error terms in (3.15).
To find a suitable parameter choice rule n = n(δ), we need to understand the growth of A −1 Qn∆ as n → ∞. Since R(APn) = Yn is closed, the Moore-Penrose inverse (APn) † = A −1 Qn is bounded [18], but its norm grows with n.

Theorem 12. Under Assumptions 1 and 4 the following estimate holds
Proof. Take an arbitrary y ∈ Y with y = 1. Then from (3.8) and (3.13) it follows that Remark 13. It can be easily verified that the diagonal elements of the matrix Rn in (3.7) are equal to y i − Qi−1y i , i = 1, ..., n. Since Rn is upper triangular, its eigenvalues are given by its diagonal entries. Hence, sup i=1,...,n y i − Qi−1y i −1 is just the inverse of the smallest eigenvalue of Rn.
Remark 15. The condition in Theorem 14 agrees with the necessary and sufficient condition for convergence of arbitrary linear regularisations, see [18].
Remark 16. Since A −1 Q n(δ) depends on y i − Qi−1y i for i = 1, ..., n(δ), the parameter choice rule in Theorem 14 depends not only on the amount of the training data n, but also on the training data themselves.

Numerical experiments
We take U = L 2 (Ω) to be the space of images supported on a bounded Lipschitz domain Ω ⊂ R 2 . For all numerical experiments in this paper, we take the Radon transform [1] as the forward operator in (2.1). We use images from "The 10k US Adult Faces Database" [23] (to which we refer as "Faces"), downsampled to 64 × 64 pixels. To produce Radon data, we use the Matlab implementation of the Radon transform. We use 89 projections uniformly distributed on the interval [0, 2π). The size of the Radon data is 95 × 89 pixels. Sample images from the "Faces" dataset along with their Radon transforms are shown in Figure 1.
We examine the reconstruction quality of (3.8) for clean (δ = 0) and noisy (δ = 0.01) measurement data for n = 1000; 2000 and 3000 training pairs. To obtain the orthonormal basis {ȳ i }i=1,··· ,n from Subsection 3.1, we use the modified Gram-Schmidt algorithm [24]. We expect that for clean data the reconstructions will improve as n grows, while for noisy data they should start deteriorating after a certain amount of training data is reached.
The parameter choice rule in Theorem 14 requires that c(n) := √ n mini=1,...,n |R ii n | is small, where Rn is the matrix in (3.7). By examining this quantity, we can roughly assess whether the reconstruction is stable. The values of c(n) for n = 1000; 2000 and 3000 are c(1000) ≈ 0.07, c(2000) ≈ 0.3 and c(3000) ≈ 3. We expect the reconstruction to be stable for n = 1000 and n = 2000, while for n = 3000 we expect instability for noisy data. The results presented in Figure 2 confirm this intuition. For n = 1000 (Figures 2a-2b) the reconstruction from noisy data is only marginally worse than that from clean data. The reconstruction error is relatively small (around 5-6%) in both cases.
For n = 2000 reconstructions from clean data improve (the reconstruction error is about 3%). Reconstructions from noisy data do not improve compared to n = 1000, but are still stable (Figures 2c-2d). Increasing the size of the training set further to n = 3000 (Figures 2e-2f) makes the reconstructions from noisy data unstable while those from clean data remain stable.
From the computational point of view, once the "training", i.e. Gram-Schmidt orthogonalisation, is complete, computing the reconstruction (3.8) amounts to just matrix-vector products, which can be done very efficiently.

Variational regularisation
If the validity of assumptions in Sections 3.2-3.4 cannot be guaranteed, explicit regularisation can be used to ensure stability of the inversion. We use variational regularisation [19], which, as we demonstrate, can be also formulated in our purely data driven setting.
If we apply the Gram-Schmidt process to the training pairs {u i }i=1,··· ,n defined in (1.1), we get an orthonormal basis of Un, which we call { u i }i=1,··· ,n. Then we define y i := A u i . As every element u ∈ Un can be represented by u = n i=1 (u, u i ) u i , we trivially find that We consider the following optimisation problem min u∈U 1 2 APnu − y δ 2 + αJ (u), (3.17) where J is a regularisation functional and α is a regularisation parameter. If we let α → 0 (for a fixed n and δ), we get back to (3.1). As the amount of training data n grows, we obtain a sequence of operators {APn} n∈N that approximates the forward operator A. Indeed, for any u ∈ U i.e. {APn} n∈N approximates A pointwise. If A is compact, this also implies approximation in the operator norm [25]. Therefore, our situation is similar to [26], where approximations of the forward operator in the operator norm were considered in the context of discretisation. Convergence of regularised solutions was established there and convergence rates were obtained. Using these results, we can conclude that (3.17) defines a valid regularisation, which means that we are able to reproduce variational regularisation in a purely data driven setting with no direct access to the forward operator A.
We make the following standard assumptions on the regularisation functional. (e) Reconstruction using 3000 training images from clean data.
(f) Reconstruction using 3000 training images from noisy data. Figure 2: Reconstruction of "Faces" images from clean and noisy (1% noise) data using direct inversion. As the size of the training set grows, reconstructions from clean data improve, while those from noisy data demonstrate semiconvergence behaviour, becoming unstable for large n.
In modern variational regularisation, (generalised) Bregman distances are typically used to study convergence of approximate solutions [27]. We briefly recall the definition.
Definition 18. For a proper convex functional J the generalised Bregman distance between u, v ∈ U corresponding to the subgradient p ∈ ∂J (v) is defined as follows Here ∂J (v) denotes the subdifferential of J at v ∈ U .
To obtain convergence rates, an additional assumption on the regularity of the exact solution, called the source condition, needs to be made [18]. Several variants of the source condition exist (e.g., [18,19,27]); we use the following one [28].

Assumption 7 (Source condition). There exists an element q ∈ Y such that
Theorem 19. Suppose that Assumptions 6 and 7 are satisfied. Then the following estimate for the Bregman distance between u n J and u † corresponding to the subgradient A * q from Assumption 7 holds for some constant C > 0.
If the regularisation parameter α = α(δ, n) is chosen as in Theorem 17 then For the particular choice we obtain the following estimate The proof of this result is very similar to the proof in [26], which in turn is based on the proof in [29] in the Hilbert space setting.
Remark 20. The fact that the speed of convergence depends on how well the basis approximates the exact solution is not surprising. However, from Theorem 19 we conclude that it is equally important for the convergence rate that the subgradient A * q from the source condition is well approximated by the basis { u i } i∈N .

Numerical experiments
We conclude this section with numerical experiments that demonstrate the performance of the approach (3.17) in the same setting as in Section 3.4.1. For natural images that contain sharp edges, such as those in the "Faces" dataset, Total Variation (TV) [30] is a natural choice of the regulariser J , although more sophisticated regularisers could be also used. We solve the following problem min u∈U 1 2 APnu − y δ 2 + α TV(u). (3.19) We perform our experiments with noisy data y δ , where δ = 0.01. We use the CVX [31,32] package to solve this problem. To generate the differential operator needed to evaluate TV, we use the DIFFOP package [33]. The results are presented in Figure 3, where we also report the relative error of the reconstruction. Overall, the reconstructions look very reasonable ( Figure 3a) and exhibit features characteristic for the regulariser, such as staircasing [34], i.e. introduction of additional jumps and blocky structures not present in the ground truth. Some artefacts at the boundary are also present. If we increase the size of the training set to n = 1500 training pairs, which is roughly 37% of the basis, the results improve ( Figure 3b) and some of the boundary artefacts disappear.
From the computational point of view, once "training" (i.e. Gram-Schmidt orthogonalisation) is complete, the application of the forward operator is cheap. This can be advantageous in applications where exact evaluation of the forward operator is computationally demanding. However, the reconstruction of an image from a given measurement y still requires solving an optimisation problem (3.19). For regularisers that are typically used in imaging, such as Total Variation or Total Generalised Variation [35], this optimisation problem involves non-smooth terms, which makes efficient computation non-trivial [36].

Dual Least Squares
Although projections in the image space (3.1) do not yield a regularisation in general, it is known that a projections in the data space indeed define a regularisation. This method is also referred to as dual least squares [18].
Solving the dual least square problem consists in calculating u Y n , the minimum norm solution of where Qn is the orthogonal projector onto the span of training data {y i }i=1,...,n. This method is indeed strongly convergent, as the following result from the literature shows:  From Theorem 21 we also know that u Y n ∈ A * Yn (note that on this finite dimensional space the solution of (4.1) is unique) and thus it can be represented as u Y n = n i=1 (u Y n )iA * ȳ i . This means that n j=1 (u Y n )j(A * ȳ j , A * ȳ i ) = (y, ȳ i ), for i = 1, ..., n. where (A * ȳ j , A * ȳ i )i,j=1,...,n is the Gram matrix. As one can see from (4.3), to implement the dual least squares approach, we require the functions v i := A * ȳ i , i = 1, ..., n, or in other words we need training pairs {v i , y i }i=1,...,n.
The following result gives a simple characterisation of the Moore-Penrose inverse of QnA.
Theorem 23. The Moore-Penrose inverse of QnA is given by Hence, the minimum norm solution u Y n of (4.1) is given by Proof. First we observe that Proceeding similarly to Theorem 4, we get that Hence, The formula (4.4) follows from this expression. i.e. u U n and u † differ only on the orthogonal complement of A * Yn. It is this component of u U n that can potentially cause its divergence.
As in the case of projections in the image space, the dimension of the space Yn (the amount of training data) plays the role of the regularisation parameter in the dual least squares method. where σn is a singular value of A. Therefore, the ideal choice of the training images would be the singular vectors of A, since they would allow one to use larger n(δ) for the same noise level δ.
The reconstruction formula (4.4) shows that the eigenvalues of PA * Y n A −1 Qn control the stability of the inversion. They can be estimated using the transformation matrix Rn (3.7) obtained during the Gram-Schmidt process. The following result is a trivial consequence of Theorem 12.
Theorem 27. The following estimate holds Using this estimate, the same parameter choice rule as in Theorem 14 can be applied. Although this estimate might be conservative, it can still be useful since it is a by-product of the Gram-Schmidt process and does not require any additional computations.

Numerical experiments
To use the reconstruction formula (4.4), we need training triples {u i , y i , v i }i=1,..n such that y i = Au i and v i = A * y i = A * Au i . Sample triplets for the "Faces" dataset are shown in Figure 4. To compute the projection onto A * Yn, we can again use Gram-Schmidt orthogonalisation, this time on the adjoint images {v i }i=1,...n.
We perform our experiments in the same setting as in Section 3.4.1. To avoid boundary artefacts related to the application of the adjoint, we pad the images with ones (which corresponds to Neumann boundary conditions). The size of padding is five pixels on each side of the image.
The results for clean and noisy (δ = 0.01) measurement data for n = 1000; 2000 and 3000 training triplets are shown in Figure 5. As expected, the results for clean data improve with n and for noisy data we observe semiconvergence. From n = 1000 to n = 2000 the reconstructions get slightly worse but remain relatively stable, while for n = 3000 the reconstructions become unstable.
Comparing the reconstructions to those in Section 3.4.1 (Figure 2), we notice that the dual least squares method produced much smoother reconstructions. This is not surprising, since the solutions of the dual least squares method (4.4) are necessarily in the range of A * (in fact, even of A * A) and hence smooth, whilst those obtained by projecting in the image space (3.8) have the same smoothness as the training images {u i }i=1,...,n. The instabilities in the case of noisy data are also much more severe for projections in the image space ( Figure 2) than for dual least squares ( Figure 5).

Discussion
We considered three types of data driven regularisation methods: one based on restrictions of the inverse A −1 (Section 3.1), one based on restrictions of the forward operator A (Section 3.5) and one on restrictions of the adjoint A * (Section 4). The main conclusion is that methods that use restrictions (e) Reconstruction using 3000 training images from clean data.
(f) Reconstruction using 3000 training images from noisy data. Figure 5: Reconstruction of "Faces" images from clean and noisy (1% noise) data using dual least squares. As the size of the training set grows, reconstructions from clean data improve, while those from noisy data demonstrate semiconvergence behaviour, becoming unstable for large n. Compared to projections in the image space ( Figure 2) the reconstructions are much smoother, since they are in the range of the adjoint A * . of bounded operators (A or A * ) are more stable and converge under very mild assumptions. The approach in Section 3.1, which is based on the restriction of the inverse A −1 , requires much stronger assumptions on the training data {y i }i=1,...,n and the measurement data y, in particular, it requires that the training data and the measurement data are very local, in the sense that their projections onto the spans Yn can be expanded in the bases of training data {y i }i=1,...,n with 1 coefficients.
Another observation is that in methods where that do not use explicit regularisation (Sections 3.1 and 4) the amount of training data becomes the regularisation parameter. This means that excessive amounts of training data can deteriorate the performance, which is related to overfitting in learning based methods. However, overfitting refers to the case when the method too closely reproduces training data, whilst in our case the instabilities are due to the fact that the method is trying to reproduce the measurement data too closely. This behaviour is due to the ill-posed nature of the inverse problem.
Finally we summarize our results on data driven regularisation by projection in Table 1.

Projection in
Training data Convergence Reference Image space {u i , y i } s.t. Au i = y i weak, strong Theorems 7 and 10 Data space {v i , y i } s.t. v i = A * y i strong Theorem 23 Table 1: Summary of our results. Different kinds of regularisation by projection require different kinds of training data and different sets of assumptions to guarantee convergence.

Conclusions
Our results demonstrate that some classical regularisation methods such as regularisation by projection and variational regularisation can be formulated in a purely data driven setting when we have access to the forward operator only via training data. From the theoretical point of view, the main conclusion is that it is possible to carry out the same analysis as in the classical model-based setting and, for instance, obtain convergence rates under standard assumptions. Therefore, it is possible to use well-established regularisation methods in problems where evaluating the forward operator is either computationally too expensive or exact modelling is impossible. Our results also give some insights into the role of the amount of training data in data driven regularisation. In particular, our analysis demonstrates that in regularisation by projection, where no explicit regularisation is used, regularisation is controlled by the size of the training set. This may be counterintuitive since the common sense in data driven methods is that vast amounts of training data are key to success. For ill-posed problems, however, this need not always be the case. These results are in line with recent experiments on the stability of data driven approaches to inverse problems, which demonstrate that the performance can deteriorate as the amount of training data increases. We emphasise, however, that regularisation by projection is a linear inversion method, whilst many of the available data driven inversion methods are non-linear.
If explicit regularisation is used, however, the role of the data changes. It controls the approximation error of the forward operator and hence using more training data always improve the reconstructions, since stability is guaranteed by the use of an appropriate regularisation functional and a parameter choice rule.