Bayesian view on the training of invertible residual networks for solving linear inverse problems

Learning-based methods for inverse problems, adapting to the data’s inherent structure, have become ubiquitous in the last decade. Besides empirical investigations of their often remarkable performance, an increasing number of works address the issue of theoretical guarantees. Recently, Arndt et al (2023 Inverse Problems 39 125018) exploited invertible residual networks (iResNets) to learn provably convergent regularizations given reasonable assumptions. They enforced these guarantees by approximating the linear forward operator with an iResNet. Supervised training on relevant samples introduces data dependency into the approach. An open question in this context is to which extent the data’s inherent structure influences the training outcome, i.e. the learned reconstruction scheme. Here, we address this delicate interplay of training design and data dependency from a Bayesian perspective and shed light on opportunities and limitations. We resolve these limitations by analyzing reconstruction-based training of the inverses of iResNets, where we show that this optimization strategy introduces a level of data-dependency that cannot be achieved by approximation training. We further provide and discuss a series of numerical experiments underpinning and extending the theoretical findings.


Introduction
The mathematical field of inverse problems has many applications, e.g., imaging, image processing, and several more.Inverse problems come with characteristic difficulties summarized under the term "ill-posedness."Typically, one wants to recover causes x, which discontinuously depend on some observed measurements z.However, a good reconstruction algorithm needs to be stable; otherwise, it cannot handle noisy measurement data.Still, one naturally wants the reconstructor to be as accurate as possible.This results in a compromise called regularization (see [6] for a recent survey on regularization theory).The more stable the reconstructor becomes, the more the set of causes for which it provides accurate results is restricted.
Hence, this set of accurate performance is critical, and one typically chooses it using prior knowledge about the application-specific data.In imaging problems, this knowledge often amounts to solutions x looking somehow "natural."However, the mathematical characterization of natural images is challenging.Thus, learned methods often outperform in this area, learning stable and accurate reconstructions from given training data (see, e.g., the early survey [4]).
While many experimental studies confirm the impressive performance of learned methods, the theoretical understanding remains limited.In particular, learned methods often lack stability guarantees.However, the topic is gaining in importance [20].In the present work, we address this issue by studying invertible residual networks (iResNets) [5].As proposed in [3], their invertibility makes them readily applicable to linear inverse problems.[3] approximates the forward operation (x → z) using the iResNet, the iResNet's inverse, then solves the inverse problem (z → x).Here, one can control the regularization strength by choosing a hyperparameter of the iResNet that directly controls its stability.
The authors of [3] also develop a regularization theory for these iResNets.For this purpose, they considered particular architectures and uncovered equivalences to filter functions from classical regularization theory.In the present article we now analyze what iResNets actually learn in practice from the given training data.For this purpose, the Bayesian view is suitable, as it encodes prior knowledge on x and the measurement noise in z as probability distributions.We consider two different ways of training, via the forward and via the inverse mapping, and investigate to which extent the iResNet uses the given information about the data to regularize inverse problems.
The manuscript is structured as follows: Section 2 introduces the problem setting, basic assumptions, and the reconstruction approach using iResNets.The subsequent two sections contain the theoretical analysis of the iResNet's training in a Bayesian setting.First, Section 3 considers the so-called approximation training, where the network is trained supervisedly to approximate the forward operator.In particular, we investigate what information the network learns from the training data distribution (i.e., the effect of prior distribution and noise on the network).Second, Section 4 considers the so-called reconstruction training, where the iResNet's inverse is trained to map from noisy measurements to a reconstruction.Section 5 complements the theoretical analyses with extensive numerical experiments.We implement the two training types and underpin the theoretical findings of the previous sections.

Related literature
The Bayesian theory for inverse problems differs from the functional analytic regularization theory.While the functional analytic theory focuses mainly on the stability and convergence of reconstruction algorithms, the Bayesian perspective considers the full posterior distribution and uncertainty estimation for reconstructions.Detailed introductions are given in [12,9,23].An overview of the basic theory and Bayesian methods for solving inverse problems is also contained in [4].Learning-based methods are particularly powerful for solving Bayesian inverse problems, e.g., [1] describes two different general concepts for an efficient application of neural networks in this framework.[16] proposes a method that demonstrates the Bayesian theory's advantages for inverse problems using a trained denoiser in a plug-and-play Langevin algorithm.The denoiser is assumed to fulfill a Lipschitz condition (similar to the iResNet, see Section 2), implying guaranteed convergence of the algorithm to the posterior distribution.[22] leverages convex analysis to create nonexpansive residual networks and uses them to solve inverse problems.This is particularly desirable for denoising and plug-and-play schemes.Furthermore, invertible neural networks are also of interest to generative modeling.In [8], iResNets act as normalizing flows, i.e., learn to map from a base distribution to a target distribution and perform competitive or even superior to alternative architectures.[3] provides a more detailed discussion of the literature concerning learned convergent regularization schemes.

Problem setting and basic properties
We consider linear inverse problems based on the operator equation where A ∈ L(X, X) is a self-adjoint and positive semidefinite operator and X is a finite-dimensional inner product space, here X = R n .For simplification, we assume A = 1, which a scaling of the operator can easily obtain.In practice, neural network domains tend to be finite-dimensional; this justifies the restriction to the finite-dimensional case.Also, the Bayesian perspective becomes less standard if the underlying probability theory uses infinite-dimensional probability spaces.Due to the properties of A, there exist eigenvalues σ 2 j ∈ (0, 1] and corresponding eigenvectors v j , such that N (A) ⊥ = span{v j | j = 1, ..., ñ}, ñ ≤ n.We use this eigendecomposition in some of our theoretical analyses.
The aim is to recover the unknown ground truth vector x † as well as possible by only having access to a noisy observation z δ = Ax † + η.The noise η is assumed to be distributed according to a probability density function (pdf) p H : X → R ≥0 .Since there may exist arbitrarily many solutions x which could explain the data z δ , it is important to incorporate prior knowledge about the unknown solutions.The pdf p X : X → R ≥0 encodes this knowledge.In practice, p X may describe the distribution of natural-looking images or the typical structure of a cross-section of the human body (e.g., in CT problems).with an arbitrary linear operator, Ã ∈ L(X, Y ) (X and Y being different spaces), into the above setting by considering A = Ã * Ã and z = Ã * y.
In this case the noise η on z may arise from noise η on y via η = Ã * η.To illustrate this, let us consider the example of Gaussian noise η ∼ N (0, Σ).Then, it holds η = Ã * η ∼ N (0, Ã * Σ Ã), which means that Ã * transforms the covariance matrix Σ.If Ã has a nontrivial nullspace, the distribution of Ã * η is singular, and there exists no pdf.Nevertheless, it is possible to approximate the distribution, e.g., by adding ε Id to the covariance matrix or restricting the problem to N ( Ã) ⊥ .
While implicit knowledge about p X and p H via the given dataset is sufficient for training ϕ θ , we derive some theoretical results using these pdfs explicitly.For this purpose, we need to make the following assumptions.
Assumption 2.1.Let • p X : X → R ≥0 be a pdf i.e., X p X (x) dx = 1 with existing first and second moments (i.e., p X (x) x and p X (x) x 2 are Lebesgue-integrable) and expectation value • p H : X → R ≥0 be a pdf i.e., X p H (η) dη = 1 with existing first and second moments (i.e., p H (η) η and p H (η) η 2 Lebesgue-integrable) and zero expectation i.e., X p H (η)η dη = 0 , and The crucial condition to guarantee the invertibility of ϕ θ is Lip(f θ ) ≤ L < 1.Consequently, the inverse ψ θ = ϕ −1 θ fulfills a property describable as a combination of coercivity and Lipschitz-continuity, which, in turn, trivially implies strong monotonicity.We formulate this equivalence in the following lemma.
For the converse implication we now prove that (2.6) guarantees invertibility of ψ.Injectivity and Lipschitz continuity follow directly by applying the Cauchy-Schwarz inequality.To prove surjectivity, we construct a convergent sequence (z k ) such that ψ(z k ) converges to an arbitrary x ∈ X.We recursively define It can be observed that (2.10) holds.Using this, it follows ) .Hence, it holds ψ(z k ) → x and (2.6) guarantees convergence of (z k ).Since x was arbitrary, ψ is surjective and therefore invertible.With the argumentation from the beginning in reversed order, we obtain the implication (2) ⇒ (1).
The following remark simplifies condition (2.6) for X = R.
Remark 2.2.In case of X = R (one-dimensional space), condition (2.6) becomes which is a constraint on the slope of ψ from above and from below.
This motivates us to think of the condition on ψ as a Lipschitz constraint similar to the one that applies to an iResNet.The following remark shows a direct connection between the iResNet and its inverse.
Remark 2.3 (Inverses of iResNets are iResNets).From Lemma 2.1, we can deduce that one can write the inverse of an iResNet as a scaled iResNet.The constraint (2.6) is equivalent to (2.13) which is a scaled iResNet Id − g where g satisfies the same Lipschitz constraint as f in the forward mapping.

Approximation training
In [3], the approximation training is introduced, in which the iResNet ϕ θ is trained to approximate A, i.e., to solve min for a given dataset of N pairs (x (i) , z δ,(i) ) ∈ X × X, z δ,(i) = Ax (i) + η (i) .The parameter space Θ L encodes the architecture choice, and the Lipschitz constraint Lip(f θ ) ≤ L. This setting was partly motivated by the so-called local approximation property ([3, Theorem 3.1]) characterizing convergence guarantees for the regularized solution ϕ −1 θ (z δ ) as δ → 0. In [3], specific network architectures were trained according to the approximation training and analyzed under which conditions they satisfy the properties of a convergent regularization scheme.This revealed a connection to the classical linear filter-based regularization theory.
In contrast, we now aim to derive more general results without making restrictions on the architecture of the iResNet apart from the constraint on the Lipschitz constant of f .This enables us to analyze the influence of the noise and prior distribution on the trained network and, especially, the regularized solution.To this end, we consider the case of an infinite amount of training data allowing us to interpret Equation (3.1) from a Bayesian point of view.To be more precise, taking the limit N → ∞ in Equation (3.1) and exploiting the independence of x and η (Assumption 2.1) results in min (3. 2) The Euclidian norm can be decomposed into ϕ Again because of the independence of x and η and due to E p H (η) = 0, the mixed term vanishes in expectation.Therefore, we obtain min Consequently, the noise does not influence the training.We could interpret this positively since the noise cannot lead to approximation errors of ϕ θ .However, a big drawback is that ϕ −1 θ , which shall regularize the inverse problem, neither depends on the noise level.Accordingly, the amount of regularization has to be set manually by choice of L for the noise level δ (see [3]) and is not data-dependent.
What remains is the influence of the prior distribution p X on the training of ϕ θ .We are especially interested in how ϕ θ acts on the different eigenspaces of A to analyze the dependence on the size of the eigenvalues.Therefore, we make the rather strong assumption of stochastic independence of the components Observe that this assumption is implicitly made when using Tikhonov regularization with • 2 -penalty term.Furthermore, Assumption 2.1 implies that p X,j has existing first and second moments with which follows from Fubini's Theorem and the independence of the components.In this setting, a diagonal structure of the network as in [3], is sufficient.Hence, the above minimization problem can be analyzed for each component separately due to properties of the eigendecomposition, and we get This is equivalent to a 1d-setting with A : R → R, x j → σ 2 j x j .In the following, instead of minimizing over a parameter space Θ L , we directly consider a function space F encoding the Lipschitz constraint (Lip(f ) ≤ L) and the architecture choice.For simplicity, in what follows, we omit the index j and consider min If F allows for (affine) linear functions and in case of 1 − σ 2 ≤ L, we can indicate the trivial solution f = (1 − σ 2 ) Id. Obviously, this solution is unique on supp(p X ).Thus, for eigenvalues σ 2 , which are not too small, the training leads to a perfect approximation of the forward operator and no regularization of the inverse problem.For 1 − σ 2 > L, the minimization problem gets more interesting due to the Lipschitz constraint.First, we derive the following result, which builds the basis for a subsequent generalization.
is the unique solution of the minimization problem (3.7).
Proof.For a function f of the form f (x) = mx + b with the constraint m 2 ≤ L 2 , we can solve (3.7) by using the Lagrangian where the integral is well-defined due to the existence of the first and second moments of p X .The convexity, coercivity, and continuity of the integral term w.r.t.(m, b) imply that there exists a minimizer.Therefore, we can calculate the minimizer using the necessary conditions (KKT conditions) where we use the abbreviated notation Since 1 − σ 2 > L holds by assumption, we need λ > 0 to ensure m ≤ L.Then, (3.12) directly implies m = L.
As m is uniquely determined we also know that b is unique with The previous lemma provides the prerequisite for the following theorem, where F contains arbitrary Lipschitz continuous functions with constrained Lipschitz constant.
is the solution of the minimization problem (3.7).This solution is unique on supp(p X ) and for 1 − σ 2 > L even on the convex hull of supp(p X ).
Proof.We define F : F → R, and start with the case 1 − σ 2 ≤ L. Obviously, it holds F (f * ) = 0, so f * is a minimizer.Using the fundamental lemma of the calculus of variations, one can deduce the uniqueness on supp(p X ).Now, consider 1 − σ 2 > L and let g ∈ C 0,1 (R), Lip(g) ≤ L be an arbitrary function.We will show that F (g) > F (f * ) holds, if g = f * on the convex hull of supp(p X ).
First, we verify that F is well-defined, i.e, for f ∈ F holds, as the second moment of p X exists.Due to L < 1−σ 2 , the function g has always a smaller slope than (1−σ 2 ) Id, which implies that there exists an intersection point x 0 such that g(x 0 ) = (1−σ 2 ) x 0 .The affine linear function f (x) = L(x−x 0 )+(1−σ 2 ) x 0 possesses the same intersection point.
In case of g = f on the convex hull of supp(p X ), we simply apply Lemma 3.1.This shows that g can be the minimizer only if f = f * .
In the case of g = f , let us examine the integrand of F (g).For any x ∈ R, it holds For x ≤ x 0 , we have (1 − σ 2 )x − f (x) ≤ 0 and Lip(g) ≤ L implies g(x) − f (x) ≥ 0. Thus, we obtain Finally, we show that F (g) = F ( f ) implies f = g on the convex hull of supp(p X ).If for any measurable Ω ⊂ R, since the term under the integral is always greater than or equal to zero.The fundamental lemma of the calculus of variations then implies for almost all x ∈ R. Thus, for any x ∈ supp(p X ), it holds g(x) = f (x) and for x 1 , x 2 ∈ supp(p X ), we obtain g(x 1 ) − g(x 2 ) = L(x 1 − x 2 ).Consequently, for any x in between of x 1 and x 2 , g(x) = f (x) must also hold, otherwise Lip(g) ≤ L would be violated.Hence, g and f coincide on the convex hull of supp(p X ). x The residual function f * which results from the approximation training (Theorem 3.1) is affine linear and only depends on σ 2 , L and µ X .In case of σ 2 < 1 − L, f * exhibits the maximum possible slope of L and intersects (1 − σ) 2 Id at the mean µ X of the prior distribution.
Figure 1 exemplifies the solution f * for a Gaussian mixture prior p X .The inverse ϕ −1 θ corresponding to the minimizer of (3.7) derived in the previous theorem provides a convergent regularization scheme, which we discuss in the following remark.Remark 3.1.One can express ϕ −1 θ as a filter-based regularization scheme with the squared soft TSVD filter function with bias.The affine linear diagonal architecture was already analyzed in [3,Lemma 4.2].For (which coincides with the solution in Theorem 3.1), it holds By [3,Lemma 3.3], this filter scheme with bias defines a convergent regularization method for L → 1 in case of vanishing noise and a suitable parameter choice L(δ).
The previous results show that approximation training of a diagonal architecture always leads to an affine linear ϕ θ , independent of prior and noise distribution (p X , p H ). Hence, an affine linear residual layer is the best architecture choice for this task.This implies that ϕ −1 θ is a reconstruction scheme with minimal data dependency since only the mean µ X of the prior distribution has an influence.Furthermore, ϕ −1 θ is equivalent to a classical regularization scheme, where one predefines the amount of regularization by choosing the parameter L depending on the noise level.
For the general approximation training problem the previous investigations suggest that the solution depends on the second moments of the prior distribution p X at most.A detailed consideration of the general setting for the approximation training is beyond the scope of the present work.

Reconstruction training
The results in the last section show that the approximation training is insufficient for learning noise and more data-dependent regularization.To address this, we instead consider the training for given training data {x (i) } i ⊂ X, noise realizations {η (i) } i ∈ X and ϕ θ = Id − f θ .This is also motivated by sufficient conditions for the convergence analysis in [3,Remark 4.1].We refer to this training scheme as the reconstruction training.One can also interpret this reconstruction training as a supervised training on data pairs (x (i) , z δ,(i) ) for ϕ −1 θ (z δ,(i) ) ≈ x (i) with z δ,(i) = Ax (i) + η (i) .Using Lemma 2.1, we know that is an equivalent problem, assuming that the architectures of ϕ θ and ψ θ can approximate any continuous function.
Similar to the approximation training, we analyze the case of an unlimited amount of training data with x ∼ p X and η ∼ p H fulfilling Assumption 2.1.Thus, we obtain the minimization problem min where the set of functions Ψ represents the choice of the architecture and the constraints on the parameters.
In this context, we also make use of the density function of z δ = Ax + η, which is given by With this, we can define the space of p Z -weighted L 2 -functions as which is a Hilbert space.Note that functions from L 2 p Z (X, X) are (only) well-defined on supp(p Z ) ⊂ X, which is sufficient for our purposes.
At first, we consider the unconstrained case of Ψ = L 2 p Z (X, X).In this setting, the conditional mean1 ψ(z δ ) = E(x|z δ ) is the solution of (4.3) which is in line with the established theory in statistical inverse problems (see, e.g., conditional mean estimator in the discussion of Bayes cost estimators in [12] or [1, Prop.2]).Lemma 4.1.Let Assumption 2.1 hold and Ψ = L 2 p Z (X, X).Then, is the solution of (4.3), which is unique w.r.t. the L 2 p Z -norm.
Proof.We denote the objective function by which coincides with (4.3) with the substitution and the integrals are both finite.Besides, note that F is convex w.r.t.ψ since ψ → ψ(z δ ) − x 2 is convex for any x, z δ ∈ X.Thus, we can find the minimizer of F by setting its derivative to zero.
To compute the derivative of F , we consider The first of the three summands is F (ψ), the last one equals h 2 p Z ,2 and the second summand equals ∂F (ψ)h.Note that the second summand is finite since F (ψ) and h 2 p Z ,2 are both finite.Using Fubini's theorem, we obtain We are looking for ψ such that ∂F ( ψ)h = 0 for any h ∈ L 2 p Z (X, X).Hence, the fundamental lemma of the calculus of variations implies for almost all z δ ∈ supp(p Z ).According to Bayes' formula ψ(z δ ) is the expectation value of the posterior density function p(x|z δ ).
Finally, we need to make sure that ψ( For this, we use Jensen's inequality for conditional expectation values [14,Theorem 8.20,Corollary 8.21] and obtain Then, by the definition of conditional expectations, we get which is finite, and the proof is complete. Next, we consider the constrained reconstruction training, where we encode an arbitrary constraint, e.g., (4.2), by choosing Ψ to be a suitable subset of L 2 p Z (X, X).
Lemma 4.2.Let Assumption 2.1 hold, Ψ be an arbitrary subset of L 2 p Z (X, X) and let ψ : X → X, z δ → E(x|z δ ) be the conditional mean estimator.Then, the minimization problem (4.3) is equivalent to (4.17) Note that the existence of an actual minimizer is only guaranteed for closed Ψ.
In the proof of Lemma 4.1, we have already established that the integrals are finite.To split the integral term into two parts, we use Fubini's theorem and the definition of p Z (4.4) implies Again using Fubini's theorem and the definition of ψ(z δ ) = E(x|z δ ) (see proof of Lemma 4.1), we obtain Thus, (4.18) is equivalent to Now, the assertion follows from Thus, in the constraint case, the function ψ * , obtained by reconstruction training, aims to approximate the conditional mean estimator for the p Z -weighted L 2 -norm.In other words, reconstruction training with a constraint corresponds to a projection of the conditional mean estimator onto the constraint set with respect to the p Z -weighted L 2 -norm.
Remark 4.1.Since we know from Remark 2.3 that the inverse network ψ can be interpreted as a scaled iResNet, we can further compare the minimization problem to the case of approximation training.In the notation of an iResNet, the problem formulated in (4.17) is equivalent to Thus, the reconstruction training is equivalent to training an iResNet with residual function g (Lip(g) ≤ L) to fit a scaled version of the posterior expectation estimator.In contrast, approximation training aims to fit the same architecture type to the linear operator A. Overall, this indicates that theoretical and numerical properties (such as data-dependence) for the two strategies are the sole consequences of the training approach, and there is no additional bias due to the architecture choice of an iResNet as the forward mapping when assuming sufficient approximation capability.
So far, the distribution of the noise p H was fixed.Now, we want to consider a variable noise level δ > 0 by introducing the pdf p H,δ : X → R ≥0 .We do not specify the exact relation of p H,δ on δ but make the rather informal assumption that η δ ∼ p H,δ implies η δ ∼ δ with high probability.So, δ → 0 corresponds to the case of vanishing noise.Analogously, let p Z,δ be defined according to (4.4) s.t.z δ ∼ p Z,δ holds for z δ = Ax + η δ .The posterior mean now also depends on δ and may therefore be defined as Further, we want to specify the set Ψ ⊂ L 2 p Z (X, X), which represents the inverses of possible iResNet architectures depending on δ and L. To encode the side constraint of (4.2), we define The set Ψ δ L is closed and convex in L 2 p Z,δ (X, X) for any L < 1 and δ > 0. Consequently, the minimization problem in Lemma 4.2 is well-defined and admits a unique solution.This solution, w.r.t.Ψ δ L , p Z,δ and ψδ , is denoted by ψ * L,δ .
The parameter L controls the stability of the elements from Ψ δ L and can therefore be interpreted as a regularization parameter.The question remains whether we can also obtain a convergence result, i.e., ψ * L,δ (Ax † + η δ ) → x † for η δ ≤ δ, δ → 0 (vanishing noise) and L → 1, analogous to the convergence theorems of classical methods like Tikhonov's.In the following remark, we discuss difficulties and supporting facts regarding a convergence result.Remark 4.2.There are different results in the literature for the posterior distribution to converge to one single point x † (or its respective delta distribution) [7, Theorem 1 and 2], [24].Thus, it might be realistic to expect that the conditional mean estimator ψδ (Ax † ) also converges to x † .In classical convergence theorems [10], the ground truth x † is mostly assumed to be a minimum-norm-solution of (2.1).However, in a Bayesian setting with a learned reconstruction scheme, a criterion based on p X for characterizing x † is more appropriate, e.g., as in [7] or the one provided in the subsequent Lemma 4.3.Since ψ * L,δ is the projection of ψδ onto the set Ψ δ L (Lemma 4.2), it is not unlikely that ψ * L,δ (Ax † ) also converges to x † .However, the projection is w.r.t. the L 2 p z,δ -norm, and we are actually interested in pointwise convergence.Besides, all functions ψ from the sets Ψ δ L are Lipschitz continuous and they fulfill the monotonicity condition Thus, one cannot approximate arbitrary L 2 p z,δ -functions.There are two facts that partly address this difficulty.First, the posterior mean ψδ is also a Lipschitz continuous function under certain assumptions [9, Theorem 4.5, Remark 4.6].Second, a generalized inverse on R(A) of the form A † : z → x † would indeed fulfill (4.28).This follows from A being self-adjoint and positive semidefinite (we can write A = Ã * Ã) and A = 1 by Assuming that ψ * L,δ converges pointwise to A † on R(A) would be sufficient to obtain a convergence result for noisy data as well.If z δ − z ≤ δ holds and one chooses L in a way that L → 1 and δ 1−L → 0 for δ → 0, the desired convergence In the following, we provide a result for a potential candidate for x † for the convergence analysis.
Lemma 4.3.Let Assumption 2.1 hold with p H,δ = p H indicating the dependence on δ.In addition, let ψδ : X → X, z δ → E(x|z δ ) be the conditional mean estimator with We further make the following assumptions: (i) Noise on R(A) ⊥ = N (A) and R(A) = N (A) ⊥ (as A = A * and X is finite-dimensional) is stochastically independent, i.e., there exist pdfs p † H,δ : (ii) p 0 H,δ and p † H,δ define Dirac sequences with respect to δ → 0, (iii) p X is compactly supported and continuous.We define R p X (A) : (iv) For any z ∈ R p X there exists a δ such that for all δ ∈ (0, δ] it holds z ∈ supp(p Z,δ ).
We then have pointwise convergence of ψδ for z ∈ R p X (A) such that it holds i.e., it converges to the minimum-norm solution A † z plus the conditional expectation E(x 0 |A † z) ∈ N (A) in the nullspace.
Proof.The proof can be found in Appendix A.

Reconstruction training for diagonal architecture
In order to derive more specific results for the minimizer of (4.3), we make the assumption of stochastic independence of the components x j = x, v j ∼ p X,j and η j = η, v j ∼ p H,j similar to the setting in the last section on the approximation training: Observe that the first and second moments of p X,j and p H,j exist due to Assumption 2.1 with for all j ∈ N. In addition, the density of z j = σ 2 j x j + η j is given by Analogously to the approximation training, in this setting, it is sufficient to consider a diagonal structure of ϕ, which then implies the same structure for ψ, i.e., and the resulting optimization problem to train each ψ j now reads with a suitable set of functions Ψ j .Recall that ψ j : R → R represents the inverse of an iResNet if ψ j satisfies for some 0 ≤ L < 1, c.f. Remark 2.2.Therefore, we consider the set for some 0 ≤ L < 1.The following lemma provides a formula for the minimizer of problem (4.37).For better readability, we leave out the index j in the subsequent derivations.
Lemma 4.4.The unique solution to the minimization problem (4.37) with Ψ j as in (4.39) is given by (4.41) Proof.We solve the minimization problem (4.3) by applying the KKT conditions.To this end, consider the Lagrange functional Inserting the last equation into Equation (4.48) yields The formulas for m and b correspond to the unconstrained solution of problem (4.3).In order to satisfy the constraint on m, we need to guarantee that If this is not satisfied, we must require λ 1 > 0 or λ 2 > 0, which we will deal with in the following paragraphs. Case as Equation (4.44) is independent of λ 1 and λ 2 .Furthermore, Equation (4.43) implies In combination with the condition λ 1 > 0 we now obtain

.57)
Case λ 1 = 0, λ 2 > 0: The same line of reasoning as in the preceding case yields Observe that the case λ 1 > 0 and λ 2 > 0 is not possible as there is no m satisfying Equations (4.45) and (4.46) simultaneously for 0 < L < 1.The uniqueness of the solution directly follows from the uniqueness of m and b, which concludes the proof.
Note that the inverse of the function ψ * : R → R is given by ϕ x + Varp H (η) for x ∈ R. In the case of noiseless data, i.e.Var p H (η) = 0, f * corresponds to the function f * derived in Lemma 3.1 and Theorem 3.1.The function ψ * plays an important role in the case of Gaussian prior and noise distributions, which we will deal with in the following corollary.η) .The plots exemplify the behavior of ψ * and ψ for small singular values (1 − σ 2 > L) assuming a fixed prior distribution p X but increasing variance of p H .In the case that Var p H (η) = 0 (left), the slope of the unconstrained solution exceeds 1  1−L .If the noise increases, the slope of the unconstrained solution decreases, resulting in m ∈ [ 1 1+L , 1 1−L ] (middle).For very noisy data, the slope of the unconstrained solution is smaller than 1 1+L (right), again resulting in ψ * = ψ.Observe that ψ * and ψ are equal to 1 Proof.In Lemma 4.1 we have seen that the unconstrained solution of problem (4.37) is given by ψ(z) = E(x|z) for all z ∈ R. In the case of Gaussian noise and prior distributions, E(x|z) can be expressed as for all z ∈ R [23, Theorem 6.20 and Eq.(2.16a)], which is an element of C 1 (R) ∩ L 2 p Z (R).In combination with Lemma (4.2), minimization problem (4.3) can be rewritten as The same reasoning as in the proof of Theorem 3.1 now shows that ψ * of Lemma 4.4 is a solution to the minimization problem.
Figure 2 illustrates the behavior of the unconstrained solution ψ and the constrained solution ψ * in case of Gaussian probability density functions for varying noise and small singular values (1 − σ 2 > L).Note that both solutions can be rewritten to depend on µ Z instead of on µ X using µ Z = σ 2 µ X .It can be observed that the noise level affects the slope of the unconstrained solution, with decreasing values at higher noise levels.Thus, ψ violates the invertibility condition (4.38) for very small and very large values of Var p H (η) leading to ψ * = ψ in these cases (left and right image of Figure 2).General behavior of ψ * : The previous results deal with special cases where either the architecture or the probability density functions are known.In order to derive more general results, we make use of the theory of optimal control.For this, we need to restrict ourselves to piecewise continuously differentiable functions ψ with bounded domain, i.e., we consider the set with fixed z 0 , z 1 ∈ R and Pr(z ≤ z 0 )2 ≤ ε, Pr(z ≥ z 1 ) ≤ ε for some small ε > 0 to stay close to the previous setting.Furthermore, to apply the optimal control theory we need to split the optimization problem into two successive minimization problems.First, we minimize over all functions ψ ∈ Ψ with fixed starting point ψ(z 0 ) = ψ 0 ∈ R.Then, the starting point minimizing the objective function is determined.In combination with Lemma 4.2, the minimization problem thus reads min We would like to stress that the minimization problem defined in Lemma 4.2 is not equivalent to the initial one of Equation ( 4.3) due to the bounded domain of ψ.However, this error is negligible for small ε and the two minimization problems coincide if supp(p Z ) ⊂ [z 0 , z 1 ].
Remark 4.3.The restriction to a bounded domain of ψ might seem artificial at first.Nevertheless, in applications, the dataset rarely contains samples belonging to low-density regions of p Z , and thus, these cases are covered in our setting.
The inner minimization problem can be solved with the help of Pontryagin's maximum principle resulting in the following necessary and sufficient conditions for the derivative of ψ.Then, in all points of differentiability, the derivative ψ 0 must satisfy the necessary and sufficient conditions Figure 3: Behavior of the solution ψ * of (4.66) in the case that p z can be represented as a Gaussian mixture, c.f. Remark 5.1.Observe that the slope of the unconstrained solution ψ exceeds and to ψ outside this interval.
with the Hamiltonian function for some function f 0 satisfying (4.69).
(4.76) Furthermore, For a function ψ 0 satisfying the conditions of Pontryagin's maximum principle to be a solution of Problem (4.67), the Hamiltonian needs to be jointly convex in ψ and u and the constrained set defined by Equation (4.72) needs to be convex, cf.[21, Theorem 2].Both of these conditions are satisfied in our setting and thus, the proof is complete.
We would like to remark that λ can be expressed as whenever p Z and ψ are continuous.To illustrate the previous lemma let us look at a very simple example.Assume that ψ (z) > 1 /1−L for all z ∈ [z 0 , z 1 ].Then, Lemma 4.5 in combination with a minimization over the starting points ψ 0 shows that a solution to problem (4.66) is given by In addition, the general behavior of the solution to problem (4.66) is illustrated in Figure 3.This exemplifies that the best possible architecture choice, when considering the reconstruction training loss, is not necessarily an affine linear one unlike in the approximation training studied in the last section.This is partly because of the influence of the noise distribution p H , which cancels out when using the approximation training loss.Moreover, the variance of the prior and noise distribution influences the best architecture and parameter choice, which is not the case in the approximation training setting where only the expectation of the prior distribution influences the solution.

Numerical experiments
To study the implications of the previously developed theory for the practical application of iResNets for solving inverse problems, we perform experiments on two forward operators, where we compare approximation training (3.1) to reconstruction training (4.1).In all experiments, we train single-layer iResNets with diagonal (3.5) structure where the residual functions f θ comprise multiple layers.
In the setting presented in the following sections, we consider a discrete convolution with a smoothing kernel a ∈ R 9×9 that is depicted in Figure 4 and zero padding to preserve dimensionality.Since the resulting Toeplitz matrix M a that performs the convolution with a is symmetric and positive definite, this serves immediately as a self-adjoint operator A = M a .The second inverse problem we aim to solve is given by a discrete Radon operator Ã : R 28×28 → R 30×41 , such that A = Ã * Ã, which is in line with the setting used in the prior work [3].We restrict the discussion of the numerical results to the convolution operator, and for the sake of completeness, the results for the Radon operator are provided in Appendix D.
In both cases, we train our models on the MNIST handwritten digits dataset [17], where we treat images as flattened vectors in R 28•28 .In addition, we study an artificially generated bimodal Gaussian dataset for which we sample the prior distribution in every singular vector independently from a (bimodal) Gaussian mixture distribution.The pdf in every j therefore reads p X,j (x j ) = ρ 1 g v,t1 (x j ) + ρ 2 g v,t2 (x j ) (5.1) where ρ 1 = 0.35, ρ 2 = 0.65, v = 0.15, t 1 = −0.6,t 2 = 0.6 and g s,t is the pdf of a Gaussian with standard deviation s and expectation value t.This bimodal structure enables us to further explore the data dependency of the optimized models.The architecture of the subnetworks included in the diagonal architecture and their training is realized identical to [3]: Each subnetwork consists of a three-layer fully connected network, equipped with 35 hidden neurons in each of the first two layers and one output neuron.In addition, we apply the ReLU activation function to the first and second layers of each subnetwork.Figure 7 of [3] depicts the architecture.The network weights are optimized using Adam [13].Extending the approach in [19], we parameterize the network weights in each layer of all subnetworks to fulfill the Lipschitz constraint.Reconstruction training is accomplished by computing the inverse of the iResNets through the usual fixed point iteration (for 30 iterations) and backpropagating through the unrolled iteration to optimize the network weights.As a result, one has to run the iterative inversion in every training step, resulting in a much greater computational effort for the reconstruction training than approximation training.The Lipschitz bound, realized by a proper parameterization of the f θ , has to be computed only once per iteration.
Of course, a numerically more efficient training approach would be to extend on Remark 2.3 and construct ψ, the inverse of an iResNet, as a scaled iResNet that is trained on reversed data points but similar to the approximation approach.However, we currently do not have guarantees on the approximation capability of the involved (fully-connected) iResNets architectures and their inverses.To provide a fair comparison, we aim to enforce the same inductive bias in both training methods by choosing the same forward mapping architecture.
In the described settings, we perform experiments for varying noise levels δ and Lipschitz constants L m , where we choose and the resulting noise η is Gaussian noise with standard deviation δ = δ •std dataset , where std dataset denotes the averaged standard deviation of the coefficients std j := std( x (i) , v j i=1,...,N ) of the current dataset (i.e., standard deviation with respect to i, mean with respect to j).In the following, we discuss the results in terms of the learned solutions, the resulting data-dependent filter functions, and the regularization and approximation properties of the models.

Learned inverse mappings
To compare the characteristics of the approximation and reconstruction training to our theoretical findings, we plot the learned one-dimensional inverse mappings ψ j in the different components j (corresponding to the singular values σ j ) for the bimodal dataset.We visualize the results in Figure 5 for a large and a small eigenvalue of A.
In the case of approximation training, we observe the predicted affine linear behavior in the support of the data distribution, limited by the Lipschitz constraints and aligned to the expectation value of the training data.This result is independent of the noise and optimal only for small noise levels.At the boundaries of the data distribution seen during training, the proper behavior of the optimal solution from Theorem 3.1 is not learned properly.
For the case of reconstruction training, we can again corroborate our theoretical findings numerically.For this purpose, we compute the posterior expectation value E(x j |z δ j ) for our setting.Remark 5.1.For the multimodal Gaussian distribution with pdf p X and Gaussian noise p H , p H (z) = g w,0 (z) (5.4) we have and the posterior expectation value ψ reads where We note that this recovers the linear behavior E(x|z) = z/σ 2 in the noise-free case while it adds a correction term in the noisy case that pulls and pushes data points towards more likely results.
For regions within the support of the data distribution, where the constraint (2.6) permits the model to approximate E(x j |z δ j ), we obverse in Figure 5 that the learned solutions match well with the posterior expectation.If the model reaches the limiting constraint, it exhausts the possible slope to be as close to      ) trained via reconstruction training at Lipschitz bound L 2 for different singular values and for noise levels "zero" (δ 0 , top row), "small" (δ 6 , middle row) and "large" (δ 2 , bottom row) for A = M a .the posterior expectation value as possible.This results in a much more data-dependent inversion scheme, where reconstructions that were more likely to appear during training are favored.Consequently, the model can compensate for larger noise levels based on additional learned knowledge about the data.The behavior of the learned solution thus coincides with the theoretically founded one in Figure 3.
In the case of large noise, the reconstruction-based model regularizes and does not necessarily exhaust the Lipschitz constraint, while the approximation model always tries to fit the operator as well as possible.If noise is absent, the learned mappings coincide for both training strategies.

Learned filter functions
As a link to classical regularization theory, we visualize the data-dependent filter functions that correspond to the learned models.For this purpose, we evaluate the filter r L , where for each singular value σ j at data points s(q) := σ 2 j (µ X,j + q • std j ), where we subtract the axis intercept bL,j = (Id − f θ,j ) −1 (0).The variable q ∈ R determines the number of standard deviations std j away from the mean value µ X,j = 1 N N i=1 x (i) , v j in the image of the dataset with respect to the fixed singular value of the operator.For simplicity, we define R L (σ j , q) := r L (σ 2 j , s(q)) s(q). (5.9) The results for approximation and reconstruction training are visualized as surface plots for both datasets in Figures 6 and 7.In all cases, the r L show a sensible behavior, damping small singular values and roughly satisfying r L → 1 as σ 2 → 1.
On the bimodal dataset in Figure 6, data dependency occurs in all filter functions.In the case of approximation training, this is visible only in the sense that proper regularizing behavior is learned exclusively within the support of the data distribution.Reconstruction training shows a more complex dependency since the posterior mean aims to push points that lie close to 0 towards the two peaks of the bimodal data distribution if noise is present in the data, as can also be seen in Figure 5.As a result, the medium singular values, in which an amplified slope in the solution is, on the one hand, permitted by the Lipschitz constraint and, on the other hand, also necessary due to the present influence of noise, are elevated near the center of the distribution.This results in filter functions that may not lead to convergent regularization schemes but include data-driven corrections for the observed noise.
On MNIST in Figure 7, the stronger data dependence of the reconstruction training is not directly visible; all filters appear to be approximately constant in the range of 5 standard deviations around the mean.This is likely to be due to the fact that the distributions in the singular values are all approximately unimodal.Therefore, the correction could be similar to the simple regularizing behavior of the neural networks in the approximation training approach.In return, the action of L as a regularization parameter becomes visible.Especially for small L, the filter functions show similarities to the case of squared soft TSVD; see Remark 3.1.The learned filter functions in approximation training are very similar for every noise level, which is in line with the developed theory.In contrast, the filters learned in reconstruction training show stronger regularization (i.e., more dampening of small singular values) for larger noise levels, adapting to the data seen during training.This again fits well within the theoretical results developed in Section 4 and is especially visible for large L = L 3 , where the model is, in principle, able to fit the operator well.However, the filter functions for the largest noise δ3 and L 2 , L 3 look very similar, indicating that the model learns strong regularization from data at the cost of a worse operator approximation.

Reconstruction quality and convergence
To compare the performance of the different models in terms of reconstruction quality, we show images and filter functions for a single MNIST digit in Figure 8.
Especially for large noise, the visual quality clearly benefits from the reconstruction training compared to the approximation approach.This supports the argument that additional data dependence may be desirable for large-noise applications.In the same case, the approximation training shows its regularization properties    (1) in Figure 8.
and indicates proper parameter choice rules: The image quality improves for smaller Lipschitz constraints which imply a strong regularization.This behavior is not visible in the reconstruction approach, where a large L generally seems to improve the quality.The filter plots we provide for the given sample in Figure 8 (right) also underlined this.In this case, they show a similar graph for L 2 and L 3 .The filter plots also reveal that the models optimized via approximation training are independent of the noise seen during training and therefore learn identical filter functions for all noise levels.Table 1 also reveals an advantage of this method for small noise levels.
Another way to study the convergence in L and δ of the trained models is to evaluate the approximation properties in terms of the overall errors on the dataset with respect to different error measures.To evaluate the localized approximation property that has been introduced in [3, Theorem 3.1], we define estimating the approximation error of the trained model for the whole dataset or a single sample.In Figure 9, we plot this error for the models trained without noise and varying L. In the case of approximation training, we observe slightly superlinear convergence in the dataset on average, indicating that the property is satisfied for many samples.As proven in prior work [3], this implies that the training constructs a convergent regularization scheme for these samples.For reconstruction-based training, this is not fulfilled on average.To preserve some insights on the convergence of the method, we extend on the weaker but sufficient condition in [3, Remark 3.2] and define which is closer to the target in reconstruction training.In this case, E x (m) (ϕ θ(L) , A) L→1 − −− → 0 would be sufficient for local convergence.Figure 9 indicates that this property can still be satisfied, however, with slow convergence rates.
In addition, we evaluate the reconstruction error of the training approaches for varying noise levels.Figure 10 depicts the results for the mean squared error   and averaged structural similarity index measure (SSIM) as defined in [25], computed for the dataset by (5.15) Here, one can again see that the approximation training comes with a typical parameter choice rule known from regularization theory, where one has to choose L → 1 while δ → 0. In contrast, the reconstruction training performs best with the largest L among all noise levels since the regularization emanates from the data.This learned regularization behavior also becomes apparent in Figure 11, which illustrates the actual Lipschitz constant of the learned residual function.While in the approximation training case, the actual Lipschitz constant L always reaches the constraint one, we can observe a different behavior in the reconstruction training, particularly for larger Lipschitz constants used in the constraint and for larger noise.First, with increasing noise levels, we observe a decay in the actual Lipschitz constant.This is because the noise distribution p H , which suffers from a larger standard deviation, decreases the slope of the conditional mean ψ and thus, for larger singular values, the slope of ϕ −1 θ(L,δ ) does not need to reach the upper bound 1 1−L anymore.As a result, the actual Lipschitz constant further decreases until it reaches a minimum and then increases again.While the larger singular values cause the first decay, the subsequent increase is caused by small singular values due to an analogous influence of p H on the conditional mean.The larger the noise becomes, the more the slope of ϕ −1 θ(L,δ ) for small singular values need to reach the lower bound 1 1+L to approximate the posterior mean.As a result, we observe the increase of the actual Lipschitz constant for large noise.A visualization of this behavior can be found in the appendix in Figure 13, which illustrates the differences in the behavior for small, intermediate, and large singular values.Overall, the reconstruction training method outperforms the approximation training for all large δ but stays slightly behind for very small noise.

Discussion and outlook
The present work can be seen as a continuation of [3].There, the authors investigated regularization properties of the proposed iResNet reconstruction approach for specific network architectures trained according to the approximation training on samples to impart data dependency.Here, we have extended the theory by focusing on the question of to what extent the training data distribution influences the optimal parameters of the iResNet and, in turn, the resulting reconstruction scheme.To this end, we considered the training of the iResNets from a Bayesian perspective and focused on two different loss functions, namely the approximation training and the reconstruction training.
In the approximation training, our results for the diagonal architecture show that for all possible prior and noise distributions, the best suited residual function f is an affine linear one whose optimal parameters depend on the mean of the prior distribution, the eigenvalue σ 2 j and the Lipschitz constant L < 1.Thus, in this setting, the data dependency on the training outcome is minimal, with no influence of the noise distribution and, especially, the regularization properties of the resulting reconstruction scheme.Instead, the amount of regularization of ϕ −1 θ is solely controlled by the Lipschitz constant L, which also becomes apparent by the observed equivalence of ϕ −1 θ to a convergent filter-based regularization scheme with bias.In contrast, the prior and noise distributions significantly impact the optimal architecture and parameters when considering the reconstruction training.Here, we realize the network by an iResNet's inverse and train it to approximate A −1 , resulting in a stable reconstruction scheme.We showed that we can interpret the optimal network as an approximation of the conditional mean estimator z δ → E(x|z δ ) w.r.t. the p Zweighted L 2 -norm, where p Z is the density function of Ax + η.Hence the optimal architecture choice and the corresponding optimal parameters depend on the prior and the noise distribution.Consequently, this indicates that the amount of regularization of the trained network is controlled by the Lipschitz constant L and possibly by the amount of noise in the training data.
The theoretical findings are validated and further corroborated by a series of numerical investigations on the MNIST dataset and an artificially generated dataset following a bimodal Gaussian distribution for two different forward operators.In particular, the results highlight that in reconstruction training, the noise distribution influences the regularization properties of the network.In the approximation training, the noise does not influence the regularization properties; they are solely controlled by the Lipschitz constant L. As a result, the reconstruction training leads to superior regularizations in high noise regimes, whereas the approximation training is more suitable in low noise regimes due to better convergence properties.
These investigations of the approximation and reconstruction training illustrate how the loss function determines the influence of the prior and noise distribution on the reconstruction scheme and shed light on which architectures are suitable.Investigating a link to MAP estimation and how it could be represented in terms of an iResNet might allow for revealing further links to regularization theory in future works (see also a more detailed discussion in Appendix B).
The presented results allow further investigations and can serve as a foundation for future research directions.In the approximation training case, it might be desirable to relax the naive Bayes assumption and to consider the general non-diagonal network case (3.27).In line with the found limited data dependency in the diagonal case, we conjecture a dependency on second moments of the prior distribution p X at most.In the general network case, we showed that reconstruction training leads to a data-dependent and stable reconstruction scheme that approximates the mean of the posterior distribution, where the degree of stability can be controlled by the Lipschitz constant L. What is left to prove is a convergence property as discussed in Remark 4.2, which could, in principle, further manifest the superiority of the reconstruction training approach and provide additional guarantees.Here, potential generalizations also could incorporate alternative loss functions in the integrand of the reconstruction training loss, which might result in approximations of alternative estimators induced by Bayes costs [12].In this context, as a starting point one may limit oneself to linear estimators to further investigate the relation to learned MAP estimators, respectively Tikhonov regularization, as in [11,2].In order to obtain convergence guarantees as well as data dependence, when desirable, one can explore a noise-controlled convex combination of both training losses.Theoretical as well as numerical investigations in this direction remain future research.
In addition, Remark 2.3 serves as a basis to improve the numerical implementation of the reconstruction training by constructing the network as a scaled iResNet resulting in a more efficient training approach as mentioned in Section 5.This is a reasonable approach when one is not interested in directly comparing the approximation and reconstruction training.Numerical investigations, including the general non-diagonal network architecture, remain immediate future research.
Besides these improvements, extending our results to deeper network architectures could be beneficial, e.g., a concatenation of iResNets.This would allow for more expressive network architectures and further improve the reconstruction quality of the networks.Finally, it might be worthwhile to generalize the results to nonlinear inverse problems allowing for an application to a larger number of operators.R(A * ) = N (A) ⊥ , the approximation property [15, Sec II] of the Dirac sequence p † H,δ delivers uniform convergence of the denominator, i.e., this implies pointwise convergence such that for any z ∈ X it holds Analogous arguments apply to the nominator of (A.6) by exploiting the approximation property of the Dirac sequence implying uniform convergence and thus pointwise convergence such that for any z ∈ X it holds Due to (iv) for any fixed z ∈ R p X (A) the representation of ψδ in (A.6) is well-defined for sufficiently small δ.Consequently, we have convergent sequences in the nominator and in the denominator such that the quotient converges by standard sequence arguments.As R p X (A) ⊂ R(A) = N (A) ⊥ we thus obtain the desired pointwise convergence B MAP estimation and its interpretation as an iResNet The posterior density p(x|y δ ) can be derived via Bayes rule from the pdf's of the prior (x ∼ p X ), the noise (y δ − Ax ∼ pH ) and the data (y δ ∼ p Y ).Using this and the monotonicity of the logarithm, one obtains Here, one can observe a well-known similarity to variational regularization schemes.In the case of Gaussian noise with noise level δ > 0, i.e., The negative log-likelihood (NLL) − log p X can be interpreted as a penalty term, weighted with the squared noise level δ 2 .If − log p X is differentiable, we can use the first-order optimality condition and derive where the last implication only holds if Ã * Ã − δ 2 ∂(log p X ) is invertible (which is guaranteed, e.g., in case of a convex NLL).Now, we can interpret (B.5) as an iResNet approach for solving the inverse problem where A = Ã * Ã and z = Ã * y.It holds Thus, MAP estimation with an iResNet is possible as long as the above-defined f θ has a Lipschitz constant of at most L < 1.
We can derive conditions for the prior p X and the noise level δ from this Lipschitz constraint by making use of Assumption 3.1, i.e., stochastic independence of the components x j ∼ p X,j and the eigendecomposition of A. In this setting, the components can be handled separately and, thus, To obtain further insights, we distinguish between large eigenvalues (i.e., σ 2 j > 1 − L) and small ones (i.e., σ 2 j ≤ 1 − L).
Remark B.1.The prior p X,j corresponding to a large eigenvalue can have a rather arbitrary character.The only important property is that the derivative of the NLL (i.e., ∂(log p X,j )) must be Lipschitz continuous.Because in this case, for small enough δ, it holds This is visualized in the left plot of Figure 12.However, for smaller eigenvalues, it holds 1 − σ 2 ≥ L. Hence, the prior must decrease the slope of f θ,j .This holds true if the NLL of the prior is convex (or even strongly convex).Because then, ∂(log p X,j ) is monotonously decreasing and δ can again be chosen s.t.
holds.An example for such a p X,j is a Gaussian prior, where ∂(log p X,j ) is a linear function with a negative slope (see the right plot of Figure 12 and the subsequent derivations).
Figure 12: MAP estimation with iResNet.For large singular values σ 2 j > 1−L, the NLL is not very restricted (can be nonconvex) to guarantee Lip(f θ,j ) ≤ L < 1.For small singular values σ 2 j ≤ 1 − L, the NLL has to be (strongly) convex to guarantee that the slope of f θ,j is smaller than 1 − σ 2 j (and smaller than L).
Example prior distributions: To exemplify the previous observations, let us look at two commonly used prior distributions, namely the Gaussian distribution and the Laplace distribution.In the case of the Gaussian distribution we have for all j ∈ N with mean µ j ∈ R and variance b 2 j > 0. Consequently, for the residual layer we obtain Hence, similar to the observations in the previous remark, for singular values with 1 − σ 2 j < L the Lipschitz constraint Lip(f j ) ≤ L is fulfilled for all δ and b j .In the case 1 − σ 2 j ≥ L, δ and b j need to satisfy δ 2 /b 2 j ≥ 1 − σ 2 j − L to guarantee Lip(f j ) ≤ L. In the case of the prior distribution being a Laplacian, i.e.
with mean µ j ∈ R and variance 2b 2 j > 0, the subgradient of log p X,j is given by Consequently, the residual layer for the Laplacian prior distribution is given by which is not Lipschitz-continuous.As a result, the Laplace distribution does not satisfy the conditions of Remark B.1.In summary, the previous considerations illustrate that MAP estimation with a Gaussian noise model can also be represented by the proposed iResNet approach for certain prior distributions guaranteeing invertibility.

C Approximation training in diagonal architecture with dependent data and noise distribution
In Section 3, we assumed that the random variables x ∼ p X and η ∼ p H are independent.However, one can obtain a more general version of Lemma 3.1 with less restrictive assumptions on the joint data and noise distribution.To this end, we denote by p : R  where we use the abbreviated notation E p for E (x,η)∼p , E p X for E x∼p X and E p H for E η∼p H . Combining both equations yields Exploiting (C.9) yields b in either case, which provides the desired f * .In combination with the observation that m and b are uniquely determined the proof is complete.

D Additional numerical experiments
The numerical results in Section 5 are illustrated for the convolution operator A = M a .In the following additional illustrations for the convolution operator and all corresponding results for the Radon operator, described in Section 5, are provided.

Remark 2 . 1 .
We can translate a more general linear inverse problem Ãx = y (2.2) 10) for m and (3.11) for b leads to

Figure 2 :
Figure 2: Illustration of the constrained solution ψ * and the unconstrained solution ψ in the case of Gaussian probability density functions p X and p H , cf.Corollary 4.1.The slope of the unconstrained solution ψ is denoted by m, i.e. m = σ 2 Varp X (x)

Corollary 4 . 1 .
Assume that p X : R → R and p H : R → R are Gaussian probability density functions.Then, the function ψ * of Lemma 4.4 is a solution to the minimization problem (4.37) with

Figure 4 :
Figure 4: The filter kernel a used in the convolution operator M a .

Consider a linear inverse
problem Ãx = y with Ã ∈ L(X, Y ) as in Remark 2.1, where noisy data y δ ∈ Y is given.One established approach in Bayesian inverse problems is the so-called MAP (maximum a posteriori) estimator x MAP = arg max x∈X p(x|y δ ).(B.1)
Observe that the integral is well-defined as the first and second moments of p X and p H exist by assumption.Moreover, K is convex and coercive w.r.t.(m, b) and the integral is continuous w.r.t.(m, b).Consequently, there exists a minimizer of problem (4.3) which satisfies the necessary KKT conditions

Table 1 :
SSIM and MSE measures corresponding to reconstructions of x