NETT: Solving Inverse Problems with Deep Neural Networks

Recovering a function or high-dimensional parameter vector from indirect measurements is a central task in various scientific areas. Several methods for solving such inverse problems are well developed and well understood. Recently, novel algorithms using deep learning and neural networks for inverse problems appeared. While still in their infancy, these techniques show astonishing performance for applications like low-dose CT or various sparse data problems. However, there are few theoretical results for deep learning in inverse problems. In this paper, we establish a complete convergence analysis for the proposed NETT (Network Tikhonov) approach to inverse problems. NETT considers data consistent solutions having small value of a regularizer defined by a trained neural network. We derive well-posedness results and quantitative error estimates, and propose a possible strategy for training the regularizer. Our theoretical results and framework are different from any previous work using neural networks for solving inverse problems. A possible data driven regularizer is proposed. Numerical results are presented for a tomographic sparse data problem, which demonstrate good performance of NETT even for unknowns of different type from the training data. To derive the convergence and convergence rates results we introduce a new framework based on the absolute Bregman distance generalizing the standard Bregman distance from the convex to the non-convex case.


Introduction
We study the stable solution of inverse problems of the form Estimate x ∈ D from data y δ = F(x) + ξ δ .
(1.1) space setting, but clearly the approach and results apply to a finite dimensional setting as well. The element ξ δ ∈ Y models the unknown data error (noise) which is assumed to satisfy the estimate ξ δ ≤ δ for some noise level δ ≥ 0. We focus on the ill-posed (or illconditioned) case where without additional information, the solution of (1.1) is either highly unstable, highly underdetermined, or both. Many inverse problems in biomedical imaging, geophysics, engineering sciences, or elsewhere can be written in such a form (see, for example, [18,38,44]). For its stable solution one has to employ regularization methods, which are based on approximating (1.1) by neighboring well-posed problems that enforce stability and uniqueness.

NETT regularization
Any method for the stable solution of (1.1) uses, either implicitly or explicitly, a-priori information about the unknowns to be recovered. Such information can be that x belongs to a certain set of admissible elements or that x has small value of a regularizer (or regularization functional) R : X → [0, ∞]. In this paper we focus on the latter situation, and assume that the regularizer takes the form ∀x ∈ X : R(x) = R(V, x) := ψ(Φ(V, x)) . (

1.2)
Here ψ : X L → [0, ∞] is a scalar functional and Φ(V, · ) : X → X L a neural network of depth L where V ∈ V, for some vector space V, contains free parameters that can be adjusted to available training data (see Section 2.1 for a precise formulation).
With the regularizer (1.2), we approach ( where D : Y ×Y → [0, ∞] is an appropriate similarity measure in the data space enforcing data consistency. One may take D(F(x), y δ ) = F(x)−y δ 2 but also other similarity measures such as the Kullback-Leibler divergence (which, among others, is used in emission tomography) are reasonable choices. Optimization problem (1.3) can be seen as a particular instance of generalized Tikhonov regularization for solving (1.1) with a neural network as regularizer. We therefore name (1.3) network Tikhonov (NETT) approach for inverse problems.
In this paper, we show that under reasonable assumptions, the NETT approach (1.3) is stably solvable. As δ → 0, the regularized solutions x α,δ ∈ arg min x T α;y δ (x) are shown to converge to R(V, · )-minimizing solutions of F(x) = y 0 . Here and below R(V, · )-minimizing solutions of F(x) = y 0 are defined as any element Additionally, we derive convergence rates (quantitative error estimates) between R(V, · )minimizing solutions x + and regularized solutions x α,δ . As a consequence, (1.3) provides a stable solution scheme for (1.1) using data consistency and encoding a-priori knowledge via neural networks. For proving norm convergence and convergence rates, we introduce the absolute Bregman distance as a new generalization of the standard Bregman distance for non-convex regularization.

Possible regularizers
The network regularizer R(V, · ) can either be user-specified, or a trained network, where free parameters are adjusted on appropriate training data. Some examples are as follows.
In this paper we study a non-linear extension of q -regularization, where the (in general) non-convex network regularizer takes the form with q ≥ 1 and Φ(V, · ) = (Φ λ (V, · )) λ∈Λ being a possible non-linear neural network with multiple layers. In Section 3.4 we present convergence results for this non-linear generalization of q -regularization. By selecting non-negative weights, one can easily construct networks that are convex with respect to the inputs [4]. In this work, however, we consider the general situation of arbitrary weights, in which the network regularizer (1.5) can be non-convex.
CNN regularizer: The network regularizer R(V, · ) in (1.2) may also be defined by a convolutional neural network (CNN) Φ(V, · ), containing free parameters that can be adjusted on appropriate training data. The CNN can be trained in such a way, that the regularizer has small value for elements x in a set of training phantoms and larger value on a class of un-desirable phantoms. The class of un-desirable phantoms can be elements containing undersampling artifacts, noise, or both. In Section 5, we present a possible regularizer design together with a strategy for training the CNN to remove undersampling artifacts. We present numerical results demonstrating that our approach performs well in practice for a sparse tomographic data problem.

Comparison to previous work
Very recently, several deep learning approaches for inverse problems have been developed (see 1 , for example, [1,5,13,28,30,31,32,45,47,50,52,53]). In all these approaches, a trained network Φ rec (V, · ) : Y → X maps measured data to the desired output image. Two-step reconstruction networks take the form Φ rec (V, · ) = Φ CNN (V, · )•B, where B : Y → X maps the data to the reconstruction space (backprojection; no free parameters) and Φ CNN (V, · ) : X → X is a neural network, for example a convolutional neural network (CNN), whose free parameters are adjusted to the training data. This basic form allows the use of well established CNNs for image reconstruction [20] and already demonstrates impressing results. Network cascades [32,45] and trained iterative schemes [1,2,27,12] learn free parameters in iterative schemes. In such approaches, the reconstruction network can be written in the form where x 0 is the initial guess, Φ k (V k , · ) : X → X are CNNs that can be trained, and B k (y , · ) : X → X are iterative updates based on the forward operator and the data. The iterative updates may be defined by a gradient step with respect to the given inverse problem. The free parameters are adjusted to available training data.
Network cascades and trained iterative schemes repeatedly make use of the forward problem which might yield increased data consistency compared to the first class of methods. Nevertheless, in existing approaches, no provable non-trivial estimates bounding the data consistency term D(F(x), y ) are available; data consistency can only be guaranteed for the training data (F(z n ), z n ) N n=1 for which the parameters in the neural network are optimized. This may results in instability and degraded reconstruction quality if the unknown to be recovered is not similar enough to the class of employed training data. The proposed NETT bounds the data consistency term D(F(x α,δ ), y ) also for data outside the training set. We expect the combination of the forward problem and a neural network via (1.3) (or, for the noiseless case, (1.4)) to increase reconstruction quality, especially in the case of limited access to a large amount of appropriate training data.
Note, further, that the formulation of NETT (1.3) separates the noise characteristic and the a-priori information of unknowns. This allows us to incorporate the knowledge of data generating mechanism, e.g. Poisson noise or Gaussian noise, by choosing the corresponding log-likelihood as the data consistency term, and also simplifies the training process of R(V, · ), as it to some extend avoids the impact of noise. Meanwhile, this enhances the interpretability of the resulting approach: we on the one hand require its fidelity to the data, and on the other penalize unfavorable features (e.g. artifacts in tomography).
An early related work [42] uses denoisers as a regularization term which also includes certain CNNs. In [2], they use a residual network for Φ and ψ(·) = · 2 2 . Another related work [12] uses a learned proximal operator instead of a regularization term. After the present paper was initially submitted, other works explored the idea of neural networks as regularizers. In particular, in [35] a regularizer has been proposed that distinguishes the distributions of desired images and noisy images. We note that neither convergence nor convergence rates results have been derived by any work using neural networks as regularizer.
The results in this paper are a main step for the regularization of inversion problems with neural networks. For the first time, we present a complete convergence analysis and derive convergence rate under reasonable assumptions. Many additional issues can be addressed in future work. This includes the design of appropriate CNN regularizers, the development of efficient algorithms for minimizing (1.3), and the consideration of other regularization strategies for (1.4). The focus of the present paper is on the theoretical analysis of NETT and demonstrating the feasibility of our approach; detailed comparison with other methods in terms of reconstruction quality, computational performance and applicability to real-world data is beyond our scope here and will be addressed in future work.

Outline
The rest of this paper is organized as follows. In Section 2, we describe the proposed NETT framework for solving inverse problems. We show its stability and derive convergence in the weak topology (see Theorem 2.6). To obtain the strong convergence of NETT, we introduce a new notion of total non-linearity of non-convex functionals. For totally non-linear regularizers, we show norm convergence of NETT (see Theorem 2.11). Convergence rates (quantitative error estimates) for NETT are derived in Section 3. Among others, we derive a convergence rate result in terms of the absolute Bregman distance (see Proposition 3.3). A framework for learning the data driven regularizer is proposed in Section 4, and applied to a sparse data problem in photoacoustic tomography in Section 5. The paper concludes with a short summary and outlook presented in Section 6.

NETT regularization
In the section we introduce the proposed NETT and analyze its well-posedness (existence, stability and weak convergence). We introduce the novel concepts of absolute Bregman distance and total non-linearity, which are applied to establish convergence of NETT with respect to the norm.

The NETT framework
Our goal is to solve (1.1) with ξ δ ≤ δ and δ > 0. For that purpose we consider minimizing the NETT functional (1.3), where the regularizer R(V, · ) : Here L is the depth of the network (the number of layers after the input layer) and V (x) = A (x) + b are affine linear operators between Banach spaces X −1 and X −1/2 ; we take X 0 := X. The operators A : X −1 → X −1/2 are the linear parts and b ∈ X −1/2 the socalled bias terms. The operators σ : X −1/2 → X are possibly non-linear and the functionals ψ : X L → [0, ∞] are possibly non-convex. Note that we use two different spaces X −1 and X −1/2 in each layer because common operations in networks like max-pooling, downsampling or upsampling change the domain space.
As common in machine learning, the affine mappings V depend on free parameters that can be adjusted in the training phase, whereas the non-linearities σ are fixed. Therefore V and σ are treated separately and only the affine part V = (V ) L =1 is indicated in the notion of the neural network regularizer R(V, · ). Throughout our theoretical analysis we assume R(V, · ) to be given and all free parameters to be trained before the minimization of (1.3). In Section 4, we present a possible framework for training a neural network regularizer.
Remark 2.1 (CNNs in Banach space setting). A typical instance for the neural network in NETT (1.2), is a deep convolutional neural network (CNN). In a possible infinite dimensional setting, such CNNs can be written in the form (2.1), where the involved spaces satisfy X := p (Λ , X ) and X −1/2 := p (Λ , X −1/2 ) with p ≥ 1, X and X −1/2 being function spaces, and Λ being an at most countable set that specifies the number of different filters (depth) of the -th layer. The linear operators A : p (Λ −1 , X −1 ) → p (Λ , X −1/2 ) are taken as

2)
where K ( ) λ,µ : X −1 → X −1/2 are convolution operators. We point out, that in the existing machine learning literature, only finite dimensional settings have been considered so far, where X and X −1/2 are finite dimensional spaces. In such a finite dimensional setting, we can take X = R N 1 ×N 2 , and Λ as a set with N c elements. One then can identify X = p (Λ , X ) R N 1 ×N 2 ×N c and interpreted its elements as stack of discrete images (the same holds for X −1/2 ). In typical CNNs, either the dimensions N 1 × N 2 of the base space X are progressively reduced and number of channels N c increased, or vice versa. While we are not aware of any infinite dimensional general formulation of CNNs, our proposed formulation (2.1), (2.2) is the natural infinite-dimensional Banach space version of CNNs, which reduces to standard CNNs [20] in the finite dimensional setting.
In this case one may take (2.1) as a single-layer neural network with The frame (ϕ λ ) λ∈Λ may be a prescribed wavelet or curvelet basis [11,16,37] or a trained dictionary [3,25]. In Section 3.4, we analyze a non-linear version of q -regularization, where φ λ , · are replaced by non-linear functionals. In this case the resulting regularizer will in general be non-convex even if q ≥ 1.

Well-posedness and weak convergence
For the convergence analysis of NETT regularization, we use the following assumptions on the regularizer and the data consistency term in (1.3).
(A2) Data consistency term D: The conditions in (A1) guarantees the lower semicontinuity of the regularizer. The conditions in (A2) for the data consistency term are not very restrictive and, for example, are satisfied for the squared norm distance. The coercivity condition (A3) might be the most restrictive condition. Several strategies how it can be obtained are discussed in the following.

Remark 2.3 (Coercivity via skip or residual connections). Coercivity (A3) clearly holds for regularizers of the form
is a trained regularizer as in (A1) and ψ (2) is coercive and weakly lower semi-continuous. The regularizer (2.3) fits to our general framework and results from a network using a skip connection between the input and the output layer. In this case, the overall network takes the form Another possibility to obtain coercivity is to use a residual connection in the network structure which results in a regularizer of the form If the last non-linearity σ in the network Φ (r ) (V, x) is a bounded function and the functional ψ is coercive, then the resulting regularizer is coercive. Coercivity also holds if Φ (r ) (V, x) has Lipschitz constant < 1, which can be achieved by appropriate training [7].

Remark 2.4 (Layer-wise coercivity).
A set of specific conditions that implies coercivity of the regularizer is to assume that, for all , the activation functions σ are coercive and there exists c ∈ [0, ∞) such that ∀x ∈ X : x ≤ c A x . The coercivity of A can be obtained by including a skip connection, in which case the operator A takes the form In CNNs, the spaces X and X −1/2 are function spaces (see Remark 2.1) and a standard operation for σ is the ReLU (the rectified linear unit), ReLU(x) := max {x, 0}, that is applied component-wise. The plain form of the ReLU is not coercive. However, the slight modification x → max {x, ax} for some a ∈ (0, 1), named leaky ReLU, is coercive, see [36,29]. Another coercive standard operation for σ in CNNs is max pooling which takes the maximum value max {|x(i )| : i ∈ I k } within clusters of transform coefficients. We emphasize however that by using one of the strategies described in Remark 2.3, one can use any common activation function without worrying about its coercivity.
Remark 2.5 (Generalization of the coercivity condition). The results derived below also hold under the following weaker alternative to the coercivity condition (A3) in Condition 2.2: (A3') For all y ∈ Y and α > 0, there exists a C > 0 such that Condition (A3') ensures that the level set in (2.5) is sequentially weakly pre-compact for all y ∈ Y and α > 0. It is indeed weaker than Condition (A3). For instance, in case that F is linear and D( · , 0) is convex, (A3') amounts to require that R(V, · ) is coercive on the null space of F, whereas (A3) requires coercivity of R(V, · ) on the whole space X. (b) Stability: If y k → y and x k ∈ arg min T α;y k , then weak accumulation points of (x k ) k∈N exist and are minimizers of T α;y .
Then the following holds: Weak accumulation points of (x k ) k∈N are R(V, · )-minimizing solutions of F(x) = y ; (x k ) k∈N has at least one weak accumulation point x + ; Proof. According to [21,44] it is sufficient to show that the functional R(V, · ) is weakly sequentially lower semi-continuous and the set {x | T α;y (x) ≤ t} is sequentially weakly precompact for all t > 0 and y ∈ Y and α > 0. By the Banach-Alaoglu theorem, the latter condition is satisfied if R(V, · ) is coercive. The coercivity of R(V, · ) however is assumed Condition 2.2 (for sufficient coercivity conditions see Remarks 2.3 and 2.4). Also from Condition 2.2 it follows that R(V, · ) is sequentially lower semi-continuous.
Note that the convergence and stability results of Theorem 2.6 are valid for any test data independent of the training data used for optimizing the network regularizer. Clearly, if the considered inverse problem is positive weights, a R(V, · )-minimizing solutions is not necessarily the one corresponding to the desired signal class for test data very different from this class.

Absolute Bregman distance and total non-linearity
For convex regularizers, the notion of Bregman distance is a powerful concept [8,44]. For non-convex regularizers, the standard definition of the Bregman distance takes negative values. In this paper, we therefore use the notion of absolute Bregman distance. To the best of our knowledge, the absolute Bregman distance has not been used in regularization theory so far.
Here F (x) denotes the Gâteaux derivative of F at x.
From Theorem 2.6 we can conclude convergence of x α,δ to the exact solution in the absolute Bregman distance. Below we show that this implies strong convergence under some additional assumption on the regularization functional. For this purpose we introduce the new total non-linearity, which has not been studied before.
We define the modulus of total non-linearity of F at x as The function F is called totally non-linear at x if ν F (x, t) > 0 for all t ∈ (0, ∞).
The notion of total non-linearity is similar to total convexity [10] for convex functionals. Opposed to total convexity we do not assume convexity of F, and use the absolute Bregman distance instead of the standard Bregman distance. For convex functions, the total nonlinearity reduces to total convexity, as the Bregman distance is always non-negative for convex functionals. For a Gâteaux differentiable convex function, the total non-linearity essentially requires that its second derivative at x is bounded away from zero. The functional We have the following result, which generalizes [41, Proposition 2.2] (see also [44,Theorem 3.49]) from the convex to the non-convex case.
Proposition 2.9 (Characterization of total non-linearity). For F : D ⊆ X → R and any x ∈ D the following assertions are equivalent: (i) The function F is totally non-linear at x; Proof This leads to ν F (x, δ) = 0, which contradicts with the total non-linearity of F at x. Then, the assertion follows by considering subsequences of (x n ) n∈N .

Strong convergence of NETT regularization
For totally non-linear regularizers R(V, · ) we can prove convergence of NETT with respect to the norm topology.
Theorem 2.11 (Strong convergence of NETT). Let Condition 2.2 hold and assume additionally that F(x) = y has a solution, R(V, · ) is totally non-linear at R(V, · )-minimizing solutions, and α satisfies (2.6). Then for every sequence (y k ) k∈N with D(y k , y ), D(y , y k ) ≤ δ k where δ k → 0 and every sequence x k ∈ arg min x T α(δ k ) (x, y k ), there exist a subsequence (x k(n) ) n∈N and an R(V, · )-minimizing solution x + with x k(n) − x + → 0. If the R(V, · )minimizing solution is unique, then x k → x + with respect to the norm topology.
Proof. It follows from Theorem 2.6 that there exists a subsequence (x k(n) ) n∈N weakly converging to some R(V, · )-minimizing solution x + such that R(V, x k(n) ) → R(V, x + ). From the weak convergence of (x k(n) ) n∈N and the convergence of (R(V, x k(n) )) n∈N it follows that B R(V, · ) (x k(n) , x + ) → 0. Thus it follows from Proposition 2.9, that x k(n) − x + → 0. If x + is the unique R-minimizing solution, the strong convergence to x + again follows from Theorem 2.6 and Proposition 2.9.

Convergence rates
In this section, we derive convergence rates for NETT in terms of general error measures under certain variational inequalities. Throughout we denote by δ > 0 the noise level and α > 0 the regularization parameter. We discuss instances where the variational inequality is satisfied for the absolute Bregman distance. Additionally, we consider a non-linear generalization of q -regularization.

Rates in the absolute Bregman distance
We next derive conditions under which a variational inequality in form of (3.1) is possible for the absolute Bregman distance as error measure, E(x, x + ) := B R(V, · ) (x, x + ).

Proposition 3.3 (Rates in the absolute Bregman distance)
. Let X and Y be Hilbert spaces and let F : X → Y be a bounded linear operator. Assume that R(V, · ) is Gâteaux differentiable, that R (V, x + ) ∈ Ran(F * ), and that there exist positive constants γ, ε with for all x satisfying |R(V, x) − R(V, x + )| < ε. Then, for some constant C.
In particular, for the similarity measure D(z , y ) = z − y 2 and under Condition 2.2, Items (a) and (b) of Theorem 3.1 hold true.
with the constant C := ξ + 2γ, and concludes the proof.

General regularizers
So far we derived well-posedness, convergence and convergence rates for regularizers of the form (1.2). These results can be generalized to Tikhonov regularization where the regularization term is not necessarily defined by a neural network. These results are derived by replacing Condition 2.2 with the following one.
Then we have the following: Note that Item (a) in the above theorem is contained in [21]. Items (b)-(d) have not been obtained previously for non-convex regularizers.

Non-linear q -regularization
We now analyze a special instance of NETT regularization (1.2), generalizing classical qregularization by including non-linear transformations. More precisely, we consider the following q -Tikhonov functional Here Λ is a countable set and φ λ : X → R are possibly non-linear functionals. Theorem 2.6 assures existence and convergence of minimizers of (3.6) provided that (φ λ ) λ∈Λ is coercive and weakly continuous. If (φ λ ) λ∈Λ is non-linear, minimizers are not necessarily unique.
, and ψ as a weighted q -norm. However, in (3.6) also more general choices for φ λ are allowed (see Condition 3.7).
We assume the following: (C1) F : X → Y is a bounded linear operator between Hilbert spaces X and Y .
Proof. The convexity of t → |t| q implies that .
Thus, the assertion follows from Proposition 3.6.

Comparison to the W -Bregman distance
A different framework for deriving convergence rates for non-convex regularization functionals is based on the W -Bregman distance introduced in [21,24]. In this subsection we compare our absolute Bregman distance with the W -Bregman for some specific examples.
Moreover, the W -Bregman distance with respect to w ∈ ∂ W R(x + ) between x + and x ∈ X is defined by The notion of W -Bregman distance reduces to its classical counterpart if we take W as the set of all bounded linear functionals on X. Allowing more general function sets W provides an extension of the Bregman distance to non-convex functionals that can be used as an alterative approach to convergence rates. It, however, requires finding a suitable function set such that R is W -convex at x + .
Relations between the absolute Bregman distance and the W -Bregman distance depends on the particular choice of W . We illustrate this by an example.
Example 3.11. Consider the functional R(x) := 2 ReLU(x −x + )−1 |x − x + | q for x ∈ R with q > 1. The absolute Bregman distance according to Definition 2.7 is given by B R (x, x + ) = |x − x + | q . For the W -Bregman distance consider the family W := w α,β : Then R is locally convex at x + with respect to W . Moreover, w α,β ∈ ∂ W R(x + ) if α = 0 and β ≥ 1. For w 0,β ∈ ∂ W R(x + ), it follows that In case of q = 2, the W -subdifferential is closely related to the notion of proximal subdifferentiability used in [14].
If β > 1, then B w 0,β R,W ( · , x + ) and B R ( · , x + ) only differ in terms of the front constants. As a consequence, the rates with respect to both Bregman distances will be of the same order. However, if β = 1, then B w 0,1 R,W ( · , x + ) equals 0 when x ≤ x + . In contrast, B R ( · , x + ) always treats both x ≥ x + and x ≤ x + equally.
We conclude that the relation between the absolute Bregman distance and the W -Bregman distance depends on the particular situation and the choice of the family W . Using the W -Bregman distance for the analysis of NETT and studying relations between the two generalized Bregman distances are interesting lines of research that we aim to address in future work.

A data driven regularizer for NETT
In this section we present a framework for constructing a trained neural network regularizer R(V, · ) of the form (2.1). Additionally, we develop a strategy for network training and minimizing the NETT functional.

A trained regularizer
For the regularizer we propose R(V, · ) = λ∈Λ L Φ λ (V, x) q q with a network Φ(V, · ) = (Φ λ (V, · )) λ∈Λ of the form (2.1), that itself is part of encoder-decoder type network (4.1) Here Φ(V, · ) : X → X L can be interpreted as encoding network and Ψ : X L → X as decoding network. We note, however, that any network with at least one hidden layer can be written in the form (4.1). Moreover we also allow X L to be of large dimension in which case the encoder Φ(V, · ) does not perform any form of dimensionality reduction or compression.
Training of the network is performed such that R(V, · ) is small for artifact free images and large for images with artifacts. The proposed training strategy is presented below.
For suitable network training of the encoder-decoder scheme (4.1), we propose the following strategy (compare Figure 4.1). We choose a set of training phantoms z n ∈ X for n = 1, . . . , N 1 + N 2 from which we construct back-projection images x n := F + (Fz n ) (where F + denotes the pseudo-inverse) for the first N 1 training examples, and set x n = z n for the last N 1 training images. From this we define the training data {(x n , r n )} N 1 +N 2 n=1 , where r n = z n − x n = z n − F + (Fz n ) for n = 1, . . . , N 1 (4.2) The free parameters in (4.1) are adjusted in such a way, that Ψ (W, Φ(V, x n )) r n for any training pair (x n , r n ). This is achieved by minimizing the error function where is a suitable distance measure (or loss function) that quantifies the error made by the network function on the n-th training sample. Typical choices for d are mean absolute error or mean squared error.
Given an arbitrary unknown x ∈ X, the trained network estimates the artifact part. As a consequence, R(V, x) is expected to be large, if x contains severe artifacts and small if it is almost artifact free. If x is similar to elements in the training set, this should produce almost artifact free results with NETT regularization. Even if the true unknown is of different type from the training data, artifacts as well as noise will have large value of the regularizer. Thus our approach is applicable for a wider range of images apart from training ones. This claim is confirmed by our numerical results in Section 5. Note that we did not explicitly account for the coercivity condition in (A1) during the training phase. Several possibilities for ensuring coercivity are discussed in Section 2.2. Moreover, note that the class of methods we have in mind for the above training strategy are underdetermined problems such as undersampled CT, MRI or PAT. We expect that similar training strategies can be designed for problems that have many small but not vanishing singular values. Investigating such issues in more detail (theoretically and numerically) is an interesting line of future research.
Remark 4.1 (Alternative trained regularizers). Another natural choice would be to simply take R(W, V, x) := Ψ (W, Φ(V, x)) 2 for the regularizer. Such a regularizer has been used in the proceedings [6] combined with quite simple network architectures for Ψ and Φ. The main emphasis of this paper is the convergence analysis of NETT, so the investigation of effects of different trained regularizers is beyond its scope. We nevertheless point out that including training data corresponding to (4.3) makes the trained network Ψ (W, Φ(V, x)) different to standard artifact removal network [30]. Such methods only use training data corresponding to (4.3) to remove artifacts. Detailed investigation of benefits of each approach is an interesting aspect of future research.

Minimizing the NETT functional
Using the encoder-decoder scheme, regularized solutions are defined as minimizers of the NETT functional where Φ is trained as above. The optimization problem (4.5) is non-convex (due to the presence of the non-linear network) and non-smooth if q = 1. Note that the subgradient of the regularization term R(V, x) = λ∈Λ L Φ(V, x) λ q q can be evaluated by standard software for network training with the backpropagation algorithm. We therefore propose to use an incremental gradient method for minimizing the Tikhonov functional (4.5), which alternates between a gradient descent step for 1 2 F(x) − y δ 2 and a subgradient descent step for the regularizer R(V, x).
The resulting minimization procedure is summarized in Algorithm 1.

Algorithm 1 Incremental gradient descent for minimizing NETT
Choose family of step-sizes (s i ) > 0 Choose initial iterate x 0 for i = 1 to maxiter dō In practice, we found that Algorithm 1 gives favorable performance, and is stable with respect to tuning parameters. Also other algorithms such as proximal gradient methods [15] or Newton type methods might be used for the minimization of (4.5). A detailed comparison with other algorithms is beyond the scope of this article.
Note that the regularizer may be taken R(V, x) = Φ(x) L with an arbitrary norm · L on X L . The concrete training procedure is described below. In the form (4.5), NETT constitutes a non-linear generalization of q -regularization.

Application to sparse data tomography
As a demonstration, we use NETT regularization with the encoder-decoder scheme presented in Section 4 to the sparse data problem in photoacoustic tomography (PAT). PAT is an emerging hybrid imaging method based on the conversion of light in sound, and beneficially combines the high contrast of optical imaging with the good resolution of ultrasound tomography (see, for example, [33,39,48,49]).

Sparse sampling problem in PAT
The aim of PAT is to recover the initial pressure p 0 : R d → R in the wave equation form measurements of p made on an observation surface S outside the support of p 0 . Here d is the spatial dimension, ∆ x the spatial Laplacian, and ∂ t the derivative with respect to the time variable t. Both cases d = 2, 3 for the spatial dimension are relevant in PAT: The case d = 3 corresponds to classical point-wise measurements; the case d = 2 to integrating line detectors [9,39]. In this paper we consider the case of d = 3 and assume the initial pressure p 0 : R 2 → R vanishes outside the unit disc D 1 , the ball of radius 1, and that acoustic data are collected at the boundary sphere S 1 = ∂D 1 . In particular, we are interested in the sparse sampling case, where data are only given for a small number of sensor locations on S 1 . This is the case that one often faces in practical applications.
In the full sampling case, the discrete PAT forward operator is written as F : R n 1 ×n 2 → R m full ×m 2 where m full corresponds to the number of complete spatial sampling points and M 2 to the number of temporal sampling points. Sufficient sampling conditions for PAT in the circular geometry have been derived in [26]. We discretize the exact inversion formula of [19] to obtain an approximation F : R m full ×m 2 → R n 1 ×n 2 to the inverse of F. In the full data case, application of F to data Fx ∈ R m full ×m 2 gives an almost artifact free reconstruction x ∈ R n 1 ×n 2 , see [26]. Note that F is the discretization of the continuous adjoint of F with respect to a weighted L 2 -inner product (see [19]).
In the sparse sampling case, the PAT forward operator is given by Here S : R m full ×m 2 → R m 1 ×m 2 is the subsampling operator, which restricts the full data in to a small number of spatial sampling points. In the case of spatial under-sampling, the filtered backprojection (FBP)reconstruction F := F • S T yields typical streak-like under-sampling artifacts (see, for example, the examples in Figures 5.1).

Implementation details
Consider NETT where the regularizer is defined by the encoder-decoder framework described in Section 4. The network Ψ (W, · ) • Φ(V, · ) is taken as the Unet, where the Φ(V, x) corresponds to the output of the bottom layer with smallest image size and largest depth. The Unet has been proposed in [43] for image segmentation and successfully applied to PAT in [5,46]. However, we point out, that any network that has the encoder-decoder of the form Ψ (W, · ) • Φ(V, · ) can be used in an analogous manner.
The network was trained on a set of training pairs {(x n , r n )} N 1 +N 2 n=1 , with N 1 = N 2 = 975, where exactly half of them contained under-sampling artifacts. For generating such training data we used (4.2), (4.3) where z n are taken as randomly generated piecewise constant Shepp-Logan type phantoms. The Shepp-Logan type phantoms have position, angle, shape and intensity of every ellipse chosen uniformly at random under the side constraints that the support of every ellipse lies inside the unit disc and the intensity of the phantom is in the range [0, 6]. During training we had no problems with overfitting or instability and thus we did not use dropout or batch normalization. However, such techniques might be needed for different networks or training sets.
In this discrete sparse sampling case, we take the forward operator F as in (5.2) with n 1 = n 2 = 256 and m 1 = 30 spatial samples distributed equidistantly on the boundary circle. We used m 2 = 2000 times sampled evenly in the interval [0, 2.5]. The under-sampling problem in PAT is solved by FBP, and NETT regularization using α = 1/4. We minimize (4.5) using Algorithm 1, where we chose a constant step size of s i = 0.4 and take the zero image x 0 = 0 for the initial guess. These parameters have been selected by hand using similar phantoms as reference.

Results and discussion
The top left image in Figure 5.1 shows a Shepp-Logan type phantom x ∈ R 256×256 corresponding to a function on the domain [−1, 1] 2 . It is of the same type as the training data, but is not contained in the training data. The NETT reconstruction x 10 and x 50 with Algorithm 1 after 10 and 50 iterations for the Shepp-Logan type phantom are shown in the bottom row   that NETT is able to well remove under-sampling artifacts while preserving high resolution information.
We also consider a phantom image (blobs phantom) that additionally includes smooth parts and is of different type from the phantoms used for training. The blobs phantom as well as the FBP reconstruction x FBP and NETT reconstructions x 10 and x 50 are shown in Figure 5  Finally, Figure 5.4 shows reconstruction results with NETT from noisy data where we added 5 % additive Gaussian noise to the the data. We performed 15 iterations with Algorithm 1. The relative reconstruction errors are E(x 15 ) = 0.280 for the Shepp Logan phantom and E(x 15 ) = 0.210 for the blobs phantom. Parameters have been taken as in the noiseless data case. We also calculated the average relative L 2 -error and average structured similarity index (SSIM) of [51] on a test set of 100 phantoms, which were similar to the training set. The errors for both the noiseless and noisy case can be seen in Table 1. We used the same parameters as above for both the noiseless and the noisy case.
In both cases, the reconstructions are free from under-sampling artifacts and contain high frequency information, which demonstrates the applicability of NETT for noisy data as well.
The above results demonstrate the proposed NETT regularization using the encoder-decoder framework and with Algorithm 1 for minimization is able to remove under-sampling artifacts. It also gives consistent results even on images with smooth structures not contained in the training data. This shows that in the NETT framework, learning the regularization functional on one class of training data, can lead to good results even for images beyond that class.

Conclusion and outlook
In this paper we developed a new framework for the solution of inverse problems via NETT (1.3). We presented a complete convergence analysis and derived well-posedness and weak convergence (Theorem 2.6), norm-convergence (Theorem 2.11), as well as various convergence rates results (see Section 3). For these results we introduced the absolute Bregman distance as a new generalization of the standard Bregman distance from the convex to the non-convex setting. NETT combines deep neural networks with a Tikhonov regularization strategy. The regularizer is defined by a network that might be a user-specified function (generalizing frame based regularization), or might be a CNN trained on an appropriate training data set. We have developed a possible strategy for learning a deep CNN (using an encoderdecoder framework, see Section 4). Initial numerical results for a sparse data problem in PAT (see Section 5) demonstrated that NETT with the trained regularizer works well and also yields good results for phantoms different from the class of training data. This may be a result of the fact, that opposed to other deep learning approaches for image reconstruction, the NETT includes a data consistency term as well as the trained network that focuses on identifying artifacts. Detailed comparison with other deep learning methods for inverse problems as well as variational regularization methods (including TV-minimization) is subject of future studies.
Many possible lines of future research arise from the proposed NETT regularization and the corresponding network-minimizing solution concept (1.4). For example, instead of the Tikhonov variant (1.3) one can employ and analyze the residual method (or Ivanov regularization) for approximating (1.4), see [24]. Instead of the simple incremental gradient descent algorithm (cf. Algorithm 1) for minimizing NETT one could investigate different algorithms such as proximal gradient or semi-smooth Newton methods. Studying network designs and training strategies different from the encoder-decoder scheme is a promising aspect of future studies. Finally, application of NETT to other inverse problems is another interesting research direction.