Statistical Multiresolution Dantzig Estimation in Imaging: Fundamental Concepts and Algorithmic Framework

In this paper we are concerned with fully automatic and locally adaptive estimation of functions in a"signal + noise"-model where the regression function may additionally be blurred by a linear operator, e.g. by a convolution. To this end, we introduce a general class of statistical multiresolution estimators and develop an algorithmic framework for computing those. By this we mean estimators that are defined as solutions of convex optimization problems with supremum-type constraints. We employ a combination of the alternating direction method of multipliers with Dykstra's algorithm for computing orthogonal projections onto intersections of convex sets and prove numerical convergence. The capability of the proposed method is illustrated by various examples from imaging and signal detection.


Introduction
In numerous applications, the relation of observable data Y and the (unknown) signal of interest u 0 can be modeled as an inverse linear regression problem. We shall assume that the data Y = {Y ν } is sampled on the equidistant grid X = {1, . . . , m} d , with m, d ∈ N and that u 0 ∈ U for some linear space U , such as the Euclidean space or a Sobolev class of functions. Hence the model can be formalized as (1) Here we assume that ε = {ε ν } ν∈X are independent and identically distributed r.v. with E (ε ν ) = 0 and E ε 2 ν = σ 2 > 0 (white noise). Moreover, K : U → (R m ) d denotes a linear operator that encodes the functional relation between the quantities that are accessible by experiment and the underlying signal. Often the operator K does not have a continuous inverse (or its inverse is ill-conditioned in a discrete setting, where K is a matrix), that is estimation of u 0 given the data Y is an ill-posed problem. As a consequence, estimators for u 0 can not be obtained by merely applying the inverse of K to an estimator of Ku 0 , in general. Instead, more sophisticated statistical regularization techniques have to be employed that, loosely speaking, are capable of simultaneously inverting K and solving the regression problem.
The application we primarily have in mind is the reconstruction of low-dimensional signals (e.g. images) u 0 which are presumed to exhibit a strong neighborhood structure as it is characteristic for imaging or signal detection problems. These neighborhood relations are often modeled by prior smoothness or structural assumptions on u 0 (e.g. on the texture of an image).
The aim of this paper is twofold. First, we will introduce the broad class of statistical multiresolution estimators (SMRE). We claim that numerous regularization techniques, that were recently proposed for different problems in various branches of applied mathematics and statistics, can be considered as special cases of these. Among others, this includes the Dantzig selector (see [4,7,29] and references therein) that was recently proposed in the context of high dimensional statistics. Our prior focus, however, will be put on imaging problems and it will turn out that the aforementioned neighborhood relations can be modeled within our SMRE framework in a straightforward manner. This will result in locally adaptive and fully automatic image reconstruction methods.
The high intrinsic structure of the signals that are typically under consideration in imaging is in contrast to the usual situation in high-dimensional statistics. Here u 0 is usually assumed to be unstructured but to have a sparse representation with respect to some basis of U (cf. [7,8,43]). Consequently, the consistent estimation of u 0 is realized by minimizing a regularization functional which fosters sparsity, such as the ℓ 1 -norm of the coefficients, subject to an ℓ ∞ -constraint on the coefficients of the residual, i.e.
In order to apply this approach for image reconstruction, two modifications become necessary: Often one aims to minimize other regularization functionals such as the total variation seminorm (cf. [34,37]) or Sobolev norms, say. Hence, we suggest to replace the ℓ 1 -norm in (2) by a general convex functional J that models the smoothness or texture information of signals or images (cf. [1,35]). Furthermore, we relax the ℓ ∞ -constraint such that neighborhood relations of the image can be taken into account. This generalizes the Dantzig selector to this task in a natural way and obviously increases estimation efficientcy. As we will layout in Paragraph 1.2.2, this requires new algorithms to compute efficiently the resulting large scale optimization problem.
1.1. Statistical Multiresolution Estimation. We will now introduce the announced class of estimators. To this end, let S be some index set and W = ω S : S ∈ S be a set of given weight-functions on the grid X = {1, . . . , m} d . A statistical multiresolution estimator (SMRE) (or generalized Dantzig selector ), is defined as a solution of the constrained optimization problem Here, J : U → R denotes a regularization functional that incorporates a priori knowledge on the unknown signal u 0 (such as smoothness) and Λ : (R m ) d → (R m ) d a possibly non-linear transformation. The constant q can be considered as a universal regularization parameter that governs the trade-off between regularity and data-fit of the reconstruction. In most practical situations q is chosen to be the α-quantile q α of the multiresolution (MR) statistic T (ε), where T : (R m ) d → R encodes the inequality constraint in (3), i.e.
To this end, we assume the distribution of T (ε) to be (approximately) known. This can either be obtained by simulations or in some cases the limiting distribution can even be derived explicitly. The regularization parameter q then admits a sound statistical interpretation: Each solutionû α of (3) satisfies P J(û α ) ≤ J(u 0 ) ≥ α i.e. the estimatorû α is more regular (in terms of J) than u 0 with a probability of at least α. To see this simply observe that the true signal u 0 satisfies the constraint in (3) with probability at least α. For a given estimatorû of u 0 , the set W is assumed to be rich enough in order to catch all relevant non-random signals that are visible in the residual Y − Kû. Then, the average function evaluated at v = Y − Kû is supposed to be significantly larger than q for at least one ω ∈ W, whenever Y − Kû fails to resemble white noise. Put differently, the MR-statistic T (Y − Kû) is bounded by q, whenever Y − Kû is accepted as white noise according to the resolution provided by W. In fact, this is a key observations that reveals numerous potential application areas of the estimation method (3). The examples we have in mind are mainly from statistical signal detection and imaging, where the index set S is typically chosen to be an overlapping (redundant) system of subsets of the grid X and ω S is the normalized indicator function on S ∈ S. Consequently the inequality constraint in (3) guarantees that the residual resembles white noise on all sets S ∈ S. In other words, the SMRE approach in (3) yields a reconstruction method that locally adapts the amount of regularization according to the underlying image features. We illustrate this in Section 3 by various examples.
Summarizing, the optimization problem in (3) amounts to choose the most parsimonious among all estimatorsû for which the residual Y − Kû resembles white noise according to the statistic T . If Y − Kû contains some non randon signal, i.e. T (Y − Kû) is likely to be larger than q and u happens to lie outside the admissible domain of (3). Thus, the multi-resolution constraint prevents too parsimonious reconstructions due to the minimization of J.
1.2. Algorithmic Challenges and Related Work.
1.2.1. Multiresolution Methods. SMREs and related MR statistics have recently been studied in various contexts. We give a brief (but incomplete) overview.
Classical MR statistics are obtained from the general form in (4) by setting U = (R m ) d and Λ = Id. Moreover, one considers the system W to contain indicator functions on cubes.
To be more precise, define the index set S to be the system of all d-dimensional cubes in X and set ω S = χ S / √ #S. Then, the MR-statistic in (4) reduces to This statistic was introduced in [42] (called scanning statistic there) in order to detect a signal against a noisy background. It was shown in [31] that s.
If the system S is reduced to the set of all dyadic squares, then it was proved in [28] that (after suitable transformations) T also converges weakly to the Gumbel distribution. There, the authors also established a method for locally adaptive image denoising employing linear diffusion equations with spatially varying diffusivity. SMREs (3) have been studied recently for the case d = 1 in [12] and [6], where total-variation penalty and the number of jumps in piecewise constant regression were considered as regularization functional J, respectively. In [21] consistency and convergence rates for SMREs have been studied in a general Hilbert space setting. SMREs with squared residuals, that is Λ(v) ν = v 2 ν , yield another class of estimators that have attracted much attention. Above all, the situation where S consists of the single set X and ω X is chosen to be the constant 1 function is of special interest, since then (3) reduces to the penalized least square estimation. In particular (3) then can be rewritten into for a suitable multiplier λ > 0. If J(u) = u 1 the LASSO estimator will result (cf. [43]). Recently, also non-trivial choices of S were considered. In [3] S is chosen to consist of a partition of G which is obtained beforehand by a Mumford-Shah segmentation. In [15], a subset S ⊂ X is fixed and afterwards S is defined as the collection of all translates of S.
In [17] MR-statistics are used for shape-constrained estimation based on testing qualitative hypothesis in nonparametric regression for d = 1. Here, the weight functions ω S incorporate qualitative features such as monotonicity or concavity. Similarly, MR-statistics are used in [18] in order to detect locations of local increase and decrease in density estimation. Much in the same spirit is the work in [16] where multiscale sign tests are employed for computing confidence bands for isotonic median curves.
As mentioned previously, the Dantzig-selector [7] is also covered by the general SMRE framework in (3). To see this, set U = R p (with typically p ≫ m), Λ = Id and define the weights Then, each solution of (3) can be considered as a generalized Dantzig selector. The matrix K ∈ R m×p in this context is usually interpreted as design matrix of a high dimensional linear model. The classical Dantzig selector as introduced in [7] then results in the special case where S only consists of single-elemented subsets of {1, . . . , p} and J is chosen to be the ℓ 1 -regularization functional Hence LASSO and Dantzig selector are uni-scale estimators which take into account the largest (S = {X}) and smallest (S consists of all singletons in {1, . . . , p}) scales, respectively. In this sense, they constitute two extreme cases of SMRE.
We note that in practical applications the number of constraints in (7), that is the cardinality of the index set S, can be quite large (in Section 3.2 denoising of a 512 × 512 image results in more than 6 million inequalities). Moreover, the inequalities (even for the simplest case where Λ = Id) are mutually correlated. Both of these facts turn (7) into a numerically challenging problem and standard approaches (such as interior point or conjugate gradient methods) perform far from satisfactorily. The authors in [3,15,28] approach the numerical solution of (7) by means of an analogon of (6) with spatially dependent multiplier λ ∈ (R m ) d , i.e.
Starting from a (constant) initial parameter λ = λ 0 , the parameter λ is iteratively adjusted by increasing it in regions which were poorly reconstructed before according to the MR-statistic T . This approach strongly depends on the special structure of S that allows a straightforward identification of each set S ∈ S with a unique point in the grid X. Put differently, it is not clear how to modify this paradigm in order to solve (3) for highly redundant systems S as we have it in mind. Recently a general algorithmic framework was introduced in [2] for the solutions of largescale convex cone problems where K is a convex cone in some Euclidean space. The approach was realized in the software package Templates for First-Order Conic Solvers (TFOCS) 1 . The above formulation is very general and in order to recover (7) one has to consider the cone The approach in [2] employs the dual formulation of the problem which involves the computation of the dual cone K * (J * denotes the Legendre-Fenchel dual of J). This approach is particularly appealing for the uni-scale Dantzig selector since in this situation the cone K coincides with the epi-graph of the ℓ ∞ -norm and hence its dual cone is straightforward to compute (it is the epi-graph of the ℓ 1 -norm). As it is argued in [2], this approach is capable of computing Dantzig selectors for large scale problems in contrast to previous approaches such as standard linear programming techniques [7] or homotopy methods such as DASSO [29] or [41]. As the authors stress, their approach works well in the case when K is the epi-graph of a norm for which the projections onto K * are tractable and computationally efficient. However, for the applications we have in mind (such as locally adaptive imaging reconstruction), the approach in [2] is only of limited use: In contrast to the aforementioned epi-graphs, the large number of (strongly dependent) constraints in (7) brings about a cone K that on the one hand exhibits a tremendous amount of faces compared to the dimension of the image space dim(H) = md and that on the other hand is no longer symmetric w.r.t. to the q-axis. Both of these facts turn the computation of dual cone K * (or the projections onto it) into a most challenging problem, even in the simplest case when Λ is linear. The aim of this paper is to develop a general algorithmic framework that makes solutions of (7) numerically accessible for many applications. In order to do so we propose to introduce a slack variable in (7) and then use the alternating direction method of multipliers, an Uzawatype algorithm that decomposes problem (7) into a J-penalized least squares problem for the primal variable and a orthogonal projection problem on the feasible set of (7) for the slack variable. This approach has the appealing effect that once an implementation for the projection problem is established, different regularization functionals J can easily be employed without changing the backbone of the algorithm. Our work is much in the same spirit as [33], which considered an alternating direction method for the computation of the Dantzig selector recently. In this case the computation of the occurring orthogonal projections are available in closed form, whereas in our applications this is not the case due to the aforementioned dependencies.
In order to tackle the orthogonal projection problem we employ Dykstra's projection method [5] which is capable of computing the projection onto the intersection of convex bodies by merely using the individual projections onto the latter. The efficiency of the proposed method hence increases considerably if the index set S can be decomposed into "few" partitions that contain indices of mutually independent inequalities in (7). In particular, by this approach we will be able to compute classical SMRE (as introduced in [12,21]) in d = 2 space dimensions which to our knowledge has never been done so far. This puts us into the position to study the performance of such estimators compared with other benchmark methods in locally adaptive signal recovery (such as adaptive weights smoothing cf. [40]). As it will turn out in Section 3 it will outperform these visually as well as quantitatively.

1.3.
Organization of the Paper. The paper is organized as follows: In Section 2 we introduce a general algorithmic approach for computing SMREs. We will rewrite (7) into a linearly constrained problem and compute a saddle point of the corresponding augmented Lagrangian by the alternating direction method of multipliers in Paragraph 2.2. Under quite general assumption, we prove convergence of the algorithm in Theorem 2.2 and give some qualitative estimates for the iterates in Theorem 2.4. One of the occurring minimization steps amounts to the computation of an orthogonal projection onto a convex set in Euclidean space. In Paragraph 2.3, this problem will be tackled by means of Dykstra's projection algorithm introduced in [5]. Finally, we illustrate the performance of some particular instances of SMREs in Section 3: we study problems in nonparametric regression, image denoising and deconvolution of fluorescence microscopy images and compare our results to other methods by means of simulations.

Computational Methodology
In this section we will address the question on how to solve the linearly constrained optimization problem (7). After discussing some notations and basic assumptions in Subsection 2.1, we will reformulate the problem in Paragraph 2.2 such that the alternating direction method of multipliers (ADMM), a Uzawa-type algorithm, can be employed as a solution method. As an effect, the task of computing a solution of (7) is replaced by alternating i) solving an unconstrained penalized least squares problem that is independent of the MRstatistic T and ii) computing the orthogonal projection on a convex set in Euclidean space that is independent of J. This reveals an appealing modular nature of our approach: The regularization functional J can easily be replaced once a method for the projection problem is settled. For the latter we will propose an iterative projection algorithm in Paragraph 2.3 that was introduced by Boyle and Dykstra in [5].
2.1. Basic Assumptions and Notation. From now on, X will stand for the d-dimensional grid {1, . . . , m} d and agree upon H = R X ≃ (R m ) d being the space of all real valued functions v : X → R. Moreover, we assume that S denotes some index set and that W = ω S : S ∈ S is a collection of elements in H. For two elements v, w ∈ H we will use the standard inner product and norm respectively. Next, we assume that Λ : H → H is continuous such that Λ(0) = 0 and that for all S ∈ S the mapping v → ω S , Λ(v) is convex. With this notation, we can rewrite the average function in (5) in the compact form We note, that it is not restricitve to consider more generaly Λ : H → R N with arbitrary N ∈ N. This could e.g. be useful for augmenting the constraint set of (7) with further constraints of different type. For the signal and image detection problems as studied in this paper, however, Λ is always a pointwise transformation of the residuals. Hence, we will restrict our considerations on the case when Λ : H → H.
Furthermore, we define U to be a separable Hilbert-space with inner product ·, · U and induced norm · U . The operator K : U → H is assumed to be linear and bounded and the functional J : U → R is convex and lower semi-continuous, that is Recall the definition of the MR-statistic in (4). Throughout this paper we will agree upon the following ii) For all y ∈ H and c ∈ R the set Under Assumption A it follows from standard techniques in convex optimization, that a solution of (7) exists. As we will discuss in Section 2.2 it even follows that a saddle point of the corresponding Lagrangian exists (cf. Theorem 2.1 below). In this context Assumption A i) is often referred to as Slater's constraint qualification and is for instance satisfied if K(U ) is dense in H. Moreover, Assumption A ii) will be needed in order to guarantee convergence of the algorithm for computing such a solution, as it is proposed in the upcoming section.
In many applications U is some function space and J a gradient based regularization method, such as the total variation semi-norm (cf. Section 3.2). Then a typical sufficient condition for Assumption A ii) is that K does not annihilate constant functions.

Alternating Direction Method of Multipliers.
By introducing a slack variable v ∈ H we rewrite (7) to the equivalent problem inf u∈U,v∈H Here, G denotes the characteristic function on the feasible region C of (7), that is, Note that due to the assumptions on Λ, the set C is closed and convex. The technique of rewriting (7) into (8) is referred to as the decomposition-coordination approach, see e.g. Fortin & Glowinski [20, Chap. III]. There, Lagrangian multiplier methods are used for solving (8).
To this end, we recall the definition of the augmented Lagrangian of Problem (8), that is The name stems from the fact that the ordinary Lagrangian is augmented by the quadratic penalty term (2λ) −1 Ku + v − Y 2 that fosters the fulfillment of the linear constraints in (8). The augmented Lagrangian method consists in computing a saddle point (û,v,p) of L λ , that is We note that each saddle point (û,v,p) of the augmented Lagrangian L λ is already a saddle point of L and vice versa and that in either case the pair (û,v) is a solution of (8) (and thuŝ u is a desired solution of (7)). This follows e.g. from [ According to Assumption A i) and due to the continuity of Λ the first requirement is clearly satisfied. Further, the coercivity assumption (11) is a consequence of Assumption A ii).
We will use the Alternating Diretion Method of Multipliers (ADMM) (cf. Algorithm 1) as proposed in [20, Chap. III Sec. 3.2] for the computation of a saddle point of L λ (and hence of a solution of (7)): Successive minimization of the augmented Lagrangian L λ w.r.t. the first and second variable followed by an explicit step for maximizing w.r.t. the third variable is performed. Convergence of this method is established in Theorem 2.2 which is a generalization of [20, Chap. III Thm. 4.1]. We note that the proof, as presented in the Appendix A allows for approximate solution of the individual subproblems. For the sake of simplicity, we present the Algorithm in its exact form.
Theorem 2.2. Every sequence {(u k , v k )} k≥1 that is generated by Algorithm 1 is bounded in U × H and every weak cluster point is a solution of (8). Moreover, Remark 2.3. i) Theorem 2.2 implies, that each weak cluster point of {u k } k≥1 is a solution of (7). In particular, if the solution u † of (7) is unique (e.g. if J is strictly convex), then u k ⇀ u † . ii) Note in particular that (12) is independent of the choice of J, while (13) is independent of the multiresolution statistic being used. This decomposition gives the proposed method a neat modular appeal: once an efficient solution method for the projection problem (12) is established (see e.g. Section 2.3), the regularization functional J in (3) can easily be replaced by providing an algorithm for the penalized least squares problem (13). For most popular choices of J, problem (13) is well studied and efficient computational methods are at hand (see [44] for a extensive collection of algorithms and [32] for an overview on MCMC methods).
For a given tolerance τ > 0, Theorem 2.2 implies that Algorithm 1 terminates and outputs approximate solution u[τ ] and v[τ ] of (8). However, the breaking condition in Algorithm 1

Algorithm 1 Alternating Direction Method of Multipliers
u k ←ũ whereũ satisfies merely guarantees that the linear constraint in (8) The result in Theorem 2.4 shows how the accuracy of the approximate solution u[τ ] depends on τ . Moreover, it reveals that choosing a small step size λ in Algorithm 1 possibly yields a slow decay of the objective functional J. However, it follows from the definition of L λ in (10) that a small value for λ fosters the linear constraint in (8).
Corollary 2.5. Let the assumtions of Theorem 2.4 be satisfied. Moreover, assume that J is a quadratic functional, i.e. J(u) = 1 2 Lu 2 V , where V is a further Hilbert-space and L : U ⊃ D → V is a linear, densely-defined and closed operator. Then Example 2.6 (Dantzig selector). As already mentioned in the introduction, SMRE (i.e. finding solutions of (3)) reduces to the computation of Dantzig selectors for the particular setting d = 1, U = R p (with usually p ≫ m) and When applying Algorithm 1 the subproblem (13) amounts to compute This is the well known least absolute shrinkage and selection operator (LASSO) estimator [43]. For the classical Dantzig selector, one chooses S = {1, . . . , p} and defines for S ∈ S the weight ω S = Kχ {S} . Hence, the subproblem (12) in this case consists in the orthonormal The implications of Theorem 2.4 in the present case are in general rather weak. If the saddle pointû is known to be S-sparse and when K restricted to the support ofû is injective, then it can be shown that We finally note that for this particular situation a slightly different decomposition than proposed in (2.2) is favorable. To be more precise, defineK = K T K andỸ = K T Y and consider whereG is the characteristic function on the set {v ∈ H : v ∞ ≤ q}. Algorithm 1 applied to this modified decomposition then results in the ADMM as introduced in [33]. In this case the projection in step (12) has a closed from.
2.3. The Projection Problem. Algorithm 1 resolves the constrained convex optimization problem (7) into a quadratic program (12) and an unconstrained optimization problem (13). The quadratic program (12) in the k-th step of Algorithm 1 can be written as a projection: where The sets C S are closed and convex and problem (14) thus amounts to compute the projection P C (Y k ) of Y k onto the intersection C of closed and convex sets. According to this interpretation, we use Dykstra's projection algorithm as introduced in [5] to solve (14). This algorithm takes an element v ∈ H and convex sets D 1 , . . . , D M ⊂ H as arguments. It then creates a sequence converging to the projection of v onto the intersection of the D j by successively performing projections onto individual D j 's. To this end, let P D (·) denote the projection onto D ⊂ H and S D = P D − Id be the corresponding projection step. Dykstra's method is summarized in Algorithm 2.
A natural explanation of the algorithm in a primal-dual framework as well as a proof that the sequence {h(M, k)} k∈N converges to P D (h) in norm can be found in [13,23]. For the case when D constitutes a polyhedron even explicit error estimates are at hand (cf. [46]): Theorem 2.7. Let {h k } k∈N be the sequence generated by Algorithm 2 and P D (h) be the projection of the input h onto D. Then there exist constants ρ > 0 and 0 ≤ c < 1 such that Remark 2.8. The constant c increases with the number M of convex sets which intersection form the set D that h is to be projected on. The convergence rate therefore improves with decreasing M . For further details and estimates for the constants ρ and c, we refer to [46].

Algorithm 2 Dykstra's Algorithm
Require: h ∈ H (data), D 1 , . . . , D M ⊂ H (closed and convex sets) Note that application of Dykstra's algorithm is particularly appealing if the projections P D j can be easily computed or even stated explicitly, as it is the case in the following examples.
Example 2.9. Assume that Λ = Id. Then the sets C S are the rectangular cylinders The projection can therefore be explicitly computed as The left image in Figure 1 depicts an example for C for H = R 2 . For a detailed geometric interpretation of the MR-statistic we also refer to [36].
is convex if and only if ω S ν ≥ 0 for all ν ∈ X. In this case, the sets C S are elliptic cylinders Moreover, if ω S ν ∈ {0, 1} for all ν ∈ X, then the projection P C S can be explicitly computed as The right image in Figure 1 depicts an example of C for H = R 2 .
A first approach to use Dykstra's algorithm to solve (14) is to set M = #S and identify D j with C S for all j = 1, . . . , M . In view of Remark 2.8, however, it is clearly desirable to decrease the number M of convex sets that enter Dykstra's algorithm. In order to do so, we take a slightly more sophisticated approach than the one just presented. We partition the set Figure 1. Admissible set C for the projection problem (14) as in Example 2.9 (left) and Example 2.10 (right). and Given the projections P C S , the projection onto D j can be easily computed: For v ∈ H identify the set V j = {S ∈ S j : µ S (v) > q} of indices for which v violates the side condition (15) and set To keep M small, we choose S 1 ⊂ {1, . . . , N } as the biggest set such that (16) holds for all S,S ∈ S 1 . We then choose S 2 ⊂ S \ S 1 with the same property and continue in this way until all indices are utilized. While this procedure does not necessarily result into M being minimal with the desired property, it still yields a distinct reduction of N in many practical situations. We will illustrate this approach for SMREs in imaging in Section 3.

Applications
In this section we will illustrate the capability of Algorithm 1 for computing SMREs in some practical situations: in Section 3.1 we will study a simply one-dimensional regression problem as it was also studied in [12], yet with a different penalty function J. In Section 3.2 we illustrate how SMREs performs in image denoising. In both cases we compare our results to other methods. Finally, we will apply the SMRE technique to the problem of image deblurring in confocal fluorescence microscopy in Section 3.3.
Before we study the aforementioned examples, we clarify some common notation. We will henceforth assume that U = H = (R m ) d with d = 1 (Section 3.1) and d = 2 (Sections 3.2 and 3.3), respectively. Moreover, we will employ gradient based regularization functionals of the form where |·| 2 is the Euclidean norm in R d and D denotes the forward difference operator defined by For the case p = 2 the minimization problem (13) amounts to solve an implicit time step of the d-dimensional diffusion equation with initial value (Y + λp k−1 − v k ) and time step size λ. This can be solved by a simple (sparse) matrix inversion.
For the case p = 1, T V 1 is better known as total-variation semi-norm. It was shown in [34] (see also [24] for similar results in the continuous setting) that the taut-string algorithm (as introduced in [10]) constitutes an efficient solution method for (13) in the case d = 1. In the general case d ≥ 1, we employ the fixed point approach for solving the Euler-Lagrange equations for (13) described in [14] (see also [44,Chap. 8.2.4]). We finally note that the functional T V 1 fails to be differentiable; a fact that leads to serious numerical problems when trying to compute the Euler-Lagrange conditions for (13). Hence, we will use in our simulations a regularized version of T V 1 defined by for a small constant β > 0.
Evaluation. In order to evaluate the performance of SMREs, we will employ various distance measures between an estimatorû and the true signal u 0 . On the one hand, we will use standard measures such as mean integrated squared error (MISE) and the mean integrated absolute error (MIAE) which are given by respectively. On the other hand, we also intend to measure how well an estimatorû matches the "smoothness" of the true signal u 0 , where smoothness is characterized by the regularization functional J. To this end, we introduce the symmetric Bregman divergence where ∇J denotes the gradient of the regularization functional J. Clearly, D sym J (û, u 0 ) is symmetric and since J is assumed to be convex, also non-negative. However, the symmetric Bregman divergence usually does not satisfy the triangle inequality and hence in general does not define a (semi-) metric on U [9]. The following examples shed some light on how the Bregman divergence incorporates the functional J in order to measures the distance ofû and u 0 . In other words, the symmetric Bregman distance w.r.t. to T V 2 is the mean squared distance of the derivatives ofû and u 0 .
ii) If p = 1, then where γ ν denotes the angle between the level lines ofû and u 0 at the point x ν . Put differently, the symmetric Bregman divergence w.r.t the total variation semi-norm T V 1 is small if for sufficiently many points x ν either bothû and u 0 are constant in a neighborhood of x ν or the level lines ofû and u 0 through x ν are parallel. In practice rather T V β 1 in (19) (for a small β > 0) instead of T V 1 is used in order to avoid singularities. Then, the above formulas are slightly more complicated.
We will use the mean symmetric Bregman divergence (MSB) given by as an additional evaluation method. In all our simulations we approximate the expectations above by the empirical means of 500 trials.
Comparison with other methods. We will compare the SMREs to other regression methods. Firstly, we will consider estimators obtained by the global penalized least squares method: In particular, we focus on estimatorsû(λ) that are closest (in some sense) to the true function u 0 . We call such estimators oracles. We define the L 2 -and Bregman-oracle byû L 2 =û(λ 2 ) andû B =û(λ B ), where respectively. Of course, oracles are not available in practice, since the true signal u 0 is unknown. However, they represent ideal instances within the class of estimators given by (20) that usually perform better than any data-driven parameter choices (such as cross-validation) and hence may serve as a reference. Secondly, we also compare our approach to adaptive weights smoothing (AWS) [40] which constitutes a benchmark technique for data-driven, spatially adaptive regression. We compute these estimators by means of the official R-package 2 and denote them byû ker aws , where ker ∈ {Gaussian, Triangle} decodes the shape of the underlying regression kernel.
3.1. Non-parametric Regression. In this section we apply the SMRE technique to a nonparametric regression problem in d = 1 dimensions, i.e. the noise model (1) becomes where we assume that ε ν are independently and normally distributed r.v. with E (ε ν ) = 0 and E ε 2 ν = σ 2 . The upper left image in Figure 2 depicts the true signal u 0 (solid line) and the data Y , with m = 1024 and σ = 0.5. The application we have in mind with this example arises in NMR spectroscopy, where the NMR spectra provide structural information on the number and type of chemical entities in a molecule. In this context, we suggest to choose J = T V 2 , since the true signal u 0 is rather smooth (see [12] for examples where J is chosen to be the total variation of the first and second derivative).
Finally, we discuss the MR-statistic T in (4). We choose Λ = Id and the index set S to consist of all discrete intervals with side lengths ranging from 1 to 100. For an interval S ∈ S we set ω S = (#S) −1/2 χ S . Thus, each SMRE solves the constrained optimization problem We choose q to be the α-quantile of the MR-statistic T , that is We note that except for few special cases (cf. [28,30]) closed form expressions for the distribution of the MT-statistic T are usually not at hand. In practice one rather considers the empirical distribution of T where the variance σ 2 can be estimated at a rate √ md (cf. [38]). We will henceforth denote byû α a solution of (22) with q = q α . As argued in Section 1.1, u α is smoother (i.e. has smaller value T V 2 ) than the true signal u 0 with a probability of at least α while it satisfies the constraint that the multiresolution statistic T does not exceed q α . This is a sound statistical interpretation of the regularization parameter α.
Numerical results and simulations. In Figure 2 the oraclesû L 2 andû B , the AWS-estimatorŝ u Triangle aws and andû Gauss aws as well as the SMREû 0.9 ,û 0.75 andû 0.5 are depicted. It is evident that the SMRE matches the smoothness of the true object much better than the other estimators while the essential features of the signal (such as peak location and peak height) are preserved. In particular, almost no additional local extrema are generated by our approach which stays in obvious contrast to the other methods. Moreover, we point out that the SMRE are quite robust w.r.t. the choice of the confidence level α.
We verify this behavior by a simulation study in Table 1. For different noise levels (σ = 0.1, 0.3 and 0, 5) we compare the MISE, MIAE and MSB. Additionally, we compute the mean number of local maxima (MLM) ofû relative to the number of local maxima in u 0 (which is 11). Hereû is any of the above estimators. Note that the latter measure (similar to the MSB) takes into account the smoothness of the estimators where a value MLM ≫ 1 indicates too many local maxima and hence a lack of regularity whereas MLM < 1 implies severe oversmoothing.
As it becomes apparent from Table 1, the SMREs are performing similarly well when compared to the reference estimators as far as the standard measures MISE and MIAE are concerned. For small noise levels (σ = 0.1) SMREs even prove to be superior. The distance measures MSB and MLM, however, are significantly smaller for SMREs which indicates that these meet the smoothness of the true object u 0 much better than the reference estimators (cf. Example 3.1 i)). All in all, the simulation results confirm our visual impressions above.
Implementation Details. The current index set S results in an overall number of constraints in (22) As pointed out in Section 2.3, the efficiency of Dykstra's Algorithm can be increased by grouping independent side-conditions, that is side-conditions corresponding to intervals in S with empty intersection. For example, the system S can be grouped such that the intersection of the corresponding sets D 1 , . . . , D M in (17) Table 1. Simulation studies for one dimensional peak data set.
In all our simulations we set τ = 10 −4 and λ = 1.0 in Algorithm 1 which results in k[τ ] ≈ 100 iterations and an overall computation time of approximately 20 minutes for each SMRE. We note, however, that more than 95% of the computation time is needed for the projection step (12) and that a considerable speed up for the latter could be achieved by parallelization.
3.2. Image denoising. In this section we apply the SMRE technique to the problem of image denoising, that is non-parametric regression in d = 2 dimensions. In other words, we consider the noise model (21) as in Section 3.1, where the index ν ranges over the discrete square {1, . . . , m} 2 . In Figure 3 two typical examples for images u 0 and noisy observations Y are depicted (m = 512 and σ = 0.1, where u 0 is scaled between 0 (black) and 1 (white)). We will use the total-variation semi-norm J = T V β 1 as regularization functional (β = 10 −8 ). Moreover, we choose Λ to be defined as The index set S is defined to be the collection of all discrete squares with side lengths ranging from 1 to 25 and we set ω S = c S χ S with yet to be defined constants c S . Thus, each SMREs solves the constrained optimization problem We agree upon q = 1 and specify the constants c S . To this end, compute for s = 1, . . . , 25 the quantile values and set c S = q −1 α,#S . In other words, the definition of c S implies that the true signal u 0 satisfies the constraints in (25) for squares of a fixed side length s with probability at least α. We will henceforth denote byû α a solution of (25). We remark on this particular choice of the parameters ω S below. andû Gauss aws as well as the SMREû 0.9 are depicted (for the "cameraman" and "roof" test image respectively). It is rather obvious that the L 2 -oracles are not favorable: although relevant details in the image are preserved, smooth parts (as e.g. the sky) still contain random structures. In contrast, the estimatorû Gauss aws preserves smooth areas but looses essential details. The aws-estimator with triangular kernel performs much better, however, it gives piecewise constant reconstructions of smoothly varying portions of the image, which is clearly undesirable. The SMRE and the Bregman-oracle visually perform superior to the other methods. The good performance of the Bregman-oracle indicates that the symmetric Bregman distance is a good measure for comparing images. In contrast to the Bregman-oracle, the SMRE adapts the amount of smoothing to the underlying image structure: constant image areas are smoothed nicely (e.g. sky portions), while oscillating patterns (e.g. the grass part in the "cameraman" image or the roof tiles in the "roof" image) are recovered.
We evaluate the performance of the SMREs by means of a simulation study. To this end, we compute the MISE, MIAE and MSB and compare these values with the reference estimators. We note, however, that in particular the MISE and MIAE are not well suited in order to measure the distance of images for they are inconsistent with human eye perception. In [45] the structural similarity index (SSIM) was introduced for image quality assessment that takes into account luminance, contrast and structure of the images at the same time. We use the author's implementation 3 which is normalized such that the SSIM lies in the interval [−1, 1] and is 1 in case of a perfect match. We denote by MSSIM the empirical mean of the SSIM in our simulations.
In Table 2 the simulation results are listed. A first striking fact is the good performance of the L 2 -oracle w.r.t. the MISE and MIAE which is supposed to imply reconstruction properties superior to the other methods. Keeping in mind the visual comparison in Figures 4 and 5, however, this is rather questionable. On the other hand, it becomes evident that the L 2 -oracle has a rather poor performance w.r.t. the MSB which is more suited for measuring image distances. It is therefore remarkable that the SMRE performs equally good as the Bregmanoracle which, in contrast to the SMRE, is not accessible (since u 0 is usually unknown). As far as the structural similarity measure MSSIM is concerned our approach proves to be superior to all others. Finally, the simulation results indicate that aws estimation is not favourable for denoising of natural images.  Table 2. Simulation studies for the test images "cameraman" and "roof".
Notes on the choice of Λ and ω S . In general, a proper choice of the transformation Λ and of the weight-functions ω S can be achieved by including prior structural information on the true image to be estimated. Substantial parts of natural images, such as photographs, consists of oscillating patterns (as e.g. fabric, wood, hair, grass etc.). This becomes obvious in the standard test images depicted in Figure 3. We claim that for signals that exhibit oscillating patterns, a quadratic transformation Λ as in (24) is favorable, since it yields (compared to the linear statistic studied in Section 3.1) a larger power of the local test statistic on small scales.
In order to illustrate this, we simulate noisy observations Y of the test images u in Figure  3 as in (21)  To be more precise, the center coordinate of each square S ∈ S is colored according to µ i,S , Hence, large values indicate locations where the estimatorû is considered over-smoothed according to the statistic. It becomes visually clear that the localization of oversmoothed regions is better for µ 2,S . This is a good motivation for incorporating the local means of the squared residuals in the SMRE model (7). Figure 6. Local means µ 1,S (left) and µ 2,S (right)) of the residuals for "roof" image.
We finally comment on the choice of c S . Since ε ν are independent and normally distributed random variables, the (scaled) average function is χ 2 distributed with #S degrees of freedom. Note that the distribution of σ −2 µ S (ε) is identical only for sets S of the same scale #S. As a consequence of this, it is likely that certain scales dominate the supremum in the MR-statistic T which spoils the multiscale properties of our approach. As a way out, we compute normalizing constants for each scale separately.
An alternative approach would be to search for transformations that turn µ S (ε) into almost identically distributed random variables. Logarithmic and p-root transformations are often employed for this purpose (see e.g. [25]). This will be investigated separately.
Implementation Details. The current index set S results in an overall number of constraints in (25) Again by grouping independent side-conditions, the system S can be grouped such that the intersection of the corresponding sets D 1 , . . . , D M in (17) In all our simulations we set τ = 10 −4 and λ = 0.25 in Algorithm 1 which results in k[τ ] ≈ 30 iterations and a overall computation time of approximately 2 hours for each SMRE. Hence, parallelization is clearly desirable in this case.

3.3.
Deconvolution. Another interesting class of problems which can be approached by means of SMREs are deconvolution problems. To be more precise, we assume that K is a convolution operator, that is where k is a square-summable kernel on the lattice Z d and u ∈ H is extended by zero-padding. We will focus on the situation where k is a circular Gaussian kernel with standard deviation σ given by With Z = Y + λp k−1 + v k , the primal step (13) in Algorithm 1 amounts to solve where we choose J to be as in (18) and apply the techniques described in [44] for the numerical solution.
In order to illustrate the performance of our approach in practical applications, we give an example from confocal microscopy, nowadays a standard technique in fluorescence microscopy (cf. [39]). When recording images with this kind of microscope, the original object gets blurred by a Gaussian kernel (in first order). The observations (photon counts) can be modeled as a Poisson process, i.e.
The image depicted in Figure 7(a) shows a recording of a PtK2 cell taken from the kidney of potorous tridactylus. Before the recording, the protein β-tubulin was tagged with a fluorescent marker such that it can be traced by the microscope. The image in 7(a) shows an area of 18 × 18 µm 2 at a resolution of 798 × 798 pixel. The point spread function of the optical system can be modeled as a Gaussian kernel with full width at half maximum of 230nm, which corresponds to σ = 4.3422 in (26).
Note that (27) does not fall immediately into the range of models covered by (1). We will adapt the present situation to the SMRE methodology described in Section 1 by standardization and consider instead of (3) the modified problem where the division is understood pointwise. Clearly, the problem of finding a solution of (28) is much more involved than solving (3) for the constraints being nonconvex : firstly, the functional G as defined in (9) is nonconvex as a consequence of which the convergence result in Theorem 2.2 does not apply and secondly Dykstra's projection algorithm as described in Section 2 cannot be employed. We propose the following ansatz in order to circumvent this problem: instead of projecting onto the intersection C of sets C S as described in (15), we now project in the k-th step of Algorithm 1 onto with a pointwise division by the square root of Ku k . Put differently, in the k-th step of Algorithm 1 we use the previous estimate u k of u 0 as a lagged standardization in order to approximate the constraints in (28). In fact, we use max(Ku k , ε) with a small number ε > 0 for standardization, in order to avoid instabilities.
We note that while with this modification Dykstra's algorithm becomes applicable again, the projection problem (12) now changes in each iteration step of Algorithm 1. As a consequence, Theorem 2.2 does not hold anymore after this modification, either. So far, we have not come up with a similar convergence analysis.
We compute the SMREû 0.9 by employing Algorithm 1 with the modifications described above. As in the denoising examples in Section 3.2 the index set S consists of all squares with the side-lengths {1, . . . , 25} and we choose ω S = χ S and Λ = Id. We note, that this results in an overall number of #S = 95 436 200 inequality constraints. The constant q are chosen as in (23), where we assume that ε ν are independent and standard normally distributed r.v.
In Algorithm 1 we set λ = 0.05 and compute 100 steps. We observe that after a few iterations (∼ 15) the error τ falls below 10 −3 and almost stagnates thereafter, which is due to the fact that we do not increase the accuracy in the subroutines for (13) and (12). Each iteration step in Algorithm 1 approximately takes 10 minutes, where 90% of the computation time is needed for (13). The result is depicted in Figure 7(b).
The benefits of our method are twofold:  i) The amount of regularization is chosen in a completely automatic way. The only parameter to be selected is the level α in (23). Note that the parameter λ in Algorithm 1 has no effect on the output (though it has an effect on the number of iterations needed and the numerical stability). ii) The reconstruction has an appealing locally adaptive behavior which in the present example mainly concerns the gaps between the protein filaments: whereas the marked β-tubulin is concentrated in regions of basically one scale, the gaps in between actually make up the multiscale nature of the image.  In the present situation we are in the comfortable position to have a reference image at hand by means of which we can evaluate the result of our method: STED (STimulated Emission Depletion) microscopy constitutes a relatively new method, that is capable of recording images at a physically 5-10 times higher resolution as confocal microscopy (see [26,27]). Hence a STED image of this object may serve as "gold standard" reference image. Figure 8(a) depicts a STED recording of the PtK2 cell data set in Figure 7(a). The comparison of the SMREû 0.9 with the STED recording in Figure 8(b) shows that our SMRE technique chooses a reasonable amount of regularization: no artifacts due to under-regularization are generated and on the other hand almost all relevant geometrical features that are present in the high-resolution STED recording become visible in the reconstruction. In particular, we note that filament bifurcations (one such bifurcation is marked by a black box in Figure 8(b)) become apparent in our reconstruction that are not visible in the recorded data.
Finally, we mention that aside to standardization, other transformations of the Poisson data (27) could possibly be considered. For example Anscombe's transformation is known to yield reasonable approximations to normality even for low Poisson-intensities and hence has a particular appeal for e.g. microscopy data with low photon-counts. We are currently investigating SMREs that employ Anscombe's transform, where in particular the arising projection problems are challenging.

Conclusion and Outlook
In this work, we propose a general estimation technique for nonparametric inverse regression problems in the white noise model (1) based on the convex program (7). It amounts to finding a minimizer of a convex regularization functional J(u) over a set of feasible estimators that satisfy the fidelty condition T (Y − Ku) ≤ q, where T is assumed to be the maximum over simple convex constraints and q is some quantile of the statistic T (ε). Any such minimizer we call statistical multiresolution estimator (SMRE). This approach covers well known uni-scale techniques, such as the Dantzig selector, but with a vast field of potentially new application areas, such as locally adaptive imaging. The particular appeal of the multi-scale generalization arises for those situations where a "neighboring relationship" within the signal can be employed to gain additional information by "averaging" neighboring residuals. We demonstrate in various examples that this improvement is drastic.
We approach the numerical solution of the problem by the ADMM (cf. Algorithm 1) that decomposes the problem into two subproblems: A J-penalized least squares problem, independent of T , and an orthogonal projection problem onto the feasible set of (7) that is independent of J. The first problem is well studied and for most typical choices of J fast and reliable numerical approaches are at hand. The projection problem, however, is computational demanding, in particular for image denoising applications. We propose Dykstra's cyclic projection method for its approximate solution. Finally, by extensive numerical experiments, we illustrate the performance of our estimation scheme (in nonparametric regression, image denoising and deblurring problems) and the applicability of our algorithmic approach.
Summarizing, this paper is meant to introduce a novel class of statistical estimators, to provide a general algorithmic approach for their numerical computation and to evaluate their performance by numerical simulations. The inherent questions on the asymptotic behaviour of these estimators (such as consistency, convergence rates or oracle inequalities) remain -to a large extent-unanswered. This opens an interesting area for future research.
A first attempt has been made in [21] where it is assumed that the model space U ∋ u 0 is some Hilbert-space of real valued functions on some domain Ω and that K : U → L 2 (Ω) is linear and bounded. The error model (1) then has to be adapted accordingly. When Y is a Gaussian process on L 2 (Ω) with mean Ku 0 and variance σ 2 > 0, consistency and convergence rates for SMREs as σ → 0 + have been proved in [21] for the case when Λ = Id. However, in order to extend these results to the present setting, one would rather work with a discrete sample of Ku 0 on the grid X and then consider the case when the number of observations N = md tends to infinity. The previous analysis in [21] indicates two major aspects that have to be considered in the asymptotic analysis for SMREs: (a) As N → ∞ usually the cardinality of the index set S (and hence of the set of weight functions W) gets unbounded. Thus, the mutliresolution statistic T (ε) = T N (ε) in (4) is likely to degenerate unless it is properly normalized and W satisfies some entropy condition. In the linear case (Λ = Id) we utilized a result from [17] that guarantees a.s. boundedness of T N (ε). (b) In order to derive convergence rates (or risk bounds) it is well known that the true signal u 0 has to satisfy some apriori regularity conditions. When using general convex regularization functionals J, this is usually expressed by the source condition K * p 0 ∈ ∂J(u 0 ), for some p 0 ∈ L 2 (Ω).
Here K * denotes the adjoint of K and ∂J the (generalized) derivative of J. For example, if J(u) = 1 2 u 2 , then this conditions means that u 0 ∈ ran(K * ). It would be of great interest to transfer and extend the results in [21] to the present situation. It is to be expected that (a) and (a) above are necessary assumptions for this purpose.
As stressed by the referees, other extensions are of interest and will be postponed to future work. In contrast to imaging, in many other applications the design X is random, rather than fixed. In these situations an obvious way to extend our algorithmic framework would be to select suitable partitions S according to the design density, i.e. with finer resolution at locations with a high concentration of design points. It also remains an open issue how to extend the SMRE methodology to density estimation rather than regression, in particular in a deconvolution setup. For d = 1 a first step in this direction has been taken in [11] and it will be of great interest to explore whether our approach allow this to be extended to d ≥ 2.
necessary that certain regularity conditions for J hold, that are not realistic for our purposes (as e.g. in the case of total-variation regularization). The assertions of Theorems 2.2 and 2.4 are modifications of the standard results.
We will henceforth assume that {(u k , v k , p k )} k∈N is a sequence generated by iteratively repeating the steps (29a) -(29c). Further, we introduce the notation u k := u k −û,v k := v k −v andp k := p k −p.
We continue with the proof of Theorem 2.2. More precisely, we prove the following generalized version Proposition A.2. Assume that the sums ∞ k=1 δ k and ∞ k=1 ε k are finite. Then, the sequence {(u k , v k )} k≥1 is bounded in U ×H and every weak cluster point is a solution of (8). Moreover, k∈N Ku k + v k − Y 2 + K(u k − u k−1 ) 2 < ∞.
Together with the optimality condition for (13) this in turn implies that for an arbitrary Summarizing, we find that for a suitably chosen constant c ∈ R, since Λ is supposed to be continuous. Thus, it follows from Assumption A that {u k } k∈N is bounded and hence sequentially weakly compact. Now, let (ũ,ṽ,p) be a weak cluster point of {(u k , v k , p k )} k∈N and recall that (û,v,p) was assumed to be a saddle point of the augmented Lagrangian L λ . Setting u =û in (33) Using the relation Kû +v = Y we further find From the definition of v k in (29a) and from the fact thatv, v k ∈ C it follows that Combining (34), (35) and (36) gives lim sup k→∞ J(u k ) ≤ J(û).
Now, choose a subsequence u ρ(k) k∈N such that u ρ(k) ⇀ũ. Since J is convex and lower semicontinuous it is also weakly lower semi-continuous and hence the previous estimate yields J(ũ) ≤ lim inf k→∞ J(u ρ(k) ) ≤ J(û).
Moreover, we have that v ρ(k) ∈ C for all k ∈ N. Since C is closed we conclude thatv ∈ C.
We proceed with the proof of Theorem 2.4. Again we present a generalized version. To this end, let D and E be as in Proposition A.2.