Adaptive complexity regularization for linear inverse problems

We tackle the problem of building adaptive estimation procedures for ill-posed inverse problems. For general regularization methods depending on tuning parameters, we construct a penalized method that selects the optimal smoothing sequence without prior knowledge of the regularity of the function to be estimated. We provide for such estimators oracle inequalities and optimal rates of convergence. This penalized approach is applied to Tikhonov regularization and to regularization by projection.


Introduction
We are interested in recovering an unobservable signal x 0 based on observations where T : X → Y is a known compact linear operator between Hilbert spaces X, Y . We cannot observe our target function x 0 directly, but rather through a blurred (by the linear filter T ), noise corrupted sample of y over a collection of discrete observation points t i , i = 1, . . . , n. Throughout the paper, we shall denote y = (y(t i )) n i=1 . We assume that the observations y(t i ) ∈ R and that the observation noise ε i are i.i.d. realizations of a certain random variable ε.
We will say the problem is ill-posed if the generalized inverse operator T + : Y → X defined by (T * T ) −1 T , where T * stands for the transpose operator, is unbounded. We In this case, this inverse operator needs to be, in some sense, regularized.
Regularization methods replace an ill-posed problem by a family of well-posed problems. Their solution, called regularized solutions, are used as approximations of the desired solution of the inverse problem. These methods always involve some parameter measuring the closeness of the regularized and the original (unregularized) inverse problem. Rules (and algorithms) for the choice of these regularization parameters as well as convergence properties of the regularized solutions are central points in the theory of these methods.
The statistical problem has been extensively studied, although in general efficient regularization-parameter choice is still under active research. Two main kinds of estimators have been considered in the literature. First regularized estimators such as Tikhonov type estimators, and second non linear thresholded estimators. The first approach has been studied in great detail. An interesting early survey of this topics are provided in [20,19]. In this setting, the main issues are, what kind of regularization functional should be considered, and, closely related, what the relative weight of the selected regularization functional should be. More recently, Mair and Ruymgaart in [18] or Bissantz et al. in [3] studied different regularized inverse problems and proved the optimality of the rate of convergence for their estimators assuming the regularity of the target function x 0 to be known. Special attention has been devoted in this setting when considering a Singular value decomposition (SVD) of operator T . We cite the recent work in this direction in [7,6]. The second approach has its most popular version in the wavelet-vaguelet decomposition introduced in [9]. In this case the main issue is finding an appropriate basis over which T + , the generalized inverse, is almost diagonal. This idea is further developed in [12] who introduce mirror wavelets. Closely related, Cohen et al. in [8] construct an adaptive thresholded estimator based on Galerkin's method. We also cite the recent work in [14], where they combine an SVD approach with a thresholding technique over a certain new basis.
Reconstructing the unknown target function x 0 is related to four issues. To the filter T , to the probabilistic structure of the noise, to the fact we are only observing T (x 0 ) over a finite observation scheme and finally to the regularity of x 0 . In order to unify notation, our assumptions will be presented in terms of an underlying basis of Y , {ψ j } j∈N , and the increasing sequence of approximating spaces Y m :=< ψ j > j∈dm . The ill-posedness of T will then be defined in terms of these subspaces Y m and the discrete observation scheme. As is usual in the numerical literature, the regularity of x 0 will be defined in terms of operator T (for a detailed discussion see [11] or [13]).
Our goal in this article is to develop algorithms providing estimators of x 0 that achieve optimal rates of convergence within the regularization method when the smoothness of the true solution is not known a priori. For this, we focus on model selection techniques for regularization procedures. As in the work of [5], [15] or [16], we choose the best regularization scheme among a set of regularization operators by minimizing a contrast and a well chosen penalty. Since the choice of a penalty is crucial, we provide a very general way of calibrating the penalty with respect to the regularization operators. This enables us to build optimal estimators, within the chosen regularization method, for inverse problems for two general classes of estimators, Tikhonov estimators and projection estimators.
The article is divided into four main parts. In Section 1 we describe the general framework and provide several assumptions. In Section 2 we build a general penalty which leads to an oracle inequality for regularization operators and shows optimality of the adaptive procedures. In Section 3 we apply the above results to Tikhonov regularization operators and projection operators. Section 4 is devoted to technical lemmas providing deviation bounds which are used in order to obtain non asymptotic bounds for the quadratic risk of our estimator.

Model description and general assumptions
In this section we introduce our general notation and assumptions.
We want to estimate a function x 0 : R → R based on the blurred and noisy observations It is important to stress that the observations depend on a fixed design (t 1 , . . . , t n ) ∈ R n . This will require introducing an empirical norm based on this design. Let Q n be the empirical measure of the co-variables Q n = 1 n n i=1 δ ti , where δ is the Dirac function. The L 2 (Q n )-norm of a function y ∈ Y is then given by , and the empirical scalar product by < y, ε > n = 1 n n i=1 ε i y(t i ). Remark that this empirical norm is defined over the observation space Y . Over the solution space X we will consider the norm given by the Hilbert space structure. For sake of simplicity, we will write . X = . when no confusion is possible. Over a finite dimensional space, the norm . will always stand for the Euclidean norm and if v ∈ R d , v t will stand for the transpose vector. Likewise for any matrix A ∈ R d×r , A t will stand for the transpose matrix and A + := (A t A) −1 A t for the generalized inverse. Considered as an operator, we will write A * for the adjoint of the corresponding operator. We also introduce certain standard assumptions on the observation noise AN moment condition for the errors ε is a centered random variable satisfying the moment condition E(|ε| q /σ q ) ≤ q!/2 for all q ≥ 1, with E(ε 2 ) = σ 2 .
As usual in statistics, assume that X satisfies a certain smoothness condition.
In this paper, we assume the following source assumption encountered typically in the inverse problems literature.

Moreover consider
where 0 ν ν 0 , ν 0 > 0 and use the further notation These sets are usually called source sets, x ∈ A ν is said to have a source representation.
Remark 1.1. Such condition is usual in analysis of inverse problems (see [11]). It links the decay of the eigenvalues of the operator with the decay of the coefficients of the decomposition of the function in the SVD basis. Thus, the parameter ν can be seen as a smoothness parameter providing a restriction over the regularity of the function to be recovered. Indeed, let (λ j , φ j , ϕ j ) j 1 be the singular value decomposition of the operator T , in the sense that the following decomposition holds Hence, The parameter ν is not known a priori, hence adaptation results will be provided with respect to this smoothness parameter.
Estimating over all X is in general not possible because we can only observe T x 0 over the fixed design (t 1 , . . . , t n ). Thus we assume that we are equipped with a sequence of nested linear subspaces whose union is dense in We are interested in a subcollection of these spaces generated by a set of indices M n . In this paper, we will use these approximation spaces as projection spaces in order to study the data. So, denote the projection of any space W over any subspace Z by Π Z W . Let Π n Ym stands for the projection in the empirical norm. Set also the corresponding projected operator T m = Π n Ym T . Using a sieve of the space Y , we consider the corresponding approximation spaces in the space X, defined as X m = T * m Y . By construction We point out that both T m and its adjoint operator T * m depend on the observation sequence t i . However, we will usually drop this fact from the notation. To illustrate this assertion, consider the following example.
with respect to the L 2 norm over Y , and T = Id, then where y j,n =< Π n Ym y, φ j > n are the solution to the projection problem under the empirical measure Q n . Set G m = (φ j (t i )) i,j , i = 1, . . . , n and j = 1, . . . , d m . Thus, we may write in matrix notation ).
An important issue is that, as above, we can always define T m in matrix notation and thus T t m y always makes sense. Moreover, if u ∈ Y , we will use indistinctly T t m u ∈ R dm and T * m u ∈ X m , the latter in operator notation. Now, define p the degree of ill-posedness of the forward operator T . In our case we will relate this parameter to the approximation properties of the spaces Y m , i.e with the projection operator Π n Ym as follows. IP ill-posedness of the operator Let M n be an index set. For m ∈ M n there exists a parameter p > 0 such that p is denoted the index of ill-posedness of the forward operator. In the following, this parameter is supposed to be known.
To illustrate this condition we include the following example Example 1.2. The above assumption can be seen to hold under certain conditions over operator T and matrix G m defined in example (1.1). Let (λ j , φ j , ϕ j ) j be the singular value decomposition of the forward operator T and assume that there exists p > 0 such that . This parameter is in the statistical literature often considered as an index of ill-posedness, since the difficulty of the estimation in inverse problem usually comes from the difficulty to invert the operator due to the decrease of its eigenvalues. But, if we let Y m be the linear subspace generated by {ϕ j } 1≤j≤dm , then assume that the fixed observation design t i , i = 1, . . . , n is such that this basis is also orthogonal in the empirical norm. Assume also that sup j=1,...,dm ϕ j ∞ < ∞. Then, leading to the approximation property provided in [IP]. If the forward operator acts in a smoothing way as integrating p times, for example, the eigenvalues λ j satisfy the required decay rates (see [13]).
Define ν m := T + m Π n Ym . This quantity controls the amplification of the observation error over the solution space X m . Consider which expresses the effect of operator T * m over the approximating subspace Y m . We have as in [13], ν m ≥ γ m .
On the other hand this term is related to the goodness of the approximation scheme. Following the proof in [13], it can be seen that The next assumption requires that γ m and γ (m) are of the same order, which will be written γ m ∼ γ (m) .

AS amplification error
We assume Moreover assume there exists a positive constant U such that In the estimation procedure, we will first project the data onto a large enough approximation space indexed by m 0 , to be selected later in this paper. For this . . , n, j = 1, , d m0 . Thus, abusing notation we may write T t m0 = D(G m0 G t m0 ) −1 G m0 : R n → R dm 0 , where D = D(λ j ) j=1,...,dm 0 is the diagonal matrix with entries λ j . Since T t m0 u = T * m0 u the latter in operator notation, both interpretations will be used indistinctly. On the other hand, for x ∈ X m0 , identified with a d m0 dimensional vector, we can think of (T m0 x(t 1 ), . . . , T m0 x(t n )) = G t m0 Dx. So that in matrix notation also T * m0 T m0 = D 2 . We also introduce the following assumptions SV There exist positive constants k 1 < k 2 such that k 1 j −p ≤ λ j ≤ k 2 j −p . SF Let ν j , j = 1, . . . , d m0 be the eigenvalues of matrix G t G, then there exist constants a 1 < a 2 such that a 1 n ≤ ν j ≤ a 2 n.

Remark 1.2.
• Assumption AS thus establishes that the worst amplification of the error over X m is roughly equivalent to the best approximation over Y m in the empirical norm. This yields a uniform bound on the operator T T + m Π n Ym , so that regularization by T compensates the bad condition of the approximation T + m Π n Ym to T + (see [11] or [13] for further comments). Notice T + m Π n Ym T is just the projection operator over X m so it is a bounded operator.
• Assumption SV is slightly stronger than AS as it establishes the exact order of the γ m . It is seen to hold, for instance, in example (1.2). Assumption SF is necessary to assure convergence results further on. It holds also in example (1.2).
• We point out that the Assumptions are met as soon as the approximation spaces are constructed as in example (1.2), i.e called truncated SVD where the spaces are defined as Y m = span{ϕ j , j = 1, . . . , d m }, where we recall that ϕ j are eigenfunctions corresponding to the eigenvalues of T * T sorted in decreasing order. Since eigenfunctions are often hard to obtain explicitly, for practical purpose we are more interested in the usual piecewise polynomial projection bases or wavelet bases, though, as in [13].

Adaptive choice for regularized estimator
Consider a class of regularized estimators built using a projection and a regularization procedure. Namely let Y m0 be a big enough space in the sense that m 0 is such that This quantity can be chosen so as not to depend on the unknown regularity of the solution x 0 . Under assumption SC the above inequality is satisfied if the dimension of the set is such that Thus it is enough to choose m 0 such that d m0 ≥ n 1/(2p+1) . For K n a set of indices, consider {R k , k ∈ K n } a collection of regularization operators which depend on different values of the smoothing parameters. For instance consider Tikhonov regularization operators which rely on the choice of a smoothing sequence, Landweber iteration operators which rely on the choice of a stopping index, or other general smoothing operators described in [10]. Consider the corresponding estimatorŝ where we have written R k :=R k Π n Ym 0 . The behavior of such general estimators depends on the choice of the regularization sequence. From the theory of inverse problems, we know that it is possible to choose a regularization operator for which the corresponding estimator achieves the optimal rate of convergence, but this choice depends on ν defined in SC, which characterizes the regularity of the solution.
Our aim is building a method that picks, according to the data, an optimal R k , among all the R k , k ∈ K n in such a way that optimal rates are maintained. This choice must also not depend on a priori regularity assumptions. We point out that selecting the optimal smoothing parameter in a collection of sequences, belongs to model selection theory since it is equivalent as selecting a good model among a collection of sets.
For this consider the following penalized procedure. For a given constant r > 2 and weights L k , k ∈ K n to be chosen, define the penalty as where T r(R t k R k ) is the trace and ρ(R t k R k ) = ρ 2 (R k ) is the spectral radius. Finallyk is selected as the solution of which defines the estimatorxk = Rky. Let x k = R k T x 0 be the regularized true function, which measures the accuracy of the estimation procedure without observation noise. The following result states the asymptotic behaviour of the estimatorxk.
where we have set for d as in lemma 4.3.
Hence, the estimator is optimal in the sense that the adaptive estimator achieves the best rate of convergence among all the regularized estimators, up to an error of order pen(k) and Σ(d)/n. This bound is non asymptotic and the rate of convergence depends on both previous terms.
We point out that the constant r and the weights must be chosen in order to control respectively the penalization term and the constant Σ(d). The weights are technically introduced to achieve the so called Kraft inequality, Σ(d) < +∞ and hence to control the size of the set of parameters K n .
We also point out that under SF ρ 2 (nR k ) and Tr(R t k R k )/ρ 2 (R k ) do not depend on n.
Proof. For any x k and any k ∈ N, the mere definition of the estimatorxk implies that Thus, following standard arguments we have Let 0 < κ < 1. Since 2ab κa 2 + 1 κ b 2 , for any a, b we have for any k and On the other hand, using that 1 R k T C, we have that for any x k ∈ X m0 and any k ∈ N, The proof then follows directly from Lemma 4.3 which characterizes the supremum of the empirical process under the linear application as defined by the regularization family. The choice of κ will depend on r in the penalty.

Applications to standard regularization operators
Penalized estimators are widely used to solve linear inverse problems and can be written in the form (4). Indeed for k ∈ K, consider a sequence of (diagonal) matrices A k ∈ M m0×m0 (R). Then, define for a chosen sequence A k the following penalized estimator Note that, for each fixed A k , the expression (7) can be written in the following way:x In practice the second expression is more complicated (the matrix to invert might be big), but it is simpler to deal with in order to show our results concerning the selection of A k . With this notation set the smoothing operator We point out that choosing the smoothing sequence A k is the key point since it balances the two terms: if A k is large the solution will be smooth but will not, in general, comply to the observations. On the other hand, if A k is small, the solution might be too close to the noisy observations to yield a good approximation of x 0 . Remark that we can write R k = (D 2 +A t k A k I m0 ) −1 D(GG t ) −1 G, as a matrix, with R t k its transpose matrix.

Tikhonov regularization
Consider the choice A k = √ α k I m0 , where α k is a positive decreasing sequence.
In this case, the estimator (7) can be written aŝ Where we recognize the usual expression of the Tikhonov estimator. Assume AN holds true. Hence Theorem 2.1 can be written in the following way, using the approximation properties of the Tikhonov regularization scheme, Then, by direct calculation of the trace and spectral radius of R k under SF and SV, Also under IP and SC, standard approximation results (see for example [11], Chapter 1) the following inequality holds true Hence under IP the above inequality yields, for ν ≤ 1, Interpreting this rate in the statistical literature reads s = 2νp: the regularity depends on the ill-posedness of the problem. In the ill-posed literature the error is not related to the underlying dimension so that rates are different. Typically in a Hilbert scale setting, if the true solution x 0 belongs to the Hilbert space H s , optimal rates are of order O(n −s/(2s+2p+1) ), see for example [7]. So, Equation 10 implies that the penalized Tikhonov estimatorxk, withk selected as in (5), achieves the best rate of convergence over all the possible choices of smoothing sequences α k , k ∈ K n . Moreover, if the input data is not too smooth, i.e for ν 1, Equation 11 implies that the penalized estimate achieves the minimax rate of convergence for this problem.
To overcome the restrictions induced by the use of Tikhonov estimate, we turn to model selection estimators in the following part.

Regularization by projection
Consider the projection estimator x m over X m . That is, x m is chosen in such a way as to minimize Π Ym (y − T x m ) n . The estimation error can thus be bounded by Since Y m is a sequence of linear subspaces E Π n Ym ǫ 2 = O( dm n ). So rates of convergence are of order This rate depends on the ill-posedness of the operator and the approximation properties of X m . Indeed, following [13] and under SC we have if ν ≤ 1/2, for a certain p. An optimal choice of the dimension (depending on ν) leads to the rate We aim at using Theorem 2.1 to select the best projection space among a collection of spaces. For this, consider a collection of index sets m = (j 1 , . . . , j m ) such that m ⊂ m 0 . For each m, define formally A m = (a ij ) with ∀i = j, a ij = 0 ∀i ∈ m, a ii = 0, ∀i m, a ii = ∞.
Then we obtain a model selection estimator which can be written aŝ where A m0 = T + m0 Π n Hence the penalized model selection estimator is optimal in the sense that it achieves the best rate among the collection of projection estimators. This leads to optimal rates also, as long as there exists a model m ∈ M n with d m = n 1/(4νp+2p+1) , as shown in Equation (13).
We also have the following interpretation for this estimator which offers an important insight. Note that (14) is equivalent to minimizinĝ Let {e j } j∈m be the canonical base over X m . Define for each m, x m,j =< A m0 y, e j >=< y, A t m0 e j >, j = 1, . . . , m.
Thus, m is selected by minimizing We can recognize a hard thresholding scheme, defined in [2], highlighting the equivalence between model selection and penalized M-estimation with an ℓ 0 penalty, as is also quoted in [17].

Appendix
In this section we give some technical lemmas. The next lemma characterizes the supremum of an empirical process by the norm of an orthogonal projection.
Proof. Using the definition of an orthogonal projector, we have As a consequence we can write: which ends the proof.
The next result is a deviation inequality based on a functional exponential inequality (Theorem 7.4) due to [4] 2003. Then, Proof. Since the application u → A t u is continuous, we have η(A) = sup u∈S n i=1 ε i (A t u) i for S some countable subset of the unit ball. On the other hand, Thus sup u ≤1 |(A t u) i /ρ 1/2 (A t A)| ≤ 1. Also, following the proof of Corollary 5.1 in [1] sup u =1 Set Z = Z(ε 1 , . . . , ε n ) = η(A)/(σρ 1/2 (A t A)). Let E j stand for the conditional expectation given ε i for i = j. Hence, in the proof of Theorem 7.4 in [4] we may bound Thus, E|Z − E j Z| q ≤ z j q!/2. Finally, note that Thus, the proof follows from Theorem 7.4 in [4].
As a corollary, we have the following lemma Lemma 4.3.
• There exists a positive constant d that depends on r such that the following inequality holds • Set k 1 = d/(ρ(A t A)σ 2 ) and k 2 = dr/2L[T r(A t A)/ρ(A t A)+1]. Then, there exists a constant C q , which depends only on q, such that, holds.
Proof. As a first step we will bound v.