Solution of linear ill-posed problems by model selection and aggregation

We consider a general statistical linear inverse problem, where the solution is represented via a known (possibly overcomplete) dictionary that allows its sparse representation. We propose two different approaches. A model selection estimator selects a single model by minimizing the penalized empirical risk over all possible models. By contrast with direct problems, the penalty depends on the model itself rather than on its size only as for complexity penalties. A Q-aggregate estimator averages over the entire collection of estimators with properly chosen weights. Under mild conditions on the dictionary, we establish oracle inequalities both with high probability and in expectation for the two estimators. Moreover, for the latter estimator these inequalities are sharp. The proposed procedures are implemented numerically and their performance is assessed by a simulation study.


Introduction
Linear inverse problems, where the data is available not on the object of primary interest but only in the form of its linear transform, appear in a variety of fields: medical imaging (X-ray tomography, CT and MRI), astronomy (blurred images), finance (model calibration of volatility) to mention just a few. The main difficulty in solving inverse problems is due to the fact that most of practically interesting and relevant cases fall into the category of so-called ill-posed problems, where the solution cannot be obtained numerically by simple inversion of the transform. In statistical inverse problems the data is, in addition, corrupted by random noise that makes the solution even more challenging.
Statistical linear inverse problems have been intensively studied and there exists an enormous amount of literature devoted to various theoretical and applied aspects of their solutions. We refer a reader to [7] for review and references therein.
Let G and H be two separable Hilbert spaces and A : G → H be a bounded linear operator. Consider a general statistical linear inverse problem y = Af + ε, (1.1) where y is the observation, f ∈ G is the (unknown) object of interest, ε is a white noise with a (known) noise level σ. For ill-posed problems A −1 does not exist as a linear bounded operator. Most of approaches for solving (1.1) essentially rely on reduction of the original problem to a sequence model using the following general scheme: 1. Choose some orthonormal basis {φ j } on G and expand the unknown f in (1.1) as 2. Define ψ j as the solution of A * ψ j = φ j , where A * is the adjoint operator, that is, ψ j = A(A * A) −1 φ j . Reduce (1.1) to the equivalent sequence model: where for ill-posed problems Var ( y, ψ j H ) = σ 2 ψ j 2 H increases with j. Following the common terminology, an inverse problem is called mildly ill-posed, if the variances increase polynomially and severely ill-posed if their growth is exponential (see, e.g., [7]). 3. Estimate the unknown coefficients f, φ j G from empirical coefficients y, ψ j H : f, φ j G = δ{ y, ψ j H }, where δ(·) is some truncating/shrinking/ thresholding procedure (see, e.g., [7], Section 1.2.2.2 for a survey), and re- Efficient representation of f in a chosen basis {φ j } in (1.2) is essential. In the widely-used singular value decomposition (SVD), φ j 's are the orthogonal eigenfunctions of the self-adjoint operator A * A and ψ j = λ −1 j Aφ j , where λ j is the corresponding eigenvalue. SVD estimators are known to be optimal in various minimax settings over certain classes of functions (e.g., [20]; [10]; [9]). A serious drawback of SVD is that the basis is defined entirely by the operator A and ignores the specific properties of the object of interest f ∈ G. Thus, for a given A, the same basis will be used regardless of the nature of a scientific problem at hand. While the SVD-basis could be very efficient for representing f in one area, it might yield poor approximation in the other. The use of SVD, therefore, restricts one within certain classes G depending on a specific operator A. See [15] for further discussion.
In wavelet-vaguelette decomposition (WVD), φ j 's are orthonormal wavelet series. Unlike SVD-basis, wavelets allows sparse representation for various classes of functions and the resulting WVD estimators have been studied in [15], [2], [21], [19]. However, WVD imposes relatively stringent conditions on A that are satisfied only for specific types of operators, mainly of convolution type.
A general shortcoming of orthonormal bases is due to the fact that they may be "too coarse" for efficient representation of unknown f . Since 90s, there was a growing interest in the atomic decomposition of functions over overcomplete dictionaries (see, for example, [23], [11], [16]). Every basis is essentially a minimal dictionary that allows only a unique (not necessarily sparse) representation. Such scarceness usually causes poor adaptivity [23]. Application of overcomplete dictionaries improves adaptivity of the representation, because one can choose now the most efficient (sparse) one among many available. One can see here an interesting analogy with colors. Theoretically, every other color can be generated by combining three basic colors (green, red and blue) in corresponding proportions. However, a painter would definitely prefer to use the whole available palette (overcomplete dictionary) to get the hues he needs! Selection of appropriate subset of atoms (model selection) that allows a sparse representation of a solution is a core element in such an approach. It becomes even more important for ill-posed inverse problems, where for large models the variance component in the risk of the estimator increases drastically. Pensky in [24] was probably the first to use overcomplete dictionaries for solving inverse problems. See the discussion on advantages of overcomplete dictionaries in applications to ill-posed problems in her paper.
Pensky in [24] utilized the Lasso techniques for model selection within the overcomplete dictionary, established oracle inequalities with high probability and applied the proposed procedure to several examples of linear inverse problems. However, as usual with Lasso, it required restrictive compatibility conditions on the design matrix Φ.
In this paper we propose two alternative approaches for overcomplete dictionaries based estimation in linear ill-posed problems. The first estimator is obtained by minimizing penalized empirical risk with a penalty on model M proportional to j∈M ψ 2 j . The second one is based on a Q-aggregation type procedure that is specifically designed for solution of linear ill-posed problems. We establish oracle inequalities for both estimators that hold with high probabilities and in expectation. Moreover, for the Q-aggregation estimator, the inequalities are sharp. Simulation study shows that the new techniques produce more accurate estimators than Lasso.
The rest of the paper is organized as follows. Section 2 introduces the notations and some preliminary results. The model selection and aggregation-type procedures are studied respectively in Section 3 and Section 4. The simulation study is described in Section 5. All proofs are given in the Appendix.

Setup and notations
Consider a discrete analog of a general statistical linear inverse problem (1.1): where y ∈ R n is the vector of observations, f ∈ R m is the unknown vector to be recovered, A is a known (ill-posed) n × m (n ≥ m) matrix with rank(A) = m, and ε ∼ N (0, σ 2 I n ).
In what follows · and ·, · denote respectively Euclidean norms and inner products. Let φ j ∈ R m , j = 1, · · · , p with φ j = 1 be a set of normalized vectors (dictionary), where typically p > m (overcomplete dictionary). Let Φ m×p be the complete dictionary matrix with the columns φ j , j = 1, . . . , p, and Ψ n×p is such that and assume that any r Φ columns of Φ are linearly independent.
For any 1 ≤ r ≤ r Φ define the r-sparse minimal and maximal eigenvalues of Φ T Φ as The oracle risk is unachievable but can be used as a benchmark for a quadratic risk of any available estimator. For direct problems (A = I), the penalized empirical risk approach, with the complexity type penalties Pen(|M |) on a model size, has been intensively studied in the literature. In the last decade, in the context of linear regression, the in-depth theories (risk bounds, oracle inequalities, minimaxity) have been developed by a number of authors. See, e.g., [18], [4], [5], [1], [25], [27] among many others.

Model selection by penalized empirical risk
For inverse problems, [9] considered a truncated orthonormal series estimator, where the cut-off point was chosen by SURE criterion corresponding to the AICtype penalty Pen(M ) = 2σ 2 Tr((A T A) −1 H M ) and established oracle inequalities for the resulting estimatorf M . It was further generalized and improved by risk hull minimization in [8].
To the best of our knowledge, [24] was the first to consider model selections within overcomplete dictionaries by empirical risk minimization for statistical inverse problems. She utilized Lasso penalty. However, as usual with Lasso, it required restrictive compatibility conditions on the design matrix Φ (see [24] for more details).
In this paper, we utilize the penalty Pen(M ) that depends on the Frobenius norm of the matrix Ψ M : The following theorem provides nonasymptotic upper bounds for the quadratic risk of the resulting estimatorf M both with high probability and in expectation: for some δ > 0 and 0 < a < 1. Then,

If, in addition, we restrict the set of admissible models to
The additional restriction on the set of models M r in the second part of Theorem 1 is required to guarantee that the oracle risk in (2.5) does not grow faster than n.
Note that for the direct problems, Ψ M = Φ M , while Φ M 2 F = |M | (recall that φ j 's are normalized to have unit norms).Thus, the penalty (3.2) is the RIC-type complexity penalty of [18] of the form Pen(M ) = C|M | ln p and the additional restriction on the set of admissible models required for (3.4) trivially holds with γ = 1. It is important to note that for inverse problems P en(M ) depends on the model itself rather on its size only. Furthermore, the presence of ν 2 2r in the penalty is essential. As a result, although the risk bounds obtained in Theorem 1 hold for any fixed 1 ≤ r ≤ r Φ /2, they increase with r due to the variance component that dominates strongly for large models. However, the use of overcomplete dictionaries allows one to assume that the unknown f has a sparse representation in φ j 's and, therefore, makes reasonable to consider M r with relatively small r to control the variance.
We can compare the quadratic risk of the proposed estimator with the oracle risk R(oracle) in (2.5). Consider the penalty for some 0 < a < 1. Assume that p ≥ n (overcomplete dictionary) and choose δ ≥ 4. Then, the last term in the RHS of (3.4) turns to be of a smaller order and we obtain for some positive constants C 1 , C 2 depending on a and δ only. By standard linear algebra arguments, Tr (A T A) −1 H M and, therefore, the following oracle inequality holds: for some constant C 0 > 0 depending on a, δ and γ only.
Thus, the quadratic risk of the proposed estimatorf M is within ln p-factor of the ideal oracle risk. The ln p-factor is a common closest rate at which an estimator can approach an oracle even in direct (complete) model selection problems (see, e.g., [17], [4], [1], [25] and also [24] for inverse problems). For an ordered model selection within a set of nested models, it is possible to construct estimators that achieve the oracle risk within a constant factor (see, e.g., [9] and [8]). Similar oracle inequalities (even sharp with the coefficient in front of f M − f 2 equals to one) with high probability were obtained for the Lasso estimator but under the additional compatibility assumption on the matrix Φ (see [24]).

Q-aggregation
Note that inequalities (3.3) and (3.4) in Theorem 1 for model selection estimator are not sharp in the sense that the coefficient in front of the minimum is greater than one. In order to derive sharp oracle inequalities both in probability and expectation, one needs to aggregate the entire collection of estimators: f = M ∈Mr θ MfM rather than to select a single estimatorf M .
[22] considered exponentially weighted aggregation (EWA) with θ M ∝ π M exp{− z −f M 2 /β}, where π is some (prior) probability measure on M r and β > 0 is a tuning parameter. They established sharp oracle inequalities in expectation for EWA in direct problems. [14] proved sharp oracle inequalities in expectation for EWA of affine estimators in nonparametric regression model. Their paper offers limited extension of the theory to the case of mildly ill-posed inverse problems where Var( y, ψ H ) = σ 2 ψ j 2 H increase at most polynomially with j. However, their results are valid only for the SVD decomposition and require block design which seriously limits the scope of application of their theory. Moreover, [12], [13] and [3] argued that EWA cannot satisfy sharp oracle inequalities with high probability and proposed instead to use Q-aggregation.
Define a general Q-aggregation estimator of f aŝ where the vector of weightsθ is the solution of the following optimization problem: 2) for some 0 < α < 1 and a penalty Pen(θ), and Θ Mr is the simplex In particular, [12] and [13] considered Pen(θ) proportional to the Kullback-Leibler divergence KL(θ, π) for some prior π on Θ Mr . For direct problems, they derived sharp oracle inequalities both in expectation and with high probability for Q-aggregation with such penalty. In fact, EWA can also be viewed as an extreme case of Q-aggregation for α = 1 (see [26]). However, the results for Q-aggregation with Kullback-Leibler-type penalty are not valid for ill-posed problems. A slightly different form of Q-aggregation was considered in [3]. In this section we propose a different type of penalty for Q-aggregation in (4.2) that is specifically designed for the solution of inverse problems. In particular, this penalty allows one to obtain sharp oracle inequalities both in expectation and with high probability in both mild and severe ill-posed linear inverse problems (see Introduction). We consider the penalty P en(θ) of the form where P en(M ) ≥ 4σ 2 (δ+1) aν 2 2r ||Ψ M || 2 F ln p for some δ > 0 and 0 < a < 1 as in (3.2).
For such type of a penalty P en(θ) and α = 1/2, the resultingθ iŝ (4.5) Note that the first term in the minimization criteria (4.5) is the same as in model selection (3.1). The presence of the second term is inherent for Q-aggregation. In fact, the model selection estimatorf M from Section 3 is a particular case of a Q-aggregate estimatorfθ with the weights obtained by solution of problem (4.2) with α = 1.
The non-asymptotic upper bounds for the quadratic risk offθ, both with high probability and in expectation, are given by the following theorem :

Then, with probability at least
2. If, in addition, we restrict the set of admissible models to M r,γ defined in Theorem 1 for some γ, then Unlike Theorem 1 for model selection estimator, inequalities in both (4.6) and (4.7) for Q-aggregation are sharp. For A = I, the results of Theorem 2 are similar to those obtained in [13] and [3] for Q-aggregation in direct problems.

Simulation study
In this section we present results of a simulation study that illustrates the performance of the model selection estimator M from (3.1) with the penalty (3.5) and the Q-aggregation estimatorfθ given by (4.1) with the weights defined in (4.5).
The data were generated w.r.t. a (discrete) ill-posed statistical linear problem (2.1) corresponding to the convolution-type operator Af (t) = In order to investigate the behavior of the estimators, we considered test functions of various sparsity and several noise levels. In particular, we used four test functions presented in Figure 2, that correspond to different sparsity scenarios: . f 4 is the well-known HeaviSine function from [17] (uncontrolled sparsity) For each test function, we used three different values of σ that were chosen to ensure a signal-to-noise ratios SN R = 10, 7, 5, where SN R(f ) = f /σ.
The accuracy of each estimator was measured by its relative integrated error: F over the entire model space M r of a very large size, we used a Simulated Annealing (SA) stochastic optimization algorithm for an approximate solution. The SA algorithm is a kind of a Metropolis sampler where the acceptance probability is "cooled down" by a synthetic temperature parameter (see [6], Chapter 7, Section 8). More precisely, if M (r) is a solution at step r > 0 of the algorithm, at step r + 1 a tentative solution M * is selected according to a given symmetric proposal distribution and it is accepted with probability a(M * , M (r) ) = min 1, exp − π(M * ) − π(M (r) ) T (r) .
where T (r) is a temperature parameter at step r. The expression (5.1) is motivated by the fact that while M * is always accepted if π(M (r) ) ≥ π(M * ), it can still be accepted even if π(M (r) ) < π(M * ) in spite of being worse than the current one. The chance of acceptance of M * for the same value of π(M (r) )−π(M * ) < 0 diminishes at every step as the temperature T (r) decreases with r. The law that reduces the temperature is called the cooling schedule, in particular, here we choose T (r) = 1/(1 + log(r)).
In this paper we adopted the classical symmetric uniform proposal distribution and selected a starting solution M (0) according to the following initial probability where c = j ψ j , y 2 / j ψ j 2 and C = p j=1 exp{ ψ j , y 2 − c ψ j 2 } −1 is the normalizing constant. Observe that the argument in the exponent in (5.2) is the difference of ψ j , y 2 and ψ j 2 where the first term ψ j , y 2 is the squared j-th empirical coefficient while the second term ψ j 2 is the increase in the variance due to the addition of the j-th dictionary function. Hence, the prior p(j) is more likely to choose dictionary functions with small variances that are highly correlated to the true function f . Thus, the adopted SA procedure can be summarized as follows: • generate a random number m ≤ n/ log(p). Set T ( While various stopping criteria could be used in the SA procedure, we found r max = 100, 000 to be sufficient for obtaining a good approximation of the global minimum in (3.1). Once the algorithm is terminated, we evaluatedf M , where M = arg min 0≤r≤rmax π(M (r) ) was the "best" model in the chain of models generated by SA algorithm. Similarly, the Q-aggregation estimatorfθ involves computationally expensive aggregation of estimators over the entire model space M r . We, therefore, approximated it by aggregating over the subset M r of the last 50 "visited" models in the SA chain, i.e.fθ = M ∈M rθ MfM withθ being a solution of (4.5).
For f 1 , f 2 and f 3 we also considered the oracle projection estimatorf oracle based on the true model. In addition, we compared the proposed estimators with the Lasso-based estimatorf Lasso = p j=1θ j φ j of [24], where the vector of coefficientsθ is a solution of the following optimization problem and λ is a tuning parameter. The tuning parameters λ forf M andf Lasso , were chosen by minimizing the error on a grid of possible values. To reduce heavy computational costs we used the same λ off M for all 50 aggregated models used for calculatingfθ.