The quasi-optimality criterion in the linear functional strategy

The linear functional strategy for the regularization of inverse problems is considered. For selecting the regularization parameter therein, we propose the heuristic quasi-optimality principle and some modifications including the smoothness of the linear functionals. We prove convergence rates for the linear functional strategy with these heuristic rules taking into account the smoothness of the solution and the functionals and imposing a structural condition on the noise. Furthermore, we study these noise conditions in both a deterministic and stochastic setup and verify that for mildly-ill-posed problems and Gaussian noise, these conditions are satisfied almost surely, where on the contrary, in the severely-ill-posed case and in a similar setup, the corresponding noise condition fails to hold. Moreover, we propose an aggregation method for adaptively optimizing the parameter choice rule by making use of improved rates for linear functionals. Numerical results indicate that this method yields better results than the standard heuristic rule.


Introduction
The estimation of linear bounded functionals of an unknown element x from an indirect noisy observation y δ given as is one of the classical problems in regularization theory [2]. Here, we assume that T is a linear, injective, not necessarily boundedly invertible operator from a solution Hilbert space X into an observation Hilbert space Y , ξ is an additive noise process, and δ is its intensity, or noise level, such that for y = T x, it holds y − y δ ≤ δ, δ ∈ (0, 1). We use the same symbols ·, · , · for the inner products and the corresponding norms in both X and Y . It is known that the problem of estimating the value f (x) = f, x of a linear bounded functional f ∈ X from (1.1) is less ill-posed than the problem of estimating x, in the sense that the value f (x) allows for a more accurate reconstruction than the element x in the X-norm [10,3,17]. A regularization of the first-named problem is usually performed by the so-called linear functional strategy [1] that is also closely related to the mollifier methods [16]. In case of a known noise intensity δ, the choice of the regularization parameters in the linear functional strategy has been extensively studied (see, e.g., [11,18,17] and references therein).
At the same time, in some applications, such as satellite gravity gradiometry, one cannot expect to have good knowledge of the noise model in general and of the noise intensity δ in particular (see, e.g., discussions in [14,5]). As a remedy for this, regularization theory has an arsenal of so-called heuristic parameter choice strategies that do not require knowledge of the noise intensity and therefore can be used in the above mentioned applications. The quasi-optimality criterion [21] is one of the simplest and the oldest but still quite efficient instance among such strategies.
Of course, in the worst case scenario, where the noise ξ in (1.1) is assumed to be chosen by some antagonistic opponent only subject to the constraint ξ ≤ 1, the quasi-optimality criterion, as well as any other heuristic parameter choice strategy, cannot guarantee convergence of the corresponding regularized approximants because of the so-called Bakushinskii veto [4]. On the other hand, it has been shown [6,7] that for the quasi-optimality criterion, the Bakushinskii veto can be avoided if the regularization performance is measured on average over realizations of ξ.
At the same time, another way to overcome the Bakushinskii veto has been proposed in [13,19], where convergence of the regularized approximants to x in the solution space norm and its rates have been established under a qualitative restriction on the noise ξ (a noise condition of Muckenhoupt type). Our intention in this paper is to extend this restricted noise approach in [13,19] to the context of the linear functional strategy. We also show that for a wide class of moderately ill-posed problems (1.1) and for random noise ξ with bounded moments, the above mentioned Muckenhoupt-type condition is satisfied almost surely.
The case of severely ill-posed problems is considered as well. Note that in this case, the theoretical bounds on the convergence rates of the regularized approximants selected by the quasi-optimality criterion in the solution space norm are worse than those for the noise level-dependent parameter choice strategies. At the same time, as follows from our results, in the linear functional strategy, the above-mentioned convergence rate gap can be essentially reduced. This hints at an opportunity to use the linear functional strategy equipped with the quasi-optimality criterion for aggregating the constructed regularized approximants in a way described in [9]. Then from [9], it follows that such aggregation by the linear functional strategy can improve the accuracy compared to the aggregated regularized approximations, and this can be seen as a way to use the quasi-optimality criterion for mildly and severely ill-posed problems.
Note that a practical implementation of the quasi-optimality criterion depends on the so-called differential quadrature [8] that is used to approximate the partial derivative ∂x δ α ∂α of the regularized solution x δ α of (1.1), which is based on a current value of the regularization parameter α = α i . Starting from the original paper [21], one usually uses a simple backward difference formula, where a ij = 0 for j = i, i − 1, and On the other hand, as it is mentioned in [8], there are many ways of determining the coefficients a ij in (1.2). For example, in the backward difference formula, one can introduce correction factors such that , and x, x δ α ℓ is the value of the linear bounded functional x δ α ℓ ∈ X at the unknown solution that can be approximated by x δ α j , x δ α ℓ , where α j is chosen by the quasi-optimality criterion. The use of the backward difference formula corrected as above can be seen as an iterated quasi-optimality rule. We will demonstrate in Section 5 that such a combination of the linear functional strategy-by an aggregation approach-and the quasi-optimality criterion can also improve the regularization performance as compared to the standard quasi-optimality.
The paper is organized as follows. In the next section, we present the problem setup and formulate main results. The proofs are given in Section 3. In Section 4, we describe random processes and investigate whether they almost surely meet the Muckenhoupt-type conditions. In Section 5, we discuss a combination of the aggregation by means of the linear functional strategy with the quasi-optimality criterion and present numerical experiments.

The main convergence rates results
In this section, we formulate the main results. Let us introduce some standard notation. Let X, Y be Hilbert spaces, T : X → Y be a continuous linear operator such that Ker(T ) = {0}, Ker(T * ) = {0}. Here, the assumptions of injectivity of T and T * are only imposed for simplicity; the main results hold with modifications in the general case as well. We denote by E λ and F λ the spectral families for the operators T * T and T T * , respectively. The notion R(T ) stands for the range and Ker(T ) for the nullspace of the operator T . For f, g being functions or sequences, the notation f ≍ g indicates that some constants c 1 , c 2 exist such that c 1 f ≤ g ≤ c 2 f for all arguments or sequence indices, where the constants in particular do not depend on δ.
Consider an ill-posed problem in the form T x = y. Suppose that we observe y δ ∈ Y such that y δ − y ≤ δ. We introduce regularized solutions obtained by a general spectral filter function g α : Moreover, let f ∈ X * = X be a linear functional.
One aim of this paper is to obtain upper bounds for the error of linear functionals of the solutions, i.e., for the quantity f, x δ α(y δ ) − x , where a parameter α(y δ ) is selected in a special way and depends only on the observation y δ . To state a smoothness/source condition for x and/or f , we use ϕ and κ, which are continuous, non-negative, increasing real functions defined for positive real values (so-called index functions). Below we impose some standard assumptions on ϕ, κ, g α .
Convergence rates estimates for the error x α −x using some smoothness conditions on x are nowadays a classical topic. For instance, if δ is known, see, for example, [17], then under some natural conditions the best accuracy that can be guaranteed under the smoothness condition x ∈ R(ϕ(T * T )) is of the order ϕ(θ −1 (δ)), where θ(t) = ϕ(t) √ t and θ −1 is its inverse function. For linear functionals, the situation can be improved: Assume that x ∈ R(ϕ(T * T )), and f ∈ R(κ(T * T )), where ϕ, κ are index functions, then the best accuracy for the linear functionals f, x δ α(y δ ) − x is of the order (κϕ)(θ −1 (δ)).
If the noise intensity is known, then the best order in accuracy can usually be achieved by standard means of selecting α. However, if δ is not known, the choice of the optimal α is a serious problem. For α(y δ ) selected according to the quasioptimality principle, some upper bounds for x δ α(y δ ) − x were obtained in [13,19]. There it is proved that if ϕ(t) = t µ and if the qualification µ 0 of the regularization g α is such that µ 0 ≥ µ, then The main assumption on the noise was the following condition of Muckenhoupt type (noise condition): We give some sufficient conditions that ensure (2.1) in Section 4. In this paper we consider (2.1) and its generalization for the linear functional strategy. We discuss these conditions in the deterministic and random case; in particular we verify that for mildly ill-posed problems and Gaussian noise, it is satisfied almost surely. Moreover, we provide upper bounds for f, x δ α(y δ ) − x , where α(y δ ) is selected by the quasioptimality principle as in [13,19], and we also obtain some generalization of the upper bounds there. Furthermore, we prove improved bounds f, x δ ακ (y δ ) − x , when α κ (y δ ) is selected heuristically but using information about y δ and also κ.
For later use we introduce the quasi-optimality functional and a variant suited for functionals: We introduce the following minimization-based heuristic parameter choice rules; the first one is the classical quasi-optimality rule as in [13,19] while the second one is our modification: α(y δ ) = argmin α ψ(α, y δ ), α κ (y δ ) = argmin α ψ κ (α, y δ ).
It is clear that α(y δ ) can be computed without knowledge of δ, which is the defining feature of heuristic parameter choice rules. The novel modified rule α κ (y δ ) additionally needs knowledge of the functional smoothness (via κ)). It will be shown that this additional information leads to improvements in the error bounds.
At first, we state some standard assumptions: 1. For all α > 0 we have 4. The qualification of g α covers ϕ and ϕκ, i.e., for all α > 0 5. The function κ is covered by the qualification 1/2, i.e., for all α > 0 6. The function κ, ϕ are regularly varying: For all c 8 > 0 there exists c 9 > 0 and δ 0 > 0 such that We note that in several places, condition (2.5) could be replaced by one with a more general qualification, i.e., that there exists µ 0 > 0 such that for any λ > 0 Additionally to the structural conditions on the filter and index functions, we impose the following generalization of the noise condition (2.1): (2.11) We state the main convergence result of the paper. In the sequel we denote by ∨ the maximum.
Remark 1. If we replace (2.5) by the more general one, (2.10), then the convergence rates in this theorem read as Remark 2. Formula (2.12) can be deduced using the reasoning of [13,19] (the authors used concrete power function in their estimates). It also can be seen from our proof for κ(λ) ≡ 1. To verify (2.12), actually only (2.1) is required, which is implied by (2.11) as the following remark indicates.
Remark 4. If we use the generalized qualification condition (2.10) and replace the condition µ + γ ≤ 1 by µ + γ ≤ µ 0 , then the rates in Corollary 1 have to be replaced by Remark 5. Under the conditions of Corollary 1 the bound for x δ α(y δ ) −x in [13,19] is for the case with µ 0 ) while the order-optimal bound is O δ 2µ 2µ+1 . For linear functionals as in the corollary, it is known that the , as δ → 0; see [17].

Proof of the main result
We need the following auxiliary results. Many of them are quite standard, we provide the proofs to make the exposition self-contained. At first we provide bounds for the approximation errors.
which proves the first inequality. For the remain ones, we estimate where we used (2.7).
Next we bound the parameter choice functionals.
Lemma 2. Let Assumption 1 hold. Then there exists a c > 0 such that for all α > 0 and all δ > 0 we have The inequalities for ψ κ follow in an analogous way.
The following result is a straightforward consequence of (2.3) and y δ − y ≤ δ. Lemma 3. Let Assumption 1 hold. There exists c > 0 such that for all α > 0 and all δ > 0 we have Lemma 4. Let Assumption 1 hold. We have for δ > 0,
Proof. Letᾱ be such that ϕ(ᾱ) = δ √ᾱ , i.e.,ᾱ = θ −1 (δ). Then (3.1) follows from Lemmas 1 and 3, and the following calculations The next lemma gives a very important consequence of (2.11), which is crucial for our proofs. In the sequel, we use the symbols K 1 , K 2 , . . ., and C for generic constants that may take different values in different formulas.
Lemma 7. We have

4)
where C is a constant independent of δ.

Case studies of noise conditions
In order to understand (2.1) and (2.11), we study situations, when these inequalities hold or fail; in particular for the case of random noise.
In this section, we specialize to the case when T is a compact operator, thus it allows for a singular system λ k , v k , u k , i.e., λ k > 0, T v k = λ k u k , T * u k = λ k v k . Then (2.1) and (2.11) can be equivalently rephrased as and respectively.
As an example, we now assume a polynomially decaying deterministic noise, i.e., Then, the following tables exemplify some sufficient conditions for the noise condition (4.1) for different degrees of ill-posedness: Ill-posedness noise sufficient condition for (4.1) A similar results can be stated for the modified noise condition (4.2): Ill-posedness noise κ sufficient condition for (4.2) In contrast to the deterministic case, we now investigate the case of random noise. We assume that the noise is random and of the form where ξ k = ξ k (ω), ω ∈ Ω are independent random variables given on a probability space (Ω, F, P), with Eξ k = 0, Var(ξ k ) = 1, (4.5) and analogous to (4.3), we assume that Note that E(y δ − y) = 0 and Var(y δ − y) ≍ δ 2 .
The stochastic analogue of the inequality (4.1) is of the following form: For almost all ω there is a constant C = C(ω) such that The stochastic analogue of (4.2) can be considered similarly with the natural modifications.
Theorem 2. Assume a mildly ill-posed case, i.e., λ 2 k ≍ k −β , with β > 0. Moreover, let the noise satisfy (4.4)-(4.6), and assume that the random variables {ξ k } have moments of all orders: The proof of this theorem is given below. The assumptions on {ξ k } hold in particular for independent Gaussian N (0, 1)-random variables. Thus, for the mildly ill-posed operators, the stochastic case is completely similar to the deterministic one and the analogous convergence rates results hold true (almost surely).
This, however, is not true for the severely ill-posed case as the following theorem shows.
Theorem 3. Assume a severely ill-posed case, i.e., λ 2 k ≍ a k , with a ∈ (0, 1) and let (4.4) and (4.6) hold, where {ξ k } are independent Gaussian N (0, 1) random variables. Then In particular, in this situation, the noise condition (4.1) fails almost surely. This shows that the difference between stochastic and deterministic cases may be very essential.
Similarly to the reasoning above we get the inequalities Chose m ∈ (− log p log a , √ p/2), i.e., m/p < 1/(2 √ p) and a m < 1/p. Then and we again obtain (4.9), the failure of the noise condition. The conclusion from the above reasoning is that if {ξ k } are i.i.d. and λ 2 k ≍ a k where a ∈ (0, 1), then assumption (4.8) is true if P(|ξ k | ∈ [ε, ε −1 ]) = 1 for some ε > 0, that is, the support of ξ k is separated from 0 and ∞. The sufficiency follows from the deterministic statement.
To prove the positive results in the mildly ill-posed case, we need the following known result. VarY n < ∞.
Remark 7. It is interesting that the Muckenhoupt-type condition fails for a typical random noise in the case of severely ill-posed problems. This observation, however, is in line with numerical investigation on the performance of heuristic rules done, for instance, by Hämarik, Palm, and Raus [12], in particular in [20]. Typically, for mildly ill-posed problems, the quasi-optimality principle is amongst the most efficient heuristic rules. However, for the backward heat equation (which is severely ill-posed), it performs worse compared to competitors such as the Hanke-Raus rules which by our results can be understood as caused by the failure of the noise condition. Note that the convergence theory for the latter rules is based on a weaker Muckenhoupttype condition which might not suffer from the negative result in Theorem 3. Thus, the restricted noise analysis clearly reveals the behaviour of heuristic rules, which was quite mysterious for a long time.
5 The quasi-optimality criterion in the aggregation of the regularized approximants: numerical illustration In this section, we illustrate how the quasi-optimality criterion can be used in the aggregation of the regularized approximants by means of the linear functional strategy. Recall that the idea of such an aggregation is to approximate the best linear combination of the constructed regularized approximants x δ α j of x, where "best" means that x s agg solves the minimization problem It is clear that the vector c s = (c s 1 , c s 2 , . . . , c s s ) ∈ R s satisfies the system of linear equations Gc = p with the Gram matrix G = x δ α i , x δ α j : i, j = 1, 2, . . . , s and the vector p = x, x δ α i : i = 1, 2, . . . , s . Since x δ α j , j = 1, 2, . . . , s, are already found, the matrix G can be computed and the calculation of the inverse matrix G −1 can be controlled. However, the vector p involves the unknown solution x, and therefore, the system Gc = p cannot be solved directly.
At the same time, each component x, x δ α i of the vector p is a value of a bounded linear functional x δ α i , and the linear functional strategy allows us to estimate x, x δ α i , i = 1, 2, . . . , s, more accurately than x in · . For example, if x ∈ R (ϕ (T * T )) and x δ α = (αI + T * T ) −1 T * y δ , then under the conditions of Theorem 1, we have while for each α i , the quasi-optimality criterion in the linear functional strategy gives us α i y δ = α κ i y δ such that where κ i is an index function for which x δ α i ∈ R (κ i (T * T )).
Note that x s agg,y δ can be effectively computed because it only uses access to T and y δ . Then by the same arguments as in the proof of Theorem 3.7 in [9], it follows from (5.2) that If α y δ ∈ {α j , j = 1, 2, . . . , s} , (5.5) then the accuracy of x s agg may only be better than the one of x δ α(y δ ) . Moreover, from (5.1), (5.4), it follows that the error of the effectively computed aggregator x s agg,y δ differs from the error of x s agg by a quantity of higher order than the accuracy guaranteed by the standard quasi-optimality criterion. In this way, a combination of the linear functional strategy and the quasi-optimality criterion resulting in (5.3) may improve the accuracy of the latter one. Such improvement indeed is observed in the numerical illustrations below.
Note that the family of the regularized approximations x δ α j may consist only of a single approximant x δ α i . Then the value of can be explicitly written as and can be interpreted as a correction factor for x δ α i . If a value α = α y δ has been already selected by the quasi-optimality criterion, then c * i can be approximated by and under the conditions of Theorem 1, we have with f ∈ R (κ (T * T )). Therefore, in view of (5.10), (5.11), for a given f , say f = x δ α i , it is reasonable to use the following discretized version of the quasi-optimality criterion in the linear functional strategy: choose α i y δ = α κ i from (5.7) such that To illustrate the quasi-optimality criterion in the aggregation (5.3), (5.12), we simulate the data by (1.1), where T is a matrix T = (t ij ), where i = 1, 2, . . . , m, j = 1, 2, . . . , n with the non-zero entries t kk = a k , 0 < a < 1, x is a vector x = (x j = j −µ η j , j = 1, 2, . . . , n), and η j are randomly sampled from the uniform distribution on [−1, 1]. We take a = 0.5, µ = 2, n = 100, m = 150.
Our simulation mimics a severely ill-posed problem because the singular values λ 2 k = t 2 kk = a 2k of T * T decrease exponentially, while the Fourier coefficients x j of x in the corresponding basis decrease only polynomially. A reason to consider this case is that, as it can be seen from Theorem 1, for severely ill-posed problems, the difference between the estimation of the solution and the functional estimation is the most noticeable. For example, if ϕ(λ) = log −ν 1 λ , ν > 0, which corresponds to the severely ill-posed case, then the quasi-optimality criterion can guarantee an accuracy of order O log −ν log 1 δ for an approximation of x, while the value of a bounded linear functional f, x can be estimated with the use of the quasi-optimality criterion much more accurately, say with the accuracy of order O δ 2γ 2 log −ν(1+γ−2γ 2 ) 1 δ when f ∈ R ((T * T ) γ ), 0 < γ < 1/2.
Numerical illustrations below demonstrate that in the considered simulation scenario, the aggregation (5.3), (5.12), which is based on the quasi-optimality criterion and the linear functional strategy, improves the accuracy resulting from the quasioptimality criterion and performs at the level of the best (but unknown) regularization parameter choice.
To guarantee almost surely that the Muckenhoupt-type condition (2.11) on the noise ξ is satisfied in our test, we simulate ξ as ξ = (ξ i , i = 1, 2, . . . , m), m = 150, where ξ i are randomly sampled from the uniform distribution on [−1, −δ] ∪ [δ, 1], δ > 0, such that the noise support is separated from 0 and ∞, as it is suggested in Remark 6 discussed in the previous section.
The random simulations of ξ and x are performed 10 times, and the noise intensity is chosen as δ = 0.01. The regularized approximants x δ α i are constructed by the Tikhonov regularization, i.e.
The performance of the regularized approximants is measured in terms of the  where s = max i : α i ≥ α y δ , and x s agg,y δ is given by (5.3), (5.12). The mean values of the considered quantities over the performed simulations are given in Table 1. The table also reports the values observed in a particular simulation displayed in Figure 1.
The presented illustration confirms that for severely ill-posed problems, the aggregation based on the linear functional strategy is able to perform at the level of the best, but unknown, regularization parameter choice. x −x δ α i (error2), x δ α i − x δ α i−1 (qo), x δ α i −x δ α i−1 (qo2), plotted against the corresponding values of α i , i = 1, 2, . . . , 7.