Early stopping for statistical inverse problems via truncated SVD estimation

We consider truncated SVD (or spectral cut-off, projection) estimators for a prototypical statistical inverse problem in dimension $D$. Since calculating the singular value decomposition (SVD) only for the largest singular values is much less costly than the full SVD, our aim is to select a data-driven truncation level $\widehat m\in\{1,\ldots,D\}$ only based on the knowledge of the first $\widehat m$ singular values and vectors. We analyse in detail whether sequential {\it early stopping} rules of this type can preserve statistical optimality. Information-constrained lower bounds and matching upper bounds for a residual based stopping rule are provided, which give a clear picture in which situation optimal sequential adaptation is feasible. Finally, a hybrid two-step approach is proposed which allows for classical oracle inequalities while considerably reducing numerical complexity.


Model
A classical model for statistical inverse problems is the observation of where A : H 1 → H 2 is a linear, bounded operator between real Hilbert spaces H 1 , H 2 , µ ∈ H 1 is the signal of interest, δ > 0 is the noise level and Ẇ is a Gaussian white noise in H 2 , see e.g.Bissantz et al. [1], Cavalier [5] and the references therein.In any concrete situation the problem is discretised, for instance by using a Galerkin scheme projecting on finite element or other approximation spaces.Therefore we can assume H 1 = R D , H 2 = R P with possibly very large D and P .Since the discretisation of µ is at our choice, we assume D ≤ P , and that A : R D → R P is one-to-one.We transform (1.1) by the singular value decomposition (SVD) of A into the Gaussian vector observation model where λ 1 ≥ λ 2 ≥ • • • ≥ λ D > 0 are the nonzero singular values of A, (µ i ) 1≤i≤D the coefficients of µ in the orthonormal basis of singular vectors and (ε i ) 1≤i≤D are independent standard Gaussian random variables.The results will easily extend to subgaussian errors as discussed below.
Working in the SVD representation (1.2), the objective is to recover the signal µ = (µ i ) 1≤i≤D with best possible accuracy from the data (Y i ) 1≤i≤D .A classical method is to use the truncated SVD estimators (also called projection or spectral cut-off estimators) µ (m) , 0 ≤ m ≤ D, given by which are ordered with decreasing bias and increasing variance (w.r.t.m).
Choosing a suitable truncation index m = m(Y ) from the observed data is the genuine problem of adaptive model selection.Typical methods use (generalized) cross validation, see e.g.Wahba [19], unbiased risk estimation, see e.g.Cavalier et al. [6], penalized empirical risk minimisation, see e.g.Cavalier and Golubev [7], or Lepski's balancing principle for inverse problems, see e.g.Mathé and Pereverzev [15].They all share the drawback that the estimators µ (m) have first to be computed for all values of 0 ≤ m ≤ D, and then be compared to each other in some way.
In this work, we are motivated by constraints due to the possible obstructive computational complexity of calculating the full SVD in high dimensions.We stress that the initial discretisation of the observation is generally based on a fixed scheme which does not deliver a representation of the observation vector Y in an SVD basis; nor is the full SVD basis of the discretized operator A a priori available in general, so that it has to be computed on the fly.Since the calculation of the largest singular value and its corresponding subspace is much less costly, efficient numerical algorithms rely on deflation or locking methods, which achieve the desired accuracy for the larger singular values first and then iteratively achieve the accuracy also for the next smaller singular values.As a basic example the popular power method can be considered, which usually finds after a few vector-matrix multiplications the top eigenvalue-eigenvector pair (with exponentially small error in the iteration number), so that by iterative application the largest m singular values and vectors are computed with roughly O(mD 2 ) multiplications compared to O(D 3 ) multiplications for a full SVD in a worst case scenario.We refer to the monograph by Saad [17] for a comprehensive exposition of the numerical methods.
We investigate the possibility of an approach which is both statistically efficient and sequential along the SVD in the following sense: we aim at early stopping methods, in which the truncated SVD estimators µ (m) for m = 0, 1, . . ., are computed iteratively, a stopping rule decides to stop at some step m and then µ ( m) is used as the estimator.
More generally, we envision our setting as a simple and prototypical model to study the scope of statistical adaptivity using iterative methods, which are widely used in computational statistics and learning.A notable feature of these methods is that not only the numerical, but also the statistical complexity (e.g., measured by the variance) increases with the number of iterations, so that early stopping is essential from both points of view.It is common to use stopping rules based on monitoring the residuals because the user has access without substantial additional cost to the residual norm.Observe that the computation of the residual norm does not require the full SVD, but only the knowledge of the m first coefficient and of the full norm Y 2 , which is readily available.The properties of such rules have been well studied for deterministic inverse problems (e.g. the discrepancy principle, see Engl et al. [10]).In a statistical setting, minimax optimal solutions along the iteration path have been identified in different settings, see e.g.Yao et al. [20] for gradient descent learning, Blanchard and Mathé [3] for conjugate gradients, Raskutti, Wainwright and Yu [16] for (reproducing) kernel learning and Bühlmann and Hothorn [4] for the ap-plication to L 2 -boosting.All these methods stop at a fixed iteration step1 , depending on the prior knowledge of the smoothness of the unknown solution.By contrast, our goal is to analyse an a posteriori early stopping rule based on monitoring the residual, which corresponds to proposals by practitioners in the absence of prior smoothness information.An analysis of such a stopping rule for quite general spectral estimators like Landweber iteration is provided in the companion paper [2].Although the general results can also be applied to the truncated SVD method, we exhibit a more transparent analysis for this prototypical method which gives more satisfactory results: we establish coherent lower bounds and we obtain adaptivity in strong norm via the oracle property, while for more general spectral estimators only rate results over Sobolev-type classes can be achieved.Moreover, a hybrid twostep procedure enjoys full adaptivity for the truncated SVD-method.

Non-asymptotic oracle approach
Our approach is a priori non-asymptotic and concentrates on oracle optimality analysis for individual signals.The oracle approach compares the error of µ ( m) to the minimal error among ( µ (m) ) m for any signal µ individually, which entails optimal adaptation in minimax settings, see e.g.Cavalier [5].
The risk (mean integrated squared error) for a fixed truncated SVD estimator µ (m) obeys a standard squared bias-variance decomposition where • denotes the Euclidean norm in R D , and In distinction with the weak norm quantities defined below, we call B m (µ) strong bias of µ and V m strong variance.
If we have access to the residual squared norm We call B 2 m,λ (µ) the weak bias and similarly V m,λ = mδ 2 the weak variance.They correspond to measuring the error in the weak norm (or prediction norm) , which usually (always if λ 1 < 1) is smaller than the strong Euclidean norm • .The squared bias-variance decomposition for the weak risk then reads E µ Our setting is thus a particular instance of the question raised by Lepski [13] whether adaptation in one loss (here: weak norm) leads to adaptation in another loss (here: strong norm).Our positive answer for truncated SVD or spectral cut-off estimation will also extend the results by Chernousova et al. [8].
Intrinsic to the sequential analysis is the fact that at truncation index m we cannot say anything about the way the bias decreases for larger indices: it may drop to zero at m + 1 or even stay constant until D − 1.Even if we knew the exact value of the bias until index m, we could not minimise the sum of squared bias and variance sequentially.Instead, we should wait until the squared bias is sufficiently small to equal (approximately) the variance.This leads to the notion of the strongly balanced oracle whose risk is always upper bounded by twice the classical oracle risk, see (3.5) below.

Setting for asymptotic considerations
Risk estimates over classes of signals and asymptotics for vanishing noise level δ → 0 often help to reveal main features.This way, we can also provide lower bounds for sequential estimation procedures and compare them directly to classical minimax convergence rates.In our setting, the magnitude of the discretisation dimension D plays a central role, so that it is sensible to assume in an asymptotic view that D = D δ → ∞ as δ → 0. As classes of signals, we will consider the Sobolev-type ellipsoids and we shall use the following polynomial spectral decay assumption for p ≥ 0, C A ≥ 1.The spectrum is allowed to change with D and δ, but p, C A are considered as fixed constants.Under these assumptions, standard computations yield for µ ∈ H β (R, D), 1 ≤ m ≤ D: Balance between these squared bias and variance bounds is obtained for m of the order of the minimax truncation "time" provided the condition D ≥ t β,p,R (δ) holds.This gives rise to the risk rate which agrees with the optimal minimax rate in the standard Gaussian sequence model (i.e.D = ∞).On the other hand, for D t β,p,R (δ) the choice m = D is optimal on H β (R, D) and the rate degenerates to O(D 2p+1 δ 2 ).This situation is indicative of an insufficient discretisation and will be excluded from the asymptotic considerations.

Overview of results
Our results consist of lower and upper bounds for sequentially adaptive stopping rules.The stopping rules permitted are most conveniently described in terms of stopping times with respect to an appropriate filtration.Introduce the frequency filtration F 0 being the trivial sigma-field.Stopping rules with respect to the filtration F = (F m ) 0≤m≤D must decide whether to halt and output µ (m) based only on the information of the first m estimators.Statistical adaptation will turn out to be essentially impossible for such stopping rules (Section 2.1).If the residual (1.6) is available at no substantial computational cost, taking this information into account, we define the residual filtration which is the filtration F m enlarged by the residuals up to index m.Pushing some technical details aside, the main message conveyed by our lower bounds is that oracle statistical adaptation with respect to the residual filtration is impossible for signals µ such that the strongly balanced oracle m s (µ) is o( √ D). (Section 2.2).On the other hand, we establish in Section 3 that this statement is sharp, in the sense that the simple residual-based stopping rule with a proper choice of κ and m 0 is statistically adaptive for signals µ such that m s (µ) √ D. Let us stress that by minimax adaptive we always mean that the procedure attains the optimal rate even among all methods with access to the entire data, that is without information constraints.
Finally, in Section 4 we introduce a hybrid two-step approach consisting of the above stopping rule with m 0 ∼ √ D log D, followed by a traditional (non-sequential) model selection procedure over m ≤ m 0 , in the case where τ = m 0 (immediate stop hinting at an optimal index smaller than m 0 ).This procedure enjoys full oracle adaptivity at a computational cost of calculating on average the first O(max( √ D log D, m s (µ))) singular values, to be compared to the full SVD with D singular values in non-sequential adaptation.Some numerical simulations illustrate the theoretical analysis.Technical proofs are gathered in an appendix.

The frequency filtration
Let τ be an F-stopping time, where F is the frequency filtration defined in (1.10) and let2 R(µ, τ By Wald's identity, we obtain the simple formula with B 2 m (µ) and V m from (1.4), (1.5).This implies in particular that an oracle stopping time, i.e., an optimal F-stopping time constructed using the knowledge of µ, coincides with the deterministic oracle argmin m B 2 m (µ) + V m almost surely.The next proposition encapsulates the main argument for the lower bound and merely relies on a two-point analysis.It clarifies that if the stopping time τ yields a squared risk comparable to the optimally balanced risk for a given signal µ, then this signal can be changed arbitrarily to μ after the index 3Cm s (µ) , while the risk for the rule τ always stays larger than the squared bias of that part -which can be made arbitrarily large by "hiding" signal in large-index coefficients.
2.1 Proposition.Let µ, μ ∈ R D with µ i = μi for all i ≤ i 0 and i 0 ∈ {1, . . ., D − 1}.Then any F-stopping rule τ satisfies for the balanced oracle m s in (1.7) and some C ≥ 1.Then for any μ ∈ R D with μi = µ i for i ≤ 3Cm s we obtain We use the fact that (Y i ) 1≤i≤i 0 has the same law under P µ and P μ and so has 1(τ ≤ i 0 ) by the stopping time property of τ .Moreover, thanks to the monotonicity of m → V m and m → B 2 m (μ), Markov's inequality and identity (2.1): The second assertion follows by inserting i 0 = 3Cm s and R(µ, τ ) 2 ≤ 2CV ms together with V ms /V i 0 +1 ≤ m s /(i 0 + 1) since the singular values λ i are non-increasing.
In Appendix 5.1 we use this proposition to provide a result suitable for asymptotic interpretation (we use the notation from Section 1.3): The conclusion for impossible rate-optimal adaptation is a direct consequence of Corollary 2.2: since for any α < β the rate δ 2α/(2β+2p+1) is suboptimal, no F-stopping rule can adapt over Sobolev classes with different regularities.Finally, the rate R(R −1 δ) 2α/(2β+2p+1) is attained by a deterministic stopping rule that stops at the oracle frequency for H β (R, D), so that the lower bound is in fact a sharp no adaptation result.

Residual filtration
We start with a key lemma, similar in spirit to the first step in the proof of Proposition 2.1, but valid for an arbitrary random τ .Here and in the sequel the numerical values are not optimised, but give rise to more transparent proofs and convey some intuition for the worst case order of magnitude.The proof is delayed until Appendix 5.2.

Lemma
. ., D} be an arbitrary (measurable) data-dependent index.Then for any m ∈ {1, . . ., D} the following implication holds true: For G-stopping rules, where G is the residual filtration defined in (1.11), we deduce the following lower bound, again based on a two-point argument: 2.4 Proposition.Let τ be an arbitrary G-stopping rule.Consider µ ∈ R D and i 0 ∈ {1, . . ., D} such that Then any i 0 ≥ 400C µ m s will satisfy the initial requirement.
Proof.First, we lower bound the risk of μ by its bias on {τ ≤ i 0 } and then transfer to the law of τ under P µ , using the total variation distance on G i 0 : . Since the law of W i 0 is identical under P µ and P μ, and W i 0 is independent of R 2 i 0 for both measures, the total variation distance between P µ and P μ on G i 0 equals the total variation distance between the respective laws of the scaled residual μi 0 +k , the total variation distance between the respective laws of the scaled residual δ −2 R 2 i 0 exactly equals P ϑ K − P θ K T V .By Lemma 5.1 in the Appendix, taking account of ϑ = δ −1 B i 0 ,λ (µ) and similarly for θ , we infer from (c) the simplified bound Under our assumption on μ, this is at most 0.05, and the inequality follows.
In comparison with the frequency filtration, the main new hypothesis is that at i 0 the weak bias of μ is sufficiently close to that of µ, while the lower bound is still expressed in terms of the strong bias.This is natural since the bias only appears in weak form in the residuals, while the risk involves the strong bias.Condition (c) is just assumed to simplify the bound.To obtain valuable counterexamples, μ is usually chosen at maximal weak bias distance of µ allowed by (b), so that (c) is always satisfied in the interesting cases where √ D − i 0 is not small.Considering the behaviour over Sobolev-type ellipsoids, we obtain in Appendix 5.4 a lower bound result comparable to Corollary 2.2 for the frequency filtration.δ) .The constants c 1 , c 3 > 0, and c 2 ∈ (0, 1] depend only on C µ , C A , α, p. The form of the lower bound is transparent: as in the case of the frequency filtration, the sub-obtimal rate R(R −1 δ) 2α/(2β+2p+1) is the one attained by a deterministic rule that stops at the oracle frequency for H β (R, D), whereas R R−1 δD 1/4 2α/(2α+2p) is the size of a signal that may be hidden in the noise of the residual, i.e., is not detected with positive probability by any test, thus also leading to erroneous early stopping.Note that for the direct problem (p = 0), the latter quantity is just δD 1/4 , which is exactly the critical signal strength in nonparametric testing, see Ingster and Suslina [11], while for p > 0, it reflects the interplay between the weak bias part in the residual and the strong bias part in the risk within the Sobolev ellipsoid.
Corollary 2.5 implies in turn explicit constraints for the maximal Sobolev regularity to which a G-stopping rule can possibly adapt.Here, we argue asymptotically and let explicitly D = D δ tend to infinity as the noise level δ tends to zero.In this setting, a stopping rule τ is to be understood as a family of stopping rules that depend on the knowledge of D and δ.
In particular, if a G-stopping rule τ is rate-optimal over H β (R, D δ ) for β ∈ [β min , β max ], β max > β min ≥ 0, and some R > 0, then we necessarily must have Proof.In this proof we denote by ' ', ' ' inequalities holding up to factors depending on C A , p, β + , β − , R − , R + .We apply Corollary 2.5 with ), the conditions are fulfilled for sufficiently small δ > 0 and we conclude (R + , R − are fixed) By assumption, that rate must be O(δ 2β − /(2β − +2p+1) ).Since the second term in the above minimum is of larger order than this, this must imply (δD For the second assertion, we proceed by contradiction and assume for some sequence δ k → 0, contradicting the first part of the corollary. For statistical inverse problems with singular values satisfying the polynomial decay (PSD(p, C A )) we may choose the maximal dimension D δ ∼ δ −2/(2p+1) without losing in the convergence rate for a Sobolev ellipsoid of any regularity β ≥ 0, see e.g.Cohen el al. [9].In fact, we then have the variance and the estimator with truncation at the order of D δ will not be consistent anyway; the oracle index is always of order o(D δ ) whatever the signal regularity.For this choice of D δ , optimal adaptation is only possible if the squared minimax rate is within the interval [δ, 1], faster adaptive rates up to δ 2 cannot be attained.Usually, D δ will be chosen much smaller, assuming some minimal a priori regularity β min .The choice D δ ∼ δ −2/(2β min +2p+1) ensures that rate optimality is possible for all (sequence space) Sobolev regularities β ≥ β min , when using either oracle (non-adaptive) rules, or adaptive rules that are not stopping times.In contrast, any G-stopping rule can at best adapt over the regularity interval [β min , β max ] with β max = 2β min + p + 1/2 (keeping the radius R of the Sobolev ball fixed).These adaptation intervals, however, are fundamentally understood only when inspecting the corresponding rate-optimal truncation indices t β,p,R (δ), which must at least be of order √ D δ ∼ δ −1/(2β min +2p+1) in order to distinguish a signal in the residual from the pure noise case.

Consider the residual-based stopping rule
m is decreasing with R 2 D = 0, the minimum is attained and we have R 2 τ ≤ κ.In order to have clearer oracle inequalities, we work with continuous oracle-type truncation indices in [0, D].Interpolating such that V t,λ = tδ 2 continues to hold for real t ∈ [0, D], we set and define further We thus obtain the following decompositions in a bias and a stochastic error term: 2) noting that the last term in (3.1) has expectation zero for deterministic t and vanishes for the integer-valued random time τ .Analogously, the linear interpolations for bias and variance in weak norm are defined.Thus, the continuously interpolated residual has expectation Integrating the last interpolation error term into the definition, we define the oracle-proxy index t * ∈ [m 0 , D] as Then by continuity For t * = m 0 we still have Dδ 2 + B 2 t * ,λ (µ) − V t * ,λ ≤ κ.Let us finally define the weakly and strongly balanced oracles t w and t s in a continuous manner: While the balanced oracles are the natural oracle quantities we try to mimic by early stopping, they should be compared to the classical oracles.Since t → B 2 t (µ) is decreasing and t → V t is increasing, we derive inf

Upper bounds in weak norm
The following is an analogue of Proposition 2.1 in [2], but includes a discretisation error for the discrete time stopping rule τ .
3.1 Proposition.The balanced oracle inequality in weak norm holds with the discretisation error Proof.The main argument is completely deterministic.For τ > t * ≥ m 0 we obtain by with γ i = 1 for i > t * + 1 and γ i = (1 − t * − t * 2 for i = t * + 1.The maximal inequality in Corollary 1.3 of [18] implies , which is smaller than ∆ τ (µ) 2 − 1 2 δ 2 .By bounding the main term via Jensen's inequality, using Var(ε and thus by Remark that the proof only relies on the moments of (ε i ) up to fourth order and a maximal deviation inequality, so that an extension to sub-Gaussian distributions is straightforward.More heavy-tailed distributions can be treated at the cost of a looser bound on ∆ τ (µ).
So far, the choice of κ has not been addressed.The identity (3.4) shows that the choice κ = Dδ 2 balances weak squared bias and variance exactly such that t * = t w .In practice, however, we might have to estimate the noise level δ 2 , or we prefer a larger threshold κ to reduce numerical complexity.Therefore, precise bounds for general κ between the oracle-proxy and the weakly balanced errors in weak norm are useful.
Remark that the weak variance control of Lemma 3.2 implies directly (t we infer further t w ≤ t s , and thus it always holds As a consequence of the preceding two results, we obtain directly a weakly balanced oracle inequality with error terms of order √ Dδ 2 , provided |κ−Dδ| 2 is at most of that order: with a numerical constant C > 0.
In weak norm, we have thus obtained a completely general oracle inequality for our early stopping rule.In view of the lower bounds, the "residual term" of order √ Dδ 2 , which is much larger than the usual parametric order δ 2 , is unavoidable.This will be developed further in the strong norm error analysis.

Upper bounds in strong norm
In Appendix 5.5 we derive exponential bounds for P (R 2 m ≤ κ), m < t * , in terms of the weak bias and deduce by partial summation the following weak bias deviation inequality: 3.4 Proposition.We have This is the probabilistic basis for the main bias oracle inequality.
3.5 Proposition.We have the balanced oracle inequality for the strong bias Proof.On the event {τ ≥ t s } we have B 2 τ (µ) ≤ B 2 ts (µ).On {τ < t s } we have using the fact that only coefficients up to index t s + 1 enter into the bias differences.From the weak bias control given by Proposition 3.4 it follows that where in the last line we used κ ≥ B 2 t * ,λ (µ) + Dδ 2 − V t * ,λ (see (3.4) and afterwards) and V t * ,λ = t * δ 2 .By (3.7) we see t * ≤ t s + (D − κδ −2 ) + and the result follows.
To assess the size of the bias bound, let us assume the polynomial decay (PSD(p, C A )). Then a Riemann sum approximation yields for any t ∈ Consequently, we can estimate . This means that the bias bound is upper bounded by the balanced strong oracle risk.
Let us see by a counterexample that (B 2 t * (µ) − B 2 ts (µ)) + can be of the same order as the strongly balanced risk itself, meaning that the bound of Proposition 3.5 is not too pessimistic in general.Suppose κ = Dδ 2 ( so that t * = t w ), µ D = 0 and δ, D such that t s = D − 3/4.This gives µ 2 D /4 = B 2 ts (µ) = V ts .In weak norm we have ts (µ) holds and we must indeed pay a positive factor for using the weak oracle in strong norm.We can meet the bound for instance for λ i = i −p with p > 3/2 and D sufficiently large.
For the stochastic error we use in Appendix 5.6 exponential inequalities for P (R 2 m−1 > κ), m > t * , to obtain the following bound: 3.6 Proposition.We have the oracle-proxy inequality for the strong norm stochastic error

If the polynomial decay condition (PSD(p, C
holds with a constant C p , only depending on p.
3.7 Corollary.We have the balanced oracle inequality for the stochastic error By the monotonicity of S t and V t in t we bound In view of Proposition 3.6 it suffices to prove (V We apply Lemma 3.2 and note V tw ≤ V ts by (3.7) to conclude.
Everything is prepared to prove our main strong norm result.
Then the following balanced oracle inequality holds in strong norm If in addition the polynomial decay condition (PSD(p, C A )) is satisfied, then there is a constant C > 0, only depending on p, C A , C κ , so that Intuitively, stopping about √ D steps later does not affect the rate under (PSD(p, C A )), but does affect it under exponential singular value decay.
(b) Compared with the main Theorem 3.5 in [2] this is a proper oracle inequality since for the truncated SVD method t w ≤ t s always holds.Note also that the more direct proof here gives much simpler and tighter bounds.
Proof.By (3.2) we have Combining Proposition 3.5 and Corollary 3.7 we thus obtain By (3.7) we have t * ≤ t s + (D − κδ −2 ) + and the first inequality follows.Under (PSD(p, C A )) we use (3.9), with a factor depending on p, C A , C κ .Finally, note and apply λ −2 δ) is thus true and we obtain the following adaptive upper bound: Then there is a constant C > 0, depending only on p, C A and C κ , such that for all (β, R) with In summary, together with the matching lower bound of Corollary 2.6 this shows that the stopping rule τ is sequentially minimax adaptive.
4. An adaptive two-step procedure

Construction and results
The lower bounds show that, in general, there is no hope for an early stopping rule attaining the order of the (unconstrained) oracle risk if the strongly balanced oracle t s is of smaller order than √ D. We can therefore always start the stopping rule τ at some m 0 √ D. If, however, immediate stopping τ = m 0 occurs, we might have stopped too late in the sense that t s m 0 .To avoid this overfitting, we propose to run a second model selection step on { µ (0) , . . ., µ (m 0 ) } in the event τ = m 0 .
Below, we shall formalise this procedure and prove that this combined model selection indeed achieves adaptivity, that is, its risk is controlled by an oracle inequality.While violating the initial stopping rule prescription, we still gain substantially in terms of numerical complexity.At the heart of this twofold model selection procedure is a simple observation of independence.
4.1 Lemma.The stopping rule τ is independent of the estimators µ (0) , . . .µ (m 0 ) .Proof.By construction, τ is measurable with respect to the σ-algebra σ(R 2 m 0 , . . ., R For the second step, we suppose that m ∈ {0, . . ., m 0 } is obtained from any model selection procedure among { µ (0) , . . ., µ (m 0 ) } that satisfies with a constant C 2 ≥ 1, for any signal µ, the oracle inequality Such an oracle inequality holds for standard procedures, for instance the AIC-criterion We refer to Section 2.3 in Cavalier and Golubev [7] for the corresponding result and further discussion.If we are interested in a weak norm oracle inequality, the AIC-criterion takes the weak empirical risk and reduces to the minimisation of − m i=1 Y 2 i + 2mδ 2 , which is classical.Based on the lemma and the tools developed in the previous section, we prove in Appendix 5.7 the following oracle inequality in an asymptotic setting.
In particular, µ (ρ) is minimax adaptive over all Sobolev-type balls H β (R, D) at a usually much reduced computational complexity compared to standard model selection procedures requiring all µ (m) , m = 0, . . ., D.

Numerical illustration
Let us exemplify the procedure by some Monte Carlo results.As a test bed we take the moderately ill-posed case λ i = i −1/2 with noise level δ = 0.01 and dimension D = 10 000.We consider early stopping at τ with κ = Dδ 2 = 1.
In Figure 1 (left), we see the SVD representation of three signals: a very smooth signal µ(1), a relatively smooth signal µ(2) and a rough signal µ(3), the attributes coming from the interpretation via the decay of Fourier coefficients.The corresponding weakly balanced oracle indices t w are (34, 316, 1356).The classical oracle indices in strong norm are (43, 504, 1331).Figure 1 (right) shows box-plots of the relative efficiency of early stopping in 1000 Monte Carlo replications defined as min m E[ µ (m) − µ 2 ] 1/2 / µ (τ ) − µ , both for strong and weak norm.Ideally, the relative efficiency should concentrate around one.This is well achieved for the smooth and rough signals and even better than for the corresponding Landweber results in [2].The super-smooth case with its very small oracle risk suffers from the variability within the residual and attains on average an efficiency of about 0.5, meaning that its root mean squared error is about twice as large as the oracle error.Let us mention that in unreported situations with higher ill-posedness the relative efficiency is similarly good or even better.
We are lead to consider the two-step procedure.According to Proposition 4.2 we have to choose an initial index somewhat larger than √ D. The factor in the choice there is very conservative due to non-tight concentration bounds.For the implementation we choose m 0 such that for a zero signal µ = 0 the probability of {τ > m 0 } = {R 2 m 0 > κ} is about 0.01, when applying a normal approximation, that is m 0 = q 0.99 √ 2D + 1 = 329 with the 99%-percentile q 0.99 of N (0, 1).In Figure 2(left) we see that with this choice for the super-smooth signal, 6 out of 1 000 MC realisations lead to τ > m 0 , for the others we apply the second model selection step.The truncation for the smooth signal varies around m 0 , and the second step is applied to about 50% of the realisations.In the rough case, τ > m 0 was always satisfied and no second model selection step was applied.
As model selection procedure we apply the AIC-criterion, based on the weak and strong empirical norm for the weak and strong norm criterion, respectively.The results are shown in Figure 2(right).We see that the efficiency for the super-smooth signal improves significantly (with the 6 outliers not being affected).The variability is still considerably higher than for the other two signals.This phenomenon is well known for unbiased risk estimation.Especially for more strongly ill-posed problems, one should penalise stronger, see the comparison with the risk hull approach in Cavalier and Golubev [7] and the numerical findings in Lucka et al. [14].Here let us rather emphasize that a pure AIC-minimisation for the super-smooth signal gives exactly the same result, apart from the 6 outliers, but requires to calculate the AIC-criterion for D = 10 000 indices in 1 000 MC itera- tions.The two-step procedure, even for known SVD, is about 30 times faster.

Proof of Lemma 2.3
Proof.
By Lemma 1 in Laurent and Massart [12], for nonnegative weights a i , we have Picking a = δ 2 (λ −2 1 , . . ., λ −2 m ) and x = log(5/4) so that 2 √ x ≤ 0.95, with probability larger than 1 − e −x = 0.2, it follows that where we used a ≤ m i=1 a i = E [S m ] = V m (observe that we could tighten the latter inequality significantly under some additional assumptions on the singular value decay).We now have We deduce from this that V m ≥ 200R(µ, τ ) 2 implies P (τ ≥ m) ≤ 0.9.
If f p denotes the χ 2 (p)-density, then we have for any t > 0 that f p (x−t) > f p (x) holds iff x ≥ x t = t 1−e −t/(p−2) .Thus, we obtain knowing that x = p − 2 is the mode of f p .Stirling's formula guarantees Γ(x) ≥ 2π/x(x/e) x for all x > 0 such that the last expression is always bounded by t(πp) −1/2 e.This yields Taking expectation with respect to Z 1 ∼ N (0, 1) we conclude Using the triangle inequality and E[|Z 1 |] = 2/π, the upper bound follows.
For the numerator, we use the lower bound For the denominators, we use 16δ In the sequel ' ', ' ' denote inequalities up to a factor only depending on p.A Riemann sum approximation shows for any R > 0 On the other hand, we have implying the result with a suitable constant C p .

Proof of Proposition 4.2
Proof.In this proof ' ' denotes an inequality holding up to factors depending only on p, C A , C κ and C 2 ; similarly, '∼' denotes a two-sided inequality holding up to factors depending on these parameters.In the case t s > m 0 we use the independence of τ from µ (0) , . . .µ .

2. 5
Corollary.Assume (PSD(p, C A )) and let τ be any G-stopping time.If there exists

.10) 3 . 9
Remarks.(a) The impact of the polynomial decay condition (PSD(p, C A )) is quite transparent here.If the eigenvalues decay exponentially, λ i = e −αi say, then a factor e 2αCκ √ D appears in the balanced oracle inequality, which we then also lose compared to the optimal minimax rate in strong norm.Polynomial decay ensures λ −2

2 m
to deduce the second bound.Again, the proof only relies on concentration bounds for the residuals R and easily extends to sub-Gaussian errors.Let us now derive from Theorem 3.8 an asymptotic minimax upper bound over the Sobolev-type ellipsoids H β (R, D).For m 0 = √ D + 1 the bound (3.10) gives D ) and µ (m) is σ(Y 1 , . . ., Y m )-measurable.By the independence of (Y 1 , . . ., Y m 0 ) and (Y m 0 +1 , . . ., Y D ) the claim follows.

Fig 1 .
Fig 1. Left: SVD representation of a super-smooth (blue), a smooth (red) and a rough (olive) signal.Right: Relative efficiency for early stopping with m0 = 0.