Distributional Results for Thresholding Estimators in High-Dimensional Gaussian Regression Models

We study the distribution of hard-, soft-, and adaptive soft-thresholding estimators within a linear regression model where the number of parameters k can depend on sample size n and may diverge with n. In addition to the case of known error-variance, we define and study versions of the estimators when the error-variance is unknown. We derive the finite-sample distribution of each estimator and study its behavior in the large-sample limit, also investigating the effects of having to estimate the variance when the degrees of freedom n-k does not tend to infinity or tends to infinity very slowly. Our analysis encompasses both the case where the estimators are tuned to perform consistent model selection and the case where the estimators are tuned to perform conservative model selection. Furthermore, we discuss consistency, uniform consistency and derive the uniform convergence rate under either type of tuning.


Introduction
We study the distribution of thresholding estimators such as hard-thresholding, soft-thresholding, and adaptive soft-thresholding in a linear regression model when the number of regressors can be large. These estimators can be viewed as penalized least-squares estimators, with softthresholding coinciding with the Lasso (introduced by Alliney and Ruzinsky (1994), Frank and Friedman (1994), and Tibshirani (1996)) and with adaptive soft-thresholding coinciding with the adaptive Lasso (introduced by Zou (2006)) in the case of an orthogonal design matrix. [Thresholding estimators have of course been discussed earlier in the context of model selection (see Bauer, Pötscher and Hackl (1988)) and in the context of wavelets (see, e.g., Donoho, Johnstone, Kerkyacharian, Picard (1995)).] Contributions concerning distributional properties of thresholding and penalized least-squares estimators are as follows: Knight and Fu (2000) study the asymptotic distribution of the Lasso estimator when it is tuned to act as a conservative variable selection procedure, whereas Zou (2006) studies the asymptotic distribution of the Lasso and the adaptive Lasso estimators when they are tuned to act as consistent variable selection procedures. Fan and Li (2001) and Fan and Peng (2004) study the asymptotic distribution of the so-called smoothly clipped absolute deviation estimator when it is tuned to act as a consistent variable selection procedure. In the wake of Fan and Li (2001) and Fan and Peng (2004) a large number of papers have been published that derive the asymptotic distribution of various penalized maximum likelihood estimators under consistent tuning; see the introduction in Pötscher and Schneider (2009) for a partial list. Except for Knight and Fu (2000), all these papers derive the asymptotic distribution in a …xed-parameter framework. As pointed out in Leeb and Pötscher (2005), such a …xed-parameter framework is often highly misleading in the context of variable selection procedures and penalized maximum likelihood estimators. For that reason, Pötscher and Leeb (2009) and Pötscher and Schneider (2009) have conducted a detailed study of the …nite-sample as well as large-sample distribution of various penalized least-squares estimators, adopting a moving-parameter framework for the asymptotic results. [Related results for so-called post-model-selection estimators can be found in Pötscher (2003, 2005) and for model averaging estimators in Pötscher (2006).] The papers by Pötscher and Leeb (2009) and Pötscher and Schneider (2009) are set in the framework of an orthogonal linear regression model with a …xed number of parameters and with the error-variance being known.
In the present paper we build on the just mentioned papers by Pötscher and Leeb (2009) and Pötscher and Schneider (2009). In contrast to these papers, we allow for arbitrary design and do not assume the number of regressors k to be …xed, but let it depend on sample size -thus allowing for high-dimensional models. We also consider the case where the error-variance is unknown, which in case of a high-dimensional model creates non-trivial complications as then estimators for the error-variance will typically not be consistent. While the asymptotic distributional results in the known-variance case do not di¤er in substance from the results in Pötscher and Leeb (2009) and Pötscher and Schneider (2009), not unexpectedly we observe di¤erent asymptotic behavior in the unknown-variance case if the number of degrees of freedom n k is constant, the di¤erence resulting from the non-vanishing variability of the error-variance estimator in the limit. Less expected is the result that -under consistent tuning -for the variable selection probabilities (implied by all the estimators considered) as well as for the distribution of the hard-thresholding estimator, estimation of the error-variance still has an e¤ect asymptotically even if n k diverges, but does so only slowly.
The paper is organized as follows. We introduce the model and de…ne the estimators in Section 2. Section 3 treats the variable selection probabilities implied by the estimators. Consistency, uniform consistency, and minimax rates are discussed in Section 4. We derive the …nite-sample distribution of each estimator in Section 5 and study the large-sample behavior of these in Section 6.

The Model and the Estimators
Consider the linear regression model Y = X + u with Y an n 1 vector, X a nonstochastic n k matrix of rank k 1, and u N (0; 2 I n ). We allow k, the number of columns of X, as well as the entries of Y , X, and u to depend on sample size n (in fact, also the probability spaces supporting Y and u may depend on n), although we shall almost always suppress this dependence on n in the notation. Note that this framework allows for high-dimensional regression models, where the number of regressors k is large compared to sample size n, as well as for the more classical situation where k is much smaller than n. Furthermore, let i;n denote the nonnegative square root of ((X 0 X=n) 1 ) ii , the i-th diagonal element of (X 0 X=n) 1 . Now let denote the least-squares estimator for and the associated estimator for 2 , the latter being de…ned only if n > k. The hard-thresholding estimator~ H is de…ned via its components as follows~ H;i =~ H;i ( i;n ) =^ LS;i 1 ^ LS;i >^ i;n i;n ; where the tuning parameters i;n are positive real numbers and^ LS;i denotes the i-th component of the least-squares estimator. We shall also need to consider its infeasible counterpart^ H given by^ H;i =^ H;i ( i;n ) =^ LS;i 1 ^ LS;i > i;n i;n : The soft-thresholding estimator~ S and its infeasible counterpart^ S are given bỹ Note that~ H ,~ S , and~ AS as well as their infeasible counterparts are equivariant under scaling of the columns of (Y : X) by non-zero column-speci…c scale factors. We have chosen to let the thresholds^ i;n i;n ( i;n i;n , respectively) depend explicitly on^ ( , respectively) and i;n in order to give i;n an interpretation independent of the values of and X. Furthermore, often i;n will be chosen independently of i, i.e., i;n = n where n is a positive real number. Clearly, for the feasible versions we always need to assume n > k, whereas for the infeasible versions n k su¢ ces.
We note the simple fact that holds on the event that^ LS;i 0. Analogous inequalities hold for the infeasible versions of the estimators.
Remark 1 (Lasso) (i) Consider the objective function where 0 i;n are positive real numbers. It is well-known that a unique minimizer~ L of this objective function exists, the Lasso-estimator. It is easy to see that in case X 0 X is diagonal, we havẽ Hence, in the case of diagonal X 0 X, the Lasso~ L;i reduces to a soft-thresholding estimator with an appropriate threshold; in particular,~ L;i coincides with~ S;i for the choice 0 i;n = i;n 1 i;n . Therefore all results derived below for soft-thresholding immediately give corresponding results for the Lasso as well as for the Dantzig-selector in the diagonal case. We shall abstain from spelling out further details.
(ii) Sometimes 0 i;n in the de…nition of the Lasso is chosen independently of i; more reasonable choices seem to be (a) 0 i;n = i;n i;n (where i;n denotes the nonnegative square root of the i-th diagonal element of (X 0 X=n)), and (b) 0 i;n = i;n 1 i;n where i;n are positive real numbers (not depending on the design matrix and often not on i) as then i;n again has an interpretation independent of the values of and X. Note that in case (a) or (b) the solution of the optimization problem is equivariant under scaling of the columns of X by non-zero column-speci…c scale factors.
(iii) Similar results obviously hold for the infeasible versions of the estimators.
Remark 2 (Adaptive Lasso) Consider the objective function where 0 i;n are positive real numbers. This is the objective function of the adaptive Lasso if 0 i;n = 0 n . Again the minimizer~ AL exists and is unique (at least on the event where^ LS;i 6 = 0 for all i). Clearly,~ AL is equivariant under scaling of the columns of X by non-zero column-speci…c scale factors. It is easy to see that in case X 0 X is diagonal, we havẽ Hence, in the case of diagonal X 0 X, the adaptive Lasso~ AL;i reduces to the adaptive softthresholding estimator~ AS;i (for 0 i;n = i;n ). Therefore all results derived below for adaptive soft-thresholding immediately give corresponding results for the adaptive Lasso in the diagonal case. We shall again abstain from spelling out further details. Similar results obviously hold for the infeasible versions of the estimators.
For all asymptotic considerations in this paper we shall always assume without further mentioning that 2 i;n =n = ((X 0 X) 1 ) ii satis…es sup n 2 i;n =n < 1 for every …xed i 1 satisfying i k(n) for large enough n. The case excluded by assumption (3) seems to be rather uninteresting as unboundedness of 2 i;n =n means that the information contained in the regressors gets weaker with increasing sample size (at least along a subsequence); in particular, this implies (coordinate-wise) inconsistency of the least-squares estimator. [In fact, if k as well as the elements of X do not depend on n, this case is actually impossible as 2 i;n =n is then necessarily monotonically nonincreasing.] The following notation will be used in the paper: Let R denote the extended real line R[ f 1; 1g endowed with the usual topology. On N[ f1g we shall consider the topology it inherits from R. Furthermore, and denote the cumulative distribution function (cdf) and the probability density function (pdf) of a standard normal distribution, respectively. By T m;c we denote the cdf of a non-central T -distribution with m 2 N degrees of freedom and non-centrality parameter c 2 R. In the central case, i.e., c = 0, we simply write T m . We use the convention (1) = 1, ( 1) = 0 with a similar convention for T m;c .

Variable Selection Probabilities
The estimators~ H ,~ S , and~ AS can be viewed as performing variable selection in the sense that these estimators set components of exactly equal to zero with positive probability. In this section we study the variable selection probability P n; ; ~ i 6 = 0 , where~ i stands for any of the estimators~ H;i ,~ S;i , and~ AS;i . Since these probabilities are the same for any of the three estimators considered we shall drop the subscripts H, S, and AS in this section. We use the same convention also for the variable selection probabilities of the infeasible versions.

Known-Variance Case
Since P n; ; ^ i 6 = 0 = 1 P n; ; ^ i = 0 it su¢ ces to study the variable deletion probability As can be seen from the above formula, P n; ; ^ i = 0 depends on only via i . We …rst study the variable selection/deletion probabilities under a "…xed-parameter" asymptotic framework.
Proposition 3 For every i 1 satisfying i k = k(n) for large enough n we have: (a) A necessary and su¢ cient condition for P n; ; ^ i = 0 ! 0 as n ! 1 for all satisfying i 6 = 0 ( i not depending on n) is i;n i;n ! 0.
(b) A necessary and su¢ cient condition for P n; ; ^ i = 0 ! 1 as n ! 1 for all satisfying i = 0 is n 1=2 i;n ! 1. (c) A necessary and su¢ cient condition for P n; ; ^ i = 0 ! c i < 1 as n ! 1 for all Proof. We …rst prove Part (a). Rewrite P n; ; ^ i = 0 as n 1=2 1 i;n i = + i;n i;n n 1=2 1 i;n i = i;n i;n : Note that n 1=2 1 i;n is bounded away from zero (but may be unbounded) by our maintained assumption (3). Now it is easy to see that this converges to zero for every i 6 = 0 if and only if i;n i;n ! 0 as n ! 1. Parts (b) and (c) are obvious since P n; ; ^ i = 0 = n 1=2 i;n n 1=2 i;n whenever i = 0.
Part (a) of the above proposition gives a necessary and su¢ cient condition for the procedure to correctly detect nonzero coe¢ cients with probability converging to 1. Part (b) gives a necessary and su¢ cient condition for correctly detecting zero coe¢ cients with probability converging to 1.
Remark 4 If i;n =n 1=2 does not converge to zero, the conditions on i;n in Parts (a) and (b) are incompatible; also the conditions in Parts (a) and (c) are then incompatible (except when e i = 0). However, the case where i;n =n 1=2 does not converge to zero is of little interest as the least-squares estimator is then not consistent.
Remark 5 (Speed of convergence in Proposition 3) (i) The speed of convergence in (a) is i;n i;n in case n 1=2 1 i;n is bounded (an uninteresting case as noted above); if n 1=2 1 i;n ! 1, the speed of convergence in (a) is not slower than exp cn 2 i;n = n 1=2 1 i;n for some suitable c > 0 depending on i = .
(ii) The speed of convergence in (b) is exp 0:5n 2 i;n = n 1=2 i;n . In (c) the speed of convergence is given by the rate at which n 1=2 i;n approaches e i . Remark 6 For 2 R k(n) let A n ( ) = fi : 1 i k(n); i 6 = 0g. Then (i) for every i 2 A n ( ) Suppose now that the entries of do not change with n (although the dimension of may depend on n). 1 Then, given that card(A n ( )) is bounded (this being in particular the case if k(n) is bounded), the probability of incorrect non-detection of at least one nonzero coe¢ cient converges to 0 if and only if i;n i;n ! 0 as n ! 1 for every i 2 A n ( ).
[If card(A n ( )) is unbounded then this probability converges to 0, e.g., if i;n i;n ! 0 and n 1=2 1 i;n ! 1 as n ! 1 for every i 2 A n ( ) and inf i2An( ) j i j > 0 and P i2An( ) exp cn 2 i;n = n 1=2 1 i;n ! 0 as n ! 1 for a suitable c that is determined by inf i2An( ) j i j = .] (ii) For every i = 2 A n ( ) we have P n; ; ^ i = 0 P n; ; Suppose again that the entries of do not change with n. Then, given that card(A c n ( )) is bounded (this being in particular the case if k(n) is bounded), the probability of incorrectly classifying at least one zero parameter as a non-zero one converges to 0 as n ! 1 if and only if n 1=2 i;n ! 1 for every i 2 A n ( ).
[If card(A c n ( )) is unbounded then this probability converges to 0, e.g., if P i = 2An( ) exp 0:5n 2 i;n = n 1=2 i;n ! 0 as n ! 1.] (iii) In case X 0 X is diagonal, the relevant probabilities P n; ; S i2An( ) can be directly expressed in terms of products of P n; ; ^ i = 0 or 1 P n; ; ^ i = 0 , and Proposition 3 can then be applied.
1 More precisely, this means that is made up of the initial k(n) elements of a …xed element of R 1 .
Since the …xed-parameter asymptotic framework often gives a misleading impression of the actual behavior of a variable selection procedure (cf. Leeb and Pötscher (2005), Pötscher and Leeb (2009)) we turn to a "moving-parameter" framework next, i.e., we allow the elements of as well as to depend on sample size n. In the proposition to follow (and all subsequent large-sample results) we shall concentrate only on the case where i;n i;n ! 0 as n ! 1, since otherwise the estimators^ i are not even consistent for i as a consequence of Proposition 3, cf. also Theorem 15 below. Given the condition i;n i;n ! 0, we shall then distinguish between the case n 1=2 i;n ! e i , 0 e i < 1, and the case n 1=2 i;n ! 1, which in light of Proposition 3 we shall call the case of "conservative tuning" and the case of "consistent tuning", respectively. 2 Proposition 7 Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! e i where 0 e i 1. (a) Assume e i < 1. Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy n 1=2 i;n =( n i;n ) ! i 2 R. Then (b) Assume e i = 1. Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy i;n =( n i;n i;n ) ! i 2 R. Then 1. j i j < 1 implies lim n!1 P n; (n) ; n ^ i = 0 = 1.
3. j i j = 1 and r i;n := n 1=2 i;n i i;n =( n i;n ) ! r i , for some r i 2 R, imply lim n!1 P n; (n) ; n ^ i = 0 = (r i ): Proof. Part (a) follows immediately from (4) and the assumptions. To prove Part (b) we use (4) to write P n; (n) ; n ^ i = 0 = n 1=2 i;n 1 i;n =( n i;n i;n ) n 1=2 i;n 1 i;n =( n i;n i;n ) : The …rst and the second claim the follow immediately. For the third claim, assume …rst that i = 1. Then P n; (n) ; n ^ i = 0 = n 1=2 i;n i i;n =( n i;n ) n 1=2 i;n 1 i;n =( n i;n i;n ) ! (r i ): The case i = 1 is handled analogously.
In a …xed-parameter asymptotic analysis, which in Proposition 7 corresponds to the case i;n i and n , the limit of the probabilities P n; ; ^ i = 0 is always 0 in case i 6 = 0, and is 1 in case i = 0 and consistent tuning (it is (e i ) ( e i ) in case i = 0 and conservative tuning); this does clearly not properly capture the …nite-sample behavior of these probabilities.
The moving-parameter asymptotic analysis underlying Proposition 7 better captures the …nitesample behavior and, e.g., allows for limits other than 0 and 1 even in the case of consistent tuning. In particular, Proposition 7 shows that the convergence of the variable selection/deletion probabilities to their limits in a …xed-parameter asymptotic framework is not uniform in i , and this non-uniformity is local in the sense that it occurs in an arbitrarily small neighborhood of i = 0 (holding the value of > 0 …xed). 3 Furthermore, the above proposition entails that under consistent tuning deviations from i = 0 of larger order than under conservative tuning go unnoticed asymptotically with probability 1 by the variable selection procedure corresponding to^ i . For more discussion in a special case (which in its essence also applies here) see Pötscher and Leeb (2009).
Remark 8 The convergence conditions in Proposition 7 on the various quantities involving i;n and n are essentially cost-free in the sense that given any sequence ( i;n ; n ) we can, due to compactness of R, select from any subsequence n j a further subsubsequence n j(l) such that along this subsubsequence all relevant quantities such as n 1=2 i;n =( n i;n ) (or i;n =( n i;n i;n ) and r i;n ) converge in R. Proposition 7 can then be applied to this subsubsequence, resulting in a characterization of all possible accumulation points of the variable selection/deletion probabilities.
Remark 9 (Speed of convergence in Proposition 7) (i) The speed of convergence in (a) is given by the slower of the rate at which n 1=2 i;n approaches e i and n 1=2 i;n =( n i;n ) approaches i provided that j i j < 1; if j i j = 1, the speed of convergence is not slower than exp cn 2 i;n =( 2 n 2 i;n ) = n 1=2 i;n =( n i;n ) for any c < 1=2.
[We have here made use of Lemma VII.1.2 in Feller (1957).] (ii) The speed of convergence in (b1) is not slower than exp cn 2 i;n = n 1=2 i;n where c depends on i . The same is true in case (b2) provided j i j < 1; if j i j = 1, the speed of convergence is not slower than exp cn 2 i;n =( 2 n 2 i;n ) = n 1=2 i;n =( n i;n ) for every c < 1=2. In case (b3) the speed of convergence is not slower than the speed of convergence of max exp cn 2 i;n = n 1=2 i;n ; jr i;n r i j for any c < 2 in case jr i j < 1; in case jr i j = 1 it is not slower than max exp cn 2 i;n = n 1=2 i;n ; exp 0:5r 2 i;n = jr i;n j for any c < 2.
The preceding remark corrects and clari…es the remarks at the end of Section 3 in Pötscher and Leeb (2009) and Section 3.1 in Pötscher and Schneider (2009).

Unknown-Variance Case
In the unknown-variance case the …nite-sample variable selection/deletion probabilities can be obtained as follows: P n; ; ~ i = 0 = P n; ; ^ LS;i ^ i;n i;n = Z 1 0 P n; ; ^ LS;i ^ i;n i;n j^ = s n k (s)ds Here we have used (4), and independence of^ and^ LS;i allowed us to replace^ by s in the relevant formulae, cf. Leeb and Pötscher (2003, p. 110). In the above n k denotes the density of (n k) 1=2 times the square root of a chi-square distributed random variable with n k degrees of freedom. It will turn out to be convenient to set n k (s) = 0 for s < 0, making n k a bounded continuous function on R.
We now have the following …xed-parameter asymptotic result for the variable selection/deletion probabilities in the unknown-variance case that perfectly parallels the corresponding result in the known-variance case, i.e., Proposition 3: Proposition 10 For every i 1 satisfying i k = k(n) for large enough n we have: (a) A necessary and su¢ cient condition for P n; ; ~ i = 0 ! 0 as n ! 1 for all satisfying i 6 = 0 ( i not depending on n) is i;n i;n ! 0.
(b) A necessary and su¢ cient condition for P n; ; ~ i = 0 ! 1 as n ! 1 for all satisfying i = 0 is n 1=2 i;n ! 1. (c) A necessary and su¢ cient condition for P n; ; ~ i = 0 c i;n ! 0 as n ! 1 for all satisfying i = 0 and with c i;n = T n k (e i ) T n k ( e i ) satisfying lim sup n!1 c i;n < 1 is n 1=2 i;n ! e i , 0 e i < 1.
Proof. We prove Part (b) …rst. Observe that By a subsequence argument it su¢ ces to prove the result under the assumption that n k = n k(n) converges in N [ f1g. If the limit is …nite, then n k(n) is eventually constant and the result follows since every t-distribution has unbounded support. If n k ! 1 then where k k 1 denotes the supremum norm. Since kT n k k 1 ! 0 if n k ! 1 by Polya's Theorem, the result follows. Part (c) is proved analogously.
We next prove Part (a). Observe that the collection of distributions corresponding to f m : m 2 Ng is tight on (0; 1), meaning that for every 0 Note that the map s 7 ! P n; ; ^ i (s i;n ) = 0 is monotonically nondecreasing. Hence, Since i;n c ( ) i;n ( i;n c ( ) i;n , respectively) converges to zero if and only if i;n i;n does so, Part (a) follows from Proposition 3 applied to the estimators^ i (c ( ) i;n )and^ i (c ( ) i;n ).
Proposition 10 shows that the dichotomy regarding conservative tuning and consistent tuning is expressed by the same conditions in the unknown-variance case as in the known-variance case. Furthermore, note that c i;n appearing in Part (c) of the above proposition converges to c i = (e i ) ( e i ) in the case where n k ! 1, the limit thus being the same as in the known-variance case. This is di¤erent in case n k is constant equal to m, say, eventually, the sequence c i;n then being constant equal to T m (e i ) T m ( e i ) eventually. We …nally note that Remark 4 also applies to Proposition 10 above.
We next investigate the asymptotic behavior of the variable selection/deletion probabilities under a moving-parameter asymptotic framework. We consider the case where n k is (eventually) constant and the case where n k ! 1. There is no essential loss in generality in considering these two cases only, since by compactness of N [ f1g we can always assume (possibly after passing to subsequences) that n k converges in N [ f1g.
[Note that the integral in the above display reduces to 1 if Since P n (s) as well as P (s) are continuous functions of s, are monotonically nondecreasing in s, and have the property that their limits for s ! 0 are 0 while the limits for s ! 1 are 1, it follows from Polya's Theorem that the convergence is uniform in s. But then as n ! 1. This completes the proof in case n k = m eventually; in case n k ! 1 observe that R 1 0 ( ( i + se i ) ( i se i )) n k (s)ds then converges to ( i + e i ) ( i e i ) as the distribution corresponding to n k converges weakly to pointmass at s = 1 and the integrand is bounded and continuous.
(b) Observe that P n; (n) ; n ^ i (s i;n ) = 0 converges to 1 for s > j i j and to 0 for s < j i j by Proposition 7 applied to the estimator^ i (s i;n ). Now (5) and dominated convergence deliver the result in (b1). Next consider (b2): Suppose …rst that j i j < 1. Choose " > 0 small enough such that j i j + " < 1. Then, recalling that P n; (n) ; n ^ i (s i;n ) = 0 is monotonically nondecreasing in s, eq. (5) gives n k (s)ds: Now the integral on the r.h.s. converges to 1 since j i j + " < 1, and the probability on the r.h.s. converges to 1 by Proposition 7 applied to the estimator^ i ((j i j + ") i;n ). Next assume that j i j > 1. Choose " > 0 small enough such that j i j " > 1 holds. Then from eq. (5) we have 0 n k (s)ds is not larger than 1 and the integrand is monotonically nondecreasing in s. Since j i j " > 1 and n k ! 1 the second term on the r.h.s. goes to zero, while the …rst term goes to zero by Proposition 7 applied to the estimator^ i ((j i j ") i;n ).
Next we prove 3.&4. and assume i = 1 …rst. Then using eq. (5) and performing the substitution s 1 = (2 (n k)) 1=2 t we obtain (recalling that n k is zero for negative arguments and using the abbreviations r i;n = n 1=2 i;n i;n =( n i;n ) and r i;n = n 1=2 i;n i;n =( n i;n ) ) The indicated term in the above display is o(1) because the expression in brackets inside the integral is bounded by 1 and by the Lemma in the Appendix. Since r i;n ! r i and r i;n ! 1, the integrand converges to (r i ) under 3. and to (r i + d i t) under 4. The dominated convergence theorem then completes the proof. The case i = 1 is treated similarly. It remains to prove 5. Again assume i = 1 …rst. De…ne r 0 i;n = 2 1=2 n 1=2 1 i;n (n k) 1=2 r i;n and r 00 i;n = 2 1=2 n 1=2 1 i;n (n k) 1=2 r i;n and rewrite the above display as Observe that r 0 i;n ! r 0 i and r 00 i;n ! 1. The expression in brackets inside the integral hence converges to 1 for t > r 0 i and to 0 for t < r 0 i . By dominated convergence the integral converges to R 1 Theorem 11 shows, in particular, that also in the unknown-variance case the convergence of the variable selection/deletion probabilities to their limits in a …xed-parameter asymptotic framework is not locally uniform in i . In the case of conservative tuning the theorem furthermore shows that the limit of the variable selection/deletion probabilities in the unknown-variance case is the same as in the known-variance case if the degrees of freedom n k go to in…nity (entailing that the distribution of^ = concentrates more and more around 1); if n k is eventually constant, the limit turns out to be a mixture of the known-variance case limits (with replaced by s ), the mixture being w.r.t. the distribution of^ = .
[We note that in the somewhat uninteresting case e i = 0 this mixture also reduces to the same limit as in the known-variance case.] While this result is as one would expect, the situation is di¤erent and more subtle in the case of consistent tuning: If n k ! 1 the limits are the same as in the known-variance case if j i j < 1 or j i j > 1 holds, namely 1 and 0, respectively. However, in the "boundary" case j i j = 1 the rate at which n k diverges to in…nity becomes relevant. If the divergence is fast enough in the sense that n 1=2 i;n = (n k) 1=2 ! 0, again the same limit as in the known-variance case, namely (r i ), is obtained; but if n k diverges to in…nity more slowly, a di¤erent limit arises (which, e.g., in case 4 of Part (b2) is obtained by averaging (r i + ) w.r.t. a suitable distribution). The case where the degrees of freedom n k is eventually constant looks very much di¤erent from the known-variance case and again some averaging w.r.t. the distribution of^ = takes place. Note that in this case the limiting variable deletion probabilities are 1 and 0, respectively, only if i = 0 and j i j = 1, respectively, which is in contrast to the known-variance case (and the unknown-variance case with n k ! 1).

Remark 12
As in the known-variance case, the convergence conditions on the various quantities involving i;n and n in Theorem 11 are essentially cost-free for the same reasons as given in Remark 8. Theorem 11 thus provides a full characterization of all possible accumulation points of the variable selection/deletion probabilities in the unknown-variance case.
As just discussed, in the case of conservative tuning we get the same limiting behavior under moving-parameter asymptotics in the known-variance and in the unknown-variance case along any sequence of parameters precisely if n k ! 1 or e i = 0 (which in the conservatively tuned case can equivalently be stated as n 1=2 i;n = (n k) 1=2 ! 0). In the case of consistent tuning the same coincidence of limits occurs precisely if n k ! 1 fast enough such that n 1=2 i;n = (n k) 1=2 ! 0. This is not accidental but a consequence of the following fact: Proposition 13 Suppose that for given i 1 satisfying i k = k(n) for large enough n we have n 1=2 i;n (n k) 1=2 ! 0 as n ! 1. Then sup 2R k ;0< <1 P n; ; ^ i = 0 P n; ; ~ i = 0 ! 0 for n ! 1: Proof. Observe that By a trivial modi…cation of Lemma 13 in Pötscher and Schneider (2010) we conclude that for every " > 0 there exists a real number c = c(") > 0 such that Z js 1j>(n k) 1=2 c n k (s)ds < " for every n > k. Using the fact, that is globally Lipschitz with constant (2 ) 1=2 , this gives which proves the result since " can be made arbitrarily small.
We note that Theorem 11 shows that the condition n 1=2 i;n (n k) 1=2 ! 0 in the above proposition cannot be weakened.

Consistency, Uniform Consistency, and Minimax Convergence Rate
For purposes of comparison we start with the following obvious proposition, which immediately follows from the observation that^ LS;i is N ( i ; 2 2 i;n =n)-distributed.
Proposition 14 For every i 1 satisfying i k = k(n) for large enough n we have the following: (a) i;n =n 1=2 ! 0 is a necessary and su¢ cient condition for^ LS;i to be consistent for i , the convergence rate being i;n =n 1=2 .
(b) Suppose i;n =n 1=2 ! 0. Then^ LS;i is uniformly consistent for i in the sense that for every " > 0 lim In fact,^ LS;i is uniformly n 1=2 = i;n -consistent for i in the sense that for every " > 0 there exists a real number M > 0 such that [Note that the probabilities in the displays above in fact neither depend on nor . In particular, the l.h.s. of the above displays equal 2 ( "n 1=2 = i;n ) and 2 ( M ), respectively.] The corresponding result for the estimators~ H;i ,~ S;i , or~ AS;i and their infeasible counter-parts^ H;i ,^ S;i , or^ AS;i is now as follows.
Theorem 15 Let~ i stand for any of the estimators~ H;i ,~ S;i , or~ AS;i . Then for every i 1 satisfying i k = k(n) for large enough n we have the following: (a)~ i is consistent for i if and only if i;n i;n ! 0 and i;n =n 1=2 ! 0.
(b) Suppose i;n i;n ! 0 and i;n =n 1=2 ! 0. Then~ i is uniformly consistent in the sense that for every " > 0 lim Furthermore,~ i is uniformly a i;n -consistent with a i;n = min n 1=2 = i;n ; ( i;n i;n ) 1 in the sense that for every " > 0 there exists a real number M > 0 such that (c) Suppose i;n i;n ! 0 and i;n =n 1=2 ! 0 and b i;n 0. If for every " > 0 there exists a real number M > 0 such that holds, then b i;n = O(a i;n ) necessarily holds.
(d) Let^ i stand for any of the estimators^ H;i ,^ S;i , or^ AS;i . Then the results in (a)-(c) also hold for^ i .
holds for any of the estimators. Hence, consistency of~ i under i;n i;n ! 0 and i;n =n 1=2 ! 0 follows immediately from Proposition 14(a) since the distributions of^ = are tight. Conversely, suppose~ i is consistent. Then clearly P n; ; ~ i = 0 ! 0 whenever i 6 = 0 must hold, which implies i;n i;n ! 0 by Proposition 10(a). This then entails consistency of^ LS;i by (8)   Choose c ("=2) as in the proof of Proposition 10. Using continuity of and the fact that the probability appearing on the r.h.s. above is monotonically increasing as j i j approaches M=a i;n from above, this can be further bounded by (c) By a subsequence argument we can reduce the argument to the case where n 1=2 i;n ! e i 2 R and n k converges in N [ f1g. Suppose …rst that e i = 1: Observe that then a i;n = ( i;n i;n ) 1 eventually. Choose i;n and n such that i;n = n i;n i;n = i , where i does not depend on n and 0 < j i j < 1 holds, and set the other coordinates of (n) to arbitrary values (e.g., equal to zero). Observe that there exists a constant > 0 such that holds: If n k converges to a …nite limit, i.e., is eventually constant, the claim follows from Theorem 11(b1); if n k ! 1, then use Theorem 11(b2). By (7) we have for " = and a suitable M that > P n; (n) ; n b i;n ~ i i;n > n M P n; (n) ; n b i;n ~ i i;n > n M;~ i = 0 = P n; (n) ; n jb i;n i;n j = n > M;~ i = 0 = 1 (jb i;n i;n j = n > M ) P n; (n) ; n ~ i = 0 > 1 (jb i;n i;n j = n > M ) for all n su¢ ciently large. But this is only possible if b i;n i;n i;n M= j i j < 1 holds eventually, implying that b i;n = O(a i;n ). Next consider the case where 0 < e i < 1: Observe that then a i;n is of the same order as n 1=2 = i;n . Then de…ne i;n and n such that n 1=2 i;n = n i;n = i , where i does not depend on n and 0 < j i j < 1 holds, and set the other coordinates of (n) to arbitrary values (e.g., equal to zero). Observe that then (9) also holds, in view of Theorem 11(a1) in case n k is eventually constant, and in view of Theorem 11(a2) in case n k ! 1. The rest of the proof is then similar as before. It remains to consider the case e i = 0: It follows from (8), the assumptions on i;n and i;n , from e i = 0, and from the observation that^ LS;i is N ( i ; 2 2 i;n =n)-distributed, that n 1=2 1 i;n 1 ~ i i converges in distribution to a standard normal distribution for each …xed i and . Hence, stochastic boundedness of 1 b i;n ~ i i for each i (and a fortiori (7)) necessarily implies that b i;n = O(n 1=2 1 i;n ) = O(a i;n ).
(d) The proof for^ i is similar and in fact simpler: note that now ^ i ^ LS;i i;n i;n holds and that in the proof of (b) the integration over s can simply be replaced by evaluation at s = 1. For (c) one uses Proposition 7 instead of Theorem 11.
Remark 16 If n 1=2 i;n ! e i = 0, then~ i is asymptotically equivalent to^ LS;i in the sense that for every " > 0 lim n!1 sup 2R k sup 0< <1 P n; ; n 1=2 = i;n j~ i ^ LS;i j > " = 0: A similar statement holds for^ i . For~ i this follows immediately from (8) and the fact that the family of distributions corresponding to n k is tight; for^ i this follows from the relation ^ i ^ LS;i i;n i;n noted above.
Remark 17 (i) A variation of the proof of Theorem 15 shows that in case of consistent tuning for the infeasible estimators additionally also holds for every M > 1 provided that n k ! 1.
(ii) Inspection of the proof shows that the conclusion of Theorem 15(c) continues to hold, if the supremum over R k is replaced by the supremum over an arbitrarily small neighborhood of 0 and is held …xed at an arbitrary positive value.
(iii) If " and M are replaced by " and M , respectively, in the displays in Proposition 14 and Theorem 15 as well as in Remark 16, the resulting statements remain true provided the suprema over 0 < < 1 are replaced by suprema over 0 < c, where c > 0 is an arbitrary real number.
The preceding theorem shows that the thresholding estimators~ H;i ,~ S;i , or~ AS;i (as well as their infeasible versions) are uniformly a i;n -consistent and that this rate is sharp and cannot be improved. In particular, if the tuning is conservative these estimators are uniformly n 1=2 = i;nconsistent, which is the usual rate one expects to …nd in a linear regression model as considered here. However, if consistent tuning is employed, the preceding theorem shows that these thresholding estimators are then only uniformly ( i;n i;n ) 1 -consistent, i.e., have a slower minimax convergence rate than the least-squares (maximum likelihood) estimator (or the conservatively tuned thresholding estimators for that matter). For a discussion of the pointwise convergence rate see Section 6.4.

Known-Variance Case
We next present the …nite-sample distributions of the infeasible thresholding estimators. It will turn out to be convenient to give the results for scaled versions, where the scaling factor i;n is a positive real number, but is otherwise arbitrary. Note that below we suppress the dependence of the distribution functions of the thresholding estimators on the scaling sequence i;n in the notation. Furthermore, observe that the …nite-sample distributions depend on only through i . The …nite-sample distributions of^ H;i ,^ S;i , and^ AS;i are seen to be non-normal. They are made up of two components, one being a multiple of pointmass at i;n i = and the other one being absolutely continuous with a density that is generally bimodal. For more discussion and some graphical illustrations in a special case see Pötscher and Leeb (2009) and Pötscher and Schneider (2009).

Remark 21
In the case where X 0 X is diagonal, the estimators of components i and j for i 6 = j are independent and hence the above results immediately allow one to determine the …nite-sample distributions of the entire vectors^ H ,^ S , and^ AS . In particular, this provides the …nite-sample distribution of the Lasso^ L and the adaptive Lasso^ AS in the diagonal case (cf. Remarks 1 and 2).

Unknown-Variance Case
The …nite-sample distributions of~ H;i ,~ S;i ,~ AS;i are obtained next. The same remark on the scaling as in the previous section applies here. where we have used independence of^ and^ LS;i allowing us to replace^ by s in the relevant formulae, cf. Leeb and Pötscher (2003, p. 110). Substituting (10), with i;n replaced by s i;n , into the above equation gives (15). Representing H i H;s i;n ;n; ; (x) as an integral of dH i H;s i;n ;n; ; given in (11) and applying Fubini's theorem then gives (16).

Proposition 23
The cdf H iz S;n; ; := H iz S; i;n ;n; ; of where we have used independence of^ and^ LS;i allowing us to replace^ by s in the relevant formulae. Substituting (14), with i;n replaced by s i;n , into the above equation gives (19).
As in the known-variance case the distributions are a convex combination of pointmass and an absolutely continuous part. In case of hard-thresholding, the averaging with respect to the density n k smoothes the indicator functions leading to a continuous density function for the absolutely continuous part (while in the known-variance case the density function is only piecewise continuous, cf. Figure 1 in Pötscher and Leeb (2009)). This is not so for soft-thresholding and adaptive soft-thresholding, where the averaging with respect to the density n k does not a¤ect the indicator functions involved; here the shape of the distribution is qualitatively the same as in the known-variance case (Figure 2 in Pötscher and Leeb (2009) and Figure 1 in Pötscher and Schneider (2009)).

Remark 25
In the case where X 0 X is diagonal, the …nite-sample distributions of the entire vectors~ H ,~ S , and~ AS can be found from the distributions of^ H ,^ S , and^ AS (see Remark 21) by conditioning on^ = s and integrating w.r.t. n k (s). In particular, this provides the …nite-sample distributions of the Lasso~ L and the adaptive Lasso~ AS in the diagonal case (cf. Remarks 1 and 2).

Large-Sample Distributions
We next derive the asymptotic distributions of the thresholding estimators under a movingparameter (and not only under a …xed-parameter) framework since it is well-known (cf. also the discussion in Section 6.4) that asymptotics based only on a …xed-parameter framework often lead to misleading conclusions regarding the performance of the estimators.

The Known-Variance Case
We …rst consider the infeasible versions of the thresholding estimators.
Proposition 26 Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! e i where 0 e i 1. (a) Assume e i < 1. Set the scaling factor i;n = n 1=2 = i;n . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy n 1=2 i;n =( n i;n ) ! i 2 R. Then H i H;n; (n) ; n converges weakly to the distribution with cdf the corresponding measure being [This distribution reduces to a standard normal distribution in case j i j = 1 or e i = 0.] (b) Assume e i = 1. Set the scaling factor i;n = i;n i;n 1 . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy i;n =( n i;n i;n ) ! i 2 R. 1. If j i j < 1, then H i H;n; (n) ; n converges weakly to i . 2. If j i j > 1, then H i H;n; (n) ; n converges weakly to 0 .
3. If j i j = 1 and n 1=2 i;n i i;n =( n i;n ) ! r i , for some r i 2 R, then H i H;n; (n) ; n converges weakly to Proof. The proof of (a) is completely analogous to the proof of Theorem 4 in Pötscher and Leeb (2009), whereas the proof of (b) is analogous to the proof of Theorem 17 in the same reference.
Proposition 27 Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! e i where 0 e i 1. (a) Assume e i < 1. Set the scaling factor i;n = n 1=2 = i;n . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy n 1=2 i;n =( n i;n ) ! i 2 R. Then H i S;n; (n) ; n converges weakly to the distribution with cdf the corresponding measure being Set the scaling factor i;n = i;n i;n 1 . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy i;n =( n i;n i;n ) ! i 2 R. Then H i S;n; (n) ; n converges weakly to sign( i ) min(1;j i j) .
Proof. The proof of (a) is completely analogous to the proof of Theorem 5 in Pötscher and Leeb (2009), whereas the proof of (b) is analogous to the proof of Theorem 18 in the same reference.
Proposition 28 Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! e i where 0 e i 1. (a) Assume e i < 1. Set the scaling factor i;n = n 1=2 = i;n . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy n 1=2 i;n =( n i;n ) ! i 2 R. Then H i AS;n; (n) ; n converges weakly to the distribution with cdf in case j i j < 1, the corresponding measure being In case j i j = 1, the cdf H i AS;n; (n) ; n converges weakly to , i.e., to a standard normal distribution. [In case e i = 0 the limit always reduces to a standard normal distribution.] (b) Assume e i = 1. Set the scaling factor i;n = i;n i;n 1 . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy i;n =( n i;n i;n ) ! i 2 R. 1. If j i j < 1, then H i AS;n; (n) ; n converges weakly to i . 2. If 1 j i j < 1, then H i AS;n; (n) ; n converges weakly to 1= i . 3. If j i j = 1, then H i AS;n; (n) ; n converges weakly to 0 .
Proof. The proof of (a) is completely analogous to the proof of Theorem 4 in Pötscher and Schneider (2009), whereas the proof of (b) is analogous to the proof of Theorem 6 in the same reference.
Observe that the scaling factors i;n used in the above propositions are exactly of the same order as a i;n in the case of conservative as well as in the case of consistent tuning and thus correspond to the minimax rate of convergence in both cases. In the case of conservative tuning the limiting distributions have essentially the same form as the …nite-sample distributions demonstrating that the moving-parameter asymptotic framework captures the …nite-sample behavior of the estimators in a satisfactory way. In contrast, a …xed-parameter asymptotic framework, which corresponds to setting i;n i and n in the above propositions, misrepresents the …nite-sample properties of the thresholding estimators whenever i 6 = 0 but small, as the …xedparameter limiting distribution is -in case of hard-thresholding and adaptive soft-thresholding -then always N (0; 1), regardless of the size of i . For soft-thresholding we also observe a strong discrepancy between the …nite-sample distribution and the …xed-parameter limit for i 6 = 0 which is given by N ( sign( i )e i ; 1). In particular, the above propositions demonstrate non-uniformity in the convergence of …nite-sample distributions to their limit in a …xed-parameter framework.
In the case of consistent tuning we observe an interesting phenomenon, namely that the limiting distributions now correspond to pointmasses (but not always located at zero!), or are convex combinations of two pointmasses in some cases when considering the hard-thresholding estimator. This essentially means that consistently tuned thresholding estimators are plagued by a bias-problem in that the "bias-component" is the dominant component and is of larger order than the "stochastic variability" of the estimator. 4 In a …xed-parameter framework we get the trivial limits 0 for every value of i in case of hard-thresholding and adaptive soft-thresholding. At …rst glance this seems to suggest that we have used a scaling sequence that does not increase fast enough with n, but recall that the scaling used here corresponds to the minimax convergence rate. We shall take this issue further up in Section 6.4. The situation is di¤erent for the softthresholding estimator where the …xed-parameter limit is sign( i) , which reduces to 0 only for i = 0; this is a re ‡ection of the well-known fact that soft-thresholding is plagued by bias problems to a higher degree than are hard-thresholding and adaptive soft-thresholding.

Uniform Closeness of Distributions in the Known-and Unknown-Variance Case
We next show that the …nite-sample cdfs of~ H;i ,~ S;i , and~ AS;i and of their infeasible counter-parts^ H;i ,^ S;i , and^ AS;i , respectively, are uniformly (w.r.t. the parameters) close in the total variation distance (or the supremum norm) provided the number of degrees of freedom n k diverges to in…nity fast enough. Apart from being of interest in their own right, these results will be instrumental in the subsequent section. We note that the results in this section hold for any choice of the scaling factors i;n .
Theorem 29 Suppose that for given i 1 satisfying i k = k(n) for large enough n we have n 1=2 i;n (n k) 1=2 ! 0 as n ! 1. H iz H;n; ; T V ! 0 for n ! 1: Proof. Observe that the total variation distance between two cdfs is bounded by the sum of the total variation distances between the corresponding discrete and continuous parts. Furthermore, recall that the total variation distance between the absolutely continuous parts is bounded from above by the L 1 -distance of the corresponding densities. Hence, from (11) and (16) where A is the l.h.s. of (6) and where we have made use of Fubini's theorem and performed an obvious substitution. By a trivial modi…cation of Lemma 13 in Pötscher and Schneider (2010) we conclude that for every " > 0 there exists a real number c = c(") > 0 such that Z js 1j>(n k) 1=2 c n k (s)ds < " for every n k > 0. Using the fact, that is globally Lipschitz with constant (2 ) 1=2 , this gives i;n Z js 1j (n k) 1=2 c j(s _ 1) (s^1)j n k (s)ds 2" + 2(2 ) 1=2 n 1=2 i;n (n k) 1=2 c: The r.h.s. now converges to 2" because n 1=2 i;n (n k) 1=2 ! 0. Since " > 0 was arbitrary, this shows that sup 2R k ;0< <1 B converges to zero. Note also that sup 2R k ;0< <1 A has already been shown to converge to zero in Proposition 13. This completes the proof.
Theorem 30 Suppose that for given i 1 satisfying i k = k(n) for large enough n we have n 1=2 i;n (n k) 1=2 ! 0 as n ! 1. Then where A is given by the l.h.s. of (6) and, using (13) and (18) where we have used the fact that is globally Lipschitz with constant (2 ) 1=2 . Since n 1=2 i;n (n k) 1=2 ! 0 and " > 0 was arbitrary, the proof is complete, because sup 2R k ;0< <1 A goes to zero by Proposition 13.
Observe that on the one hand C 1 (s) and C 2 (s) are bounded by 1, and that on the other hand, using the Lipschitz-property of and the mean-value theorem, (2 ) 1=2 n 1=2 2 i;n js 1j sup where s is a mean-value between s and 1 which may depend on x. The supremum over x on the r.h.s. is now clearly assumed for x = i;n i = , resulting in the bound jC 1 (s)j (2 ) 1=2 n 1=2 i;n js 1j : The same bound is obtained for C 2 in exactly the same way. Consequently, using (24) Since n 1=2 i;n (n k) 1=2 ! 0 and " > 0 was arbitrary, the proof is complete.
Remark 32 In case of conservative tuning, the condition n 1=2 i;n (n k) 1=2 ! 0 is always satis…ed if n k ! 1. [In fact it is then equivalent to n k ! 1 or e i = 0.] In case of consistent tuning n k ! 1 is clearly a weaker condition than n 1=2 i;n (n k) 1=2 ! 0. However, in general, a su¢ cient condition for n 1=2 i;n (n k) 1=2 ! 0 is that i;n ! 0 and lim sup n!1 k=n < 1.
Remark 33 Since di¤erent limiting probabilities arise in the known-variance case (Proposition 7(b)) and the unknown-variance case (Theorem 11(b2)) in the case where n k ! 1 but n 1=2 i;n (n k) 1=2 ! 0 is violated, it follows that the condition n 1=2 i;n (n k) 1=2 ! 0 cannot be weakened.

Conservative Tuning
We next obtain the limiting distributions of~ H;i ,~ S;i , and~ AS;i in a moving-parameter framework under conservative tuning.
Theorem 34 (Hard-thresholding with conservative tuning) Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! e i where 0 e i < 1. Set the scaling factor i;n = n 1=2 = i;n . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy n 1=2 i;n =( n i;n ) ! i 2 R. (a) If n k is eventually constant equal to m, say, then H iz H;n; (n) ; n converges weakly to the distribution with cdf the corresponding measure being Proof. (a) The atomic part of dH iz H;n; (n) ; n as given in (16) clearly converges weakly to the atomic part of (25) in view of Theorem 11(a1) and the fact that i;n i;n = n = n 1=2 i;n =( n i;n ) ! i by assumption; also note that the atomic part converges to the zero measure in case j i j = 1 or e i = 0 as then the total mass of the atomic part converges to zero. We turn to the absolutely continuous part next. For later use we note that what has been established so far also implies that the total mass of the absolutely continuous part converges to the total mass of the absolutely continuous part of the limit, since it is easy to see that the limiting distribution given in the theorem has total mass 1. The density of the absolutely continuous part of (16) takes the form Observe that for given x 2 R, the indicator function in the above display converges to 1 (jx + i j > se i ) for Lebesgue almost all s.
[If e i = 0, this is necessarily true only for x 2 R with x 6 = i .] Since n k = m eventually, we get from the dominated convergence theorem that the above display converges to (x) R 1 0 1 (jx + i j > se i ) m (s)ds for every x 2 R (for every x 2 R with x 6 = i in case e i = 0), which is the density of the absolutely continuous part in (25). Since the total mass of the absolutely continuous part is preserved in the limit as shown above, the proof is completed by Sche¤é's Lemma.
(b) Follows immediately from Proposition 26 and Theorem 29.
Theorem 35 (Soft-thresholding with conservative tuning) Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! e i where 0 e i < 1. Set the scaling factor i;n = n 1=2 = i;n . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy n 1=2 i;n =( n i;n ) ! i 2 R. (a) If n k is eventually constant equal to m, say, then H iz S;n; (n) ; n converges weakly to the distribution with cdf the corresponding measure being [The atomic part in the above expression is absent in case j i j = 1. Furthermore, the distribution reduces to a standard normal distribution if e i = 0.] (b) If n k ! 1 holds, then H iz S;n; (n) ; n converges weakly to the distribution given in Proposition 27(a).
Proof. (a) The atomic part of dH iz S;n; (n) ; n as given in (18) converges weakly to the atomic part of (26) in view of Theorem 11(a1) and the fact that i;n i;n = n = n 1=2 i;n =( n i;n ) ! i by assumption; also note that the atomic part converges to the zero measure in case j i j = 1 or e i = 0 as then the total mass of the atomic part converges to zero. We turn to the absolutely continuous part next. For later use we note that what has been established so far also implies that the total mass of the absolutely continuous part converges to the total mass of the absolutely continuous part of the limit, since it is easy to see that the limiting distribution given in the theorem has total mass 1. The density of the absolutely continuous part of (18) takes the form Observe that for given x 2 R, the functions x sn 1=2 i;n converge to (x se i ), respectively, for all s. Since n k = m eventually, we then get from the dominated convergence theorem that the above display converges to for every x6 = i ; the last display is precisely the density of the absolutely continuous part in (26). Since the total mass of the absolutely continuous part is preserved in the limit as shown above, the proof is completed by Sche¤é's Lemma.
(b) Follows immediately from Proposition 27 and Theorem 30.
Theorem 36 (Adaptive soft-thresholding with conservative tuning) Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! e i where 0 e i < 1. Set the scaling factor i;n = n 1=2 = i;n . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy n 1=2 i;n =( n i;n ) ! i 2 R. (a) Suppose n k is eventually constant equal to m, say. Then H iz AS;n; (n) ; n converges weakly to the distribution with cdf in case j i j < 1, the corresponding measure being given by In case j i j = 1, the cdf H iz AS;n; (n) ; n converges weakly to , i.e., a standard normal distribution.
[If e i = 0, the limit always reduces to a standard normal distribution.] (b) If n k ! 1, then H iz AS;n; (n) ; n converges weakly to the distribution given in Proposition 28(a).

Proof. (a) Observe that
where z (1) n; (n) ; n (x; s i;n ) and z (2) n; (n) ; n (x; s i;n ) reduce to 0:5(x n 1=2 i;n =( n i;n )) q 0:5(x + n 1=2 i;n =( n i;n )) 2 + s 2 n 2 i;n : Clearly, z (1) n; (n) ; n (x; s i;n ) as well as z respectively, if j i j < 1, and the dominated convergence theorem shows that the weights of the indicator functions in (29) converge to the corresponding weights in (28). Since n 1=2 i;n =( n i;n ) converges to i by assumption, it follows that for every x 6 = i we have convergence of H iz AS;n; (n) ; n to the cdf given in (28). This proves part (a) in case j i j < 1. In case i = 1, we have that z (2) n; (n) ; n (x; s i;n ) converges to x by an application of Proposition 15 in Pötscher and Schneider (2009). Consequently, the limit of z (2) n; (n) ; n (x; s i;n ) is now (x). Again applying the dominated convergence theorem and observing that for each x 2 R we have that 1 x + n 1=2 i;n =( n i;n ) < 0 is eventually zero, shows that H iz AS;n; (n) ; n (x) converges to (x). The case i = 1 is proved analogously.
(b) Follows immediately from Proposition 28 and Theorem 31.
It transpires that in case of conservative tuning and n k ! 1 we obtain exactly the same limiting distributions as in the known-variance case and hence the relevant discussion given at the end of Section 6.1 applies also here. [That one obtains the same limits does not come as a surprise given the results in Section 6.2 and the observation made in Remark 32.] In the case, where n k is eventually constant, the limits are obtained from the limits in the known-variance case (with replaced by s) by averaging w.r.t. the distribution of^ = . Again the limiting distributions have essentially the same structure as the corresponding …nite-sample distributions. The …xed-parameter limiting distributions (corresponding to setting i;n i and n in the above theorems) again misrepresent the …nite-sample properties of the thresholding estimators whenever i 6 = 0 but small, as the …xed-parameter limiting distribution is -in case of hardthresholding and adaptive soft-thresholding -then always N (0; 1), regardless of the size of i . For soft-thresholding we also observe a strong discrepancy between the …nite-sample distribution and the …xed-parameter limit especially for i 6 = 0 but small, which is given by the distribution with pdf R 1 0 (x + s sign( i )e i ) m (s)ds regardless of the size of i . As a consequence, we again observe non-uniformity in the convergence of …nite-sample distributions to their limit in a …xedparameter framework also in the case where the number of degrees of freedom is (eventually) constant.

Consistent Tuning
We next derive the limiting distribution of~ H;i ,~ S;i , and~ AS;i in a moving-parameter framework under consistent tuning.
Theorem 37 (Hard-thresholding with consistent tuning) Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! 1. Set the scaling factor i;n = i;n i;n 1 . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy i;n =( n i;n i;n ) ! i 2 R.
(a) If n k is eventually constant equal to m, say, then H iz H;n; (n) ; n converges weakly to [The above display reduces to 0 for j i j = 1.] (b) If n k ! 1 holds, then 1. j i j < 1 implies that H iz H;n; (n) ; n converges weakly to i . 2. j i j > 1 implies that H iz H;n; (n) ; n converges weakly to 0 .
3. j i j = 1 and n 1=2 i;n = (n k) 1=2 ! 0 imply that H iz H;n; (n) ; n converges weakly to provided r i;n = n 1=2 i;n i i;n =( n i;n ) ! r i for some r i 2 R. 4. j i j = 1 and n 1=2 i;n = (n k) provided r i;n ! r i for some r i 2 R. [Note that the above display reduces to i if r i = 1, and to 0 if r i = 1.] 5. j i j = 1 and n 1=2 i;n = (n k) 1=2 ! 1 imply that H iz H;n; (n) ; n converges weakly to Proof. Observe that if j i j < 1. Part (b) of Theorem 11 completes the proof of both parts of the theorem in case j i j < 1. If j i j = 1 the same theorem shows that the weak limit is now 0 .
Theorem 38 (Soft-thresholding with consistent tuning) Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! 1. Set the scaling factor i;n = i;n i;n 1 . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy i;n =( n i;n i;n ) ! i 2 R.
(a) If n k is eventually constant equal to m, say, then H iz S;n; (n) ; n converges weakly to the distribution given by Z 1 (30) where we recall the convention that m (x) = 0 for x < 0. [In case j i j = 1, the atomic part in (30) is absent and (30) reduces to m ( sign( i )x)dx.] (b) If n k ! 1 holds, then H iz S;n; (n) ; n converges weakly to sign( i ) min(1;j i j) .
Proof. (a) The atomic part of dH iz S;n; (n) ; n as given in (18) converges weakly to the atomic part given in (30) by Theorem 11(b1). The density of the absolutely continuous part of dH iz S;n; (n) ; n can be written as x + i;n =( n i;n i;n ) < 0 recalling the convention that m (s) = 0 for s < 0. Note that with this convention m is then a bounded continuous function on the real line. Since n 1=2 i;n n 1=2 i;n (x + ) and n 1=2 i;n n 1=2 i;n (x ) clearly converge weakly to x and x , respectively, the density of the absolutely continuous part of dH iz S;n; (n) ; n is seen to converge to m ( x)1 (x + i > 0) + m (x)1 (x + i < 0) for every x 6 = i . An application of Sche¤é's Lemma then completes the proof, noting that the total mass of the absolutely continuous part of dH iz S;n; (n) ; n converges to the total mass of the absolutely continuous part of (30) as the same is true for the atomic part in view of Theorem 11(b1) (and since the distributions involved all have total mass 1).
(b) Rewrite 1 n i;n (~ S;i i;n ) as i;n =( n i;n i;n )1 ~ S;i = 0 + W n (^ = n ) sign(W n + i;n =( n i;n i;n )) 1 ~ S;i 6 = 0 ; where W n is a sequence of N (0; n 1 2 i;n )-distributed random variables. Observe that i;n =( n i;n i;n ) converges to i and that W n converges to zero in P n; (n) ; n -probability. Now, if j i j < 1, then P n; (n) ; n ~ S;i = 0 ! 1 by Theorem 11(b2), and hence 1 n i;n (~ S;i i;n ) converges to i in P n; (n) ; n -probability. This proves the result in case j i j < 1. In case j i j > 1 we have that P n; (n) ; n ~ S;i 6 = 0 ! 1 and P n; (n) ; n sign(W n + i;n =( n i;n i;n )) = sign( i ) ! 1: Clearly, also^ = n converges to 1 in P n; (n) ; n -probability since n k ! 1. Consequently, 1 n i;n (~ S;i i;n ) converges to sign( i ) in P n; (n) ; n -probability, which proves the case j i j > 1. Finally, if j i j = 1, then (31) continues to hold and we can write where o p (1) refers to a term that converges to zero in P n; (n) ; n -probability. This then completes the proof of part (b).
Theorem 39 (Adaptive soft-thresholding with consistent tuning) Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! 1. Set the scaling factor i;n = i;n i;n 1 . Suppose that the true parameters (n) = ( 1;n ; : : : ; kn;n ) 2 R kn and n 2 (0; 1) satisfy i;n =( n i;n i;n ) ! i 2 R. (a) Suppose n k is eventually constant equal to m, say. Then H iz AS;n; (n) ; n converges weakly to the distribution with cdf Z 1 p jx i j m (s)ds1 ( i x < 0) + 1 (x 0) = Pr( 2 m > m jx i j)1 ( i x < 0) + 1 (x 0) in case 0 i < 1, and to the distribution with cdf Z p where W n is a sequence of N (0; n 1 2 i;n )-distributed random variables. Note that i;n =( n i;n i;n ) converges to i by assumption. Now, if j i j < 1, then P n; (n) ; n ~ AS;i = 0 ! 1 by Theorem 11(b2), hence 1 n i;n (~ AS;i i;n ) converges to i in P n; (n) ; n -probability, establishing the result in this case. Furthermore, for 1 j i j 1 rewrite the above display as with the convention that 1 i = 0 in case j i j = 1. If j i j > 1 (including the case j i j = 1) then P n; (n) ; n ~ AS;i = 0 ! 1 by Theorem 11(b2), and hence the last display shows that 1 n i;n (~ AS;i i;n ) converges to 1 i in P n; (n) ; n -probability, establishing the result in this case. Finally, if j i j = 1 holds, then the last line in the above display reduces to i + o p (1), completing the proof of part (b).
We know from the results in Section 6.2 that we obtain the same limiting distributions for H;i ,~ S;i , and~ AS;i as for^ H;i ,^ S;i , and^ AS;i , respectively, provided n k diverges to in…nity su¢ ciently fast in the sense that n 1=2 i;n (n k) 1=2 ! 0. The theorems in this section show that for the soft-thresholding as well as for the adaptive thresholding estimator we actually get the same limiting distribution as in the unknown-variance case whenever n k diverges even if n 1=2 i;n (n k) 1=2 ! 0 is violated. However, for the hard-thresholding estimator the picture is di¤erent, and in case n k diverges but n 1=2 i;n (n k) 1=2 ! 0 is violated, limit distributions di¤erent from the known-variance case arise (these limiting distributions still being convex combinations of two pointmasses, but with weights di¤erent from the known-variance case). It seems that this is a re ‡ection of the fact that the hard-thresholding estimator is a discontinuous function of the data, whereas the other two estimators considered depend continuously on the data. The …xed-parameter limiting distributions for all three estimators are again the same as in the known-variance case.
In the case where the degrees of freedom n k are eventually constant, the limiting distribution of the hard-thresholding estimator is again a convex combination of two pointmasses, with weights that are in general di¤erent from the known-variance case. However, for the soft-thresholding as well as for the adaptive thresholding estimator the limiting distributions can also contain an absolutely continuous component. This component seems to stem from an interaction of the more pronounced "bias-component" (as compared to hard-thresholding) with the nonvanishing randomness in the estimated variance. The …xed-parameter limiting distributions for hardthresholding and adaptive thresholding are again given by 0 for all values of i as in the knownvariance case, whereas for soft-thresholding the …xed-parameter limiting distribution is 0 only for i = 0 and otherwise has a pdf given by m ( sign( i )x) (as compared to a limit of sign( i) in the known-variance case).

Consistent Tuning: Some Comments on Fixed-Parameter Large-
Sample Distributions and the "Oracle-Property" 6.4.1 Hard-thresholding and Adaptive Soft-thresholding As already mentioned at the end of Section 6.1 as well as Section 6.3.2, under consistent tuning the …xed-parameter limiting distributions of the hard-thresholding and of the adaptive softthresholding estimator -in the known-variance as well as in the unknown-variance case -always degenerate to pointmass at zero. Recall that in these results the estimators (after centering with i ) are scaled by 1 i;n i;n 1 , which corresponds to the minimax convergence rate.
We next show that if the estimators are scaled by 1 n 1=2 1 i;n instead, a limit distribution under …xed-parameter asymptotics arises that is not degenerate in general (under an additional condition on the tuning parameter in case of adaptive soft-thresholding). In fact, we show that the hard-thresholding as well as the adaptive soft-thresholding estimators then satisfy what has been called the "oracle-property". However, it should be kept in mind that -with this faster scaling sequence 1 n 1=2 1 i;n -the centered estimators are no longer stochastically bounded in a moving-parameter framework (for certain sequences of parameters), cf. Theorem 15. This shows the fragility of the "oracle-property", which is a …xed-parameter concept, and calls into question the statistical signi…cance of this notion. For a more extensive discussion of the "oracle-property" and its consequences see Leeb and Pötscher (2008), Pötscher and Leeb (2009), and Pötscher and Schneider (2009).
Proposition 40 Suppose that for given i 1 satisfying i k = k(n) for large enough n we have i;n i;n ! 0 and n 1=2 i;n ! 1. Proof. (a) By a subsequence argument we may assume that n k converges in N [ f1g.
Applying Theorem 11(b) we obtain that P n; ; ~ H;i = 0 converges to 1 in case i = 0, and to 0 in case i 6 = 0. Observe that The result then follows in view of the fact that Z n is standard normally distributed. The proof for^ H;i is similar using Proposition 7(b) instead of Theorem 11(b) (it is in fact simpler as the subsequence argument is not needed).
(b) Again we may assume that n k converges in N [ f1g. By the same reference as in the proof of (a) we obtain that P n; ; ~ AS;i = 0 converges to 1 in case i = 0, and to 0 in case