Abstract
We investigate the asymptotic distributions of coordinates of regression M-estimates in the moderate p / n regime, where the number of covariates p grows proportionally with the sample size n. Under appropriate regularity conditions, we establish the coordinate-wise asymptotic normality of regression M-estimates assuming a fixed-design matrix. Our proof is based on the second-order Poincaré inequality and leave-one-out analysis. Some relevant examples are indicated to show that our regularity conditions are satisfied by a broad class of design matrices. We also show a counterexample, namely an ANOVA-type design, to emphasize that the technical assumptions are not just artifacts of the proof. Finally, numerical experiments confirm and complement our theoretical results.
Similar content being viewed by others
References
Anderson, T.W.: An Introduction to Multivariate Statistical Analysis. Wiley, New York (1962)
Bai, Z., Silverstein, J.W.: Spectral Analysis of Large Dimensional Random Matrices, vol. 20. Springer, Berlin (2010)
Bai, Z., Yin, Y.: Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21(3), 1275–1294 (1993)
Baranchik, A.: Inadmissibility of maximum likelihood estimators in some multiple regression problems with three or more independent variables. Ann. Stat. 1(2), 312–321 (1973)
Bean, D., Bickel, P.J., El Karoui, N., Lim, C., Yu, B.: Penalized robust regression in high-dimension. Technical Report 813, Department of Statistics, UC Berkeley (2012)
Bean, D., Bickel, P.J., El Karoui, N., Yu, B.: Optimal M-estimation in high-dimensional regression. Proc. Natl. Acad. Sci. 110(36), 14563–14568 (2013)
Bickel, P.J., Doksum, K.A.: Mathematical Statistics: Basic Ideas and Selected Topics, Volume I, vol. 117. CRC Press, Boca Raton (2015)
Bickel, P.J., Freedman, D.A.: Some asymptotic theory for the bootstrap. Ann. Stat. 9(6), 1196–1217 (1981)
Bickel, P.J., Freedman, D.A.: Bootstrapping regression models with many parameters. Festschrift for Erich L. Lehmann pp. 28–48 (1983)
Chatterjee, S.: Fluctuations of eigenvalues and second order Poincaré inequalities. Probab. Theory Relat. Fields 143(1–2), 1–40 (2009)
Chernoff, H.: A note on an inequality involving the normal distribution. Ann. Probab. 9(3), 533–535 (1981)
Cizek, P., Härdle, W.K., Weron, R.: Statistical Tools for Finance and Insurance. Springer, Berlin (2005)
Cochran, W.G.: Sampling Techniques. Wiley, Hoboken (1977)
David, H.A., Nagaraja, H.N.: Order Statistics. Wiley Online Library, Hoboken (1981)
Donoho, D., Montanari, A.: High dimensional robust M-estimation: asymptotic variance via approximate message passing. Probab. Theory Relat. Fields 166, 935–969 (2016)
Durrett, R.: Probability: Theory and Examples. Cambridge University Press, Cambridge (2010)
Efron, B.: The Jackknife, the Bootstrap and Other Resampling Plans, vol. 38. SIAM, Philadelphia (1982)
El Karoui, N.: Concentration of measure and spectra of random matrices: applications to correlation matrices, elliptical distributions and beyond. Ann. Appl. Probab. 19(6), 2362–2405 (2009)
El Karoui, N.: High-dimensionality effects in the Markowitz problem and other quadratic programs with linear constraints: risk underestimation. Ann. Stat. 38(6), 3487–3566 (2010)
El Karoui, N.: Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arXiv preprint arXiv:1311.2445 (2013)
El Karoui, N.: On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields, pp. 1–81 (2015)
El Karoui, N., Bean, D., Bickel, P.J., Lim, C., Yu, B.: On Robust Regression with High-Dimensional Predictors. Technical Report 811, Department of Statistics, UC Berkeley (2011)
El Karoui, N., Bean, D., Bickel, P.J., Lim, C., Yu, B.: On robust regression with high-dimensional predictors. Proc. Natl. Acad. Sci. 110(36), 14557–14562 (2013)
El Karoui, N., Purdom, E.: Can We Trust the Bootstrap in High-Dimension? Technical Report 824. Department of Statistics, UC Berkeley (2015)
Esseen, C.G.: Fourier analysis of distribution functions. A mathematical study of the Laplace–Gaussian law. Acta Math. 77(1), 1–125 (1945)
Geman, S.: A limit theorem for the norm of random matrices. Ann. Probab. 8(2), 252–261 (1980)
Hanson, D.L., Wright, F.T.: A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Stat. 42(3), 1079–1083 (1971)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)
Huber, P.J.: The 1972 wald lecture robust statistics: a review. Ann. Math. Stat. 43(4), 1041–1067 (1972)
Huber, P.J.: Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Stat. 1(5), 799–821 (1973)
Huber, P.J.: Robust Statistics. Wiley, New York (1981)
Johnstone, I.M.: On the distribution of the largest eigenvalue in principal components analysis. Ann. Stat. 29(2), 295–327 (2001)
Jurečkovà, J., Klebanov, L.B.: Inadmissibility of robust estimators with respect to \(L_1\) norm. In: Dodge, Y. (ed.) \(L_1\)-Statistical Procedures and Related Topics. Lecture Notes–Monograph Series, vol. 31, pp. 71–78. Institute of Mathematical Statistics, Hayward (1997)
Latała, R.: Some estimates of norms of random matrices. Proc. Am. Math. Soc. 133(5), 1273–1282 (2005)
Ledoux, M.: The Concentration of Measure Phenomenon, vol. 89. American Mathematical Society, Providence (2001)
Litvak, A.E., Pajor, A., Rudelson, M., Tomczak-Jaegermann, N.: Smallest singular value of random matrices and geometry of random polytopes. Adv. Math. 195(2), 491–523 (2005)
Mallows, C.: A note on asymptotic joint normality. Ann. Math. Stat. 43(2), 508–515 (1972)
Mammen, E.: Asymptotics with increasing dimension for robust regression with applications to the bootstrap. Ann. Stat. 17(1), 382–400 (1989)
Marčenko, V.A., Pastur, L.A.: Distribution of eigenvalues for some sets of random matrices. Math. USSR Sbornik 1(4), 457 (1967)
Muirhead, R.J.: Aspects of Multivariate Statistical Theory, vol. 197. Wiley, Hoboken (1982)
Portnoy, S.: Asymptotic behavior of M-estimators of \(p\) regression parameters when \(p^{2}/n\) is large. I. Consistency. Ann. Stat. 12(4), 1298–1309 (1984)
Portnoy, S.: Asymptotic behavior of M-estimators of \(p\) regression parameters when \(p^{2} / n\) is large. II. Normal approximation. Ann. Stat. 13(4), 1403–1417 (1985)
Portnoy, S.: On the central limit theorem in \(\mathbb{R}^{p}\) when \(p\rightarrow \infty \). Probab. Theory Relat. Fields 73(4), 571–583 (1986)
Portnoy, S.: A central limit theorem applicable to robust regression estimators. J. Multivar. Anal. 22(1), 24–50 (1987)
Posekany, A., Felsenstein, K., Sykacek, P.: Biological assessment of robust noise models in microarray data analysis. Bioinformatics 27(6), 807–814 (2011)
Relles, D.A.: Robust Regression by Modified Least-Squares. Technical reports, DTIC Document (1967)
Rosenthal, H.P.: On the subspaces of \(l^{p} (p > 2)\) spanned by sequences of independent random variables. Isr. J. Math. 8(3), 273–303 (1970)
Rudelson, M., Vershynin, R.: Smallest singular value of a random rectangular matrix. Commun. Pure Appl. Math. 62(12), 1707–1739 (2009)
Rudelson, M., Vershynin, R.: Non-asymptotic theory of random matrices: extreme singular values. arXiv preprint arXiv:1003.2990 (2010)
Rudelson, M., Vershynin, R.: Hanson-Wright inequality and sub-gaussian concentration. Electron. Commun. Probab. 18(82), 1–9 (2013)
Scheffe, H.: The Analysis of Variance, vol. 72. Wiley, Hoboken (1999)
Silverstein, J.W.: The smallest eigenvalue of a large dimensional Wishart matrix. Ann. Probab. 13(4), 1364–1368 (1985)
Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B (Methodolog) 36(2), 111–147 (1974)
Tyler, D.E.: A distribution-free M-estimator of multivariate scatter. Ann. Stat. 15(1), 234–251 (1987)
Van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)
Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010)
Wachter, K.W.: Probability plotting points for principal components. In: Ninth Interface Symposium Computer Science and Statistics, pp. 299–308. Prindle, Weber and Schmidt, Boston (1976)
Wachter, K.W.: The strong limits of random matrix spectra for sample matrices of independent elements. Ann. Probab. 6(1), 1–18 (1978)
Wasserman, L., Roeder, K.: High dimensional variable selection. Ann. Stat. 37(5A), 2178 (2009)
Yohai, V.J.: Robust M-Estimates for the General Linear Model. Universidad Nacional de la Plata. Departamento de Matematica (1972)
Yohai, V.J., Maronna, R.A.: Asymptotic behavior of M-estimators for the linear model. Ann. Stat. 7(2), 258–268 (1979)
Author information
Authors and Affiliations
Corresponding authors
Additional information
Peter J. Bickel and Lihua Lei gratefully acknowledge support from NSF DMS-1160319 and NSF DMS-1713083. Noureddine El Karoui gratefully acknowledges support from NSF grant DMS-1510172. He would also like to thank Criteo for providing a great research environment.
Appendices
Appendix
Proof sketch of Lemma 4.4
In this Appendix, we provide a roadmap for proving Lemma 4.4 by considering a special case where X is one realization of a random matrix Z with i.i.d. mean-zero \(\sigma ^{2}\)-sub-gaussian entries. Random matrix theory [3, 26, 53] implies that \(\lambda _{+}= (1 + \sqrt{\kappa })^{2} + o_{p}(1) = O_{p}(1)\) and \(\lambda _{-}= (1 - \sqrt{\kappa })^{2} + o_{p}(1) = \varOmega _{p}(1)\). Thus, the assumption A3 is satisfied with high probability. Thus, the Lemma 4.3 in p. 17 holds with high probability. It remains to prove the following lemma to obtain Theorem 3.1.
Lemma A-1
Let Z be a random matrix with i.i.d. mean-zero \(\sigma ^{2}\)-sub-gaussian entries and X be one realization of Z. Then under assumptions A1 and A2,
where \(M_{j}\) is defined in (11) in p. 17 and the randomness in \(o_{p}(\cdot )\) and \(O_{p}(\cdot )\) comes from Z.
Note that we prove in Proposition 3.1 that assumptions A4 and A5 are satisfied with high probability in this case. However, we will not use them directly but prove Lemma A-1 from the scratch instead, in order to clarify why assump3tions in forms of A4 and A5 are needed in the proof.
1.1 Upper bound of \(M_{j}\)
First by Proposition E.3,
In the rest of the proof, the symbol \(\mathbb {E}\) and \(\hbox {Var}\) denotes the expectation and the variance conditional on Z. Let \(\tilde{Z} = D^{\frac{1}{2}}Z\), then \(M_{j} = \mathbb {E}\Vert e_{j}^{T}(\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T}\Vert _{\infty }\). Let \(\tilde{H}_{j} = I - \tilde{Z}_{[j]}(\tilde{Z}_{[j]}^{T}\tilde{Z}_{[j]})^{-1}\tilde{Z}_{[j]}^{T}\), then by block matrix inversion formula (see Proposition E.1), which we state as Proposition E.1 in “Appendix E”.
This implies that
Since \(Z^{T}DZ / n\succeq K_{0}\lambda _{-}I\), we have
and we obtain a bound for \(M_{1}\) as
Similarly,
The vector in the numerator is a linear contrast of \(Z_{j}\) and \(Z_{j}\) has mean-zero i.i.d. sub-gaussian entries. For any fixed matrix \(A\in \mathbb {R}^{n\times n}\), denote \(A_{k}\) by its k-th column, then \(A_{k}^{T}Z_{j}\) is \(\sigma ^{2}\Vert A_{k}\Vert _{2}^{2}\)-sub-gaussian (see Section 5.2.3 of [57] for a detailed discussion) and hence by definition of sub-Gaussianity,
Therefore, by a simple union bound, we conclude that
Let \(t = 2\sqrt{\log n}\),
This entails that
with high probability. In \(M_{j}\), the coefficient matrix \((I - H_{j})D^{\frac{1}{2}}\) depends on \(Z_{j}\) through D and hence we cannot use (A-3) directly. However, the dependence can be removed by replacing D by \(D_{[j]}\) since \(r_{i, [j]}\) does not depend on \(Z_{j}\).
Since Z has i.i.d. sub-gaussian entries, no column is highly influential. In other words, the estimator will not change drastically after removing j-th column. This would suggest \(R_{i}\approx r_{i, [j]}\). It is proved by [20] that
It can be rigorously proved that
where \(H_{j} = I - D_{[j]}^{\frac{1}{2}}Z_{[j]}(Z_{[j]}^{T}D_{[j]}Z_{[j]})^{-1}Z_{[j]}^{T}D_{[j]}^{\frac{1}{2}}\); see “Appendix A-1” for details. Since \(D_{[j]}(I - H_{j})\) is independent of \(Z_{j}\) and
it follows from (A-2) and (A-3) that
In summary,
1.2 Lower bound of \(\hbox {Var}(\hat{\beta }_{j})\)
1.2.1 Approximating \(\hbox {Var}(\hat{\beta }_{j})\) by \(\hbox {Var}(b_{j})\)
It is shown by [20]Footnote 1 that
where
It has been shown by [20] that
Thus, \(\hbox {Var}(\hat{\beta }_{j})\approx \hbox {Var}(b_{j})\) and a more refined calculation in “Appendix A-2.1” shows that
It is left to show that
1.2.2 Bounding \(\hbox {Var}(b_{j})\) via \(\hbox {Var}(N_{j})\)
By definition of \(b_{j}\),
As will be shown in “Appendix B-6.4”,
As a result, \(\xi _{j}\approx \mathbb {E}\xi _{j}\) and
As in the previous paper [20], we rewrite \(\xi _{j}\) as
The middle matrix is idempotent and hence positive semi-definite. Thus,
Then we obtain that
and it is left to show that
1.2.3 Bounding \(\hbox {Var}(N_{j})\) via \(\hbox {tr}(Q_{j})\)
Recall the definition of \(N_{j}\) (A-5), and that of \(Q_{j}\) (see Sect. 3.1 in p. 8), we have
Notice that \(Z_{j}\) is independent of \(r_{i, [j]}\) and hence the conditional distribution of \(Z_{j}\) given \(Q_{j}\) remains the same as the marginal distribution of \(Z_{j}\). Since \(Z_{j}\) has i.i.d. sub-gaussian entries, the Hanson-Wright inequality ([27, 51]; see Proposition E.2), shown in Proposition E.2, implies that any quadratic form of \(Z_{j}\), denoted by \(Z_{j}^{T}Q_{j}Z_{j}\) is concentrated on its mean, i.e.
As a consequence, it is left to show that
1.2.4 Lower bound of \(\hbox {tr}(Q_{j})\)
By definition of \(Q_{j}\),
To lower bounded the variance of \(\psi (r_{i, [j]})\), recall that for any random variable W,
where \(W'\) is an independent copy of W. Suppose \(g: \mathbb {R}\rightarrow \mathbb {R}\) is a function such that \(|g'(x)|\ge c\) for all x, then (A-9) implies that
In other words, (A-10) entails that \(\hbox {Var}(W)\) is a lower bound for \(\hbox {Var}(g(W))\) provided that the derivative of g is bounded away from 0. As an application, we see that
and hence
By the variance decomposition formula,
where \(\varepsilon _{(i)}\) includes all but i-th entry of \(\varepsilon \). Given \(\varepsilon _{(i)}\), \(r_{i, [j]}\) is a function of \(\varepsilon _{i}\). Using (A-10), we have
This implies that
Summing \(\hbox {Var}(r_{i, [j]})\) over \(i = 1, \ldots , n\), we obtain that
It will be shown in “Appendix B-6.3” that under assumptions A1–A3,
This proves (A-8) and as a result,
Proof of Theorem 3.1
1.1 Notation
To be self-contained, we summarize our notations in this subsection. The model we considered here is
where \(X\in \mathbb {R}^{n\times p}\) be the design matrix and \(\varepsilon \) is a random vector with independent entries. Notice that the target quantity \(\frac{\hat{\beta }_{j} - \mathbb {E}\hat{\beta }_{j}}{\sqrt{\hbox {Var}(\hat{\beta }_{j})}}\) is shift invariant, we can assume \(\beta ^{*}= 0\) without loss of generality provided that X has full column rank; see Sect. 3.1 for details.
Let \(x_{i}^{T}\in \mathbb {R}^{1\times p}\) denote the i-th row of X and \(X_{j}\in \mathbb {R}^{n\times 1}\) denote the j-th column of X. Throughout the paper we will denote by \(X_{ij}\in \mathbb {R}\) the (i, j)-th entry of X, by \(X_{(i)}\in \mathbb {R}^{(n-1)\times p}\) the design matrix X after removing the i-th row, by \(X_{[j]}\in \mathbb {R}^{n\times (p-1)}\) the design matrix X after removing the j-th column, by \(X_{(i), [j]}\in \mathbb {R}^{(n-1)\times (p-1)}\) the design matrix after removing both i-th row and j-th column, and by \(x_{i, [j]}\in \mathbb {R}^{1\times (p-1)}\) the vector \(x_{i}\) after removing j-th entry. The M-estimator \(\hat{\beta }\) associated with the loss function \(\rho \) is defined as
Similarly we define the leave-j-th-predictor-out version as
Based on these notation we define the full residual \(R_{k}\) as
the leave-j-th-predictor-out residual as
Four diagonal matrices are defined as
Further we define G and \(G_{[j]}\) as
Let \(J_{n}\) denote the indices of coefficients of interest. We say \(a\in ]a_{1}, a_{2}[\) if and only if \(a\in [\min \{a_{1}, a_{2}\}, \max \{a_{1}, a_{2}\}]\). Regarding the technical assumptions, we need the following quantities
be the largest (resp. smallest) eigenvalue of the matrix \(\frac{X^{T}X}{n}\). Let \(e_{i}\in \mathbb {R}^{n}\) be the i-th canonical basis vector and
Finally, let
We adopt Landau’s notation (\(O(\cdot ), o(\cdot ), O_{p}(\cdot ), o_{p}(\cdot )\)). In addition, we say \(a_{n} = \varOmega (b_{n})\) if \(b_{n} = O(a_{n})\) and similarly, we say \(a_{n} = \varOmega _{p}(b_{n})\) if \(b_{n} = O_{p}(a_{n})\). To simplify the logarithm factors, we use the symbol \(\mathrm {polyLog(n)}\) to denote any factor that can be upper bounded by \((\log n)^{\gamma }\) for some \(\gamma > 0\). Similarly, we use \(\frac{1}{\mathrm {polyLog(n)}}\) to denote any factor that can be lower bounded by \(\frac{1}{(\log n)^{\gamma '}}\) for some \(\gamma ' > 0\).
Finally we restate all the technical assumptions:
- A1:
-
\(\rho (0) = \psi (0) = 0\) and there exists \(K_{0} = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) \), \(K_{1}, K_{2} = O\left( \mathrm {polyLog(n)}\right) \), such that for any \(x\in \mathbb {R}\),
$$\begin{aligned} K_{0} \le \psi '(x)\le K_{1}, \quad \bigg |\frac{d}{dx}(\sqrt{\psi '}(x))\bigg | = \frac{|\psi ''(x)|}{\sqrt{\psi '(x)}}\le K_{2}; \end{aligned}$$ - A2:
-
\(\varepsilon _{i} = u_{i}(W_{i})\) where \((W_{1}, \ldots , W_{n})\sim N(0, I_{n\times n})\) and \(u_{i}\) are smooth functions with \(\Vert u'_{i}\Vert _{\infty }\le c_{1}\) and \(\Vert u''_{i}\Vert _{\infty }\le c_{2}\) for some \(c_{1}, c_{2} = O(\mathrm {polyLog(n)})\). Moreover, assume \(\min _{i}\hbox {Var}(\varepsilon _{i}) = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) \).
- A3:
-
\(\lambda _{+}= O(\mathrm {polyLog(n)})\) and \(\lambda _{-}= \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) \);
- A4:
-
\(\min _{j\in J_{n}}\frac{X_{j}^{T}Q_{j}X_{j}}{\hbox {tr}(Q_{j})} = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) \);
- A5:
-
\(\mathbb {E}\varDelta _{C}^{8} = O\left( \mathrm {polyLog(n)}\right) \).
1.2 Deterministic approximation results
In “Appendix A”, we use several approximations under random designs, e.g. \(R_{i}\approx r_{i, [j]}\). To prove them, we follow the strategy of [20] which establishes the deterministic results and then apply the concentration inequalities to obtain high probability bounds. Note that \(\hat{\beta }\) is the solution of
we need the following key lemma to bound \(\Vert \beta _{1} - \beta _{2}\Vert _{2}\) by \(\Vert f(\beta _{1}) - f(\beta _{2})\Vert _{2}\), which can be calculated explicily.
Lemma B-1
[20, Proposition 2.1] For any \(\beta _{1}\) and \(\beta _{2}\),
Proof
By the mean value theorem, there exists \(\nu _{i}\in ]\varepsilon _{i} - x_{i}^{T}\beta _{1}, \varepsilon _{i} - x_{i}^{T}\beta _{2}[\) such that
Then
\(\square \)
Based on Lemma B-1, we can derive the deterministic results informally stated in “Appendix A”. Such results are shown by [20] for ridge-penalized M-estimates and here we derive a refined version for unpenalized M-estimates. Throughout this subsection, we only assume assumption A1. This implies the following lemma,
Lemma B-2
Under assumption A1, for any x and y,
To state the result, we define the following quantities.
The following proposition summarizes all deterministic results which we need in the proof.
Proposition B.1
Under Assumption \(\mathbf A 1\),
-
(i)
The norm of M estimator is bounded by
$$\begin{aligned} \Vert \hat{\beta }\Vert _{2} \le \frac{1}{K_{0}\lambda _{-}}(U + U_{0}); \end{aligned}$$ -
(ii)
Define \(b_{j}\) as
$$\begin{aligned} b_{j} = \frac{1}{\sqrt{n}}\frac{N_{j}}{\xi _{j}} \end{aligned}$$where
$$\begin{aligned} N_{j}= & {} \frac{1}{\sqrt{n}}\sum _{i=1}^{n}X_{ij}\psi (r_{i, [j]}), \quad \\ \xi _{j}= & {} \frac{1}{n}X_{j}^{T}\left( D_{[j]} - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}\right) X_{j}, \end{aligned}$$Then
$$\begin{aligned} \max _{j\in J_{n}}|b_{j}|\le \frac{1}{\sqrt{n}}\cdot \frac{\sqrt{2K_{1}}}{K_{0}\lambda _{-}}\cdot \varDelta _{C}\cdot \sqrt{\mathscr {E}}, \end{aligned}$$ -
(iii)
The difference between \(\hat{\beta }_{j}\) and \(b_{j}\) is bounded by
$$\begin{aligned}\max _{j\in J_{n}}|\hat{\beta }_{j} - b_{j}|\le \frac{1}{n}\cdot \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}.\end{aligned}$$ -
(iv)
The difference between the full and the leave-one-predictor-out residual is bounded by
$$\begin{aligned}\max _{j\in J_{n}}\max _{i}|R_{i} - r_{i, [j]}|\le \frac{1}{\sqrt{n}}\left( \frac{2K_{1}^{2}K_{3}\lambda _{+}T^{2}}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}+ \frac{\sqrt{2}K_{1}}{K_{0}^{\frac{3}{2}}\lambda _{-}}\cdot \varDelta _{C}^{2}\cdot \sqrt{\mathscr {E}}\right) . \end{aligned}$$
Proof
-
(i)
By Lemma B-1,
$$\begin{aligned}\Vert \hat{\beta }\Vert _{2} \le \frac{1}{K_{0}\lambda _{-}}\Vert f(\hat{\beta }) - f(0)\Vert _{2} = \frac{\Vert f(0)\Vert _{2}}{K_{0}\lambda _{-}},\end{aligned}$$since \(\hat{\beta }\) is a zero of \(f(\beta )\). By definition,
$$\begin{aligned}f(0) = \frac{1}{n}\sum _{i=1}^{n}x_{i}\psi (\varepsilon _{i}) = \frac{1}{n}\sum _{i=1}^{n}x_{i}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i})) + \frac{1}{n}\sum _{i=1}^{n}x_{i}\mathbb {E}\psi (\varepsilon _{i}).\end{aligned}$$This implies that
$$\begin{aligned}\left\| f(0)\right\| _{2} \le U + U_{0}.\end{aligned}$$ -
(ii)
First we prove that
$$\begin{aligned} \xi _{j}\ge K_{0}\lambda _{-}. \end{aligned}$$(B-25)Since all diagonal entries of \(D_{[j]}\) is lower bounded by \(K_{0}\), we conclude that
$$\begin{aligned} \lambda _{\mathrm {min}}\left( \frac{X^{T}D_{[j]}X}{n}\right) \ge K_{0}\lambda _{-}. \end{aligned}$$Note that \(\xi _{j}\) is the Schur’s complement ([28], chapter 0.8) of \(\frac{X^{T}D_{[j]}X}{n}\), we have
$$\begin{aligned} \xi _{j}^{-1} = e_{j}^{T}\left( \frac{X^{T}D_{[j]}X}{n}\right) ^{-1} e_{j}\le \frac{1}{K_{0}\lambda _{-}}, \end{aligned}$$which implies (B-25). As for \(N_{j}\), we have
$$\begin{aligned} N_{j} = \frac{X_{j}^{T}h_{j, 0}}{\sqrt{n}} = \frac{\left\| h_{j, 0}\right\| _{2}}{\sqrt{n}}\cdot \frac{X_{j}^{T}h_{j, 0}}{\left\| h_{j, 0}\right\| _{2}}. \end{aligned}$$(B-26)The the second term is bounded by \(\varDelta _{C}\) by definition, see (B-21). For the first term, the assumption A1 that \(\psi '(x)\le K_{1}\) implies that
$$\begin{aligned} \rho (x) = \rho (x) - \rho (0) = \int _{0}^{x}\psi (y)dy\ge \int _{0}^{x}\frac{\psi '(y)}{K_{1}}\cdot \psi (y)dy = \frac{1}{2K_{1}}\psi ^{2}(x). \end{aligned}$$Here we use the fact that \(\hbox {sign}(\psi (y)) = \hbox {sign}(y)\). Recall the definition of \(h_{j, 0}\), we obtain that
$$\begin{aligned} \frac{\left\| h_{j, 0}\right\| _{2}}{\sqrt{n}} = \sqrt{\frac{\sum _{i=1}^{n}\psi (r_{i, [j]})^{2}}{n}}\le \sqrt{2K_{1}}\cdot \sqrt{\frac{\sum _{i=1}^{n}\rho (r_{i, [j]})}{n}}. \end{aligned}$$Since \(\hat{\beta }_{[j]}\) is the minimizer of the loss function \(\sum _{i=1}^{n}\rho (\varepsilon _{i} - x_{i, [j]}^{T}\beta _{[j]})\), it holds that
$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\rho (r_{i, [j]})\le \frac{1}{n}\sum _{i=1}^{n}\rho (\varepsilon _{i}) = \mathscr {E}. \end{aligned}$$Putting together the pieces, we conclude that
$$\begin{aligned} |N_{j}|\le \sqrt{2K_{1}}\cdot \varDelta _{C}\sqrt{\mathscr {E}}. \end{aligned}$$(B-27)By definition of \(b_{j}\),
$$\begin{aligned} |b_{j}|\le \frac{1}{\sqrt{n}}\cdot \frac{\sqrt{2K_{1}}}{K_{0}\lambda _{-}}\varDelta _{C}\sqrt{\mathscr {E}}. \end{aligned}$$ -
(iii)
The proof of this result is almost the same as [20]. We state it here for the sake of completeness. Let \(\tilde{\mathbf {b}}_{\mathbf {j}}\in \mathbb {R}^{p}\) with
$$\begin{aligned} (\tilde{\mathbf {b}}_{\mathbf {j}})_{j} = b_{j}, \quad (\tilde{\mathbf {b}}_{\mathbf {j}})_{[j]} = \hat{\beta }_{[j]} - b_{j}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}\end{aligned}$$(B-28)where the subscript j denotes the j-th entry and the subscript [j] denotes the sub-vector formed by all but j-th entry. Furthermore, define \(\gamma _{j}\) with
$$\begin{aligned} (\gamma _{j})_{j} = -1, \quad (\gamma _{j})_{[j]} = \left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}. \end{aligned}$$(B-29)Then we can rewrite \(\tilde{\mathbf {b}}_{\mathbf {j}}\) as
$$\begin{aligned} (\tilde{\mathbf {b}}_{\mathbf {j}})_{j} = -b_{j}(\gamma _{j})_{j}, \quad (\tilde{\mathbf {b}}_{\mathbf {j}})_{[j]} = \hat{\beta }_{[j]} - b_{j}(\gamma _{j})_{[j]}. \end{aligned}$$By definition of \(\hat{\beta }_{[j]}\), we have \([f(\hat{\beta }_{[j]})]_{[j]} = 0\) and hence
$$\begin{aligned}{}[f(\tilde{\mathbf {b}}_{\mathbf {j}})]_{[j]}&= [f(\tilde{\mathbf {b}}_{\mathbf {j}})]_{[j]} - [f(\hat{\beta }_{[j]})]_{[j]} \nonumber \\&= \frac{1}{n}\sum _{i=1}^{n}x_{i, [j]}\left[ \psi (\varepsilon _{i} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}) - \psi (\varepsilon _{i} - x_{i, [j]}^{T}\hat{\beta }_{[j]})\right] . \end{aligned}$$(B-30)By mean value theorem, there exists \(\nu _{i, j}\in ]\varepsilon _{i} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}, \varepsilon _{i} - x_{i, [j]}^{T}\hat{\beta }_{[j]}[\) such that
$$\begin{aligned}&\psi \left( \varepsilon _{i} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}\right) - \psi \left( \varepsilon _{i} - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right) = \psi '(\nu _{i, j})\left( x_{i, [j]}^{T}\hat{\beta }_{[j]} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}\right) \\&\quad = \psi '(\nu _{i, j})\left( x_{i, [j]}^{T}\hat{\beta }_{[j]} - x_{i, [j]}^{T}(\tilde{\mathbf {b}}_{\mathbf {j}})_{[j]} - X_{ij}b_{j}\right) \\&\quad = \psi '(\nu _{i, j})\cdot b_{j}\cdot \left[ x_{i, [j]}^{T}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \end{aligned}$$Let
$$\begin{aligned} d_{i, j} = \psi '(\nu _{i, j}) - \psi '(r_{i, [j]}) \end{aligned}$$(B-31)and plug the above result into (B-30), we obtain that
$$\begin{aligned} \left[ f(\tilde{\mathbf {b}}_{\mathbf {j}})\right] _{[j]}&= \frac{1}{n}\sum _{i=1}^{n}x_{i, [j]}\cdot \left( \psi '(r_{i, [j]}) + d_{i, j}\right) \cdot b_{j}\cdot \left[ x_{i, [j]}^{T}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \\&= b_{j}\cdot \frac{1}{n}\sum _{i=1}^{n}\psi '(r_{i, [j]})x_{i, [j]}\left[ x_{i, [j]}^{T}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \\&\quad + b_{j}\cdot \frac{1}{n}\sum _{i=1}^{n}d_{i, j}x_{i, [j]}\left( x_{i, [j]}^{T}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right) \\&= b_{j}\cdot \frac{1}{n}\left[ X_{[j]}^{T}D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{[j]}^{T}D_{[j]}X_{j}\right] \\&\quad + b_{j}\cdot \frac{1}{n}\sum _{i=1}^{n}d_{i, j}x_{i, [j]}\cdot x_{i}^{T}\gamma _{j}\\&= b_{j}\cdot \frac{1}{n}\left( \sum _{i=1}^{n}d_{i, j}x_{i, [j]}x_{i}^{T}\right) \gamma _{j}. \end{aligned}$$Now we calculate \([f(\tilde{\mathbf {b}}_{\mathbf {j}})]_{j}\), the j-th entry of \(f(\tilde{\mathbf {b}}_{\mathbf {j}})\). Note that
$$\begin{aligned} \left[ f(\tilde{\mathbf {b}}_{\mathbf {j}})\right] _{j}&= \frac{1}{n}\sum _{i=1}^{n}X_{ij}\psi \left( \varepsilon _{i} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}\right) = \frac{1}{n}\sum _{i=1}^{n}X_{ij}\psi (r_{i, [j]}) \\&\quad + b_{j}\cdot \frac{1}{n}\sum _{i=1}^{n}X_{ij}(\psi '(r_{i, [j]})+ d_{i, j}) \\&\quad \cdot \left[ x_{i, [j]}^{T}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \\&= \frac{1}{n}\sum _{i=1}^{n}X_{ij}\psi (r_{i, [j]})+ b_{j} \\&\quad \cdot \frac{1}{n}\sum _{i=1}^{n}\psi '(r_{i, [j]})X_{ij}\left[ x_{i, [j]}^{T}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \\&\quad + b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i, j}X_{ij}x_{i}^{T}\right) \gamma _{j}= \frac{1}{\sqrt{n}}N_{j}+ b_{j} \\&\quad \cdot \left( \frac{1}{n}X_{j}^{T}D_{[j]}X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}X_{j}- \frac{1}{n}\sum _{i=1}^{n}\psi '(r_{i, [j]})X_{ij}^{2}\right) \\&\quad + b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i, j}X_{ij}x_{i}^{T}\right) \gamma _{j}= \frac{1}{\sqrt{n}}N_{j} - b_{j}\cdot \xi _{j}\\&\quad + b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i, j}X_{ij}x_{i}^{T}\right) \gamma _{j} = b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i, j}X_{ij}x_{i}^{T}\right) \gamma _{j} \end{aligned}$$where the second last line uses the definition of \(b_{j}\). Putting the results together, we obtain that
$$\begin{aligned} f(\tilde{\mathbf {b}}_{\mathbf {j}}) = b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i,j}x_{i}x_{i}^{T}\right) \cdot \gamma _{j}. \end{aligned}$$This entails that
$$\begin{aligned} \Vert f(\tilde{\mathbf {b}}_{\mathbf {j}})\Vert _{2}\le |b_{j}|\cdot \max _{i}|d_{i,j}|\cdot \lambda _{+}\cdot \Vert \gamma _{j}\Vert _{2}. \end{aligned}$$(B-32)Now we derive a bound for \(\max _{i}|d_{i,j}|\), where \(d_{i,j}\) is defined in (B-36). By Lemma B-2,
$$\begin{aligned} |d_{i,j}| = |\psi '(\nu _{i,j}) - \psi '(r_{i, [j]})|\le K_{3}\left| \nu _{i, j} - r_{i, [j]}| = K_{3}|x_{i, [j]}^{T}\hat{\beta }_{[j]} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}\right| . \end{aligned}$$By definition of \(\tilde{\mathbf {b}}_{\mathbf {j}}\) and \(h_{j, 1, i}\),
$$\begin{aligned} |x_{i, [j]}^{T}\hat{\beta }_{[j]} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}|&= |b_{j}|\cdot \big |x_{i, [j]}^{T}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\big |\nonumber \\&= |b_{j}| \cdot \left| e_{i}^{T}(I - X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]})X_{j}\right| \nonumber \\&= |b_{j}|\cdot \left| h_{j, 1, i}^{T}X_{j}| \le |b_{j}\right| \cdot \varDelta _{C}\left\| h_{j, 1, i}\right\| _{2}, \end{aligned}$$(B-33)where the last inequality is derived by definition of \(\varDelta _{C}\), see (B-21). Since \(h_{j, 1, i}\) is the i-th column of matrix \(I - D_{[j]}X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}\), its \(L_{2}\) norm is upper bounded by the operator norm of this matrix. Notice that
$$\begin{aligned}&I - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T} \\&\quad = D_{[j]}^{\frac{1}{2}}\left( I - D_{[j]}^{\frac{1}{2}}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}\right) D_{[j]}^{-\frac{1}{2}}. \end{aligned}$$The middle matrix in RHS of the displayed atom is an orthogonal projection matrix and hence
$$\begin{aligned} \left\| I - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}\Vert _{\mathrm {op}}\le \Vert D_{[j]}^{\frac{1}{2}}\right\| _{\mathrm {op}}\cdot \left\| D_{[j]}^{-\frac{1}{2}}\right\| _{\mathrm {op}} \le \left( \frac{K_{1}}{K_{0}}\right) ^{\frac{1}{2}}. \end{aligned}$$(B-34)Therefore,
$$\begin{aligned} \max _{i, j}\Vert h_{j, 1, i}\Vert _{2}\le \max _{j\in J_{n}}\left\| I - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}\right\| _{\mathrm {op}}\le \left( \frac{K_{1}}{K_{0}}\right) ^{\frac{1}{2}}, \end{aligned}$$(B-35)and thus
$$\begin{aligned} \max _{i}|d_{i,j}|\le K_{3}\sqrt{\frac{K_{1}}{K_{0}}}\cdot |b_{j}|\cdot \varDelta _{C}. \end{aligned}$$(B-36)As for \(\gamma _{j}\), we have
$$\begin{aligned} K_{0}\lambda _{-}\Vert \gamma _{j}\Vert _{2}^{2}&\le \gamma _{j}^{T}\left( \frac{X^{T}D_{[j]}X}{n}\right) \gamma _{j} \\&= (\gamma _{j})_{j}^{2}\cdot \frac{X_{j}^{T}D_{j}X_{j}}{n} + (\gamma _{j})_{[j]}^{T}\left( \frac{X_{[j]}^{T}D_{[j]}X_{[j]}}{n}\right) (\gamma _{j})_{[j]}\\&\quad + 2\gamma _{j}\frac{X_{j}^{T}D_{[j]}X_{[j]}}{n}(\gamma _{j})_{[j]} \end{aligned}$$Recall the definition of \(\gamma _{j}\) in (B-37), we have
$$\begin{aligned} (\gamma _{j})_{[j]}^{T}\left( \frac{X_{[j]}^{T}D_{[j]}X_{[j]}}{n}\right) (\gamma _{j})_{[j]} = \frac{1}{n}X_{j}^{T}D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j} \end{aligned}$$and
$$\begin{aligned} \gamma _{j}\frac{X_{j}^{T}D_{[j]}X_{[j]}}{n}(\gamma _{j})_{[j]} = - \frac{1}{n}X_{j}^{T}D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}. \end{aligned}$$As a result,
$$\begin{aligned} K_{0}\lambda _{-}\Vert \gamma _{j}\Vert _{2}^{2}&\le \frac{1}{n}X_{j}^{T}D_{[j]}^{\frac{1}{2}}\left( I - D_{[j]}^{\frac{1}{2}}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}\right) D_{[j]}^{\frac{1}{2}}X_{j}\\&\le \frac{\left\| D_{[j]}^{\frac{1}{2}}X_{j}\right\| _{2}^{2}}{n}\cdot \left\| I - D_{[j]}^{\frac{1}{2}}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}\right\| _{op}\\&\le \frac{\left\| D_{[j]}^{\frac{1}{2}}X_{j}\right\| _{2}^{2}}{n} \le \frac{K_{1}\Vert X_{j}\Vert _{2}^{2}}{n}\le T^{2}K_{1}, \end{aligned}$$where T is defined in (B-23). Therefore we have
$$\begin{aligned} \left\| \gamma _{j}\right\| _{2}\le \sqrt{\frac{K_{1}}{K_{0}\lambda _{-}}}T. \end{aligned}$$(B-37)Putting (B-32), (B-36), (B-37) and part (ii) together, we obtain that
$$\begin{aligned} \Vert f(\tilde{\mathbf {b}}_{\mathbf {j}})\Vert _{2}&\le \lambda _{+}\cdot |b_{j}|\cdot K_{3}\sqrt{\frac{K_{1}}{K_{0}}} \varDelta _{C}|b_{j}| \cdot \sqrt{\frac{K_{1}}{K_{0}\lambda _{-}}}T\\&\le \lambda _{+}\cdot \frac{1}{n}\frac{2K_{1}}{(K_{0}\lambda _{-})^{2}}\varDelta _{C}^{2}\mathscr {E}\cdot K_{3}\sqrt{\frac{K_{1}}{K_{0}}} \varDelta _{C}\cdot \sqrt{\frac{K_{1}}{K_{0}\lambda _{-}}}T\\&= \frac{1}{n}\cdot \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{3}\lambda _{-}^{\frac{5}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}. \end{aligned}$$By Lemma B-1,
$$\begin{aligned} \Vert \hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}\Vert _{2}&\le \frac{\Vert f(\hat{\beta }) - f(\tilde{\mathbf {b}}_{\mathbf {j}})\Vert _{2}}{K_{0}\lambda _{-}} = \frac{\Vert f(\tilde{\mathbf {b}}_{\mathbf {j}})\Vert _{2}}{K_{0}\lambda _{-}} \le \frac{1}{n}\cdot \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}. \end{aligned}$$Since \(\hat{\beta }_{j} - b_{j}\) is the j-th entry of \(\hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}\), we have
$$\begin{aligned} |\hat{\beta }_{j} - b_{j}|\le \Vert \hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}\Vert _{2} \le \frac{1}{n}\cdot \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3} \cdot \mathscr {E}. \end{aligned}$$ -
(iv)
Similar to part (iii), this result has been shown by [20]. Here we state a refined version for the sake of completeness. Let \(\tilde{\mathbf {b}}_{\mathbf {j}}\) be defined as in (B-28), then
$$\begin{aligned} |R_{i} - r_{i, [j]}|&= \left| x_{i}^{T}\hat{\beta } - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right| = \left| x_{i}^{T}(\hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}) + x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}} - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right| \\&\le \Vert x_{i}\Vert _{2} \cdot \Vert \hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}\Vert _{2} + \left| x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}} - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right| . \end{aligned}$$Note that \(\left\| x_{i}\right\| _{2}\le \sqrt{n}T\), by part (iii), we have
$$\begin{aligned} \Vert x_{i}\Vert _{2} \cdot \Vert \hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}\Vert _{2}\le \frac{1}{\sqrt{n}}\frac{2K_{1}^{2}K_{3}\lambda _{+}T^{2}}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}. \end{aligned}$$(B-38)On the other hand, similar to (B-36), by (B-33),
$$\begin{aligned} \left| x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}} - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right| \le \sqrt{\frac{K_{1}}{K_{0}}}\cdot |b_{j}|\cdot \varDelta _{C} \le \frac{1}{\sqrt{n}}\cdot \frac{\sqrt{2}K_{1}}{K_{0}^{\frac{3}{2}}\lambda _{-}}\cdot \varDelta _{C}^{2}\cdot \sqrt{\mathscr {E}}. \end{aligned}$$(B-39)Therefore,
$$\begin{aligned} |R_{i} - r_{i, [j]}|\le \frac{1}{\sqrt{n}}\left( \frac{2K_{1}^{2}K_{3}\lambda _{+}T^{2}}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}+ \frac{\sqrt{2}K_{1}}{K_{0}^{\frac{3}{2}}\lambda _{-}}\cdot \varDelta _{C}^{2}\cdot \sqrt{\mathscr {E}}\right) . \end{aligned}$$
\(\square \)
1.3 Summary of approximation results
Under our technical assumptions, we can derive the rate for approximations via Proposition B.1. This justifies all approximations in “Appendix A”.
Theorem B.1
Under the assumptions A1–A5,
-
(i)
$$\begin{aligned} T\le \lambda _{+}= O\left( \mathrm {polyLog(n)}\right) ; \end{aligned}$$
-
(ii)
$$\begin{aligned} \max _{j\in J_{n}}|\hat{\beta }_{j}|\le \Vert \hat{\beta }\Vert _{2} = O_{L^{4}}\left( \mathrm {polyLog(n)}\right) ; \end{aligned}$$
-
(iii)
$$\begin{aligned} \max _{j\in J_{n}}|b_{j}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) ; \end{aligned}$$
-
(iv)
$$\begin{aligned} \max _{j\in J_{n}}|\hat{\beta }_{j} - b_{j}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{n}\right) ; \end{aligned}$$
-
(v)
$$\begin{aligned} \max _{j\in J_{n}}\max _{i}|R_{i} - r_{i, [j]}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) . \end{aligned}$$
Proof
-
(i)
Notice that \(X_{j} = Xe_{j}\), where \(e_{j}\) is the j-th canonical basis vector in \(\mathbb {R}^{p}\), we have
$$\begin{aligned} \frac{\Vert X_{j}\Vert ^{2}}{n} = e_{j}^{T}\frac{X^{T}X}{n}e_{j}\le \lambda _{+}. \end{aligned}$$Similarly, consider the \(X^{T}\) instead of X, we conclude that
$$\begin{aligned} \frac{\Vert x_{i}\Vert ^{2}}{n} \le \lambda _{\max }\left( \frac{XX^{T}}{n}\right) = \lambda _{+}. \end{aligned}$$Recall the definition of T in (B-23), we conclude that
$$\begin{aligned} T \le \sqrt{\lambda _{+}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$ -
(ii)
Since \(\varepsilon _{i} = u_{i}(W_{i})\) with \(\Vert u'_{i}\Vert _{\infty }\le c_{1}\), the gaussian concentration property ([36], chapter 1.3) implies that \(\varepsilon _{i}\) is \(c_{1}^{2}\)-sub-gaussian and hence \(\mathbb {E}|\varepsilon _{i}|^{k} = O(c_{1}^{k})\) for any finite \(k > 0\). By Lemma B-2, \(|\psi (\varepsilon _{i})|\le K_{1}|\varepsilon _{i}|\) and hence for any finite k,
$$\begin{aligned} \mathbb {E}|\psi (\varepsilon _{i})|^{k}\le K_{1}^{k}\mathbb {E}|\varepsilon _{i}|^{k} = O\left( c_{1}^{k}\right) . \end{aligned}$$By part (i) of Proposition B.1, using the convexity of \(x^{4}\) and hence \(\left( \frac{a + b}{2}\right) ^{4} \le \frac{a^{4} + b^{4}}{2}\),
$$\begin{aligned} \mathbb {E}\Vert \hat{\beta }\Vert _{2}^{4}\le \frac{1}{(K_{0}\lambda _{-})^{4}}\mathbb {E}(U + U_{0})^{4}\le \frac{8}{(K_{0}\lambda _{-})^{4}}\left( \mathbb {E}U^{4} + U_{0}^{4}\right) . \end{aligned}$$Recall (B-24) that \(U = \left\| \frac{1}{n}\sum _{i=1}^{n}x_{i}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))\right\| _{2}\),
$$\begin{aligned} U^{4}&= (U^{2})^{2} = \frac{1}{n^{4}}\left( \sum _{i,i'=1}^{n}x_{i}^{T}x_{i'}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))\right) ^{2}\\&= \frac{1}{n^{4}}\left( \sum _{i=1}^{n}\Vert x_{i}\Vert _{2}^{2}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{2} \right. \\&\quad \left. + \sum _{i\not = i'}|x_{i}^{T}x_{i'}|(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))\right) ^{2}\\&=\frac{1}{n^{4}}\bigg \{\sum _{i=1}^{n}\Vert x_{i}\Vert _{2}^{4}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{4} \\&\quad + \sum _{i\not = i'}(2|x_{i}^{T}x_{i'}|^{2} + \Vert x_{i}\Vert _{2}^{2}\Vert x_{i'}\Vert _{2}^{2})(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{2}(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))^{2}\\&\quad + \sum _{\mathrm{others}}|x_{i}^{T}x_{i'}|\cdot |x_{k}^{T}x_{k'}|\cdot (\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))(\psi (\varepsilon _{i'}) \\&\quad - \mathbb {E}\psi (\varepsilon _{i'}))(\psi (\varepsilon _{k}) - \mathbb {E}\psi (\varepsilon _{k}))(\psi (\varepsilon _{k'}) - \mathbb {E}\psi (\varepsilon _{k'}))\bigg \} \end{aligned}$$Since \(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i})\) has a zero mean, we have
$$\begin{aligned}&\mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))(\psi (\varepsilon _{k}) \\&\quad -\, \mathbb {E}\psi (\varepsilon _{k}))(\psi (\varepsilon _{k'}) - \mathbb {E}\psi (\varepsilon _{k'})) = 0 \end{aligned}$$for any \((i, i')\not = (k, k') \text{ or } (k', k)\) and \(i\not = i'\). As a consequence,
$$\begin{aligned} \mathbb {E}U^{4}&= \frac{1}{n^{4}}\bigg (\sum _{i=1}^{n}\Vert x_{i}\Vert _{2}^{4}\mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{4}\\&+\,\sum _{i\not =i'}(2|x_{i}^{T}x_{i'}|_{2}^{2} + \Vert x_{i}\Vert _{2}^{2}\Vert x_{i'}\Vert _{2}^{2})\mathbb {E}(\psi (\varepsilon _{i})\\&-\,\mathbb {E}\psi (\varepsilon _{i}))^{2}\mathbb {E}(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))^{2}\bigg )\\&\le \frac{1}{n^{4}}\left( \sum _{i=1}^{n}\Vert x_{i}\Vert _{2}^{4}\mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{4} \right. \\&\left. +\,3\sum _{i\not =i'}\Vert x_{i}\Vert _{2}^{2}\Vert x_{i'}\Vert _{2}^{2}\mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{2}\mathbb {E}(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))^{2}\right) . \end{aligned}$$For any i, using the convexity of \(x^{4}\), hence \((\frac{a + b}{2})^{4}\le \frac{a^{4} + b^{4}}{2}\), we have
$$\begin{aligned} \mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{4}\le & {} 8\mathbb {E}\left( \psi (\varepsilon _{i})^{4} + (\mathbb {E}\psi (\varepsilon _{i}))^{4}\right) \le 16 \mathbb {E}\psi (\varepsilon _{i})^{4}\\\le & {} 16\max _{i}\mathbb {E}\psi (\varepsilon _{i})^{4}. \end{aligned}$$By Cauchy–Schwartz inequality,
$$\begin{aligned} \mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{2}\le \mathbb {E}\psi (\varepsilon _{i})^{2}\le \sqrt{\mathbb {E}\psi (\varepsilon _{i})^{4}}\le \sqrt{\max _{i}\mathbb {E}\psi (\varepsilon _{i})^{4}}. \end{aligned}$$Recall (B-23) that \(\Vert x_{i}\Vert _{2}^{2} \le nT^{2}\) and thus,
$$\begin{aligned} \mathbb {E}U^{4}&\le \frac{1}{n^{4}}\left( 16 n\cdot n^{2}T^{4} + 3n^{2}\cdot n^{2}T^{4}\right) \cdot \max _{i}\mathbb {E}\psi (\varepsilon _{i})^{4}\\&\le \frac{1}{n^{4}}\cdot (16 n^{3} + 3n^{4})T^{4}\max _{i}\mathbb {E}\psi (\varepsilon _{i})^{4} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$On the other hand, let \(\mu ^{T} = (\mathbb {E}\psi (\varepsilon _{1}), \ldots , \mathbb {E}\psi (\varepsilon _{n}))\), then \(\Vert \mu \Vert _{2}^{2} = O(n\cdot \mathrm {polyLog(n)})\) and hence by definition of \(U_{0}\) in (B-24),
$$\begin{aligned} U_{0} = \frac{\Vert \mu ^{T}X\Vert _{2}}{n} = \frac{1}{n}\sqrt{\mu ^{T}XX^{T}\mu }\le \sqrt{\frac{\Vert \mu \Vert _{2}^{2}}{n}\cdot \lambda _{+}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$In summary,
$$\begin{aligned} \mathbb {E}\Vert \hat{\beta }\Vert _{2}^{4} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$ -
(iii)
By mean-value theorem, there exists \(a_{x}\in (0, x)\) such that
$$\begin{aligned} \rho (x) = \rho (0) + x\psi (0) + \frac{x^{2}}{2}\psi '(a_{x}). \end{aligned}$$By assumption A1 and Lemma B-2, we have
$$\begin{aligned} \rho (x) = \frac{x^{2}}{2}\psi '(a_{x})\le \frac{x^{2}}{2}\Vert \psi '\Vert _{\infty } \le \frac{K_{3}x^{2}}{2}, \end{aligned}$$where \(K_{3}\) is defined in Lemma B-2. As a result,
$$\begin{aligned} \mathbb {E}\rho (\varepsilon _{i})^{8} \le \left( \frac{K_{3}}{2}\right) ^{8}\mathbb {E}\varepsilon _{i}^{16} = O\left( c_{1}^{16}\right) . \end{aligned}$$Recall the definition of \(\mathscr {E}\) in (B-23) and the convexity of \(x^{8}\), we have
$$\begin{aligned} \mathbb {E}\mathscr {E}^{8} \le \frac{1}{n}\sum _{i=1}^{n}\mathbb {E}\rho (\varepsilon _{i})^{8} = O(c_{1}^{16}) = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$(B-40)Under assumption A5, by Cauchy–Schwartz inequality,
$$\begin{aligned} \mathbb {E}(\varDelta _{C}\sqrt{\mathscr {E}})^{2} = \mathbb {E}\varDelta _{C}^{2}\mathscr {E}\le \sqrt{\mathbb {E}\varDelta _{C}^{4}}\cdot \sqrt{\mathbb {E}\mathscr {E}^{2}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$Under assumptions A1 and A3,
$$\begin{aligned} \frac{\sqrt{2K_{1}}}{K_{0}\lambda _{-}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$Putting all the pieces together, we obtain that
$$\begin{aligned} \max _{j\in J_{n}}|b_{j}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) . \end{aligned}$$ -
(iv)
Similarly, by Holder’s inequality,
$$\begin{aligned} \mathbb {E}\left( \varDelta _{C}^{3}\mathscr {E}\right) ^{2} = \mathbb {E}\varDelta _{C}^{6}\mathscr {E}^{2}\le \left( \mathbb {E}\varDelta _{C}^{8}\right) ^{\frac{3}{4}}\cdot \left( \mathbb {E}\mathscr {E}^{8}\right) ^{\frac{1}{4}} = O\left( \mathrm {polyLog(n)}\right) , \end{aligned}$$and under assumptions A1 and A3,
$$\begin{aligned} \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$Therefore,
$$\begin{aligned} \max _{j\in J_{n}}|\hat{\beta }_{j} - b_{j}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$ -
(v)
It follows from the previous part that
$$\begin{aligned} \mathbb {E}(\varDelta _{C}^{2}\cdot \sqrt{\mathscr {E}})^{2} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$Under assumptions A1 and A3, the multiplicative factors are also \(O\left( \mathrm {polyLog(n)}\right) \), i.e.
$$\begin{aligned} \frac{2K_{1}^{2}K_{3}\lambda _{+}T^{2}}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}} = O\left( \mathrm {polyLog(n)}\right) , \quad \frac{\sqrt{2}K_{1}}{K_{0}^{\frac{3}{2}}\lambda _{-}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$Therefore,
$$\begin{aligned} \max _{j\in J_{n}}\max _{i}|R_{i} - r_{i, [j]}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) . \end{aligned}$$
\(\square \)
1.4 Controlling gradient and Hessian
Proof
(Proof of Lemma 4.1) Recall that \(\hat{\beta }\) is the solution of the following equation
Taking derivative of (B-41), we have
This establishes (9). To establishes (10), note that (9) can be rewritten as
Fix \(k\in \{1, \ldots , n\}\). Note that
Recall that \(G = I - X(X^{T}DX)^{-1}X^{T}D\), we have
where \(e_{i}\) is the i-th canonical basis of \(\mathbb {R}^{n}\). As a result,
Taking derivative of (B-42), we have
where \(G = I - X(X^{T}DX)^{-1}X^{T}D\) is defined in (B-18) in p. 31. Then for each \(j\in \{1, \ldots , p\}\) and \(k\in \{1, \ldots , n\}\),
where we use the fact that \(a^{T}\hbox {diag}(b) = b^{T}\hbox {diag}(a)\) for any vectors a, b. This implies that
\(\square \)
Proof
(Proof of Lemma 4.2) Throughout the proof we are using the simple fact that \(\left\| a\right\| _{\infty }\le \left\| a\right\| _{2}\). Based on it, we found that
Thus for any \(m > 1\), recall that \(M_{j} = \mathbb {E}\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty }\),
We should emphasize that we cannot use the naive bound that
since it fails to guarantee the convergence of TV distance. We will address this issue after deriving Lemma 4.3.
By contrast, as proved below,
Thus (B-46) produces a slightly tighter bound
It turns out that the above bound suffices to prove the convergence. Although (B-48) implies the possibility to sharpen the bound from \(n^{-\frac{m + 1}{2m}}\) to \(n^{-1}\) using refined analysis, we do not explore this to avoid extra conditions and notation.
-
Bound for \(\kappa _{0j}\) First we derive a bound for \(\kappa _{0j}\). By definition,
$$\begin{aligned} \kappa _{0j}^{2} = \mathbb {E}\left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{4}^{4}\le \mathbb {E}\left( \left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{\infty }^{2}\cdot \left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{2}^{2}\right) . \end{aligned}$$By Lemma 4.1 and (B-46) with \(m = 2\),
$$\begin{aligned} \mathbb {E}\left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{\infty }^{2} \le \mathbb {E}\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty }^{2}\cdot K_{1} = \frac{K_{1}M_{j}}{(nK_{0}\lambda _{-})^{\frac{1}{2}}}. \end{aligned}$$On the other hand, it follows from (B-45) that
$$\begin{aligned} \left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{2}^{2}&= \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D\right\| _{2}^{2} \le K_{1}\cdot \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{2}^{2} \le \frac{K_{1}}{nK_{0}\lambda _{-}}. \end{aligned}$$(B-49)Putting the above two bounds together we have
$$\begin{aligned} \kappa _{0j}^{2}\le \frac{K_{1}^{2}}{(nK_{0}\lambda _{-})^{\frac{3}{2}}}\cdot M_{j}. \end{aligned}$$(B-50) -
Bound for \(\kappa _{1j}\) As a by-product of (B-49), we obtain that
$$\begin{aligned} \kappa _{1j}^{4}&= \mathbb {E}\left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{2}^{4} \le \frac{K_{1}^{2}}{(nK_{0}\lambda _{-})^{2}}. \end{aligned}$$(B-51) -
Bound for \(\kappa _{2j}\) Finally, we derive a bound for \(\kappa _{2j}\). By Lemma 4.1, \(\kappa _{2j}\) involves the operator norm of a symmetric matrix with form \(G^{T}MG\) where M is a diagonal matrix. Then by the triangle inequality,
$$\begin{aligned} \left\| G^{T}MG\right\| _{op}\le \Vert M\Vert _{\mathrm {op}}\cdot \left\| G^{T}G\right\| _{op} = \Vert M\Vert _{\mathrm {op}}\cdot \left\| G\right\| _{op}^{2}. \end{aligned}$$
Note that
is a projection matrix, which is idempotent. This implies that
Write G as \(D^{-\frac{1}{2}}(D^{\frac{1}{2}}GD^{-\frac{1}{2}})D^{\frac{1}{2}}\), then we have
Returning to \(\kappa _{2j}\), we obtain that
Assumption A1 implies that
Therefore,
By (B-46) with \(m = 4\),
\(\square \)
Proof
(Proof of Lemma 4.3) By Theorem B.1, for any j,
Then using the second-order Poincaré inequality (Proposition 4.1),
It follows from (B-45) that \(nM_{j}^{2} = O\left( \mathrm {polyLog(n)}\right) \) and the above bound can be simplified as
\(\square \)
Remark B.1
If we use the naive bound (B-47), by repeating the above derivation, we obtain a worse bound for \(\kappa _{0, j} = O(\frac{\mathrm {polyLog(n)}}{n})\) and \(\kappa _{2} = O(\frac{\mathrm {polyLog(n)}}{\sqrt{n}})\), in which case,
However, we can only prove that \(\hbox {Var}(\hat{\beta }_{j}) = \varOmega (\frac{1}{n})\). Without the numerator \((nM_{j}^{2})^{\frac{1}{8}}\), which will be shown to be \(O(n^{-\frac{1}{8}}\mathrm {polyLog(n)})\) in the next subsection, the convergence cannot be proved.
1.5 Upper bound of \(M_{j}\)
As mentioned in “Appendix A”, we should approximate D by \(D_{[j]}\) to remove the functional dependence on \(X_{j}\). To achieve this, we introduce two terms, \(M_{j}^{(1)}\) and \(M_{j}^{(2)}\), defined as
We will first prove that both \(|M_{j} - M_{j}^{(1)}|\) and \(|M_{j}^{(1)} - M_{j}^{(2)}|\) are negligible and then derive an upper bound for \(M_{j}^{(2)}\).
1.5.1 Controlling \(|M_{j} - M_{j}^{(1)}|\)
By Lemma B-2,
and by Theorem B.1,
Then we can bound \(|M_{j} - M_{j}^{(1)}|\) via the fact that \(\left\| a\right\| _{\infty }\le \left\| a\right\| _{2}\) and algebra as follows.
By Lemma B-2,
thus
This entails that
1.5.2 Bound of \(|M_{j}^{(1)} - M_{j}^{(2)}|\)
First we prove a useful lemma.
Lemma B-3
For any symmetric matrix N with \(\Vert N\Vert _{\mathrm {op}} < 1\),
Proof
First, notice that
and therefore
Since \(\Vert N\Vert _{\mathrm {op}} < 1\), \(I + N\) is positive semi-definite and
Therefore,
\(\square \)
We now back to bounding \(|M_{j}^{(1)} - M_{j}^{(2)}|\). Let \(A_{j} = X^{T}D_{[j]}X\), \(B_{j} = X^{T}(D - D_{[j]})X\). By Lemma B-2,
and hence
where \(\eta _{j} = K_{3}\lambda _{+}\cdot \mathscr {R}_{j}\). Then by Theorem B.1.(v),
Using the fact that \(\left\| a\right\| _{\infty }\le \left\| a\right\| _{2}\), we obtain that
The inner matrix can be rewritten as
Let \(N_{j} = A_{j}^{-\frac{1}{2}}B_{j}A_{j}^{-\frac{1}{2}}\), then
On the event \(\{\eta _{j} \le \frac{1}{2}K_{0}\lambda _{-}\}\), \(\Vert N_{j}\Vert _{\mathrm {op}}\le \frac{1}{2}\). By Lemma B-3,
This together with (B-53) entails that
Since \(A_{j}\succeq nK_{0}\lambda _{-}I\), and \(\Vert B_{j}\Vert _{\mathrm {op}}\le n\eta _{j}\), we have
Thus,
On the event \(\{\eta _{j} > \frac{1}{2}K_{0}\lambda _{-}\}\), since \(nK_{0}\lambda _{-}I \preceq A_{j}\preceq nK_{1}\lambda _{+}I\) and \(A_{j} + B_{j}\succeq nK_{0}\lambda _{-}I\),
This together with Markov inequality implies htat
Putting pieces together, we conclude that
1.5.3 Bound of \(M_{j}^{(2)}\)
Similar to (A-1), by block matrix inversion formula (See Proposition E.1),
where \(H_{j} = D_{[j]}^{\frac{1}{2}} X_{[j]} (X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}\). Recall that \(\xi _{j}\ge K_{0}\lambda _{-}\) by (B-25), so we have
As for the numerator, recalling the definition of \(h_{j, 1, i}\), we obtain that
As proved in (B-35),
This entails that
Putting the pieces together we conclude that
1.5.4 Summary
Based on results from Sections B.5.1–B.5.3, we have
Note that the bounds we obtained do not depend on j, so we conclude that
1.6 Lower Bound of \(\hbox {Var}(\hat{\beta }_{j})\)
1.6.1 Approximating \(\hbox {Var}(\hat{\beta }_{j})\) by \(\hbox {Var}(b_{j})\)
By Theorem B.1,
Using the fact that
we can bound the difference between \(\mathbb {E}\hat{\beta }_{j}^{2}\) and \(\mathbb {E}b_{j}^{2}\) by
Similarly, since \(|a^{2} - b^{2}| = |a - b|\cdot |a + b|\le |a - b|(|a - b| + 2|b|)\),
Putting the above two results together, we conclude that
Then it is left to show that
1.6.2 Controlling \(\hbox {Var}(b_{j})\) by \(\hbox {Var}(N_{j})\)
Recall that
where
Then
Using the fact that \((a + b)^{2} - (\frac{1}{2}a^{2} - b^{2}) = \frac{1}{2}(a + 2b)^{2}\ge 0\), we have
1.6.3 Controlling \(I_{1}\)
The Assumption \(\mathbf A 4\) implies that
It is left to show that \( \hbox {tr}(\hbox {Cov}(h_{j, 0})) / n = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) \). Since this result will also be used later in “Appendix C”, we state it in the following the lemma.
Lemma B-4
Under assumptions A1 - A3,
Proof
The (A-10) implies that
Note that \(r_{i, [j]}\) is a function of \(\varepsilon \), we can apply (A-10) again to obtain a lower bound for \(\hbox {Var}(r_{i, [j]})\). In fact, by variance decomposition formula, using the independence of \(\varepsilon '_{i}\)s,
where \(\varepsilon _{(i)}\) includes all but the i-th entry of \(\varepsilon \). Apply A-10 again,
and hence
Now we compute \(\frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}}\). Similar to (B-43) in p. 40, we have
where \(G_{[j]}\) is defined in (B-18) in p. 31. When \(k = i\),
By definition of \(G_{[j]}\),
Let \(\tilde{X}_{[j]} = D_{[j]}^{\frac{1}{2}}X_{[j]}\) and \(H_{j} = \tilde{X}_{[j]}(\tilde{X}_{[j]}^{T}\tilde{X}_{[j]})^{-1}\tilde{X}_{[j]}^{T}\). Denote by \(\tilde{X}_{(i), [j]}\) the matrix \(\tilde{X}_{[j]}\) after removing i-th row, then by block matrix inversion formula (See Proposition E.1),
This implies that
Apply the above argument that replaces \(H_{j}\) by \(X_{[j]}(X_{[j]}^{T}X_{[j]})^{-1}X_{[j]}^{T}\), we have
Summing i over \(1, \ldots , n\), we obtain that
Since \(\min _{i}\hbox {Var}(\varepsilon _{i}) = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) \) by assumption A2, we conclude that
\(\square \)
In summary,
Recall that
we conclude that
1.6.4 Controlling \(I_{2}\)
By definition,
By (B-27) in the proof of Theorem B.1,
where the last equality uses the fact that \(\mathscr {E}= O_{L^{2}}\left( \mathrm {polyLog(n)}\right) \) as proved in (B-40). On the other hand, let \(\tilde{\xi }_{j}\) be an independent copy of \(\xi _{j}\), then
Since \(\xi _{j}\ge K_{0}\lambda _{-}\) as shown in (B-25), we have
To bound \(\hbox {Var}(\xi _{j})\), we propose to using the standard Poincaré inequality [11], which is stated as follows.
Proposition B.2
Let \(W = (W_{1}, \ldots , W_{n})\sim N(0, I_{n\times n})\) and f be a twice differentiable function, then
In our case, \(\varepsilon _{i} = u_{i}(W_{i})\), and hence for any twice differentiable function g,
Applying it to \(\xi _{j}\), we have
For given \(k\in \{1, \ldots , n\}\), using the chain rule and the fact that \(dB^{-1} = -B^{-1}dB B^{-1}\) for any square matrix B, we obtain that
where \(G_{[j]} = I - X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}\) as defined in last subsection. This implies that
Then (B-64) entails that
First we compute \(\frac{\partial D_{[j]}}{\partial \varepsilon _{k}}\). Similar to (B-44) in p. 40 and recalling the definition of \(D_{[j]}\) in (B-17) and that of \(G_{[j]}\) in (B-18) in p. 31, we have
Let \(\mathscr {X}_{j} = G_{[j]}X_{j}\) and \(\tilde{\mathscr {X}}_{j} = \mathscr {X}_{j}\circ \mathscr {X}_{j}\) where \(\circ \) denotes Hadamard product. Then
Here we use the fact that for any vectors \(x, a\in \mathbb {R}^{n}\),
This together with (B-65) imply that
Note that \(G_{[j]}G_{[j]}^{T}\preceq \Vert G_{[j]}\Vert _{\mathrm {op}}^{2} I\), and \(\tilde{D}_{[j]}\preceq K_{3}I\) by Lemma B-2 in p. 32. Therefore we obtain that
As shown in (B-34),
On the other hand, notice that the i-th row of \(G_{[j]}\) is \(h_{j, 1, i}\) (see (B-20) for definition), by definition of \(\varDelta _{C}\) we have
By (B-35) and assumption A5,
This entails that
Combining with (B-62) and (B-63), we obtain that
1.6.5 Summary
Putting (B-55), (B-61) and (B-62) together, we conclude that
Combining with (B-54),
Proof of other results
1.1 Proofs of propositions in Section 2.3
Proof
(Proof of Proposition 2.1) Let \(H_{i}(\alpha ) = \mathbb {E}\rho (\varepsilon _{i} - \alpha )\). First we prove that the conditions imply that 0 is the unique minimizer of \(H_{i}(\alpha )\) for all i. In fact, since \(\varepsilon _{i}{\mathop {=}\limits ^{d}}-\varepsilon _{i}\),
Using the fact that \(\rho \) is even, we have
By (4), for any \(\alpha \not = 0\), \(H_{i}(\alpha ) > H_{i}(0)\). As a result, 0 is the unique minimizer of \(H_{i}\). Then for any \(\beta \in \mathbb {R}^{p}\)
The equality holds iff \(x_{i}^{T}(\beta - \beta ^{*}) = 0\) for all i since 0 is the unique minimizer of \(H_{i}\). This implies that
Since X has full column rank, we conclude that
\(\square \)
Proof
(Proof of Proposition 2.2) For any \(\alpha \in \mathbb {R}\) and \(\beta \in \mathbb {R}^{p}\), let
Since \(\alpha _{\rho }\) minimizes \(\mathbb {E}\rho (\varepsilon _{i} - \alpha )\), it holds that
Note that \(\alpha _{\rho }\) is the unique minimizer of \(\mathbb {E}\rho (\varepsilon _{i} - \alpha )\), the above equality holds if and only if
Since \((\mathbf 1 \,\, X)\) has full column rank, it must hold that \(\alpha = \alpha _{\rho }\) and \(\beta = \beta ^{*}\). \(\square \)
1.2 Proofs of Corollary 3.1
Proposition C.1
Suppose that \(\varepsilon _{i}\) are i.i.d. such that \(\mathbb {E}\rho (\varepsilon _{1} - \alpha )\) as a function of \(\alpha \) has a unique minimizer \(\alpha _{\rho }\). Further assume that \(X_{J_{n}^{c}}\) contains an intercept term, \(X_{J_{n}}\) has full column rank and
Let
Then \(\beta _{J_{n}}(\rho ) = \beta ^{*}_{J_{n}}\).
Proof
let
For any minimizer \(\beta (\rho )\) of G, which might not be unique, we prove that \(\beta _{J_{n}}(\rho ) = \beta ^{*}_{J_{n}}\). It follows by the same argument as in Proposition 2.2 that
Since \(X_{J_{n}^{c}}\) contains the intercept term, we have
It then follows from (C-68) that
Since \(X_{J_{n}}\) has full column rank, we conclude that
\(\square \)
The Proposition C.1 implies that \(\beta ^{*}_{J_{n}}\) is identifiable even when X is not of full column rank. A similar conclusion holds for the estimator \(\hat{\beta }_{J_{n}}\) and the residuals \(R_{i}\). The following two propositions show that under certain assumptions, \(\hat{\beta }_{J_{n}}\) and \(R_{i}\) are invariant to the choice of \(\hat{\beta }\) in the presense of multiple minimizers.
Proposition C.2
Suppose that \(\rho \) is convex and twice differentiable with \(\rho ''(x)> c > 0\) for all \(x\in \mathbb {R}\). Let \(\hat{\beta }\) be any minimizer, which might not be unique, of
Then \(R_{i} = y_{i} - x_{i}\hat{\beta }\) is independent of the choice of \(\hat{\beta }\) for any i.
Proof
The conclusion is obvious if \(F(\beta )\) has a unique minimizer. Otherwise, let \(\hat{\beta }^{(1)}\) and \(\hat{\beta }^{(2)}\) be two different minimizers of F denote by \(\eta \) their difference, i.e. \(\eta = \hat{\beta }^{(2)} - \hat{\beta }^{(1)}\). Since F is convex, \(\hat{\beta }^{(1)} + v\eta \) is a minimizer of F for all \(v\in [0, 1]\). By Taylor expansion,
Since both \(\hat{\beta }^{(1)} + v\eta \) and \(\hat{\beta }^{(1)}\) are minimizers of F, we have \(F(\hat{\beta }^{(1)} + v\eta ) = F(\hat{\beta }^{(1)})\) and \(\nabla F(\hat{\beta }^{(1)}) = 0\). By letting v tend to 0, we conclude that
The hessian of F can be written as
Thus, \(\eta \) satisfies that
This implies that
and hence \(R_{i}\) is the same for all i in both cases. \(\square \)
Proposition C.3
Suppose that \(\rho \) is convex and twice differentiable with \(\rho ''(x)> c > 0\) for all \(x\in \mathbb {R}\). Further assume that \(X_{J_{n}}\) has full column rank and
Let \(\hat{\beta }\) be any minimizer, which might not be unique, of
Then \(\hat{\beta }_{J_{n}}\) is independent of the choice of \(\hat{\beta }\).
Proof
As in the proof of Proposition C.2, we conclude that for any minimizers \(\hat{\beta }^{(1)}\) and \(\hat{\beta }^{(2)}\), \(X\eta = 0\) where \(\eta = \hat{\beta }^{(2)} - \hat{\beta }^{(1)}\). Decompose the term into two parts, we have
It then follows from (C-68) that \(X_{J_{n}}\eta _{J_{n}} = 0\). Since \(X_{J_{n}}\) has full column rank, we conclude that \(\eta _{J_{n}}= 0\) and hence \(\hat{\beta }^{(1)}_{J_{n}} = \hat{\beta }^{(2)}_{J_{n}}\). \(\square \)
Proof
(Proof of Corollary 3.1) Under assumption A3*, \(X_{J_{n}}\) must have full column rank. Otherwise there exists \(\alpha \in \mathbb {R}^{|J_{n}|}\) such that \(X_{J_{n}}\alpha \), in which case \(\alpha ^{T}X_{J_{n}}^{T}(I - H_{J_{n}^{c}})X_{J_{n}}\alpha = 0\). This violates the assumption that \(\tilde{\lambda }_{-} > 0\). On the other hand, it also guarantees that
This together with assumption A1 and Proposition C.3 implies that \(\hat{\beta }_{J_{n}}\) is independent of the choice of \(\hat{\beta }\).
Let \(B_{1}\in \mathbb {R}^{|J_{n}^{c}|\times |J_{n}|}\), \(B_{2}\in \mathbb {R}^{|J_{n}^{c}|\times |J_{n}^{c}|}\) and assume that \(B_{2}\) is invertible. Let \(\tilde{X}\in \mathbb {R}^{n\times p}\) such that
Then \(\hbox {rank}(X) = \hbox {rank}(\tilde{X})\) and model (1) can be rewritten as
where
Let \(\tilde{\hat{\beta }}\) be an M-estimator, which might not be unique, based on \(\tilde{X}\). Then Proposition C.3 shows that \(\tilde{\hat{\beta }}_{J_{n}}\) is independent of the choice of \(\tilde{\hat{\beta }}\), and an invariance argument shows that
In the rest of proof, we use \(\tilde{\cdot }\) to denote the quantity obtained based on \(\tilde{X}\). First we show that the assumption A4 is not affected by this transformation. In fact, for any \(j\in J_{n}\), by definition we have
and hence the leave-j-th-predictor-out residuals are not changed by Proposition C.2. This implies that \(\tilde{h_{j, 0}} = h_{j, 0}\) and \(\tilde{Q}_{j} = Q_{j}\). Recall the definition of \(h_{j, 0}\), the first-order condition of \(\hat{\beta }\) entails that \(X^{T}h_{j, 0} = 0\). In particular, \(X_{J_{n}^{c}}^{T}h_{j, 0} = 0\) and this implies that for any \(\alpha \in \mathbb {R}^{n}\),
Thus,
Then we prove that the assumption A5 is also not affected by the transformation. The above argument has shown that
On the other hand, let \(B = \left( \begin{array}{cc} I_{|J_{n}|} &{} 0 \\ -B_{1} &{} B_{2} \end{array} \right) \), then B is non-singular and \(\tilde{X} = XB\). Let \(B_{(j),[j]}\) denote the matrix B after removing j-th row and j-th column. Then \(B_{(j),[j]}\) is also non-singular and \(\tilde{X}_{[j]} = X_{[j]}B_{(j), [j]}\). Recall the definition of \(h_{j, 1, i}\), we have
On the other hand, by definition,
Thus,
In summary, for any \(j\in J_{n}\) and \(i\le n\),
Putting the pieces together we have
By Theorem 3.1,
provided that \(\tilde{X}\) satisfies the assumption A3.
Now let \(U\varLambda V\) be the singular value decomposition of \(X_{J_{n}^{c}}\), where \(U\in \mathbb {R}^{n\times p}, \varLambda \in \mathbb {R}^{p\times p}, V \in \mathbb {R}^{p\times p}\) with \(U^{T}U = V^{T}V = I_{p}\) and \(\varLambda = \hbox {diag}(\nu _{1}, \ldots , \nu _{p})\) being the diagonal matrix formed by singular values of \(X_{J_{n}^{c}}\). First we consider the case where \(X_{J_{n}^{c}}\) has full column rank, then \(\nu _{j} > 0\) for all \(j\le p\). Let \(B_{1} = (X_{J_{n}}^{T}X_{J_{n}})^{-}X_{J_{n}}^{T}X_{J_{n}}\) and \(B_{2} = \sqrt{n / |J_{n}^{c}|}V^{T}\varLambda ^{-1}\). Then
This implies that
The assumption A3* implies that
By Theorem 3.1, we conclude that
Next we consider the case where \(X_{J_{n}}^{c}\) does not have full column rank. We first remove the redundant columns from \(X_{J_{n}}^{c}\), i.e. replace \(X_{J_{n}^{c}}\) by the matrix formed by its maximum linear independent subset. Denote by \(\mathbf {X}\) this matrix. Then \(\hbox {span}(X) = \hbox {span}(\mathbf {X})\) and \(\hbox {span}(\{X_{j}: j\not \in J_{n}\}) = \hbox {span}(\{\mathbf {X}_{j}: j\not \in J_{n}\})\). As a consequence of Propositions C.1 and C.3, neither \(\beta ^{*}_{J_{n}}\) nor \(\hat{\beta }_{J_{n}}\) is affected. Thus, the same reasoning as above applies to this case. \(\square \)
1.3 Proofs of results in Section 3.3
First we prove two lemmas regarding the behavior of \(Q_{j}\). These lemmas are needed for justifying Assumption A4 in the examples.
Lemma C-1
Under assumptions A1 and A2,
where \(Q_{j} = \hbox {Cov}(h_{j, 0})\) as defined in section B-1.
Proof
(Proof of Lemma C-1) By definition,
where \(\mathbb {S}^{n - 1}\) is the n-dimensional unit sphere. For given \(\alpha \in \mathbb {S}^{n - 1}\),
It has been shown in (B-59) in “Appendix B-6.3” that
where \(G_{[j]} = I - X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}\). This yields that
By standard Poincaré inequality (see Proposition B.2), since \(\varepsilon _{i} = u_{i}(W_{i})\),
We conclude from Lemma B-2 and (B-34) in “Appendix B-2” that
Therefore,
and hence
\(\square \)
Lemma C-2
Under assumptions A1 - A3,
where \(K^{*} = \frac{K_{0}^{4}}{K_{1}^{2}}\cdot \left( \frac{n - p + 1}{n}\right) ^{2}\cdot \min _{i}\hbox {Var}(\varepsilon _{i})\).
Proof
This is a direct consequence of Lemma B-4 in p. 49. \(\square \)
Throughout the following proofs, we will use several results from the random matrix theory to bound the largest and smallest singular values of Z. The results are shown in “Appendix E”. Furthermore, in contrast to other sections, the notation \(P(\cdot ), \mathbb {E}(\cdot ), \hbox {Var}(\cdot )\) denotes the probability, the expectation and the variance with respect to both \(\varepsilon \) and Z in this section.
Proof
(Proof of Proposition 3.1) By Proposition E.3,
and thus the assumption A3 holds with high probability. By Hanson-Wright inequality ([27, 51]; see Proposition E.2), for any given deterministic matrix A,
for some universal constant c. Let \(A = Q_{j}\) and conditioning on \(Z_{[j]}\), then by Lemma C-1, we know that
and hence
Note that
By Lemma C-2, we conclude that
Let \(t = \frac{1}{2}\tau ^{2}nK^{*}\) and take expectation of both sides over \(Z_{[j]}\), we obtain that
and hence
This entails that
Thus, assumption A4 is also satisfied with high probability. On the other hand, since \(Z_{j}\) has i.i.d. mean-zero \(\sigma ^{2}\)-sub-gaussian entries, for any deterministic unit vector \(\alpha \in \mathbb {R}^{n}\), \(\alpha ^{T}Z_{j}\) is \(\sigma ^{2}\)-sub-gaussian and mean-zero, and hence
Let \(\alpha _{j, i} = h_{j, 1, i} / \Vert h_{j, 1, i}\Vert _{2}\) and \(\alpha _{j, 0} = h_{j, 0} / \Vert h_{j, 0}\Vert _{2}\). Since \(h_{j, 1, i}\) and \(h_{j, 0}\) are independent of \(Z_{j}\), a union bound then gives
By Fubini’s formula ([16], Lemma 2.2.8.),
This, together with Markov inequality, guarantees that assumption A5 is also satisfied with high probability. \(\square \)
Proof
(Proof of Proposition 3.2) It is left to prove that assumption A3 holds with high probability. The proof of assumption A4 and A5 is exactly the same as the proof of Proposition 3.2. By Proposition E.4,
On the other hand, by Proposition E.7 [37],
and thus\(\square \)
Proof
(Proof of Proposition 3.3) Since \(J_{n}\) excludes the intercept term, the proof of assumption A4 and A5 is still the same as Proposition 3.2. It is left to prove assumption A3. Let \(R_{1}, \ldots , R_{n}\) be i.i.d. Rademacher random variables, i.e. \(P(R_{i} = 1) = P(R_{i} = -1) = \frac{1}{2}\), and
Then \((Z^{*})^{T}Z^{*} = Z^{T}Z\). It is left to show that the assumption A3 holds for \(Z^{*}\) with high probability. Note that
For any \(r \in \{1, -1\}\) and borel sets \(B_{1}, \ldots , B_{p}\subset \mathbb {R}\),
where the last two lines uses the symmetry of \(\tilde{Z}_{ij}\). Then we conclude that \(Z^{*}_{i}\) has independent entries. Since the rows of \(Z^{*}\) are independent, \(Z^{*}\) has independent entries. Since \(B_{i}\) are symmetric and sub-gaussian with unit variance and \(B_{i}\tilde{Z}_{ij}{\mathop {=}\limits ^{d}}\tilde{Z}_{ij}\), which is also symmetric and sub-gaussian with variance bounded from below, \(Z^{*}\) satisfies the conditions of Propsition 3.2 and hence the assumption A3 is satisfied with high probability. \(\square \)
Proof
(Proof of Proposition 3.5 (with Proposition 3.4 being a special case)) Let \(Z_{*} = \varLambda ^{-\frac{1}{2}}Z\varSigma ^{-\frac{1}{2}}\), then \(Z_{*}\) has i.i.d. standard gaussian entries. By Proposition 3.3, \(Z_{*}\) satisfies assumption A3 with high probability. Thus,
and
As for assumption A4, the first step is to calculate \(\mathbb {E}(Z_{j}^{T}Q_{j}Z_{j} | Z_{[j]})\). Let \(\tilde{Z} = \varLambda ^{-\frac{1}{2}}Z\), then \(\hbox {vec}(\tilde{Z})\sim N(0, I\otimes \varSigma )\). As a consequence,
where
Thus,
where \(\mu _{j} = Z_{[j]}\varSigma _{[j], [j]}^{-1}\varSigma _{[j], j}\). It is easy to see that
It has been shown that \(Q_{j}\mu _{j} = 0\) and hence
Let \(\mathscr {Z}_{j} = \varLambda ^{-\frac{1}{2}}(Z_{j} - \mu _{j})\) and \(\tilde{Q}_{j} = \varLambda ^{\frac{1}{2}}Q_{j}\varLambda ^{\frac{1}{2}}\), then \(\mathscr {Z}_{j}\sim N(0, \sigma _{j}^{2}I)\) and
By Lemma C-1,
and hence
By Hanson-Wright inequality ([27, 51]; see Proposition E.2), we obtain a similar inequality to (C-69) as follows:
On the other hand,
By definition,
By Lemma C-2,
Similar to (C-70), we obtain that
Let \(t = \frac{1}{2}\sigma _{j}^{2}nK^{*}\), we have
and a union bound together with (C-73) yields that
As for assumption A5, let
then for \(i = 0, 1, \ldots , p\),
Note that
using the same argument as in (C-72), we obtain that
and by Markov inequality and (C-73),
\(\square \)
Proof
(Proof of Proposition 3.6) The proof that assumptions A4 and A5 hold with high probability is exactly the same as the proof of Proposition 3.5. It is left to prove assumption A3*; see Corollary 3.1. Let \(c = (\min _{i}|(\varLambda ^{-\frac{1}{2}}{} \mathbf 1 )_{i}|)^{-1}\) and \(\mathbf {Z} = (c\mathbf 1 \,\, \tilde{Z})\). Recall the the definition of \(\tilde{\lambda }_{+}\) and \(\tilde{\lambda }_{-}\), we have
where
Rewrite \(\varSigma _{\{1\}}\) as
It is obvious that
As a consequence
It remains to prove that
To prove this, we let
where \(\nu = c\varLambda ^{-\frac{1}{2}}{} \mathbf 1 \) and \(\tilde{Z}_{*} = \varLambda ^{-\frac{1}{2}}\tilde{Z}\varSigma ^{-\frac{1}{2}}\). Then
and
It is left to show that
By definition, \(\min _{i}|\nu _{i}| = 1\) and \(\max _{i}|\nu _{i}| = O\left( \mathrm {polyLog(n)}\right) \), then
Since \(\tilde{Z}_{*}\) has i.i.d. standard gaussian entries, by Proposition E.3,
Moreover, \(\Vert \nu \Vert _{2}^{2} \le n \max _{i}|\nu _{i}|^{2} = O(n\cdot \mathrm {polyLog(n)})\) and thus,
On the other hand, similar to Proposition 3.3,
where \(B_{1}, \ldots , B_{n}\) are i.i.d. Rademacher random variables. The same argument in the proof of Proposition 3.3 implies that \(\mathbf {Z}_{*}\) has independent entries with sub-gaussian norm bounded by \(\Vert \nu \Vert _{\infty }^{2}\vee 1\) and variance lower bounded by 1. By Proposition E.7, \(Z_{*}\) satisfies assumption A3 with high probability. Therefore, A3* holds with high probability. \(\square \)
Proof
(Proof of Proposition 3.7) Let \(\varLambda = (\lambda _{1}, \ldots , \lambda _{n})\) and \(\mathscr {Z}\) be the matrix with entries \(\mathscr {Z}_{ij}\), then by Proposition 3.1 or Proposition 3.2, \(\mathscr {Z}_{ij}\) satisfies assumption A3 with high probability. Notice that
and
Thus Z satisfies assumption A3 with high probability.
Conditioning on any realization of \(\varLambda \), the law of \(\mathscr {Z}_{ij}\) does not change due to the independence between \(\varLambda \) and \(\mathscr {Z}\). Repeating the arguments in the proof of Proposition 3.1 and Proposition 3.2, ow that
where
Then
and
By Markov inequality, the assumption A5 is satisfied with high probability. \(\square \)
Proof
(Proof of Proposition 3.8) The concentration inequality of \(\zeta _{i}\) plus a union bound imply that
Thus, with high probability,
Let \(n' = \lfloor (1 - \delta )n\rfloor \) for some \(\delta \in (0, 1 - \kappa )\). Then for any subset I of \(\{1, \ldots , n\}\) with size \(n'\), by Proposition E.6 (Proposition E.7), under the conditions of Proposition 3.1 (Proposition 3.2), there exists constants \(c_{3}\) and \(c_{4}\), which only depend on \(\kappa \), such that
where \(\mathscr {Z}_{I}\) represents the sub-matrix of \(\mathscr {Z}\) formed by \(\{\mathscr {Z}_{i}: i \in I\}\), where \(\mathscr {Z}_{i}\) is the i-th row of \(\mathscr {Z}\). Then by a union bound,
By Stirling’s formula, there exists a constant \(c_{5} > 0\) such that
where \(\tilde{\delta } = n' / n\). For sufficiently small \(\delta \) and sufficiently large n,
and hence
for some \(c_{6} > 0\). By Borel–Cantelli Lemma,
On the other hand, since \(F^{-1}\) is continuous at \(\delta \), then
where \(\zeta _{(k)}\) is the k-th largest of \(\{\zeta _{i}: i = 1, \ldots , n\}\). Let \(I^{*}\) be the set of indices corresponding to the largest \(\lfloor (1 - \delta ) n\rfloor \) \(\zeta _{i}'\)s. Then with probability 1,
To prove assumption A4, similar to (C-75) in the proof of Proposition 3.7, it is left to show that
Furthermore, by Lemma C-2, it remains to prove that
Recalling the Eq. (B-60) in the proof of Lemma B-4, we have
By Proposition E.5,
On the other hand, apply (C-77) to \(\mathscr {Z}_{(i), [j]}\), we have
A union bound indicates that with probability \((c_{5}np + 2p)e^{-\min \{C_{2}, c_{6}\}n} = o(1)\),
This implies that for any j,
and for any i and j,
Moreover, as discussed above,
almost surely. Thus, it follows from (C-78) that with high probability,
The above bound holds for all diagonal elements of \(Q_{j}\) uniformly with high probability. Therefore,
As a result, the assumption A4 is satisfied with high probability. Finally, by (C-76), we obtain that
By Cauchy’s inequality,
Similar to (C-72), we conclude that
and by Markov inequality, the assumption A5 is satisfied with high probability. \(\square \)
1.4 More results of least-squares (Section 5)
1.4.1 The relation between \(S_{j}(X)\) and \(\varDelta _{C}\)
In Sect. 5, we give a sufficient and almost necessary condition for the coordinate-wise asymptotic normality of the least-square estimator \(\hat{\beta }^{LS}\); see Theorem 5.1. In this subsubsection, we show that \(\varDelta _{C}\) is a generalization of \(\max _{j\in J_{n}}S_{j}(X)\) for general M-estimators.
Consider the matrix \((X^{T}DX)^{-1}X^{T}\), where D is obtain by using general loss functions, then by block matrix inversion formula (see Proposition E.1),
where we use the approximation \(D \approx D_{[1]}\). The same result holds for all \(j\in J_{n}\), then
Recall that \(h_{j, 1, i}^{T}\) is i-th row of \(I - D_{[1]}X_{[1]}(X_{[1]}^{T}D_{[1]}X_{[1]})^{-1}X_{[1]}^{T}\), we have
The right-handed side equals to \(S_{j}(X)\) in the least-square case. Therefore, although of complicated form, assumption A5 is not an artifact of the proof but is essential for the asymptotic normality.
1.4.2 Additional examples
Benefit from the analytical form of the least-square estimator, we can depart from sub-gaussinity of the entries. The following proposition shows that a random design matrix Z with i.i.d. entries under appropriate moment conditions satisfies \(\max _{j\in J_{n}}S_{j}(Z) = o(1)\) with high probability. This implies that, when X is one realization of Z, the conditions Theorem 5.1 are satisfied for X with high probability over Z.
Proposition C.4
If \(\{Z_{ij}: i\le n, j\in J_{n}\}\) are independent random variables with
-
1.
\(\max _{i\le n, j\in J_{n}}(\mathbb {E}|Z_{ij}|^{8 + \delta })^{\frac{1}{8 + \delta }}\le M\) for some \(\delta , M > 0\);
-
2.
\(\min _{i\le n, j\in J_{n}}\hbox {Var}(Z_{ij}) > \tau ^{2}\) for some \(\tau > 0\)
-
3.
\(P(Z \text{ has } \text{ full } \text{ column } \text{ rank }) = 1 - o(1)\);
-
4.
\(\mathbb {E}Z_{j} \in \hbox {span}\{Z_{j}: j\in J_{n}^{c}\}\) almost surely for all \(j\in J_{n}\);
where \(Z_{j}\) is the j-th column of Z. Then
A typical practically interesting example is that Z contains an intercept term, which is not in \(J_{n}\), and \(Z_{j}\) has i.i.d. entries for \(j\in J_{n}\) with continuous distribution and sufficiently many moments, in which case the first three conditions are easily checked and \(\mathbb {E}Z_{j}\) is a multiple of \((1, \ldots , 1)\), which belongs to \(\hbox {span}\{Z_{j}: j\in J_{n}^{c}\}\).
In fact, the condition 4 allows Proposition C.4 to cover more general cases than the above one. For example, in a census study, a state-specific fix effect might be added into the model, i.e.
where \(s_{i}\) represents the state of subject i. In this case, Z contains a sub-block formed by \(z_{i}\) and a sub-block with ANOVA forms as mentioned in Example 1. The latter is usually incorporated only for adjusting group bias and not the target of inference. Then condition 4 is satisfied if only \(Z_{ij}\) has same mean in each group for each j, i.e. \(\mathbb {E}Z_{ij} = \mu _{s_{i}, j}\).
Proof
(Proof of Proposition C.4) By Sherman–Morison–Woodbury formula,
where \(H_{j} = Z_{[j]}(Z_{[j]}^{T}Z_{[j]})^{-1}Z_{[j]}^{T}\) is the projection matrix generated by \(Z_{[j]}\). Then
Similar to the proofs of other examples, the strategy is to show that the numerator, as a linear contrast of \(Z_{j}\), and the denominator, as a quadratic form of \(Z_{j}\), are both concentrated around their means. Specifically, we will show that there exists some constants \(C_{1}\) and \(C_{2}\) such that
If (C-80) holds, since \(H_{j}\) is independent of \(Z_{j}\) by assumptions, we have
Thus with probability \(1 - o(|J_{n}| / n) = 1 - o(1)\),
and hence
Now we prove (C-80). The proof, although looks messy, is essentially the same as the proof for other examples. Instead of relying on the exponential concentration given by the sub-gaussianity, we show the concentration in terms of higher-order moments.
In fact, for any idempotent A, the sum square of each row is bounded by 1 since
By Jensen’s inequality,
For any j, by Rosenthal’s inequality [48], there exists some universal constant C such that
Let \(C_{1}= (2CM^{8 + \delta })^{\frac{1}{8 + \delta }}\), then for given i, by Markov inequality,
and a union bound implies that
Now we derive a bound for \(Z_{j}^{T}AZ_{j}\). Since \(p/n \rightarrow \kappa \in (0, 1)\), there exists \(\tilde{\kappa }\in (0, 1 - \kappa )\) such that \(n - p > \tilde{\kappa } n\). Then
To bound the tail probability, we need the following result: \(\square \)
Lemma C-3
[2, Lemma 6.2] Let B be an \(n\times n\) nonrandom matrix and \(W = (W_{1}, \ldots , W_{n})^{T}\) be a random vector of independent entries. Assume that \(\mathbb {E}W_{i} = 0\), \(\mathbb {E}W_{i}^{2} = 1\) and \(\mathbb {E}|W_{i}|^{k}\le \nu _{k}\). Then, for any \(q\ge 1\),
where \(C_{q}\) is a constant depending on q only.
It is easy to extend Lemma C-3 to non-isotropic case by rescaling. In fact, denote \(\sigma _{i}^{2}\) by the variance of \(W_{i}\), and let \(\varSigma = \hbox {diag}(\sigma _{1}, \ldots , \sigma _{n})\), \(Y = (W_{1} / \sigma _{1}, \ldots , W_{n} / \sigma _{n})\). Then
with \(\hbox {Cov}(Y) = I\). Let \(\tilde{B} = \varSigma ^{\frac{1}{2}}B\varSigma ^{\frac{1}{2}}\), then
This entails that
On the other hand,
Thus we obtain the following result
Lemma C-4
Let B be an \(n\times n\) nonrandom matrix and \(W = (W_{1}, \ldots , W_{n})^{T}\) be a random vector of independent mean-zero entries. Suppose \(\mathbb {E}|W_{i}|^{k}\le \nu _{k}\), then for any \(q\ge 1\),
where \(C_{q}\) is a constant depending on q only.
Apply Lemma C-4 with \(W = Z_{j}\), \(B = A\) and \(q = 4 + \delta / 2\), we obtain that
for some constant C. Since A is idempotent, all eigenvalues of A is either 1 or 0 and thus \(AA^{T}\preceq I\). This implies that
and hence
for some constant \(C_{1}\), which only depends on M. By Markov inequality,
Combining with (C-84), we conclude that
where \(C_{2} = \frac{\tilde{\kappa }\tau ^{2}}{2}\). Notice that both (C-83) and (C-85) do not depend on j and A. Therefore, (C-80) is proved and hence the Proposition.
Additional numerical experiments
In this section, we repeat the experiments in Sect. 6 by using \(L_{1}\) loss, i.e. \(\rho (x) = |x|\). \(L_{1}\)-loss is not smooth and does not satisfy our technical conditions. The results are displayed below. It is seen that the performance is quite similar to that with the huber loss (Figs. 5, 6, 7).
Miscellaneous
In this appendix we state several technical results for the sake of completeness.
Proposition E.1
([28], formula (0.8.5.6)) Let \(A\in \mathbb {R}^{p\times p}\) be an invertible matrix and write A as a block matrix
with \(A_{11}\in \mathbb {R}^{p_{1}\times p_{1}}, A_{22}\in \mathbb {R}^{(p - p_{1})\times (p -p_{1})}\) being invertible matrices. Then
where \(S = A_{22} - A_{21}A_{11}^{-1}A_{12}\) is the Schur’s complement.
Proposition E.2
([51]; improved version of the original form by [27]) Let \(X = (X_{1}, \ldots , X_{n})\in \mathbb {R}^{n}\) be a random vector with independent mean-zero \(\sigma ^{2}\)-sub-gaussian components \(X_{i}\). Then, for every t,
Proposition E.3
[3] If \(\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}\) are i.i.d. random variables with zero mean, unit variance and finite fourth moment and \(p / n\rightarrow \kappa \), then
Proposition E.4
[35] Suppose \(\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}\) are independent mean-zero random variables with finite fourth moment, then
for some universal constant C. In particular, if \(\mathbb {E}Z_{ij}^{4}\) are uniformly bounded, then
Proposition E.5
[50] Suppose \(\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}\) are independent mean-zero \(\sigma ^{2}\)-sub-gaussian random variables. Then there exists a universal constant \(C_{1}, C_{2} > 0\) such that
Proposition E.6
[49] Suppose \(\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}\) are i.i.d. \(\sigma ^{2}\)-sub-gaussian random variables with zero mean and unit variance, then for \(\varepsilon \ge 0\)
for some universal constants C and c.
Proposition E.7
[37] Suppose \(\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}\) are independent \(\sigma ^{2}\)-sub-gaussian random variables such that
for some \(\sigma , \tau > 0\), and \(p / n\rightarrow \kappa \in (0, 1)\), then there exists constants \(c_{1}, c_{2} > 0\), which only depends on \(\sigma \) and \(\tau \), such that
Rights and permissions
About this article
Cite this article
Lei, L., Bickel, P.J. & El Karoui, N. Asymptotics for high dimensional regression M-estimates: fixed design results. Probab. Theory Relat. Fields 172, 983–1079 (2018). https://doi.org/10.1007/s00440-017-0824-7
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-017-0824-7
Keywords
- M-estimation
- Robust regression
- High-dimensional statistics
- Second order Poincaré inequality
- Leave-one-out analysis