Asymptotics for high dimensional regression M-estimates: fixed design results

Lei, Lihua; Bickel, Peter J.; El Karoui, Noureddine

doi:10.1007/s00440-017-0824-7

Asymptotics for high dimensional regression M-estimates: fixed design results

Published: 09 February 2018

Volume 172, pages 983–1079, (2018)
Cite this article

Probability Theory and Related Fields Aims and scope Submit manuscript

1559 Accesses
15 Citations
Explore all metrics

Abstract

We investigate the asymptotic distributions of coordinates of regression M-estimates in the moderate p / n regime, where the number of covariates p grows proportionally with the sample size n. Under appropriate regularity conditions, we establish the coordinate-wise asymptotic normality of regression M-estimates assuming a fixed-design matrix. Our proof is based on the second-order Poincaré inequality and leave-one-out analysis. Some relevant examples are indicated to show that our regularity conditions are satisfied by a broad class of design matrices. We also show a counterexample, namely an ANOVA-type design, to emphasize that the technical assumptions are not just artifacts of the proof. Finally, numerical experiments confirm and complement our theoretical results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Bayesian-motivated test for high-dimensional linear regression models with fixed design matrix

Article 22 January 2020

Goodness-of-fit tests in linear EV regression with replications

Article 15 March 2018

Comments on: Statistical inference and large-scale multiple testing for high-dimensional regression models

Article 07 November 2023

Notes

[20] considers a ridge regularized M estimator, which is different from our setting. However, this argument still holds in our case and proved in “Appendix B”.

References

Anderson, T.W.: An Introduction to Multivariate Statistical Analysis. Wiley, New York (1962)
Google Scholar
Bai, Z., Silverstein, J.W.: Spectral Analysis of Large Dimensional Random Matrices, vol. 20. Springer, Berlin (2010)
Book Google Scholar
Bai, Z., Yin, Y.: Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21(3), 1275–1294 (1993)
Article MathSciNet Google Scholar
Baranchik, A.: Inadmissibility of maximum likelihood estimators in some multiple regression problems with three or more independent variables. Ann. Stat. 1(2), 312–321 (1973)
Article MathSciNet Google Scholar
Bean, D., Bickel, P.J., El Karoui, N., Lim, C., Yu, B.: Penalized robust regression in high-dimension. Technical Report 813, Department of Statistics, UC Berkeley (2012)
Bean, D., Bickel, P.J., El Karoui, N., Yu, B.: Optimal M-estimation in high-dimensional regression. Proc. Natl. Acad. Sci. 110(36), 14563–14568 (2013)
Article Google Scholar
Bickel, P.J., Doksum, K.A.: Mathematical Statistics: Basic Ideas and Selected Topics, Volume I, vol. 117. CRC Press, Boca Raton (2015)
MATH Google Scholar
Bickel, P.J., Freedman, D.A.: Some asymptotic theory for the bootstrap. Ann. Stat. 9(6), 1196–1217 (1981)
Article MathSciNet Google Scholar
Bickel, P.J., Freedman, D.A.: Bootstrapping regression models with many parameters. Festschrift for Erich L. Lehmann pp. 28–48 (1983)
Chatterjee, S.: Fluctuations of eigenvalues and second order Poincaré inequalities. Probab. Theory Relat. Fields 143(1–2), 1–40 (2009)
Article Google Scholar
Chernoff, H.: A note on an inequality involving the normal distribution. Ann. Probab. 9(3), 533–535 (1981)
Article MathSciNet Google Scholar
Cizek, P., Härdle, W.K., Weron, R.: Statistical Tools for Finance and Insurance. Springer, Berlin (2005)
MATH Google Scholar
Cochran, W.G.: Sampling Techniques. Wiley, Hoboken (1977)
MATH Google Scholar
David, H.A., Nagaraja, H.N.: Order Statistics. Wiley Online Library, Hoboken (1981)
Google Scholar
Donoho, D., Montanari, A.: High dimensional robust M-estimation: asymptotic variance via approximate message passing. Probab. Theory Relat. Fields 166, 935–969 (2016)
Article MathSciNet Google Scholar
Durrett, R.: Probability: Theory and Examples. Cambridge University Press, Cambridge (2010)
Book Google Scholar
Efron, B.: The Jackknife, the Bootstrap and Other Resampling Plans, vol. 38. SIAM, Philadelphia (1982)
Book Google Scholar
El Karoui, N.: Concentration of measure and spectra of random matrices: applications to correlation matrices, elliptical distributions and beyond. Ann. Appl. Probab. 19(6), 2362–2405 (2009)
Article MathSciNet Google Scholar
El Karoui, N.: High-dimensionality effects in the Markowitz problem and other quadratic programs with linear constraints: risk underestimation. Ann. Stat. 38(6), 3487–3566 (2010)
Article MathSciNet Google Scholar
El Karoui, N.: Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arXiv preprint arXiv:1311.2445 (2013)
El Karoui, N.: On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields, pp. 1–81 (2015)
El Karoui, N., Bean, D., Bickel, P.J., Lim, C., Yu, B.: On Robust Regression with High-Dimensional Predictors. Technical Report 811, Department of Statistics, UC Berkeley (2011)
El Karoui, N., Bean, D., Bickel, P.J., Lim, C., Yu, B.: On robust regression with high-dimensional predictors. Proc. Natl. Acad. Sci. 110(36), 14557–14562 (2013)
Article Google Scholar
El Karoui, N., Purdom, E.: Can We Trust the Bootstrap in High-Dimension? Technical Report 824. Department of Statistics, UC Berkeley (2015)
Esseen, C.G.: Fourier analysis of distribution functions. A mathematical study of the Laplace–Gaussian law. Acta Math. 77(1), 1–125 (1945)
Article MathSciNet Google Scholar
Geman, S.: A limit theorem for the norm of random matrices. Ann. Probab. 8(2), 252–261 (1980)
Article MathSciNet Google Scholar
Hanson, D.L., Wright, F.T.: A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Stat. 42(3), 1079–1083 (1971)
Article MathSciNet Google Scholar
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
Book Google Scholar
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)
Article MathSciNet Google Scholar
Huber, P.J.: The 1972 wald lecture robust statistics: a review. Ann. Math. Stat. 43(4), 1041–1067 (1972)
Article MathSciNet Google Scholar
Huber, P.J.: Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Stat. 1(5), 799–821 (1973)
Article MathSciNet Google Scholar
Huber, P.J.: Robust Statistics. Wiley, New York (1981)
Book Google Scholar
Johnstone, I.M.: On the distribution of the largest eigenvalue in principal components analysis. Ann. Stat. 29(2), 295–327 (2001)
Article MathSciNet Google Scholar
Jurečkovà, J., Klebanov, L.B.: Inadmissibility of robust estimators with respect to $L_1$ norm. In: Dodge, Y. (ed.) $L_1$-Statistical Procedures and Related Topics. Lecture Notes–Monograph Series, vol. 31, pp. 71–78. Institute of Mathematical Statistics, Hayward (1997)
Chapter Google Scholar
Latała, R.: Some estimates of norms of random matrices. Proc. Am. Math. Soc. 133(5), 1273–1282 (2005)
Article MathSciNet Google Scholar
Ledoux, M.: The Concentration of Measure Phenomenon, vol. 89. American Mathematical Society, Providence (2001)
MATH Google Scholar
Litvak, A.E., Pajor, A., Rudelson, M., Tomczak-Jaegermann, N.: Smallest singular value of random matrices and geometry of random polytopes. Adv. Math. 195(2), 491–523 (2005)
Article MathSciNet Google Scholar
Mallows, C.: A note on asymptotic joint normality. Ann. Math. Stat. 43(2), 508–515 (1972)
Article MathSciNet Google Scholar
Mammen, E.: Asymptotics with increasing dimension for robust regression with applications to the bootstrap. Ann. Stat. 17(1), 382–400 (1989)
Article MathSciNet Google Scholar
Marčenko, V.A., Pastur, L.A.: Distribution of eigenvalues for some sets of random matrices. Math. USSR Sbornik 1(4), 457 (1967)
Article Google Scholar
Muirhead, R.J.: Aspects of Multivariate Statistical Theory, vol. 197. Wiley, Hoboken (1982)
Book Google Scholar
Portnoy, S.: Asymptotic behavior of M-estimators of $p$ regression parameters when $p^{2}/n$ is large. I. Consistency. Ann. Stat. 12(4), 1298–1309 (1984)
Article Google Scholar
Portnoy, S.: Asymptotic behavior of M-estimators of $p$ regression parameters when $p^{2} / n$ is large. II. Normal approximation. Ann. Stat. 13(4), 1403–1417 (1985)
MATH Google Scholar
Portnoy, S.: On the central limit theorem in $\mathbb{R}^{p}$ when $p\rightarrow \infty $. Probab. Theory Relat. Fields 73(4), 571–583 (1986)
Article Google Scholar
Portnoy, S.: A central limit theorem applicable to robust regression estimators. J. Multivar. Anal. 22(1), 24–50 (1987)
Article MathSciNet Google Scholar
Posekany, A., Felsenstein, K., Sykacek, P.: Biological assessment of robust noise models in microarray data analysis. Bioinformatics 27(6), 807–814 (2011)
Article Google Scholar
Relles, D.A.: Robust Regression by Modified Least-Squares. Technical reports, DTIC Document (1967)
Rosenthal, H.P.: On the subspaces of $l^{p} (p > 2)$ spanned by sequences of independent random variables. Isr. J. Math. 8(3), 273–303 (1970)
Article Google Scholar
Rudelson, M., Vershynin, R.: Smallest singular value of a random rectangular matrix. Commun. Pure Appl. Math. 62(12), 1707–1739 (2009)
Article MathSciNet Google Scholar
Rudelson, M., Vershynin, R.: Non-asymptotic theory of random matrices: extreme singular values. arXiv preprint arXiv:1003.2990 (2010)
Rudelson, M., Vershynin, R.: Hanson-Wright inequality and sub-gaussian concentration. Electron. Commun. Probab. 18(82), 1–9 (2013)
MathSciNet MATH Google Scholar
Scheffe, H.: The Analysis of Variance, vol. 72. Wiley, Hoboken (1999)
MATH Google Scholar
Silverstein, J.W.: The smallest eigenvalue of a large dimensional Wishart matrix. Ann. Probab. 13(4), 1364–1368 (1985)
Article MathSciNet Google Scholar
Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B (Methodolog) 36(2), 111–147 (1974)
MathSciNet MATH Google Scholar
Tyler, D.E.: A distribution-free M-estimator of multivariate scatter. Ann. Stat. 15(1), 234–251 (1987)
Article MathSciNet Google Scholar
Van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)
Book Google Scholar
Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010)
Wachter, K.W.: Probability plotting points for principal components. In: Ninth Interface Symposium Computer Science and Statistics, pp. 299–308. Prindle, Weber and Schmidt, Boston (1976)
Wachter, K.W.: The strong limits of random matrix spectra for sample matrices of independent elements. Ann. Probab. 6(1), 1–18 (1978)
Article MathSciNet Google Scholar
Wasserman, L., Roeder, K.: High dimensional variable selection. Ann. Stat. 37(5A), 2178 (2009)
Article MathSciNet Google Scholar
Yohai, V.J.: Robust M-Estimates for the General Linear Model. Universidad Nacional de la Plata. Departamento de Matematica (1972)
Yohai, V.J., Maronna, R.A.: Asymptotic behavior of M-estimators for the linear model. Ann. Stat. 7(2), 258–268 (1979)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
Lihua Lei, Peter J. Bickel & Noureddine El Karoui
Criteo Research, Paris, France
Noureddine El Karoui

Authors

Lihua Lei
View author publications
You can also search for this author in PubMed Google Scholar
Peter J. Bickel
View author publications
You can also search for this author in PubMed Google Scholar
Noureddine El Karoui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Lihua Lei, Peter J. Bickel or Noureddine El Karoui.

Additional information

Peter J. Bickel and Lihua Lei gratefully acknowledge support from NSF DMS-1160319 and NSF DMS-1713083. Noureddine El Karoui gratefully acknowledges support from NSF grant DMS-1510172. He would also like to thank Criteo for providing a great research environment.

Appendices

Appendix

Proof sketch of Lemma 4.4

In this Appendix, we provide a roadmap for proving Lemma 4.4 by considering a special case where X is one realization of a random matrix Z with i.i.d. mean-zero $\sigma ^{2}$-sub-gaussian entries. Random matrix theory [3, 26, 53] implies that $\lambda _{+}= (1 + \sqrt{\kappa })^{2} + o_{p}(1) = O_{p}(1)$ and $\lambda _{-}= (1 - \sqrt{\kappa })^{2} + o_{p}(1) = \varOmega _{p}(1)$. Thus, the assumption A3 is satisfied with high probability. Thus, the Lemma 4.3 in p. 17 holds with high probability. It remains to prove the following lemma to obtain Theorem 3.1.

Lemma A-1

Let Z be a random matrix with i.i.d. mean-zero $\sigma ^{2}$-sub-gaussian entries and X be one realization of Z. Then under assumptions A1 and A2,

$$\begin{aligned} \max _{1\le j\le p}M_{j} = O_{p}\left( \frac{\mathrm {polyLog(n)}}{n}\right) , \quad \min _{1\le j\le p}\hbox {Var}(\hat{\beta }_{j}) = \varOmega _{p}\left( \frac{1}{n\cdot \mathrm {polyLog(n)}}\right) , \end{aligned}$$

where $M_{j}$ is defined in (11) in p. 17 and the randomness in $o_{p}(\cdot )$ and $O_{p}(\cdot )$ comes from Z.

Note that we prove in Proposition 3.1 that assumptions A4 and A5 are satisfied with high probability in this case. However, we will not use them directly but prove Lemma A-1 from the scratch instead, in order to clarify why assump3tions in forms of A4 and A5 are needed in the proof.

1.1 Upper bound of $M_{j}$

First by Proposition E.3,

$$\begin{aligned} \lambda _{+}= O_{p}(1), \quad \lambda _{-}= \varOmega _{p}(1). \end{aligned}$$

In the rest of the proof, the symbol $\mathbb {E}$ and $\hbox {Var}$ denotes the expectation and the variance conditional on Z. Let $\tilde{Z} = D^{\frac{1}{2}}Z$, then $M_{j} = \mathbb {E}\Vert e_{j}^{T}(\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T}\Vert _{\infty }$. Let $\tilde{H}_{j} = I - \tilde{Z}_{[j]}(\tilde{Z}_{[j]}^{T}\tilde{Z}_{[j]})^{-1}\tilde{Z}_{[j]}^{T}$, then by block matrix inversion formula (see Proposition E.1), which we state as Proposition E.1 in “Appendix E”.

$$\begin{aligned} (\tilde{Z}^{T}\tilde{Z})^{-1}\tilde{Z}^{T}&= \left( \begin{array}{cc} \tilde{Z}_{1}^{T}\tilde{Z}_{1}&{} \tilde{Z}_{1}^{T}\tilde{Z}_{[1]}\\ \tilde{Z}_{[1]}^{T}\tilde{Z}_{1} &{} \tilde{Z}_{[1]}^{T}\tilde{Z}_{[1]} \end{array} \right) ^{-1}\left( \begin{array}{c}\tilde{Z}_{1} \\ \tilde{Z}_{[1]}\end{array}\right) \\&= \frac{1}{\tilde{Z}_{1}^{T}(I - \tilde{H}_{1})\tilde{Z}_{1}}\left( \begin{array}{cc} 1 &{} -\tilde{Z}_{1}^{T}\tilde{Z}_{[1]}(\tilde{Z}_{[1]}^{T}\tilde{Z}_{[1]})^{-1}\\ * &{} * \end{array} \right) \left( \begin{array}{c}\tilde{Z}_{1} \\ \tilde{Z}_{[1]}\end{array}\right) \\&= \frac{1}{\tilde{Z}_{1}^{T}(I - \tilde{H}_{1})\tilde{Z}_{1}}\left( \begin{array}{c}\tilde{Z}_{1}^{T}(I - \tilde{H}_{1}) \\ *\end{array}\right) . \end{aligned}$$

This implies that

$$\begin{aligned} M_{1} = \mathbb {E}\frac{\left\| \tilde{Z}_{1}^{T}(I - \tilde{H}_{1})\right\| _{\infty }}{\tilde{Z}_{1}^{T}(I - \tilde{H}_{1})\tilde{Z}_{1}}. \end{aligned}$$

(A-1)

Since $Z^{T}DZ / n\succeq K_{0}\lambda _{-}I$, we have

$$\begin{aligned} \frac{1}{\tilde{Z}_{1}^{T}(I - \tilde{H}_{1})\tilde{Z}_{1}}= & {} e_{1}^{T}(\tilde{Z}^{T}\tilde{Z})^{-1}e_{1} = e_{1}^{T}(Z^{T}DZ)^{-1}e_{1}\\= & {} \frac{1}{n}e_{1}^{T}\left( \frac{Z^{T}DZ}{n}\right) ^{-1}e_{1} \le \frac{1}{nK_{0}\lambda _{-}} \end{aligned}$$

and we obtain a bound for $M_{1}$ as

$$\begin{aligned} M_{1} \le \frac{\mathbb {E}\left\| \tilde{Z}_{1}^{T}(I - \tilde{H}_{1})\right\| _{\infty }}{nK_{0}\lambda _{-}} = \frac{\mathbb {E}\left\| Z_{1}^{T}D^{\frac{1}{2}}(I - \tilde{H}_{1})\right\| _{\infty }}{nK_{0}\lambda _{-}}. \end{aligned}$$

Similarly,

$$\begin{aligned} M_{j}\le & {} \frac{\mathbb {E}\left\| Z_{j}^{T}D^{\frac{1}{2}}(I - \tilde{H}_{j})\right\| _{\infty }}{nK_{0}\lambda _{-}}\nonumber \\= & {} \frac{\mathbb {E}\left\| Z_{j}^{T}D^{\frac{1}{2}}\left( I - D^{\frac{1}{2}}Z_{[j]}^{T}\left( Z_{[j]}^{T}DZ_{[j]}\right) ^{-1}Z_{[j]}D^{\frac{1}{2}}\right) \right\| _{\infty }}{nK_{0}\lambda _{-}}. \end{aligned}$$

(A-2)

The vector in the numerator is a linear contrast of $Z_{j}$ and $Z_{j}$ has mean-zero i.i.d. sub-gaussian entries. For any fixed matrix $A\in \mathbb {R}^{n\times n}$, denote $A_{k}$ by its k-th column, then $A_{k}^{T}Z_{j}$ is $\sigma ^{2}\Vert A_{k}\Vert _{2}^{2}$-sub-gaussian (see Section 5.2.3 of [57] for a detailed discussion) and hence by definition of sub-Gaussianity,

$$\begin{aligned} P\left( \left| A_{k}^{T}Z_{j}\right| \ge \sigma \Vert A_{k}\Vert _{2}t\right) \le 2e^{-\frac{t^{2}}{2}}. \end{aligned}$$

Therefore, by a simple union bound, we conclude that

$$\begin{aligned} P(\Vert A^{T}Z_{j}\Vert _{\infty }\ge \sigma \max _{k}\Vert A_{k}\Vert _{2}t)\le 2ne^{-\frac{t^{2}}{2}}. \end{aligned}$$

Let $t = 2\sqrt{\log n}$,

$$\begin{aligned} P(\Vert A^{T}Z_{j}\Vert _{\infty }\ge 2\sigma \max _{k}\Vert A_{k}\Vert _{2}\sqrt{\log n})\le \frac{2}{n} = o(1). \end{aligned}$$

This entails that

$$\begin{aligned} \Vert A^{T}Z_{j}\Vert _{\infty } = O_{p}\left( \max _{k}\Vert A_{k}\Vert _{2}\cdot \mathrm {polyLog(n)}\right) = O_{p}\left( \Vert A\Vert _{\mathrm {op}}\cdot \mathrm {polyLog(n)}\right) . \end{aligned}$$

(A-3)

with high probability. In $M_{j}$, the coefficient matrix $(I - H_{j})D^{\frac{1}{2}}$ depends on $Z_{j}$ through D and hence we cannot use (A-3) directly. However, the dependence can be removed by replacing D by $D_{[j]}$ since $r_{i, [j]}$ does not depend on $Z_{j}$.

Since Z has i.i.d. sub-gaussian entries, no column is highly influential. In other words, the estimator will not change drastically after removing j-th column. This would suggest $R_{i}\approx r_{i, [j]}$. It is proved by [20] that

$$\begin{aligned} \sup _{i, j}|R_{i} - r_{i, [j]}| = O_{p}\left( {\frac{\mathrm {polyLog(n)}}{\sqrt{n}}}\right) . \end{aligned}$$

It can be rigorously proved that

$$\begin{aligned} \big |\Vert Z_{j}^{T}D(I - \tilde{H}_{j})\Vert _{\infty } - \Vert Z_{j}^{T}D_{[j]}(I - H_{j})\Vert _{\infty }\big | = O_{p}\left( \frac{\mathrm {polyLog(n)}}{n}\right) , \end{aligned}$$

where $H_{j} = I - D_{[j]}^{\frac{1}{2}}Z_{[j]}(Z_{[j]}^{T}D_{[j]}Z_{[j]})^{-1}Z_{[j]}^{T}D_{[j]}^{\frac{1}{2}}$; see “Appendix A-1” for details. Since $D_{[j]}(I - H_{j})$ is independent of $Z_{j}$ and

$$\begin{aligned} \Vert D_{[j]}(I - H_{j})\Vert _{\mathrm {op}} \le \Vert D_{[j]}\Vert _{\mathrm {op}}\le K_{1} = O\left( \mathrm {polyLog(n)}\right) , \end{aligned}$$

it follows from (A-2) and (A-3) that

$$\begin{aligned} \left\| Z_{j}^{T}D_{[j]}(I - H_{j})\right\| _{\infty } = O_{p}\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

In summary,

$$\begin{aligned} M_{j} = O_{p}\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

(A-4)

1.2 Lower bound of $\hbox {Var}(\hat{\beta }_{j})$

1.2.1 Approximating $\hbox {Var}(\hat{\beta }_{j})$ by $\hbox {Var}(b_{j})$

It is shown by [20]^{Footnote 1} that

$$\begin{aligned} \hat{\beta }_{j} \approx b_{j} \triangleq \frac{1}{\sqrt{n}}\frac{N_{j}}{\xi _{j}} \end{aligned}$$

(A-5)

where

$$\begin{aligned} N_{j}= & {} \frac{1}{\sqrt{n}}\sum _{i=1}^{n}Z_{ij}\psi (r_{i, [j]}), \\ \xi _{j}= & {} \frac{1}{n}Z_{j}^{T}\left( D_{[j]} - D_{[j]}Z_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}Z_{[j]}^{T}D_{[j]}\right) Z_{j}. \end{aligned}$$

It has been shown by [20] that

$$\begin{aligned} \max _{j}|\hat{\beta }_{j} - b_{j}| = O_{p}\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

Thus, $\hbox {Var}(\hat{\beta }_{j})\approx \hbox {Var}(b_{j})$ and a more refined calculation in “Appendix A-2.1” shows that

$$\begin{aligned} |\hbox {Var}(\hat{\beta }_{j}) - \hbox {Var}(b_{j})| = O_{p}\left( \frac{\mathrm {polyLog(n)}}{n^{\frac{3}{2}}}\right) . \end{aligned}$$

It is left to show that

$$\begin{aligned} \hbox {Var}(b_{j}) = \varOmega _{p}\left( \frac{1}{n\cdot \mathrm {polyLog(n)}}\right) . \end{aligned}$$

(A-6)

1.2.2 Bounding $\hbox {Var}(b_{j})$ via $\hbox {Var}(N_{j})$

By definition of $b_{j}$,

$$\begin{aligned} \hbox {Var}(b_{j}) = \varOmega _{p}\left( \frac{\mathrm {polyLog(n)}}{n}\right) \Longleftrightarrow \hbox {Var}\left( \frac{N_{j}}{\xi _{j}}\right) = \varOmega _{p}\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$

As will be shown in “Appendix B-6.4”,

$$\begin{aligned} \hbox {Var}(\xi _{j}) = O_{p}\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

As a result, $\xi _{j}\approx \mathbb {E}\xi _{j}$ and

$$\begin{aligned} \hbox {Var}\left( \frac{N_{j}}{\xi _{j}}\right) \approx \hbox {Var}\left( \frac{N_{j}}{\mathbb {E}\xi _{j}}\right) = \frac{\hbox {Var}(N_{j})}{(\mathbb {E}\xi _{j})^{2}}. \end{aligned}$$

As in the previous paper [20], we rewrite $\xi _{j}$ as

$$\begin{aligned} \xi _{j} = \frac{1}{n}Z_{j}^{T}D_{[j]}^{\frac{1}{2}}\left( I - D_{[j]}^{\frac{1}{2}}Z_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}Z_{[j]}^{T}D_{[j]}^{\frac{1}{2}}\right) D_{[j]}^{\frac{1}{2}}Z_{j}. \end{aligned}$$

The middle matrix is idempotent and hence positive semi-definite. Thus,

$$\begin{aligned} \xi _{j}&\le \frac{1}{n}Z_{j}^{T}D_{[j]}Z_{j} \le K_{1}\lambda _{+}= O_{p}\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$

Then we obtain that

$$\begin{aligned} \frac{\hbox {Var}(N_{j})}{(\mathbb {E}\xi _{j})^{2}} = \varOmega _{p}\left( \frac{\hbox {Var}(N_{j})}{\mathrm {polyLog(n)}}\right) , \end{aligned}$$

and it is left to show that

$$\begin{aligned} \hbox {Var}(N_{j}) = \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

(A-7)

1.2.3 Bounding $\hbox {Var}(N_{j})$ via $\hbox {tr}(Q_{j})$

Recall the definition of $N_{j}$ (A-5), and that of $Q_{j}$ (see Sect. 3.1 in p. 8), we have

$$\begin{aligned} \hbox {Var}(N_{j}) = \frac{1}{n}Z_{j}^{T}Q_{j}Z_{j} \end{aligned}$$

Notice that $Z_{j}$ is independent of $r_{i, [j]}$ and hence the conditional distribution of $Z_{j}$ given $Q_{j}$ remains the same as the marginal distribution of $Z_{j}$. Since $Z_{j}$ has i.i.d. sub-gaussian entries, the Hanson-Wright inequality ([27, 51]; see Proposition E.2), shown in Proposition E.2, implies that any quadratic form of $Z_{j}$, denoted by $Z_{j}^{T}Q_{j}Z_{j}$ is concentrated on its mean, i.e.

$$\begin{aligned} Z_{j}^{T}Q_{j}Z_{j}\approx \mathbb {E}_{\tiny Z_{j}, \varepsilon }Z_{j}^{T}Q_{j}Z_{j} = \left( \mathbb {E}Z_{1j}^{2}\right) \cdot \hbox {tr}(Q_{j}). \end{aligned}$$

As a consequence, it is left to show that

$$\begin{aligned} \hbox {tr}(Q_{j}) = \varOmega _{p}\left( \frac{n}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

(A-8)

1.2.4 Lower bound of $\hbox {tr}(Q_{j})$

By definition of $Q_{j}$,

$$\begin{aligned} \hbox {tr}(Q_{j}) = \sum _{i=1}^{n}\hbox {Var}(\psi (r_{i, [j]})). \end{aligned}$$

To lower bounded the variance of $\psi (r_{i, [j]})$, recall that for any random variable W,

$$\begin{aligned} \hbox {Var}(W) = \frac{1}{2}\mathbb {E}(W - W')^{2}. \end{aligned}$$

(A-9)

where $W'$ is an independent copy of W. Suppose $g: \mathbb {R}\rightarrow \mathbb {R}$ is a function such that $|g'(x)|\ge c$ for all x, then (A-9) implies that

$$\begin{aligned} \hbox {Var}(g(W)) = \frac{1}{2}\mathbb {E}(g(W) - g(W'))^{2} \ge \frac{c^{2}}{2}\mathbb {E}(W - W')^{2} = c^{2}\hbox {Var}(W). \end{aligned}$$

(A-10)

In other words, (A-10) entails that $\hbox {Var}(W)$ is a lower bound for $\hbox {Var}(g(W))$ provided that the derivative of g is bounded away from 0. As an application, we see that

$$\begin{aligned} \hbox {Var}(\psi (r_{i, [j]}))\ge K_{0}^{2}\hbox {Var}(r_{i, [j]}) \end{aligned}$$

and hence

$$\begin{aligned} \hbox {tr}(Q_{j}) \ge K_{0}^{2}\sum _{i=1}^{n}\hbox {Var}(r_{i, [j]}). \end{aligned}$$

By the variance decomposition formula,

$$\begin{aligned} \hbox {Var}(r_{i, [j]})&= \mathbb {E}\left( \hbox {Var}\left( r_{i, [j]} \big | \varepsilon _{(i)}\right) \right) + \hbox {Var}\left( \mathbb {E}\left( r_{i, [j]}\big | \varepsilon _{(i)}\right) \right) \ge \mathbb {E}\left( \hbox {Var}\left( r_{i, [j]} \big | \varepsilon _{[i]}\right) \right) , \end{aligned}$$

where $\varepsilon _{(i)}$ includes all but i-th entry of $\varepsilon $. Given $\varepsilon _{(i)}$, $r_{i, [j]}$ is a function of $\varepsilon _{i}$. Using (A-10), we have

$$\begin{aligned} \hbox {Var}(r_{i, [j]} | \varepsilon _{(i)})\ge \inf _{\varepsilon _{i}}\bigg |\frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}}\bigg |^{2}\cdot \hbox {Var}(\varepsilon _{i} | \varepsilon _{(i)})\ge \inf _{\varepsilon _{i}}\bigg |\frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}}\bigg |^{2}\cdot \hbox {Var}(\varepsilon _{i}). \end{aligned}$$

This implies that

$$\begin{aligned} \hbox {Var}(r_{i, [j]})\ge \mathbb {E}\left( \hbox {Var}\left( r_{i, [j]} \big | \varepsilon _{[i]}\right) \right) \ge \mathbb {E}\inf _{\varepsilon }\bigg |\frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}}\bigg |^{2}\cdot \min _{i}\hbox {Var}(\varepsilon _{i}). \end{aligned}$$

Summing $\hbox {Var}(r_{i, [j]})$ over $i = 1, \ldots , n$, we obtain that

$$\begin{aligned} \hbox {tr}(Q_{j}) = \sum _{i=1}^{n}\hbox {Var}(r_{i, [j]})\ge \mathbb {E}\left( \sum _{i}\inf _{\varepsilon }\bigg |\frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}}\bigg |^{2}\right) \cdot \min _{i}\hbox {Var}(\varepsilon _{i}). \end{aligned}$$

It will be shown in “Appendix B-6.3” that under assumptions A1–A3,

$$\begin{aligned} \mathbb {E}\sum _{i}\inf _{\varepsilon }\bigg |\frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}}\bigg |^{2} = \varOmega _{p}\left( \frac{n}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

(A-11)

This proves (A-8) and as a result,

$$\begin{aligned} \min _{j}\hbox {Var}(\hat{\beta }_{j}) = \varOmega _{p}\left( \frac{1}{n\cdot \mathrm {polyLog(n)}}\right) . \end{aligned}$$

Proof of Theorem 3.1

1.1 Notation

To be self-contained, we summarize our notations in this subsection. The model we considered here is

$$\begin{aligned} y = X\beta ^{*}+ \varepsilon \end{aligned}$$

where $X\in \mathbb {R}^{n\times p}$ be the design matrix and $\varepsilon $ is a random vector with independent entries. Notice that the target quantity $\frac{\hat{\beta }_{j} - \mathbb {E}\hat{\beta }_{j}}{\sqrt{\hbox {Var}(\hat{\beta }_{j})}}$ is shift invariant, we can assume $\beta ^{*}= 0$ without loss of generality provided that X has full column rank; see Sect. 3.1 for details.

Let $x_{i}^{T}\in \mathbb {R}^{1\times p}$ denote the i-th row of X and $X_{j}\in \mathbb {R}^{n\times 1}$ denote the j-th column of X. Throughout the paper we will denote by $X_{ij}\in \mathbb {R}$ the (i, j)-th entry of X, by $X_{(i)}\in \mathbb {R}^{(n-1)\times p}$ the design matrix X after removing the i-th row, by $X_{[j]}\in \mathbb {R}^{n\times (p-1)}$ the design matrix X after removing the j-th column, by $X_{(i), [j]}\in \mathbb {R}^{(n-1)\times (p-1)}$ the design matrix after removing both i-th row and j-th column, and by $x_{i, [j]}\in \mathbb {R}^{1\times (p-1)}$ the vector $x_{i}$ after removing j-th entry. The M-estimator $\hat{\beta }$ associated with the loss function $\rho $ is defined as

$$\begin{aligned} \hat{\beta } = \mathop {\hbox {arg min}}\limits _{\beta \in \mathbb {R}^{p}}\frac{1}{n}\sum _{k=1}^{n}\rho \left( \varepsilon _{k} - x_{k}^{T}\beta \right) . \end{aligned}$$

(B-12)

Similarly we define the leave-j-th-predictor-out version as

$$\begin{aligned} \hat{\beta }_{[j]} = \mathop {\hbox {arg min}}\limits _{\beta \in \mathbb {R}^{p}}\frac{1}{n}\sum _{k=1}^{n}\rho \left( \varepsilon _{k} - x_{k, [j]}^{T}\beta \right) . \end{aligned}$$

(B-13)

Based on these notation we define the full residual $R_{k}$ as

$$\begin{aligned} R_{k} = \varepsilon _{k} - x_{k}^{T}\hat{\beta }, \quad k = 1, 2, \ldots , n \end{aligned}$$

(B-14)

the leave-j-th-predictor-out residual as

$$\begin{aligned} r_{k, [j]} = \varepsilon _{k} - x_{k, [j]}^{T}\hat{\beta }_{[j]}, \quad k = 1,2,\ldots , n, \,\,j \in J_{n}. \end{aligned}$$

(B-15)

Four diagonal matrices are defined as

$$\begin{aligned} D= & {} \hbox {diag}(\psi '(R_{k})), \quad \tilde{D} = \hbox {diag}(\psi ''(R_{k})), \end{aligned}$$

(B-16)

$$\begin{aligned} D_{[j]}= & {} \hbox {diag}(\psi '(r_{k, [j]})), \quad \tilde{D}_{[j]} = \hbox {diag}(\psi ''(r_{k, [j]})). \end{aligned}$$

(B-17)

Further we define G and $G_{[j]}$ as

$$\begin{aligned} G = I - X(X^{T}DX)^{-1}X^{T}D, \quad G_{[j]} = I - X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}. \end{aligned}$$

(B-18)

Let $J_{n}$ denote the indices of coefficients of interest. We say $a\in ]a_{1}, a_{2}[$ if and only if $a\in [\min \{a_{1}, a_{2}\}, \max \{a_{1}, a_{2}\}]$. Regarding the technical assumptions, we need the following quantities

$$\begin{aligned} \lambda _{+}= \lambda _{\mathrm {\max }}\left( \frac{X^{T}X}{n}\right) , \quad \lambda _{-}= \lambda _{\mathrm {\min }}\left( \frac{X^{T}X}{n}\right) \end{aligned}$$

(B-19)

be the largest (resp. smallest) eigenvalue of the matrix $\frac{X^{T}X}{n}$. Let $e_{i}\in \mathbb {R}^{n}$ be the i-th canonical basis vector and

$$\begin{aligned} h_{j, 0} = (\psi (r_{1, [j]}), \ldots , \psi (r_{n, [j]}))^{T}, \quad h_{j, 1, i} = G_{[j]}^{T}e_{i}. \end{aligned}$$

(B-20)

Finally, let

$$\begin{aligned} \varDelta _{C}= & {} \max \left\{ \max _{j\in J_{n}}\frac{\left| h_{j, 0}^{T}X_{j}\right| }{|\!|h_{j, 0}|\!|}, \max _{i\le n, j\in J_{n}}\frac{\left| h_{j, 1, i}^{T}X_{j}\right| }{|\!|h_{j, 1, i}|\!|}\right\} , \end{aligned}$$

(B-21)

$$\begin{aligned} Q_{j}= & {} \hbox {Cov}(h_{j, 0}). \end{aligned}$$

(B-22)

We adopt Landau’s notation ($O(\cdot ), o(\cdot ), O_{p}(\cdot ), o_{p}(\cdot )$). In addition, we say $a_{n} = \varOmega (b_{n})$ if $b_{n} = O(a_{n})$ and similarly, we say $a_{n} = \varOmega _{p}(b_{n})$ if $b_{n} = O_{p}(a_{n})$. To simplify the logarithm factors, we use the symbol $\mathrm {polyLog(n)}$ to denote any factor that can be upper bounded by $(\log n)^{\gamma }$ for some $\gamma > 0$. Similarly, we use $\frac{1}{\mathrm {polyLog(n)}}$ to denote any factor that can be lower bounded by $\frac{1}{(\log n)^{\gamma '}}$ for some $\gamma ' > 0$.

Finally we restate all the technical assumptions:

A1:: $\rho (0) = \psi (0) = 0$ and there exists $K_{0} = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) $, $K_{1}, K_{2} = O\left( \mathrm {polyLog(n)}\right) $, such that for any $x\in \mathbb {R}$,
$$\begin{aligned} K_{0} \le \psi '(x)\le K_{1}, \quad \bigg |\frac{d}{dx}(\sqrt{\psi '}(x))\bigg | = \frac{|\psi ''(x)|}{\sqrt{\psi '(x)}}\le K_{2}; \end{aligned}$$
A2:: $\varepsilon _{i} = u_{i}(W_{i})$ where $(W_{1}, \ldots , W_{n})\sim N(0, I_{n\times n})$ and $u_{i}$ are smooth functions with $\Vert u'_{i}\Vert _{\infty }\le c_{1}$ and $\Vert u''_{i}\Vert _{\infty }\le c_{2}$ for some $c_{1}, c_{2} = O(\mathrm {polyLog(n)})$. Moreover, assume $\min _{i}\hbox {Var}(\varepsilon _{i}) = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) $.
A3:: $\lambda _{+}= O(\mathrm {polyLog(n)})$ and $\lambda _{-}= \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) $;
A4:: $\min _{j\in J_{n}}\frac{X_{j}^{T}Q_{j}X_{j}}{\hbox {tr}(Q_{j})} = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) $;
A5:: $\mathbb {E}\varDelta _{C}^{8} = O\left( \mathrm {polyLog(n)}\right) $.

1.2 Deterministic approximation results

In “Appendix A”, we use several approximations under random designs, e.g. $R_{i}\approx r_{i, [j]}$. To prove them, we follow the strategy of [20] which establishes the deterministic results and then apply the concentration inequalities to obtain high probability bounds. Note that $\hat{\beta }$ is the solution of

$$\begin{aligned} 0 = f(\beta )\triangleq \frac{1}{n}\sum _{i=1}^{n}x_{i}\psi \left( \varepsilon _{i} - x_{i}^{T}\beta \right) , \end{aligned}$$

we need the following key lemma to bound $\Vert \beta _{1} - \beta _{2}\Vert _{2}$ by $\Vert f(\beta _{1}) - f(\beta _{2})\Vert _{2}$, which can be calculated explicily.

Lemma B-1

[20, Proposition 2.1] For any $\beta _{1}$ and $\beta _{2}$,

$$\begin{aligned} \left\| \beta _{1} - \beta _{2}\right\| _{2}\le \frac{1}{K_{0}\lambda _{-}}\left\| f(\beta _{1}) - f(\beta _{2})\right\| _{2}. \end{aligned}$$

Proof

By the mean value theorem, there exists $\nu _{i}\in ]\varepsilon _{i} - x_{i}^{T}\beta _{1}, \varepsilon _{i} - x_{i}^{T}\beta _{2}[$ such that

$$\begin{aligned} \psi \left( \varepsilon _{i} - x_{i}^{T}\beta _{1}\right) - \psi \left( \varepsilon _{i} - x_{i}^{T}\beta _{2}\right) = \psi '(\nu _{i})\cdot x_{i}^{T}(\beta _{2} - \beta _{1}). \end{aligned}$$

Then

$$\begin{aligned} \left\| f(\beta _{1}) - f(\beta _{2})\right\| _{2}&= \left\| \frac{1}{n}\sum _{i=1}^{n}\psi '(\nu _{i})x_{i}x_{i}^{T}\left( \beta _{1} - \beta _{2}\right) \right\| _{2}\\&\ge \lambda _{\mathrm {min}}\left( \frac{1}{n}\sum _{i=1}^{n}\psi '(\nu _{i})x_{i}x_{i}^{T}\right) \cdot \left\| \beta _{1} - \beta _{2}\right\| _{2}\\&\ge K_{0}\lambda _{-}\left\| \beta _{1} - \beta _{2}\right\| _{2}. \end{aligned}$$

$\square $

Based on Lemma B-1, we can derive the deterministic results informally stated in “Appendix A”. Such results are shown by [20] for ridge-penalized M-estimates and here we derive a refined version for unpenalized M-estimates. Throughout this subsection, we only assume assumption A1. This implies the following lemma,

Lemma B-2

Under assumption A1, for any x and y,

$$\begin{aligned} |\psi (x)|\le & {} K_{1}|x|, \quad |\sqrt{\psi '}(x) - \sqrt{\psi '}(y)|\le K_{2}|x - y|, \\ |\psi '(x) - \psi '(y)|\le & {} 2\sqrt{K_{1}}K_{2}|x - y| \triangleq K_{3}|x - y|. \end{aligned}$$

To state the result, we define the following quantities.

$$\begin{aligned} T= & {} \frac{1}{\sqrt{n}}\max \left\{ \max _{i}\Vert x_{i}\Vert _{2}, \max _{j\in J_{n}}\Vert X_{j}\Vert _{2}\right\} , \quad \mathscr {E}= \frac{1}{n}\sum _{i=1}^{n}\rho (\varepsilon _{i}), \end{aligned}$$

(B-23)

$$\begin{aligned} U= & {} \left\| \frac{1}{n}\sum _{i=1}^{n}x_{i}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))\right\| _{2}, \quad U_{0} = \left\| \frac{1}{n}\sum _{i=1}^{n}x_{i}\mathbb {E}\psi (\varepsilon _{i})\right\| _{2}. \end{aligned}$$

(B-24)

The following proposition summarizes all deterministic results which we need in the proof.

Proposition B.1

Under Assumption $\mathbf A 1$,

(i)
The norm of M estimator is bounded by
$$\begin{aligned} \Vert \hat{\beta }\Vert _{2} \le \frac{1}{K_{0}\lambda _{-}}(U + U_{0}); \end{aligned}$$
(ii)
Define $b_{j}$ as
$$\begin{aligned} b_{j} = \frac{1}{\sqrt{n}}\frac{N_{j}}{\xi _{j}} \end{aligned}$$
where
$$\begin{aligned} N_{j}= & {} \frac{1}{\sqrt{n}}\sum _{i=1}^{n}X_{ij}\psi (r_{i, [j]}), \quad \\ \xi _{j}= & {} \frac{1}{n}X_{j}^{T}\left( D_{[j]} - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}\right) X_{j}, \end{aligned}$$
Then
$$\begin{aligned} \max _{j\in J_{n}}|b_{j}|\le \frac{1}{\sqrt{n}}\cdot \frac{\sqrt{2K_{1}}}{K_{0}\lambda _{-}}\cdot \varDelta _{C}\cdot \sqrt{\mathscr {E}}, \end{aligned}$$
(iii)
The difference between $\hat{\beta }_{j}$ and $b_{j}$ is bounded by
$$\begin{aligned}\max _{j\in J_{n}}|\hat{\beta }_{j} - b_{j}|\le \frac{1}{n}\cdot \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}.\end{aligned}$$
(iv)
The difference between the full and the leave-one-predictor-out residual is bounded by
$$\begin{aligned}\max _{j\in J_{n}}\max _{i}|R_{i} - r_{i, [j]}|\le \frac{1}{\sqrt{n}}\left( \frac{2K_{1}^{2}K_{3}\lambda _{+}T^{2}}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}+ \frac{\sqrt{2}K_{1}}{K_{0}^{\frac{3}{2}}\lambda _{-}}\cdot \varDelta _{C}^{2}\cdot \sqrt{\mathscr {E}}\right) . \end{aligned}$$

Proof

(i)
By Lemma B-1,
$$\begin{aligned}\Vert \hat{\beta }\Vert _{2} \le \frac{1}{K_{0}\lambda _{-}}\Vert f(\hat{\beta }) - f(0)\Vert _{2} = \frac{\Vert f(0)\Vert _{2}}{K_{0}\lambda _{-}},\end{aligned}$$
since $\hat{\beta }$ is a zero of $f(\beta )$. By definition,
$$\begin{aligned}f(0) = \frac{1}{n}\sum _{i=1}^{n}x_{i}\psi (\varepsilon _{i}) = \frac{1}{n}\sum _{i=1}^{n}x_{i}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i})) + \frac{1}{n}\sum _{i=1}^{n}x_{i}\mathbb {E}\psi (\varepsilon _{i}).\end{aligned}$$
This implies that
$$\begin{aligned}\left\| f(0)\right\| _{2} \le U + U_{0}.\end{aligned}$$
(ii)
First we prove that
$$\begin{aligned} \xi _{j}\ge K_{0}\lambda _{-}. \end{aligned}$$
(B-25)
Since all diagonal entries of $D_{[j]}$ is lower bounded by $K_{0}$, we conclude that
$$\begin{aligned} \lambda _{\mathrm {min}}\left( \frac{X^{T}D_{[j]}X}{n}\right) \ge K_{0}\lambda _{-}. \end{aligned}$$
Note that $\xi _{j}$ is the Schur’s complement ([28], chapter 0.8) of $\frac{X^{T}D_{[j]}X}{n}$, we have
$$\begin{aligned} \xi _{j}^{-1} = e_{j}^{T}\left( \frac{X^{T}D_{[j]}X}{n}\right) ^{-1} e_{j}\le \frac{1}{K_{0}\lambda _{-}}, \end{aligned}$$
which implies (B-25). As for $N_{j}$, we have
$$\begin{aligned} N_{j} = \frac{X_{j}^{T}h_{j, 0}}{\sqrt{n}} = \frac{\left\| h_{j, 0}\right\| _{2}}{\sqrt{n}}\cdot \frac{X_{j}^{T}h_{j, 0}}{\left\| h_{j, 0}\right\| _{2}}. \end{aligned}$$
(B-26)
The the second term is bounded by $\varDelta _{C}$ by definition, see (B-21). For the first term, the assumption A1 that $\psi '(x)\le K_{1}$ implies that
$$\begin{aligned} \rho (x) = \rho (x) - \rho (0) = \int _{0}^{x}\psi (y)dy\ge \int _{0}^{x}\frac{\psi '(y)}{K_{1}}\cdot \psi (y)dy = \frac{1}{2K_{1}}\psi ^{2}(x). \end{aligned}$$
Here we use the fact that $\hbox {sign}(\psi (y)) = \hbox {sign}(y)$. Recall the definition of $h_{j, 0}$, we obtain that
$$\begin{aligned} \frac{\left\| h_{j, 0}\right\| _{2}}{\sqrt{n}} = \sqrt{\frac{\sum _{i=1}^{n}\psi (r_{i, [j]})^{2}}{n}}\le \sqrt{2K_{1}}\cdot \sqrt{\frac{\sum _{i=1}^{n}\rho (r_{i, [j]})}{n}}. \end{aligned}$$
Since $\hat{\beta }_{[j]}$ is the minimizer of the loss function $\sum _{i=1}^{n}\rho (\varepsilon _{i} - x_{i, [j]}^{T}\beta _{[j]})$, it holds that
$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\rho (r_{i, [j]})\le \frac{1}{n}\sum _{i=1}^{n}\rho (\varepsilon _{i}) = \mathscr {E}. \end{aligned}$$
Putting together the pieces, we conclude that
$$\begin{aligned} |N_{j}|\le \sqrt{2K_{1}}\cdot \varDelta _{C}\sqrt{\mathscr {E}}. \end{aligned}$$
(B-27)
By definition of $b_{j}$,
$$\begin{aligned} |b_{j}|\le \frac{1}{\sqrt{n}}\cdot \frac{\sqrt{2K_{1}}}{K_{0}\lambda _{-}}\varDelta _{C}\sqrt{\mathscr {E}}. \end{aligned}$$
(iii)
The proof of this result is almost the same as [20]. We state it here for the sake of completeness. Let $\tilde{\mathbf {b}}_{\mathbf {j}}\in \mathbb {R}^{p}$ with
$$\begin{aligned} (\tilde{\mathbf {b}}_{\mathbf {j}})_{j} = b_{j}, \quad (\tilde{\mathbf {b}}_{\mathbf {j}})_{[j]} = \hat{\beta }_{[j]} - b_{j}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}\end{aligned}$$
(B-28)
where the subscript j denotes the j-th entry and the subscript [j] denotes the sub-vector formed by all but j-th entry. Furthermore, define $\gamma _{j}$ with
$$\begin{aligned} (\gamma _{j})_{j} = -1, \quad (\gamma _{j})_{[j]} = \left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}. \end{aligned}$$
(B-29)
Then we can rewrite $\tilde{\mathbf {b}}_{\mathbf {j}}$ as
$$\begin{aligned} (\tilde{\mathbf {b}}_{\mathbf {j}})_{j} = -b_{j}(\gamma _{j})_{j}, \quad (\tilde{\mathbf {b}}_{\mathbf {j}})_{[j]} = \hat{\beta }_{[j]} - b_{j}(\gamma _{j})_{[j]}. \end{aligned}$$
By definition of $\hat{\beta }_{[j]}$, we have $[f(\hat{\beta }_{[j]})]_{[j]} = 0$ and hence
$$\begin{aligned}{}[f(\tilde{\mathbf {b}}_{\mathbf {j}})]_{[j]}&= [f(\tilde{\mathbf {b}}_{\mathbf {j}})]_{[j]} - [f(\hat{\beta }_{[j]})]_{[j]} \nonumber \\&= \frac{1}{n}\sum _{i=1}^{n}x_{i, [j]}\left[ \psi (\varepsilon _{i} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}) - \psi (\varepsilon _{i} - x_{i, [j]}^{T}\hat{\beta }_{[j]})\right] . \end{aligned}$$
(B-30)
By mean value theorem, there exists $\nu _{i, j}\in ]\varepsilon _{i} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}, \varepsilon _{i} - x_{i, [j]}^{T}\hat{\beta }_{[j]}[$ such that
$$\begin{aligned}&\psi \left( \varepsilon _{i} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}\right) - \psi \left( \varepsilon _{i} - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right) = \psi '(\nu _{i, j})\left( x_{i, [j]}^{T}\hat{\beta }_{[j]} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}\right) \\&\quad = \psi '(\nu _{i, j})\left( x_{i, [j]}^{T}\hat{\beta }_{[j]} - x_{i, [j]}^{T}(\tilde{\mathbf {b}}_{\mathbf {j}})_{[j]} - X_{ij}b_{j}\right) \\&\quad = \psi '(\nu _{i, j})\cdot b_{j}\cdot \left[ x_{i, [j]}^{T}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \end{aligned}$$
Let
$$\begin{aligned} d_{i, j} = \psi '(\nu _{i, j}) - \psi '(r_{i, [j]}) \end{aligned}$$
(B-31)
and plug the above result into (B-30), we obtain that
$$\begin{aligned} \left[ f(\tilde{\mathbf {b}}_{\mathbf {j}})\right] _{[j]}&= \frac{1}{n}\sum _{i=1}^{n}x_{i, [j]}\cdot \left( \psi '(r_{i, [j]}) + d_{i, j}\right) \cdot b_{j}\cdot \left[ x_{i, [j]}^{T}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \\&= b_{j}\cdot \frac{1}{n}\sum _{i=1}^{n}\psi '(r_{i, [j]})x_{i, [j]}\left[ x_{i, [j]}^{T}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \\&\quad + b_{j}\cdot \frac{1}{n}\sum _{i=1}^{n}d_{i, j}x_{i, [j]}\left( x_{i, [j]}^{T}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right) \\&= b_{j}\cdot \frac{1}{n}\left[ X_{[j]}^{T}D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{[j]}^{T}D_{[j]}X_{j}\right] \\&\quad + b_{j}\cdot \frac{1}{n}\sum _{i=1}^{n}d_{i, j}x_{i, [j]}\cdot x_{i}^{T}\gamma _{j}\\&= b_{j}\cdot \frac{1}{n}\left( \sum _{i=1}^{n}d_{i, j}x_{i, [j]}x_{i}^{T}\right) \gamma _{j}. \end{aligned}$$
Now we calculate $[f(\tilde{\mathbf {b}}_{\mathbf {j}})]_{j}$, the j-th entry of $f(\tilde{\mathbf {b}}_{\mathbf {j}})$. Note that
$$\begin{aligned} \left[ f(\tilde{\mathbf {b}}_{\mathbf {j}})\right] _{j}&= \frac{1}{n}\sum _{i=1}^{n}X_{ij}\psi \left( \varepsilon _{i} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}\right) = \frac{1}{n}\sum _{i=1}^{n}X_{ij}\psi (r_{i, [j]}) \\&\quad + b_{j}\cdot \frac{1}{n}\sum _{i=1}^{n}X_{ij}(\psi '(r_{i, [j]})+ d_{i, j}) \\&\quad \cdot \left[ x_{i, [j]}^{T}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \\&= \frac{1}{n}\sum _{i=1}^{n}X_{ij}\psi (r_{i, [j]})+ b_{j} \\&\quad \cdot \frac{1}{n}\sum _{i=1}^{n}\psi '(r_{i, [j]})X_{ij}\left[ x_{i, [j]}^{T}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\right] \\&\quad + b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i, j}X_{ij}x_{i}^{T}\right) \gamma _{j}= \frac{1}{\sqrt{n}}N_{j}+ b_{j} \\&\quad \cdot \left( \frac{1}{n}X_{j}^{T}D_{[j]}X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}X_{j}- \frac{1}{n}\sum _{i=1}^{n}\psi '(r_{i, [j]})X_{ij}^{2}\right) \\&\quad + b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i, j}X_{ij}x_{i}^{T}\right) \gamma _{j}= \frac{1}{\sqrt{n}}N_{j} - b_{j}\cdot \xi _{j}\\&\quad + b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i, j}X_{ij}x_{i}^{T}\right) \gamma _{j} = b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i, j}X_{ij}x_{i}^{T}\right) \gamma _{j} \end{aligned}$$
where the second last line uses the definition of $b_{j}$. Putting the results together, we obtain that
$$\begin{aligned} f(\tilde{\mathbf {b}}_{\mathbf {j}}) = b_{j}\cdot \left( \frac{1}{n}\sum _{i=1}^{n}d_{i,j}x_{i}x_{i}^{T}\right) \cdot \gamma _{j}. \end{aligned}$$
This entails that
$$\begin{aligned} \Vert f(\tilde{\mathbf {b}}_{\mathbf {j}})\Vert _{2}\le |b_{j}|\cdot \max _{i}|d_{i,j}|\cdot \lambda _{+}\cdot \Vert \gamma _{j}\Vert _{2}. \end{aligned}$$
(B-32)
Now we derive a bound for $\max _{i}|d_{i,j}|$, where $d_{i,j}$ is defined in (B-36). By Lemma B-2,
$$\begin{aligned} |d_{i,j}| = |\psi '(\nu _{i,j}) - \psi '(r_{i, [j]})|\le K_{3}\left| \nu _{i, j} - r_{i, [j]}| = K_{3}|x_{i, [j]}^{T}\hat{\beta }_{[j]} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}\right| . \end{aligned}$$
By definition of $\tilde{\mathbf {b}}_{\mathbf {j}}$ and $h_{j, 1, i}$,
$$\begin{aligned} |x_{i, [j]}^{T}\hat{\beta }_{[j]} - x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}}|&= |b_{j}|\cdot \big |x_{i, [j]}^{T}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}X_{j}- X_{ij}\big |\nonumber \\&= |b_{j}| \cdot \left| e_{i}^{T}(I - X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]})X_{j}\right| \nonumber \\&= |b_{j}|\cdot \left| h_{j, 1, i}^{T}X_{j}| \le |b_{j}\right| \cdot \varDelta _{C}\left\| h_{j, 1, i}\right\| _{2}, \end{aligned}$$
(B-33)
where the last inequality is derived by definition of $\varDelta _{C}$, see (B-21). Since $h_{j, 1, i}$ is the i-th column of matrix $I - D_{[j]}X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}$, its $L_{2}$ norm is upper bounded by the operator norm of this matrix. Notice that
$$\begin{aligned}&I - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T} \\&\quad = D_{[j]}^{\frac{1}{2}}\left( I - D_{[j]}^{\frac{1}{2}}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}\right) D_{[j]}^{-\frac{1}{2}}. \end{aligned}$$
The middle matrix in RHS of the displayed atom is an orthogonal projection matrix and hence
$$\begin{aligned} \left\| I - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}\Vert _{\mathrm {op}}\le \Vert D_{[j]}^{\frac{1}{2}}\right\| _{\mathrm {op}}\cdot \left\| D_{[j]}^{-\frac{1}{2}}\right\| _{\mathrm {op}} \le \left( \frac{K_{1}}{K_{0}}\right) ^{\frac{1}{2}}. \end{aligned}$$
(B-34)
Therefore,
$$\begin{aligned} \max _{i, j}\Vert h_{j, 1, i}\Vert _{2}\le \max _{j\in J_{n}}\left\| I - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}\right\| _{\mathrm {op}}\le \left( \frac{K_{1}}{K_{0}}\right) ^{\frac{1}{2}}, \end{aligned}$$
(B-35)
and thus
$$\begin{aligned} \max _{i}|d_{i,j}|\le K_{3}\sqrt{\frac{K_{1}}{K_{0}}}\cdot |b_{j}|\cdot \varDelta _{C}. \end{aligned}$$
(B-36)
As for $\gamma _{j}$, we have
$$\begin{aligned} K_{0}\lambda _{-}\Vert \gamma _{j}\Vert _{2}^{2}&\le \gamma _{j}^{T}\left( \frac{X^{T}D_{[j]}X}{n}\right) \gamma _{j} \\&= (\gamma _{j})_{j}^{2}\cdot \frac{X_{j}^{T}D_{j}X_{j}}{n} + (\gamma _{j})_{[j]}^{T}\left( \frac{X_{[j]}^{T}D_{[j]}X_{[j]}}{n}\right) (\gamma _{j})_{[j]}\\&\quad + 2\gamma _{j}\frac{X_{j}^{T}D_{[j]}X_{[j]}}{n}(\gamma _{j})_{[j]} \end{aligned}$$
Recall the definition of $\gamma _{j}$ in (B-37), we have
$$\begin{aligned} (\gamma _{j})_{[j]}^{T}\left( \frac{X_{[j]}^{T}D_{[j]}X_{[j]}}{n}\right) (\gamma _{j})_{[j]} = \frac{1}{n}X_{j}^{T}D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j} \end{aligned}$$
and
$$\begin{aligned} \gamma _{j}\frac{X_{j}^{T}D_{[j]}X_{[j]}}{n}(\gamma _{j})_{[j]} = - \frac{1}{n}X_{j}^{T}D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}X_{j}. \end{aligned}$$
As a result,
$$\begin{aligned} K_{0}\lambda _{-}\Vert \gamma _{j}\Vert _{2}^{2}&\le \frac{1}{n}X_{j}^{T}D_{[j]}^{\frac{1}{2}}\left( I - D_{[j]}^{\frac{1}{2}}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}\right) D_{[j]}^{\frac{1}{2}}X_{j}\\&\le \frac{\left\| D_{[j]}^{\frac{1}{2}}X_{j}\right\| _{2}^{2}}{n}\cdot \left\| I - D_{[j]}^{\frac{1}{2}}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}\right\| _{op}\\&\le \frac{\left\| D_{[j]}^{\frac{1}{2}}X_{j}\right\| _{2}^{2}}{n} \le \frac{K_{1}\Vert X_{j}\Vert _{2}^{2}}{n}\le T^{2}K_{1}, \end{aligned}$$
where T is defined in (B-23). Therefore we have
$$\begin{aligned} \left\| \gamma _{j}\right\| _{2}\le \sqrt{\frac{K_{1}}{K_{0}\lambda _{-}}}T. \end{aligned}$$
(B-37)
Putting (B-32), (B-36), (B-37) and part (ii) together, we obtain that
$$\begin{aligned} \Vert f(\tilde{\mathbf {b}}_{\mathbf {j}})\Vert _{2}&\le \lambda _{+}\cdot |b_{j}|\cdot K_{3}\sqrt{\frac{K_{1}}{K_{0}}} \varDelta _{C}|b_{j}| \cdot \sqrt{\frac{K_{1}}{K_{0}\lambda _{-}}}T\\&\le \lambda _{+}\cdot \frac{1}{n}\frac{2K_{1}}{(K_{0}\lambda _{-})^{2}}\varDelta _{C}^{2}\mathscr {E}\cdot K_{3}\sqrt{\frac{K_{1}}{K_{0}}} \varDelta _{C}\cdot \sqrt{\frac{K_{1}}{K_{0}\lambda _{-}}}T\\&= \frac{1}{n}\cdot \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{3}\lambda _{-}^{\frac{5}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}. \end{aligned}$$
By Lemma B-1,
$$\begin{aligned} \Vert \hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}\Vert _{2}&\le \frac{\Vert f(\hat{\beta }) - f(\tilde{\mathbf {b}}_{\mathbf {j}})\Vert _{2}}{K_{0}\lambda _{-}} = \frac{\Vert f(\tilde{\mathbf {b}}_{\mathbf {j}})\Vert _{2}}{K_{0}\lambda _{-}} \le \frac{1}{n}\cdot \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}. \end{aligned}$$
Since $\hat{\beta }_{j} - b_{j}$ is the j-th entry of $\hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}$, we have
$$\begin{aligned} |\hat{\beta }_{j} - b_{j}|\le \Vert \hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}\Vert _{2} \le \frac{1}{n}\cdot \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3} \cdot \mathscr {E}. \end{aligned}$$
(iv)
Similar to part (iii), this result has been shown by [20]. Here we state a refined version for the sake of completeness. Let $\tilde{\mathbf {b}}_{\mathbf {j}}$ be defined as in (B-28), then
$$\begin{aligned} |R_{i} - r_{i, [j]}|&= \left| x_{i}^{T}\hat{\beta } - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right| = \left| x_{i}^{T}(\hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}) + x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}} - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right| \\&\le \Vert x_{i}\Vert _{2} \cdot \Vert \hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}\Vert _{2} + \left| x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}} - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right| . \end{aligned}$$
Note that $\left\| x_{i}\right\| _{2}\le \sqrt{n}T$, by part (iii), we have
$$\begin{aligned} \Vert x_{i}\Vert _{2} \cdot \Vert \hat{\beta } - \tilde{\mathbf {b}}_{\mathbf {j}}\Vert _{2}\le \frac{1}{\sqrt{n}}\frac{2K_{1}^{2}K_{3}\lambda _{+}T^{2}}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}. \end{aligned}$$
(B-38)
On the other hand, similar to (B-36), by (B-33),
$$\begin{aligned} \left| x_{i}^{T}\tilde{\mathbf {b}}_{\mathbf {j}} - x_{i, [j]}^{T}\hat{\beta }_{[j]}\right| \le \sqrt{\frac{K_{1}}{K_{0}}}\cdot |b_{j}|\cdot \varDelta _{C} \le \frac{1}{\sqrt{n}}\cdot \frac{\sqrt{2}K_{1}}{K_{0}^{\frac{3}{2}}\lambda _{-}}\cdot \varDelta _{C}^{2}\cdot \sqrt{\mathscr {E}}. \end{aligned}$$
(B-39)
Therefore,
$$\begin{aligned} |R_{i} - r_{i, [j]}|\le \frac{1}{\sqrt{n}}\left( \frac{2K_{1}^{2}K_{3}\lambda _{+}T^{2}}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}}\cdot \varDelta _{C}^{3}\cdot \mathscr {E}+ \frac{\sqrt{2}K_{1}}{K_{0}^{\frac{3}{2}}\lambda _{-}}\cdot \varDelta _{C}^{2}\cdot \sqrt{\mathscr {E}}\right) . \end{aligned}$$

$\square $

1.3 Summary of approximation results

Under our technical assumptions, we can derive the rate for approximations via Proposition B.1. This justifies all approximations in “Appendix A”.

Theorem B.1

Under the assumptions A1–A5,

(i)
$$\begin{aligned} T\le \lambda _{+}= O\left( \mathrm {polyLog(n)}\right) ; \end{aligned}$$
(ii)
$$\begin{aligned} \max _{j\in J_{n}}|\hat{\beta }_{j}|\le \Vert \hat{\beta }\Vert _{2} = O_{L^{4}}\left( \mathrm {polyLog(n)}\right) ; \end{aligned}$$
(iii)
$$\begin{aligned} \max _{j\in J_{n}}|b_{j}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) ; \end{aligned}$$
(iv)
$$\begin{aligned} \max _{j\in J_{n}}|\hat{\beta }_{j} - b_{j}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{n}\right) ; \end{aligned}$$
(v)
$$\begin{aligned} \max _{j\in J_{n}}\max _{i}|R_{i} - r_{i, [j]}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) . \end{aligned}$$

Proof

(i)
Notice that $X_{j} = Xe_{j}$, where $e_{j}$ is the j-th canonical basis vector in $\mathbb {R}^{p}$, we have
$$\begin{aligned} \frac{\Vert X_{j}\Vert ^{2}}{n} = e_{j}^{T}\frac{X^{T}X}{n}e_{j}\le \lambda _{+}. \end{aligned}$$
Similarly, consider the $X^{T}$ instead of X, we conclude that
$$\begin{aligned} \frac{\Vert x_{i}\Vert ^{2}}{n} \le \lambda _{\max }\left( \frac{XX^{T}}{n}\right) = \lambda _{+}. \end{aligned}$$
Recall the definition of T in (B-23), we conclude that
$$\begin{aligned} T \le \sqrt{\lambda _{+}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
(ii)
Since $\varepsilon _{i} = u_{i}(W_{i})$ with $\Vert u'_{i}\Vert _{\infty }\le c_{1}$, the gaussian concentration property ([36], chapter 1.3) implies that $\varepsilon _{i}$ is $c_{1}^{2}$-sub-gaussian and hence $\mathbb {E}|\varepsilon _{i}|^{k} = O(c_{1}^{k})$ for any finite $k > 0$. By Lemma B-2, $|\psi (\varepsilon _{i})|\le K_{1}|\varepsilon _{i}|$ and hence for any finite k,
$$\begin{aligned} \mathbb {E}|\psi (\varepsilon _{i})|^{k}\le K_{1}^{k}\mathbb {E}|\varepsilon _{i}|^{k} = O\left( c_{1}^{k}\right) . \end{aligned}$$
By part (i) of Proposition B.1, using the convexity of $x^{4}$ and hence $\left( \frac{a + b}{2}\right) ^{4} \le \frac{a^{4} + b^{4}}{2}$,
$$\begin{aligned} \mathbb {E}\Vert \hat{\beta }\Vert _{2}^{4}\le \frac{1}{(K_{0}\lambda _{-})^{4}}\mathbb {E}(U + U_{0})^{4}\le \frac{8}{(K_{0}\lambda _{-})^{4}}\left( \mathbb {E}U^{4} + U_{0}^{4}\right) . \end{aligned}$$
Recall (B-24) that $U = \left\| \frac{1}{n}\sum _{i=1}^{n}x_{i}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))\right\| _{2}$,
$$\begin{aligned} U^{4}&= (U^{2})^{2} = \frac{1}{n^{4}}\left( \sum _{i,i'=1}^{n}x_{i}^{T}x_{i'}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))\right) ^{2}\\&= \frac{1}{n^{4}}\left( \sum _{i=1}^{n}\Vert x_{i}\Vert _{2}^{2}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{2} \right. \\&\quad \left. + \sum _{i\not = i'}|x_{i}^{T}x_{i'}|(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))\right) ^{2}\\&=\frac{1}{n^{4}}\bigg \{\sum _{i=1}^{n}\Vert x_{i}\Vert _{2}^{4}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{4} \\&\quad + \sum _{i\not = i'}(2|x_{i}^{T}x_{i'}|^{2} + \Vert x_{i}\Vert _{2}^{2}\Vert x_{i'}\Vert _{2}^{2})(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{2}(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))^{2}\\&\quad + \sum _{\mathrm{others}}|x_{i}^{T}x_{i'}|\cdot |x_{k}^{T}x_{k'}|\cdot (\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))(\psi (\varepsilon _{i'}) \\&\quad - \mathbb {E}\psi (\varepsilon _{i'}))(\psi (\varepsilon _{k}) - \mathbb {E}\psi (\varepsilon _{k}))(\psi (\varepsilon _{k'}) - \mathbb {E}\psi (\varepsilon _{k'}))\bigg \} \end{aligned}$$
Since $\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i})$ has a zero mean, we have
$$\begin{aligned}&\mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))(\psi (\varepsilon _{k}) \\&\quad -\, \mathbb {E}\psi (\varepsilon _{k}))(\psi (\varepsilon _{k'}) - \mathbb {E}\psi (\varepsilon _{k'})) = 0 \end{aligned}$$
for any $(i, i')\not = (k, k') \text{ or } (k', k)$ and $i\not = i'$. As a consequence,
$$\begin{aligned} \mathbb {E}U^{4}&= \frac{1}{n^{4}}\bigg (\sum _{i=1}^{n}\Vert x_{i}\Vert _{2}^{4}\mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{4}\\&+\,\sum _{i\not =i'}(2|x_{i}^{T}x_{i'}|_{2}^{2} + \Vert x_{i}\Vert _{2}^{2}\Vert x_{i'}\Vert _{2}^{2})\mathbb {E}(\psi (\varepsilon _{i})\\&-\,\mathbb {E}\psi (\varepsilon _{i}))^{2}\mathbb {E}(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))^{2}\bigg )\\&\le \frac{1}{n^{4}}\left( \sum _{i=1}^{n}\Vert x_{i}\Vert _{2}^{4}\mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{4} \right. \\&\left. +\,3\sum _{i\not =i'}\Vert x_{i}\Vert _{2}^{2}\Vert x_{i'}\Vert _{2}^{2}\mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{2}\mathbb {E}(\psi (\varepsilon _{i'}) - \mathbb {E}\psi (\varepsilon _{i'}))^{2}\right) . \end{aligned}$$
For any i, using the convexity of $x^{4}$, hence $(\frac{a + b}{2})^{4}\le \frac{a^{4} + b^{4}}{2}$, we have
$$\begin{aligned} \mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{4}\le & {} 8\mathbb {E}\left( \psi (\varepsilon _{i})^{4} + (\mathbb {E}\psi (\varepsilon _{i}))^{4}\right) \le 16 \mathbb {E}\psi (\varepsilon _{i})^{4}\\\le & {} 16\max _{i}\mathbb {E}\psi (\varepsilon _{i})^{4}. \end{aligned}$$
By Cauchy–Schwartz inequality,
$$\begin{aligned} \mathbb {E}(\psi (\varepsilon _{i}) - \mathbb {E}\psi (\varepsilon _{i}))^{2}\le \mathbb {E}\psi (\varepsilon _{i})^{2}\le \sqrt{\mathbb {E}\psi (\varepsilon _{i})^{4}}\le \sqrt{\max _{i}\mathbb {E}\psi (\varepsilon _{i})^{4}}. \end{aligned}$$
Recall (B-23) that $\Vert x_{i}\Vert _{2}^{2} \le nT^{2}$ and thus,
$$\begin{aligned} \mathbb {E}U^{4}&\le \frac{1}{n^{4}}\left( 16 n\cdot n^{2}T^{4} + 3n^{2}\cdot n^{2}T^{4}\right) \cdot \max _{i}\mathbb {E}\psi (\varepsilon _{i})^{4}\\&\le \frac{1}{n^{4}}\cdot (16 n^{3} + 3n^{4})T^{4}\max _{i}\mathbb {E}\psi (\varepsilon _{i})^{4} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
On the other hand, let $\mu ^{T} = (\mathbb {E}\psi (\varepsilon _{1}), \ldots , \mathbb {E}\psi (\varepsilon _{n}))$, then $\Vert \mu \Vert _{2}^{2} = O(n\cdot \mathrm {polyLog(n)})$ and hence by definition of $U_{0}$ in (B-24),
$$\begin{aligned} U_{0} = \frac{\Vert \mu ^{T}X\Vert _{2}}{n} = \frac{1}{n}\sqrt{\mu ^{T}XX^{T}\mu }\le \sqrt{\frac{\Vert \mu \Vert _{2}^{2}}{n}\cdot \lambda _{+}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
In summary,
$$\begin{aligned} \mathbb {E}\Vert \hat{\beta }\Vert _{2}^{4} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
(iii)
By mean-value theorem, there exists $a_{x}\in (0, x)$ such that
$$\begin{aligned} \rho (x) = \rho (0) + x\psi (0) + \frac{x^{2}}{2}\psi '(a_{x}). \end{aligned}$$
By assumption A1 and Lemma B-2, we have
$$\begin{aligned} \rho (x) = \frac{x^{2}}{2}\psi '(a_{x})\le \frac{x^{2}}{2}\Vert \psi '\Vert _{\infty } \le \frac{K_{3}x^{2}}{2}, \end{aligned}$$
where $K_{3}$ is defined in Lemma B-2. As a result,
$$\begin{aligned} \mathbb {E}\rho (\varepsilon _{i})^{8} \le \left( \frac{K_{3}}{2}\right) ^{8}\mathbb {E}\varepsilon _{i}^{16} = O\left( c_{1}^{16}\right) . \end{aligned}$$
Recall the definition of $\mathscr {E}$ in (B-23) and the convexity of $x^{8}$, we have
$$\begin{aligned} \mathbb {E}\mathscr {E}^{8} \le \frac{1}{n}\sum _{i=1}^{n}\mathbb {E}\rho (\varepsilon _{i})^{8} = O(c_{1}^{16}) = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
(B-40)
Under assumption A5, by Cauchy–Schwartz inequality,
$$\begin{aligned} \mathbb {E}(\varDelta _{C}\sqrt{\mathscr {E}})^{2} = \mathbb {E}\varDelta _{C}^{2}\mathscr {E}\le \sqrt{\mathbb {E}\varDelta _{C}^{4}}\cdot \sqrt{\mathbb {E}\mathscr {E}^{2}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
Under assumptions A1 and A3,
$$\begin{aligned} \frac{\sqrt{2K_{1}}}{K_{0}\lambda _{-}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
Putting all the pieces together, we obtain that
$$\begin{aligned} \max _{j\in J_{n}}|b_{j}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) . \end{aligned}$$
(iv)
Similarly, by Holder’s inequality,
$$\begin{aligned} \mathbb {E}\left( \varDelta _{C}^{3}\mathscr {E}\right) ^{2} = \mathbb {E}\varDelta _{C}^{6}\mathscr {E}^{2}\le \left( \mathbb {E}\varDelta _{C}^{8}\right) ^{\frac{3}{4}}\cdot \left( \mathbb {E}\mathscr {E}^{8}\right) ^{\frac{1}{4}} = O\left( \mathrm {polyLog(n)}\right) , \end{aligned}$$
and under assumptions A1 and A3,
$$\begin{aligned} \frac{2K_{1}^{2}K_{3}\lambda _{+}T}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
Therefore,
$$\begin{aligned} \max _{j\in J_{n}}|\hat{\beta }_{j} - b_{j}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$
(v)
It follows from the previous part that
$$\begin{aligned} \mathbb {E}(\varDelta _{C}^{2}\cdot \sqrt{\mathscr {E}})^{2} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
Under assumptions A1 and A3, the multiplicative factors are also $O\left( \mathrm {polyLog(n)}\right) $, i.e.
$$\begin{aligned} \frac{2K_{1}^{2}K_{3}\lambda _{+}T^{2}}{K_{0}^{4}\lambda _{-}^{\frac{7}{2}}} = O\left( \mathrm {polyLog(n)}\right) , \quad \frac{\sqrt{2}K_{1}}{K_{0}^{\frac{3}{2}}\lambda _{-}} = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$
Therefore,
$$\begin{aligned} \max _{j\in J_{n}}\max _{i}|R_{i} - r_{i, [j]}| = O_{L^{2}}\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) . \end{aligned}$$

$\square $

1.4 Controlling gradient and Hessian

Proof

(Proof of Lemma 4.1) Recall that $\hat{\beta }$ is the solution of the following equation

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}x_{i}\psi \left( \varepsilon _{i} - x_{i}^{T}\hat{\beta }\right) = 0. \end{aligned}$$

(B-41)

Taking derivative of (B-41), we have

$$\begin{aligned} X^{T}D\left( I - X\frac{\partial \hat{\beta }}{\partial \varepsilon ^{T}}\right) = 0\Longrightarrow \frac{\partial \hat{\beta }}{\partial \varepsilon ^{T}}= (X^{T}DX)^{-1}X^{T}D. \end{aligned}$$

This establishes (9). To establishes (10), note that (9) can be rewritten as

$$\begin{aligned} (X^{T}DX)\frac{\partial \hat{\beta }}{\partial \varepsilon ^{T}}= X^{T}D. \end{aligned}$$

(B-42)

Fix $k\in \{1, \ldots , n\}$. Note that

$$\begin{aligned} \frac{\partial R_{i}}{\partial \varepsilon _{k}} = \frac{\partial \varepsilon _{i}}{\partial \varepsilon _{k}} - x_{i}^{T}\frac{\partial \hat{\beta }}{\partial \varepsilon _{k}} = I(i = k) - x_{i}^{T}(X^{T}DX)^{-1}X^{T}D. \end{aligned}$$

Recall that $G = I - X(X^{T}DX)^{-1}X^{T}D$, we have

$$\begin{aligned} \frac{\partial R_{i}}{\partial \varepsilon _{k}} = e_{i}^{T}Ge_{k}, \end{aligned}$$

(B-43)

where $e_{i}$ is the i-th canonical basis of $\mathbb {R}^{n}$. As a result,

$$\begin{aligned} \frac{\partial D}{\partial \varepsilon _{k}} = \tilde{D}\hbox {diag}(Ge_{k}). \end{aligned}$$

(B-44)

Taking derivative of (B-42), we have

$$\begin{aligned}&X^{T}\frac{\partial D}{\partial \varepsilon _{k}}X\frac{\partial \hat{\beta }}{\partial \varepsilon ^{T}}+ (X^{T}DX)\frac{\partial \hat{\beta }}{\partial \varepsilon _{k}\partial \varepsilon ^{T}} = X^{T}\frac{\partial D}{\partial \varepsilon _{k}}\\&\quad \Longrightarrow \frac{\partial \hat{\beta }}{\partial \varepsilon _{k}\partial \varepsilon ^{T}} = (X^{T}DX)^{-1}X^{T}\frac{\partial D}{\partial \varepsilon _{k}}\left( I - X(X^{T}DX)^{-1}X^{T}D\right) \\&\quad \Longrightarrow \frac{\partial \hat{\beta }}{\partial \varepsilon _{k}\partial \varepsilon ^{T}} = (X^{T}DX)^{-1}X^{T} \tilde{D}\hbox {diag}(Ge_{k})G, \end{aligned}$$

where $G = I - X(X^{T}DX)^{-1}X^{T}D$ is defined in (B-18) in p. 31. Then for each $j\in \{1, \ldots , p\}$ and $k\in \{1, \ldots , n\}$,

$$\begin{aligned} \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon _{k}\partial \varepsilon ^{T}} = e_{j}^{T}(X^{T}DX)^{-1}X^{T}\tilde{D}\hbox {diag}(Ge_{k})G = e_{k}^{T}G^{T}\hbox {diag}\left( e_{j}^{T}(X^{T}DX)^{-1}X^{T}\tilde{D}\right) G \end{aligned}$$

where we use the fact that $a^{T}\hbox {diag}(b) = b^{T}\hbox {diag}(a)$ for any vectors a, b. This implies that

$$\begin{aligned} \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon \partial \varepsilon ^{T}} = G^{T}\hbox {diag}\left( e_{j}^{T}(X^{T}DX)^{-1}X^{T}\tilde{D}\right) G \end{aligned}$$

$\square $

Proof

(Proof of Lemma 4.2) Throughout the proof we are using the simple fact that $\left\| a\right\| _{\infty }\le \left\| a\right\| _{2}$. Based on it, we found that

$$\begin{aligned} \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty }&\le \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{2}\nonumber \\&= \sqrt{e_{j}^{T}(X^{T}DX)^{-1}X^{T}DX(X^{T}DX)^{-1}e_{j}}\nonumber \\&= \sqrt{e_{j}^{T}(X^{T}DX)^{-1}e_{j}} \le \frac{1}{(nK_{0}\lambda _{-})^{\frac{1}{2}}}. \end{aligned}$$

(B-45)

Thus for any $m > 1$, recall that $M_{j} = \mathbb {E}\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty }$,

$$\begin{aligned}&\mathbb {E}\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty }^{m}\nonumber \\&\quad \le \mathbb {E}\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty }\cdot \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{2}^{m - 1}\nonumber \\&\quad \le \frac{M_{j}}{(nK_{0}\lambda _{-})^{\frac{m - 1}{2}}}. \end{aligned}$$

(B-46)

We should emphasize that we cannot use the naive bound that

$$\begin{aligned}&\mathbb {E}\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty }^{m} \le \mathbb {E}\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{2}^{m}\le \frac{1}{(nK_{0}\lambda _{-})^{\frac{m}{2}}},\nonumber \\&\quad \Longrightarrow \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty } = O_{L^{m}}\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) \end{aligned}$$

(B-47)

since it fails to guarantee the convergence of TV distance. We will address this issue after deriving Lemma 4.3.

By contrast, as proved below,

$$\begin{aligned} \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty } = O_{p}(M_{j}) = O_{p}\left( \frac{\mathrm {polyLog(n)}}{n}\right)<< \frac{1}{\sqrt{nK_{0}\lambda _{-}}}. \end{aligned}$$

(B-48)

Thus (B-46) produces a slightly tighter bound

$$\begin{aligned} \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty } = O_{L^{m}}\left( \frac{\mathrm {polyLog(n)}}{n^{\frac{m + 1}{2m}}}\right) . \end{aligned}$$

It turns out that the above bound suffices to prove the convergence. Although (B-48) implies the possibility to sharpen the bound from $n^{-\frac{m + 1}{2m}}$ to $n^{-1}$ using refined analysis, we do not explore this to avoid extra conditions and notation.

Bound for $\kappa _{0j}$ First we derive a bound for $\kappa _{0j}$. By definition,
$$\begin{aligned} \kappa _{0j}^{2} = \mathbb {E}\left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{4}^{4}\le \mathbb {E}\left( \left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{\infty }^{2}\cdot \left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{2}^{2}\right) . \end{aligned}$$
By Lemma 4.1 and (B-46) with $m = 2$,
$$\begin{aligned} \mathbb {E}\left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{\infty }^{2} \le \mathbb {E}\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty }^{2}\cdot K_{1} = \frac{K_{1}M_{j}}{(nK_{0}\lambda _{-})^{\frac{1}{2}}}. \end{aligned}$$
On the other hand, it follows from (B-45) that
$$\begin{aligned} \left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{2}^{2}&= \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D\right\| _{2}^{2} \le K_{1}\cdot \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{2}^{2} \le \frac{K_{1}}{nK_{0}\lambda _{-}}. \end{aligned}$$
(B-49)
Putting the above two bounds together we have
$$\begin{aligned} \kappa _{0j}^{2}\le \frac{K_{1}^{2}}{(nK_{0}\lambda _{-})^{\frac{3}{2}}}\cdot M_{j}. \end{aligned}$$
(B-50)
Bound for $\kappa _{1j}$ As a by-product of (B-49), we obtain that
$$\begin{aligned} \kappa _{1j}^{4}&= \mathbb {E}\left\| \frac{\partial \hat{\beta }_{j}}{\partial \varepsilon ^{T}}\right\| _{2}^{4} \le \frac{K_{1}^{2}}{(nK_{0}\lambda _{-})^{2}}. \end{aligned}$$
(B-51)
Bound for $\kappa _{2j}$ Finally, we derive a bound for $\kappa _{2j}$. By Lemma 4.1, $\kappa _{2j}$ involves the operator norm of a symmetric matrix with form $G^{T}MG$ where M is a diagonal matrix. Then by the triangle inequality,
$$\begin{aligned} \left\| G^{T}MG\right\| _{op}\le \Vert M\Vert _{\mathrm {op}}\cdot \left\| G^{T}G\right\| _{op} = \Vert M\Vert _{\mathrm {op}}\cdot \left\| G\right\| _{op}^{2}. \end{aligned}$$

Note that

$$\begin{aligned} D^{\frac{1}{2}}GD^{-\frac{1}{2}} = I - D^{\frac{1}{2}}X(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}} \end{aligned}$$

is a projection matrix, which is idempotent. This implies that

$$\begin{aligned} \left\| D^{\frac{1}{2}}GD^{-\frac{1}{2}}\right\| _{op} = \lambda _{\mathrm {max}}\left( D^{\frac{1}{2}}GD^{-\frac{1}{2}}\right) \le 1. \end{aligned}$$

Write G as $D^{-\frac{1}{2}}(D^{\frac{1}{2}}GD^{-\frac{1}{2}})D^{\frac{1}{2}}$, then we have

$$\begin{aligned} \left\| G\right\| _{op} \le \left\| D^{-\frac{1}{2}}\right\| _{op}\cdot \left\| D^{\frac{1}{2}}GD^{-\frac{1}{2}}\right\| _{op}\cdot \left\| D^{\frac{1}{2}}\right\| _{op}\le \sqrt{\frac{K_{1}}{K_{0}}}. \end{aligned}$$

Returning to $\kappa _{2j}$, we obtain that

$$\begin{aligned} \kappa _{2j}^{4}&= \mathbb {E}\left\| G^{T}\hbox {diag}(e_{j}^{T}(X^{T}DX)^{-1}X^{T}\tilde{D})G\right\| _{op}^{4} \\&\le \mathbb {E}\left( \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}\tilde{D}\right\| _{\infty }^{4}\cdot \left\| G\right\| _{op}^{8}\right) \\&\le \mathbb {E}\left( \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}\tilde{D}\right\| _{\infty }^{4}\right) \left( \frac{K_{1}}{K_{0}}\right) ^{4}\\&= \mathbb {E}\left( \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}D^{-\frac{1}{2}}\tilde{D}\right\| _{\infty }^{4}\right) \cdot \left( \frac{K_{1}}{K_{0}}\right) ^{4} \end{aligned}$$

Assumption A1 implies that

$$ \begin{aligned} \forall i, \,\, \frac{|\psi ''(R_{i})|}{\sqrt{\psi '(R_{i})}} \le K_{2}\,\, \& \,\, \hbox {hence} \Vert D^{-\frac{1}{2}}\tilde{D}\Vert _{\mathrm {op}}\le K_{2}. \end{aligned}$$

Therefore,

$$\begin{aligned} \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}D^{-\frac{1}{2}}\tilde{D}\right\| _{\infty }^{4}\le K_{2}^{4}\cdot \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}D^{\frac{1}{2}}\right\| _{\infty }^{4}. \end{aligned}$$

By (B-46) with $m = 4$,

$$\begin{aligned} \kappa _{2j}^{4}&\le \frac{K_{2}^{4}}{(n\lambda _{-})^{\frac{3}{2}}}\cdot \left( \frac{K_{1}}{K_{0}}\right) ^{4}\cdot M_{j}. \end{aligned}$$

(B-52)

$\square $

Proof

(Proof of Lemma 4.3) By Theorem B.1, for any j,

$$\begin{aligned} \mathbb {E}\hat{\beta }_{j}^{4} \le \mathbb {E}\Vert \hat{\beta }\Vert _{2}^{4} < \infty . \end{aligned}$$

Then using the second-order Poincaré inequality (Proposition 4.1),

$$\begin{aligned}&\max _{j\in J_{n}}d_{TV}\left( \mathscr {L}\left( \frac{\hat{\beta }_{j} - \mathbb {E}\hat{\beta }_{j}}{\sqrt{\hbox {Var}(\hat{\beta }_{j})}}\right) , N(0, 1)\right) = O\left( \frac{c_{1}c_{2}\kappa _{0j}+ c_{1}^{3}\kappa _{1j}\kappa _{2j}}{\hbox {Var}(\hat{\beta }_{j})}\right) \\&\quad = O\left( \frac{\frac{M_{j}^{\frac{1}{2}}}{n^{\frac{3}{4}}} + \frac{M_{j}^{\frac{1}{4}}}{n^{\frac{7}{8}}}}{\hbox {Var}(\hat{\beta }_{j})}\cdot \mathrm {polyLog(n)}\right) = O\left( \frac{(nM_{j}^{2})^{\frac{1}{4}} + (nM_{j}^{2})^{\frac{1}{8}}}{n\hbox {Var}(\hat{\beta }_{j})}\cdot \mathrm {polyLog(n)}\right) . \end{aligned}$$

It follows from (B-45) that $nM_{j}^{2} = O\left( \mathrm {polyLog(n)}\right) $ and the above bound can be simplified as

$$\begin{aligned} \max _{j\in J_{n}}d_{TV}\left( \mathscr {L}\left( \frac{\hat{\beta }_{j} - \mathbb {E}\hat{\beta }_{j}}{\sqrt{\hbox {Var}(\hat{\beta }_{j})}}\right) , N(0, 1)\right) = O\left( \frac{(nM_{j}^{2})^{\frac{1}{8}}}{n\hbox {Var}(\hat{\beta }_{j})}\cdot \mathrm {polyLog(n)}\right) . \end{aligned}$$

$\square $

Remark B.1

If we use the naive bound (B-47), by repeating the above derivation, we obtain a worse bound for $\kappa _{0, j} = O(\frac{\mathrm {polyLog(n)}}{n})$ and $\kappa _{2} = O(\frac{\mathrm {polyLog(n)}}{\sqrt{n}})$, in which case,

$$\begin{aligned} \max _{j\in J_{n}}d_{TV}\left( \mathscr {L}\left( \frac{\hat{\beta }_{j} - \mathbb {E}\hat{\beta }_{j}}{\sqrt{\hbox {Var}(\hat{\beta }_{j})}}\right) , N(0, 1)\right) = O\left( \frac{\mathrm {polyLog(n)}}{n\hbox {Var}(\hat{\beta }_{j})}\right) . \end{aligned}$$

However, we can only prove that $\hbox {Var}(\hat{\beta }_{j}) = \varOmega (\frac{1}{n})$. Without the numerator $(nM_{j}^{2})^{\frac{1}{8}}$, which will be shown to be $O(n^{-\frac{1}{8}}\mathrm {polyLog(n)})$ in the next subsection, the convergence cannot be proved.

1.5 Upper bound of $M_{j}$

As mentioned in “Appendix A”, we should approximate D by $D_{[j]}$ to remove the functional dependence on $X_{j}$. To achieve this, we introduce two terms, $M_{j}^{(1)}$ and $M_{j}^{(2)}$, defined as

$$\begin{aligned} M_{j}^{(1)}= & {} \mathbb {E}\left( \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T} D_{[j]}^{\frac{1}{2}}\right\| _{\infty }\right) , \quad M_{j}^{(2)} \\= & {} \mathbb {E}\left( \left\| e_{j}^{T}(X^{T}D_{[j]}X)^{-1}X^{T} D_{[j]}^{\frac{1}{2}}\right\| _{\infty }\right) . \end{aligned}$$

We will first prove that both $|M_{j} - M_{j}^{(1)}|$ and $|M_{j}^{(1)} - M_{j}^{(2)}|$ are negligible and then derive an upper bound for $M_{j}^{(2)}$.

1.5.1 Controlling $|M_{j} - M_{j}^{(1)}|$

By Lemma B-2,

$$\begin{aligned} \left\| D^{\frac{1}{2}} - D_{[j]}^{\frac{1}{2}}\right\| _{\infty } \le K_{2}\max _{i}|R_{i} - r_{i, [j]}| \triangleq K_{2}\mathscr {R}_{j}, \end{aligned}$$

and by Theorem B.1,

$$\begin{aligned} \sqrt{\mathbb {E}\mathscr {R}_{j}^{2}} = O\left( \frac{\mathrm {polyLog(n)}}{\sqrt{n}}\right) . \end{aligned}$$

Then we can bound $|M_{j} - M_{j}^{(1)}|$ via the fact that $\left\| a\right\| _{\infty }\le \left\| a\right\| _{2}$ and algebra as follows.

$$\begin{aligned} |M_{j} - M_{j}^{(1)}|&\le \mathbb {E}\left( \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T} \left( D^{\frac{1}{2}} - D_{[j]}^{\frac{1}{2}}\right) \right\| _{\infty } \right) \\&\le \mathbb {E}\left( \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T} \left( D^{\frac{1}{2}} - D_{[j]}^{\frac{1}{2}}\right) \right\| _{2} \right) \\&\le \sqrt{\mathbb {E}\left( \left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T} \left( D^{\frac{1}{2}} - D_{[j]}^{\frac{1}{2}}\right) \right\| _{2}^{2} \right) }\\&= \sqrt{\mathbb {E}\left( e_{j}^{T}(X^{T}DX)^{-1}X^{T} \left( D^{\frac{1}{2}} - D_{[j]}^{\frac{1}{2}}\right) ^{2} X(X^{T}DX)^{-1} e_{j}\right) }. \end{aligned}$$

By Lemma B-2,

$$\begin{aligned} \left| \sqrt{\psi '(R_{i})} - \sqrt{\psi '(r_{i, [j]})}\right| \le K_{2}|R_{i} - r_{i, [j]}|\le K_{2}\mathscr {R}_{j}, \end{aligned}$$

thus

$$\begin{aligned} \left( D^{\frac{1}{2}} - D_{[j]}^{\frac{1}{2}}\right) ^{2}\preceq K_{2}^{2}\mathscr {R}_{j}^{2}I\preceq \frac{K_{2}^{2}}{K_{0}}\mathscr {R}_{j}^{2}D. \end{aligned}$$

This entails that

$$\begin{aligned} \left| M_{j} - M_{j}^{(1)}\right|&\le K_{2}K_{0}^{-\frac{1}{2}}\sqrt{\mathbb {E}\left( \mathscr {R}_{j}^{2}\cdot e_{j}^{T}(X^{T}DX)^{-1}X^{T}DX(X^{T}DX)^{-1} e_{j}\right) }\\&= K_{2}K_{0}^{-\frac{1}{2}}\sqrt{\mathbb {E}\left( \mathscr {R}_{j}^{2}\cdot e_{j}^{T}(X^{T}DX)^{-1} e_{j}\right) }\\&\le \frac{K_{2}}{\sqrt{n}K_{0}\sqrt{\lambda _{-}}}\sqrt{\mathbb {E}\left( \mathscr {R}_{j}^{2}\right) } = O\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

1.5.2 Bound of $|M_{j}^{(1)} - M_{j}^{(2)}|$

First we prove a useful lemma.

Lemma B-3

For any symmetric matrix N with $\Vert N\Vert _{\mathrm {op}} < 1$,

$$\begin{aligned} (I - (I + N)^{-1})^{2}\preceq \frac{N^{2}}{(1 - \Vert N\Vert _{\mathrm {op}})^{2}}. \end{aligned}$$

Proof

First, notice that

$$\begin{aligned} I - (I + N)^{-1} = (I + N - I)(I + N)^{-1} = N(I + N)^{-1}, \end{aligned}$$

and therefore

$$\begin{aligned}(I - (I + N)^{-1})^{2} = N(I + N)^{-2}N.\end{aligned}$$

Since $\Vert N\Vert _{\mathrm {op}} < 1$, $I + N$ is positive semi-definite and

$$\begin{aligned} (I + N)^{-2}\preceq \frac{1}{(1 - \Vert N\Vert _{\mathrm {op}})^{2}}I. \end{aligned}$$

Therefore,

$$\begin{aligned} N(I + N)^{-2}N\preceq \frac{N^{2}}{(1 - \Vert N\Vert _{\mathrm {op}})^{2}}. \end{aligned}$$

$\square $

We now back to bounding $|M_{j}^{(1)} - M_{j}^{(2)}|$. Let $A_{j} = X^{T}D_{[j]}X$, $B_{j} = X^{T}(D - D_{[j]})X$. By Lemma B-2,

$$\begin{aligned} \Vert D - D_{[j]}\Vert _{\infty }\le K_{3}\max _{i}|R_{i} - r_{i, [j]}| = K_{3}\mathscr {R}_{j} \end{aligned}$$

and hence

$$\begin{aligned} \Vert B_{j}\Vert _{\mathrm {op}}\le K_{3}\mathscr {R}_{j}\cdot n\lambda _{+}I\triangleq n\eta _{j}. \end{aligned}$$

where $\eta _{j} = K_{3}\lambda _{+}\cdot \mathscr {R}_{j}$. Then by Theorem B.1.(v),

$$\begin{aligned} \mathbb {E}(\eta _{j}^{2}) = O\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

Using the fact that $\left\| a\right\| _{\infty }\le \left\| a\right\| _{2}$, we obtain that

$$\begin{aligned} \left| M_{j}^{(1)} - M_{j}^{(2)}\right|&\le \mathbb {E}\left( \left\| e_{j}^{T}A_{j}^{-1}X^{T}D_{[j]}^{\frac{1}{2}} - e_{j}^{T}(A_{j} + B_{j})^{-1}X^{T}D_{[j]}^{\frac{1}{2}}\right\| _{\infty } \right) \\&\le \sqrt{\mathbb {E}\left( \Vert e_{j}^{T}A_{j}^{-1}X^{T}D_{[j]}^{\frac{1}{2}} - e_{j}^{T}(A_{j} + B_{j})^{-1}X^{T}D_{[j]}^{\frac{1}{2}}\Vert _{2}^{2} \right) }\\&= \sqrt{\mathbb {E}\left[ e_{j}^{T}(A_{j}^{-1} - (A_{j} + B_{j})^{-1})X^{T}D_{[j]}X(A_{j}^{-1} - (A_{j} + B_{j})^{-1}) e_{j}\right] }\\&= \sqrt{\mathbb {E}\left[ e_{j}^{T}(A_{j}^{-1} - (A_{j} + B_{j})^{-1})A_{j}(A_{j}^{-1} - (A_{j} + B_{j})^{-1}) e_{j}\right] } \end{aligned}$$

The inner matrix can be rewritten as

$$\begin{aligned}&\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) A_{j}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) \nonumber \\&\quad = A_{j}^{-\frac{1}{2}}\left( I - \left( I + A_{j}^{-\frac{1}{2}}B_{j}A_{j}^{-\frac{1}{2}}\right) ^{-1}\right) A_{j}^{-\frac{1}{2}}A_{j}A_{j}^{-\frac{1}{2}}\left( I \right. \nonumber \\&\quad \left. - (I + A_{j}^{-\frac{1}{2}}B_{j}A_{j}^{-\frac{1}{2}})^{-1}\right) A_{j}^{-\frac{1}{2}}\nonumber \\&\quad = A_{j}^{-\frac{1}{2}}\left( I - \left( I + A_{j}^{-\frac{1}{2}}B_{j}A_{j}^{-\frac{1}{2}}\right) ^{-1}\right) ^{2}A_{j}^{-\frac{1}{2}}. \end{aligned}$$

(B-53)

Let $N_{j} = A_{j}^{-\frac{1}{2}}B_{j}A_{j}^{-\frac{1}{2}}$, then

$$\begin{aligned} \Vert N_{j}\Vert _{\mathrm {op}}\le & {} \left\| A_{j}^{-\frac{1}{2}}\right\| _{\mathrm {op}}\cdot \Vert B_{j}\Vert _{\mathrm {op}}\cdot \left\| A_{j}^{-\frac{1}{2}}\right\| _{\mathrm {op}}\le (nK_{0}\lambda _{-})^{-\frac{1}{2}}\cdot n\eta _{j}\cdot (nK_{0}\lambda _{-})^{-\frac{1}{2}} \\= & {} \frac{\eta _{j}}{K_{0}\lambda _{-}}. \end{aligned}$$

On the event $\{\eta _{j} \le \frac{1}{2}K_{0}\lambda _{-}\}$, $\Vert N_{j}\Vert _{\mathrm {op}}\le \frac{1}{2}$. By Lemma B-3,

$$\begin{aligned} (I - (I + N_{j})^{-1})^{2}\preceq 4N_{j}^{2}. \end{aligned}$$

This together with (B-53) entails that

$$\begin{aligned}&e_{j}^{T}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) A_{j}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) e_{j} \\&\quad = e_{j}^{T}A_{j}^{-\frac{1}{2}}(I - (I + N_{j})^{-1})^{2}A_{j}^{-\frac{1}{2}}e_{j}\\&\quad \le 4e_{j}^{T}A_{j}^{-\frac{1}{2}}N_{j}^{2}A_{j}^{-\frac{1}{2}}e_{j} = e_{j}^{T}A_{j}^{-1}B_{j}A_{j}^{-1}B_{j}A_{j}^{-1}e_{j} \le \left\| A_{j}^{-1}B_{j}A_{j}^{-1}B_{j}A_{j}^{-1}\right\| _{\mathrm {op}}. \end{aligned}$$

Since $A_{j}\succeq nK_{0}\lambda _{-}I$, and $\Vert B_{j}\Vert _{\mathrm {op}}\le n\eta _{j}$, we have

$$\begin{aligned} \left\| A_{j}^{-1}B_{j}A_{j}^{-1}B_{j}A_{j}^{-1}\right\| _{\mathrm {op}}\le \left\| A_{j}^{-1}\right\| _{\mathrm {op}}^{3} \cdot \Vert B_{j}\Vert _{\mathrm {op}}^{2}\le \frac{1}{n}\cdot \frac{1}{(K_{0}\lambda _{-})^{3}}\cdot \eta _{j}^{2}. \end{aligned}$$

Thus,

$$\begin{aligned}&\mathbb {E}\left[ e_{j}^{T}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) A_{j}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) e_{j} \cdot I\left( \eta _{j}\le \frac{K_{0}\lambda _{-}}{2}\right) \right] \\&\quad \le \mathbb {E}\left[ e_{j}^{T}A_{j}^{-1}B_{j}A_{j}^{-1}B_{j}A_{j}^{-1}e_{j}\right] \le \frac{1}{n}\cdot \frac{1}{(K_{0}\lambda _{-})^{3}}\cdot \mathbb {E}\eta _{j}^{2} = O\left( \frac{\mathrm {polyLog(n)}}{n^{2}}\right) . \end{aligned}$$

On the event $\{\eta _{j} > \frac{1}{2}K_{0}\lambda _{-}\}$, since $nK_{0}\lambda _{-}I \preceq A_{j}\preceq nK_{1}\lambda _{+}I$ and $A_{j} + B_{j}\succeq nK_{0}\lambda _{-}I$,

$$\begin{aligned}&\left| e_{j}^{T}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) A_{j}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) e_{j}\right| \\&\quad \le nK_{1}\lambda _{+}\cdot \left| e_{j}^{T}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) ^{2} e_{j}\right| \\&\quad \le nK_{1}\lambda _{+}\cdot \left( 2\left| e_{j}^{T}A_{j}^{-2}e_{j}\right| + 2\left| e_{j}^{T}(A_{j} + B_{j})^{-2}e_{j}\right| \right) \\&\quad \le \frac{4nK_{1}\lambda _{+}}{(nK_{0}\lambda _{-})^{2}} = \frac{1}{n}\cdot \frac{4K_{1}\lambda _{+}}{(K_{0}\lambda _{-})^{2}}. \end{aligned}$$

This together with Markov inequality implies htat

$$\begin{aligned}&\mathbb {E}\left[ e_{j}^{T}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) A_{j}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) e_{j} \cdot I\left( \eta _{j}> \frac{K_{0}\lambda _{-}}{2}\right) \right] \\&\quad \le \frac{1}{n}\cdot \frac{4K_{1}\lambda _{+}}{(K_{0}\lambda _{-})^{2}} \cdot P\left( \eta _{j} > \frac{K_{0}\lambda _{-}}{2}\right) \\&\quad \le \frac{1}{n}\cdot \frac{4K_{1}\lambda _{+}}{(K_{0}\lambda _{-})^{2}} \cdot \frac{4}{(K_{0}\lambda _{-})^{2}}\cdot \mathbb {E}\eta _{j}^{2}\\&\quad = O\left( \frac{\mathrm {polyLog(n)}}{n^{2}}\right) . \end{aligned}$$

Putting pieces together, we conclude that

$$\begin{aligned}&|M_{j}^{(1)} - M_{j}^{(2)}| \le \sqrt{\mathbb {E}\left[ e_{j}^{T}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) A_{j}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) e_{j}\right] }\\&\quad \le \sqrt{\mathbb {E}\left[ e_{j}^{T}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) A_{j}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) e_{j} \cdot I\left( \eta _{j} > \frac{K_{0}\lambda _{-}}{2}\right) \right] }\\&\quad + \sqrt{\mathbb {E}\left[ e_{j}^{T}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) A_{j}\left( A_{j}^{-1} - (A_{j} + B_{j})^{-1}\right) e_{j} \cdot I\left( \eta _{j} \le \frac{K_{0}\lambda _{-}}{2}\right) \right] }\\&\quad = O\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

1.5.3 Bound of $M_{j}^{(2)}$

Similar to (A-1), by block matrix inversion formula (See Proposition E.1),

$$\begin{aligned} e_{j}^{T}(X^{T}D_{[j]}X)^{-1}X^{T}D_{[j]}^{\frac{1}{2}} = \frac{X_{j}^{T}D_{[j]}^{\frac{1}{2}}(I - H_{j})}{X_{j}^{T}D_{[j]}^{\frac{1}{2}}(I - H_{j})D_{[j]}^{\frac{1}{2}} X_{j}}, \end{aligned}$$

where $H_{j} = D_{[j]}^{\frac{1}{2}} X_{[j]} (X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}$. Recall that $\xi _{j}\ge K_{0}\lambda _{-}$ by (B-25), so we have

$$\begin{aligned} X_{j}^{T}D_{[j]}^{\frac{1}{2}}(I - H_{j})D_{[j]}^{\frac{1}{2}} X_{j} = n\xi _{j}\ge n\lambda _{-}. \end{aligned}$$

As for the numerator, recalling the definition of $h_{j, 1, i}$, we obtain that

$$\begin{aligned} \Vert X_{j}^{T}D_{[j]}^{\frac{1}{2}}(I - H_{j})\Vert _{\infty }&= \left\| \frac{1}{n} X_{j}^{T}(I - D_{[j]}X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]})\cdot D_{[j]}^{\frac{1}{2}}\right\| _{\infty }\\&\le \sqrt{K_{1}}\cdot \left\| \frac{1}{n} X_{j}^{T}(I - D_{[j]}X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]})\right\| _{\infty }\\&= \sqrt{K_{1}}\max _{i}\big |h_{j, 1, i}^{T}X_{j}\big | \le \sqrt{K_{1}}\varDelta _{C}\max _{i}\Vert h_{j, 1, i}\Vert _{2}. \end{aligned}$$

As proved in (B-35),

$$\begin{aligned} \max _{i}\Vert h_{j, 1, i}\Vert _{2}\le \left( \frac{K_{1}}{K_{0}}\right) ^{\frac{1}{2}}. \end{aligned}$$

This entails that

$$\begin{aligned} \left\| X_{j}^{T}D_{[j]}^{\frac{1}{2}}(I - H_{j})\right\| _{\infty }\le \frac{K_{1}}{\sqrt{K_{0}}}\cdot \varDelta _{C} = O_{L^{1}}\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$

Putting the pieces together we conclude that

$$\begin{aligned} M_{j}^{(2)}\le&\frac{\mathbb {E}\left\| X_{j}^{T}D_{[j]}^{\frac{1}{2}}(I - H_{j})\right\| _{\infty }}{n\lambda _{-}} = O\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

1.5.4 Summary

Based on results from Sections B.5.1–B.5.3, we have

$$\begin{aligned} M_{j} = O\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

Note that the bounds we obtained do not depend on j, so we conclude that

$$\begin{aligned} \max _{j\in J_{n}}M_{j} = O\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

1.6 Lower Bound of $\hbox {Var}(\hat{\beta }_{j})$

1.6.1 Approximating $\hbox {Var}(\hat{\beta }_{j})$ by $\hbox {Var}(b_{j})$

By Theorem B.1,

$$\begin{aligned} \max _{j}\mathbb {E}(\hat{\beta }_{j} - b_{j})^{2} = O\left( \frac{\mathrm {polyLog(n)}}{n^{2}}\right) , \quad \max _{j} \mathbb {E}b_{j}^{2} = O\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

Using the fact that

$$\begin{aligned} \hat{\beta }_{j}^{2} - b_{j}^{2} = (\hat{\beta }_{j} - b_{j} + b_{j})^{2} - b_{j}^{2} = (\hat{\beta }_{j} - b_{j})^{2} + 2(\hat{\beta }_{j} - b_{j})b_{j}, \end{aligned}$$

we can bound the difference between $\mathbb {E}\hat{\beta }_{j}^{2}$ and $\mathbb {E}b_{j}^{2}$ by

$$\begin{aligned} \big |\mathbb {E}\hat{\beta }_{j}^{2} - \mathbb {E}b_{j}^{2}\big |&= \mathbb {E}(\hat{\beta }_{j} - b_{j})^{2} + 2|\mathbb {E}(\hat{\beta }_{j} - b_{j})b_{j}| \le \mathbb {E}(\hat{\beta }_{j} - b_{j})^{2} \\&\quad + 2\sqrt{\mathbb {E}(\hat{\beta }_{j} - b_{j})^{2}}\sqrt{\mathbb {E}b_{j}^{2}} = O\left( \frac{\mathrm {polyLog(n)}}{n^{\frac{3}{2}}}\right) . \end{aligned}$$

Similarly, since $|a^{2} - b^{2}| = |a - b|\cdot |a + b|\le |a - b|(|a - b| + 2|b|)$,

$$\begin{aligned} |(\mathbb {E}\hat{\beta }_{j})^{2} - (\mathbb {E}b_{j})^{2}|&\le \mathbb {E}|\hat{\beta }_{j} - b_{j}|\cdot \left( \mathbb {E}|\hat{\beta }_{j} - b_{j}| + 2 \mathbb {E}|b_{j}| \right) = O\left( \frac{\mathrm {polyLog(n)}}{n^{\frac{3}{2}}}\right) . \end{aligned}$$

Putting the above two results together, we conclude that

$$\begin{aligned} \big |\hbox {Var}(\hat{\beta }_{j}) - \hbox {Var}(b_{j})\big | = O\left( \frac{\mathrm {polyLog(n)}}{n^{\frac{3}{2}}}\right) . \end{aligned}$$

(B-54)

Then it is left to show that

$$\begin{aligned} \hbox {Var}(b_{j}) = \varOmega \left( \frac{1}{n\cdot \mathrm {polyLog(n)}}\right) . \end{aligned}$$

1.6.2 Controlling $\hbox {Var}(b_{j})$ by $\hbox {Var}(N_{j})$

Recall that

$$\begin{aligned} b_{j} = \frac{1}{\sqrt{n}}\frac{N_{j}}{\xi _{j}} \end{aligned}$$

where

$$\begin{aligned} N_{j}= & {} \frac{1}{\sqrt{n}}\sum _{i=1}^{n}X_{ij}\psi (r_{i, [j]}), \quad \xi _{j} \\= & {} \frac{1}{n}X_{j}^{T}\left( D_{[j]} - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}\right) X_{j}. \end{aligned}$$

Then

$$\begin{aligned} n\hbox {Var}(b_{j}) = \mathbb {E}\left( \frac{N_{j}}{\xi _{j}} - \mathbb {E}\frac{N_{j}}{\xi _{j}}\right) ^{2} = \mathbb {E}\left( \frac{N_{j} - \mathbb {E}N_{j}}{\xi _{j}} + \frac{\mathbb {E}N_{j}}{\xi _{j}} - \mathbb {E}\frac{N_{j}}{\xi _{j}}\right) ^{2}. \end{aligned}$$

Using the fact that $(a + b)^{2} - (\frac{1}{2}a^{2} - b^{2}) = \frac{1}{2}(a + 2b)^{2}\ge 0$, we have

$$\begin{aligned} n\hbox {Var}(b_{j}) \ge \frac{1}{2}\mathbb {E}\left( \frac{N_{j} - \mathbb {E}N_{j}}{\xi _{j}}\right) ^{2} - \mathbb {E}\left( \frac{\mathbb {E}N_{j}}{\xi _{j}} - \mathbb {E}\frac{N_{j}}{\xi _{j}}\right) ^{2}\triangleq \frac{1}{2}I_{1} - I_{2}. \end{aligned}$$

(B-55)

1.6.3 Controlling $I_{1}$

The Assumption $\mathbf A 4$ implies that

$$\begin{aligned} \hbox {Var}(N_{j}) = \frac{1}{n}X_{j}^{T}Q_{j}X_{j} = \varOmega \left( \frac{\hbox {tr}(\hbox {Cov}(h_{j, 0}))}{n \cdot \mathrm {polyLog(n)}} \right) . \end{aligned}$$

It is left to show that $ \hbox {tr}(\hbox {Cov}(h_{j, 0})) / n = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) $. Since this result will also be used later in “Appendix C”, we state it in the following the lemma.

Lemma B-4

Under assumptions A1 - A3,

$$\begin{aligned} \frac{\hbox {tr}(\hbox {Cov}(\psi (h_{j, 0})))}{n}\ge \frac{K_{0}^{4}}{K_{1}^{2}}\cdot \left( \frac{n - p + 1}{n}\right) ^{2}\cdot \min _{i}\hbox {Var}(\varepsilon _{i}) = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

Proof

The (A-10) implies that

$$\begin{aligned} \hbox {Var}(\psi (r_{i, [j]}))\ge K_{0}^{2}\hbox {Var}(r_{i, [j]}). \end{aligned}$$

(B-56)

Note that $r_{i, [j]}$ is a function of $\varepsilon $, we can apply (A-10) again to obtain a lower bound for $\hbox {Var}(r_{i, [j]})$. In fact, by variance decomposition formula, using the independence of $\varepsilon '_{i}$s,

$$\begin{aligned} \hbox {Var}(r_{i, [j]})&= \mathbb {E}\left( \hbox {Var}\left( r_{i, [j]} \big | \varepsilon _{(i)}\right) \right) + \hbox {Var}\left( \mathbb {E}\left( r_{i, [j]}\big | \varepsilon _{(i)}\right) \right) \ge \mathbb {E}\left( \hbox {Var}\left( r_{i, [j]} \big | \varepsilon _{(i)}\right) \right) , \end{aligned}$$

where $\varepsilon _{(i)}$ includes all but the i-th entry of $\varepsilon $. Apply A-10 again,

$$\begin{aligned} \hbox {Var}\left( r_{i, [j]} \big |\varepsilon _{(i)}\right) \ge \inf _{\varepsilon _{i}} \bigg |\frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}}\bigg |^{2}\cdot \hbox {Var}(\varepsilon _{i}), \end{aligned}$$

and hence

$$\begin{aligned} \hbox {Var}(r_{i, [j]})\ge \mathbb {E}\hbox {Var}\left( r_{i, [j]} \big |\varepsilon _{(i)}\right) \ge \mathbb {E}\inf _{\varepsilon } \bigg |\frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}}\bigg |^{2}\cdot \hbox {Var}(\varepsilon _{i}). \end{aligned}$$

(B-57)

Now we compute $\frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}}$. Similar to (B-43) in p. 40, we have

$$\begin{aligned} \frac{\partial r_{k, [j]}}{\partial \varepsilon _{i}} = e_{i}^{T}G_{[j]}e_{k}, \end{aligned}$$

(B-58)

where $G_{[j]}$ is defined in (B-18) in p. 31. When $k = i$,

$$\begin{aligned} \frac{\partial r_{i, [j]}}{\partial \varepsilon _{i}} = e_{i}^{T}G_{[j]}e_{i} = e_{i}^{T}D_{[j]}^{-\frac{1}{2}}D_{[j]}^{\frac{1}{2}}G_{[j]}D_{[j]}^{-\frac{1}{2}}D_{[j]}^{\frac{1}{2}}e_{i} = e_{i}^{T}D_{[j]}^{\frac{1}{2}}G_{[j]}D_{[j]}^{-\frac{1}{2}}e_{i}. \end{aligned}$$

(B-59)

By definition of $G_{[j]}$,

$$\begin{aligned} D_{[j]}^{\frac{1}{2}}G_{[j]}D_{[j]}^{-\frac{1}{2}} = I - D_{[j]}^{\frac{1}{2}}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}. \end{aligned}$$

Let $\tilde{X}_{[j]} = D_{[j]}^{\frac{1}{2}}X_{[j]}$ and $H_{j} = \tilde{X}_{[j]}(\tilde{X}_{[j]}^{T}\tilde{X}_{[j]})^{-1}\tilde{X}_{[j]}^{T}$. Denote by $\tilde{X}_{(i), [j]}$ the matrix $\tilde{X}_{[j]}$ after removing i-th row, then by block matrix inversion formula (See Proposition E.1),

$$\begin{aligned} e_{i}^{T}H_{j}e_{i}&= \tilde{x}_{i, [j]}^{T}\left( \tilde{X}_{(i), [j]}^{T}\tilde{X}_{(i), [j]} + \tilde{x}_{i, [j]}\tilde{x}_{i, [j]}^{T}\right) ^{-1}\tilde{x}_{i, [j]}\\&= \tilde{x}_{i, [j]}^{T}\left( \left( \tilde{X}_{(i), [j]}^{T}\tilde{X}_{(i), [j]}\right) ^{-1} \right. \\&\quad \left. - \frac{\left( \tilde{X}_{(i), [j]}^{T}\tilde{X}_{(i), [j]}\right) ^{-1}\tilde{x}_{i, [j]}\tilde{x}_{i, [j]}^{T}\left( \tilde{X}_{(i), [j]}^{T}\tilde{X}_{(i), [j]}\right) ^{-1}}{1 + \tilde{x}_{i, [j]}^{T}\left( \tilde{X}_{(i), [j]}^{T}\tilde{X}_{(i), [j]}\right) ^{-1}\tilde{x}_{i, [j]}}\right) \tilde{x}_{i, [j]}\\&= \frac{\tilde{x}_{i, [j]}^{T}\left( \tilde{X}_{(i), [j]}^{T}\tilde{X}_{(i), [j]}\right) ^{-1}\tilde{x}_{i, [j]}}{1 + \tilde{x}_{i, [j]}^{T}\left( \tilde{X}_{(i), [j]}^{T}\tilde{X}_{(i), [j]}\right) ^{-1}\tilde{x}_{i, [j]}}. \end{aligned}$$

This implies that

$$\begin{aligned} e_{i}^{T}D_{[j]}^{\frac{1}{2}}G_{[j]}D_{[j]}^{-\frac{1}{2}}e_{i}&= e_{i}^{T}(I - H_{j})e_{i} = \frac{1}{1 + \tilde{x}_{i, [j]}^{T}\left( \tilde{X}_{(i), [j]}^{T}\tilde{X}_{(i), [j]}\right) ^{-1}\tilde{x}_{i, [j]}}\nonumber \\&= \frac{1}{1 + e_{i}^{T}D_{[j]}^{\frac{1}{2}}X_{[j]}\left( X_{(i), [j]}^{T}D_{(i), [j]}X_{(i), [j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}e_{i}}\nonumber \\&\ge \frac{1}{1 + K_{0}^{-1}e_{i}^{T}D_{[j]}^{\frac{1}{2}}X_{[j]}\left( X_{(i), [j]}^{T}X_{(i), [j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}^{\frac{1}{2}}e_{i}}\nonumber \\&= \frac{1}{1 + K_{0}^{-1}(D_{[j]})_{i, i}\cdot e_{i}^{T}X_{[j]}\left( X_{(i), [j]}^{T}X_{(i), [j]}\right) ^{-1}X_{[j]}^{T}e_{i}}\nonumber \\&\ge \frac{1}{1 + K_{0}^{-1}K_{1}e_{i}^{T}X_{[j]}\left( X_{(i), [j]}^{T}X_{(i), [j]}\right) ^{-1}X_{[j]}^{T}e_{i}}\nonumber \\&\ge \frac{K_{0}}{K_{1}}\cdot \frac{1}{1 + e_{i}^{T}X_{[j]}\left( X_{(i), [j]}^{T}X_{(i), [j]}\right) ^{-1}X_{[j]}^{T}e_{i}}. \end{aligned}$$

(B-60)

Apply the above argument that replaces $H_{j}$ by $X_{[j]}(X_{[j]}^{T}X_{[j]})^{-1}X_{[j]}^{T}$, we have

$$\begin{aligned} \frac{1}{1 + e_{i}^{T}X_{[j]}^{T}\left( X_{(i), [j]}^{T}X_{(i), [j]}\right) ^{-1}X_{[j]}e_{i}} = e_{i}^{T}\left( I - X_{[j]}\left( X_{[j]}^{T}X_{[j]}\right) ^{-1}X_{[j]}^{T}\right) e_{i}. \end{aligned}$$

Thus, by (B-56) and (B-57),

$$\begin{aligned} \hbox {Var}(\psi (r_{i, [j]}))\ge \frac{K_{0}^{4}}{K_{1}^{2}}\cdot \left[ e_{i}^{T}\left( I - X_{[j]}\left( X_{[j]}^{T}X_{[j]}\right) ^{-1}X_{[j]}^{T}\right) e_{i}\right] ^{2}. \end{aligned}$$

Summing i over $1, \ldots , n$, we obtain that

$$\begin{aligned} \frac{\hbox {tr}(\hbox {Cov}(h_{j, 0}))}{n}&\ge \frac{K_{0}^{4}}{K_{1}^{2}}\cdot \frac{1}{n}\sum _{i=1}^{n}\left[ e_{i}^{T}\left( I - X_{[j]}\left( X_{[j]}^{T}X_{[j]}\right) ^{-1}X_{[j]}^{T}\right) e_{i}\right] ^{2}\cdot \min _{i}\hbox {Var}(\varepsilon _{i})\\&\ge \frac{K_{0}^{4}}{K_{1}^{2}}\cdot \left( \frac{1}{n}\hbox {tr}\left( I - X_{[j]}\left( X_{[j]}^{T}X_{[j]}\right) ^{-1}X_{[j]}^{T}\right) \right) ^{2}\cdot \min _{i}\hbox {Var}(\varepsilon _{i})\\&= \frac{K_{0}^{4}}{K_{1}^{2}}\cdot \left( \frac{n - p + 1}{n}\right) ^{2}\cdot \min _{i}\hbox {Var}(\varepsilon _{i}) \end{aligned}$$

Since $\min _{i}\hbox {Var}(\varepsilon _{i}) = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) $ by assumption A2, we conclude that

$$\begin{aligned} \frac{\hbox {tr}(\hbox {Cov}(h_{j, 0}))}{n} = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

$\square $

In summary,

$$\begin{aligned} \hbox {Var}(N_{j}) = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

Recall that

$$\begin{aligned} \xi _{j}= & {} \frac{1}{n}X_{j}^{T}\left( D_{[j]} - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}\right) X_{j}\\\le & {} \frac{1}{n}X_{j}^{T}D_{[j]}X_{j}\le K_{1}T^{2}, \end{aligned}$$

we conclude that

$$\begin{aligned} I_{1}\ge \frac{\hbox {Var}(N_{j})}{(K_{1}T^{2})^{2}} = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

(B-61)

1.6.4 Controlling $I_{2}$

By definition,

$$\begin{aligned} I_{2}&= \mathbb {E}\left( \mathbb {E}N_{j}\left( \frac{1}{\xi _{j}} - \mathbb {E}\frac{1}{\xi _{j}}\right) + \mathbb {E}N_{j}\mathbb {E}\frac{1}{\xi _{j}} - \mathbb {E}\frac{N_{j}}{\xi _{j}}\right) ^{2}\nonumber \\&= \hbox {Var}\left( \frac{\mathbb {E}N_{j}}{\xi _{j}}\right) + \left( \mathbb {E}N_{j}\mathbb {E}\frac{1}{\xi _{j}} - \mathbb {E}\frac{N_{j}}{\xi _{j}}\right) ^{2}\nonumber \\&= (\mathbb {E}N_{j})^{2}\cdot \hbox {Var}\left( \frac{1}{\xi _{j}}\right) + \hbox {Cov}\left( N_{j}, \frac{1}{\xi _{j}}\right) ^{2}\nonumber \\&\le (\mathbb {E}N_{j})^{2}\cdot \hbox {Var}\left( \frac{1}{\xi _{j}}\right) + \hbox {Var}(N_{j})\hbox {Var}\left( \frac{1}{\xi _{j}}\right) \nonumber \\&= \mathbb {E}N_{j}^{2}\cdot \hbox {Var}\left( \frac{1}{\xi _{j}}\right) . \end{aligned}$$

(B-62)

By (B-27) in the proof of Theorem B.1,

$$\begin{aligned} \mathbb {E}N_{j}^{2}\le 2K_{1}\mathbb {E}\left( \mathscr {E}\cdot \varDelta _{C}^{2}\right) \le 2K_{1}\sqrt{\mathbb {E}\mathscr {E}^{2}\cdot \mathbb {E}\varDelta _{C}^{4}} = O\left( \mathrm {polyLog(n)}\right) , \end{aligned}$$

where the last equality uses the fact that $\mathscr {E}= O_{L^{2}}\left( \mathrm {polyLog(n)}\right) $ as proved in (B-40). On the other hand, let $\tilde{\xi }_{j}$ be an independent copy of $\xi _{j}$, then

$$\begin{aligned} \hbox {Var}\left( \frac{1}{\xi _{j}}\right)&= \frac{1}{2}\mathbb {E}\left( \frac{1}{\xi _{j}} - \frac{1}{\tilde{\xi }_{j}}\right) ^{2} = \frac{1}{2}\mathbb {E}\frac{(\xi _{j} - \tilde{\xi }_{j})^{2}}{\xi _{j}^{2}\tilde{\xi }_{j}^{2}}. \end{aligned}$$

Since $\xi _{j}\ge K_{0}\lambda _{-}$ as shown in (B-25), we have

$$\begin{aligned} \hbox {Var}\left( \frac{1}{\xi _{j}}\right) \le \frac{1}{2(K_{0}\lambda _{-})^{4}}\mathbb {E}(\xi _{j} - \tilde{\xi }_{j})^{2} = \frac{1}{(K_{0}\lambda _{-})^{4}}\cdot \hbox {Var}(\xi _{j}). \end{aligned}$$

(B-63)

To bound $\hbox {Var}(\xi _{j})$, we propose to using the standard Poincaré inequality [11], which is stated as follows.

Proposition B.2

Let $W = (W_{1}, \ldots , W_{n})\sim N(0, I_{n\times n})$ and f be a twice differentiable function, then

$$\begin{aligned} \hbox {Var}(f(W))\le \mathbb {E}\left\| \frac{\partial f(W)}{\partial W}\right\| _{2}^{2}. \end{aligned}$$

In our case, $\varepsilon _{i} = u_{i}(W_{i})$, and hence for any twice differentiable function g,

$$\begin{aligned} \hbox {Var}(g(\varepsilon ))\le \mathbb {E}\left\| \frac{\partial g(\varepsilon )}{\partial W}\right\| _{2}^{2} = \mathbb {E}\left\| \frac{\partial g(\varepsilon )}{\partial \varepsilon }\cdot \frac{\partial \varepsilon }{\partial W^{T}}\right\| _{2}^{2}\le \max _{i}\left\| u'_{i}\right\| _{\infty }^{2}\cdot \mathbb {E}\left\| \frac{\partial g(\varepsilon )}{\partial \varepsilon }\right\| _{2}^{2}. \end{aligned}$$

Applying it to $\xi _{j}$, we have

$$\begin{aligned} \hbox {Var}(\xi _{j})\le c_{1}^{2}\cdot \mathbb {E}\left\| \frac{\partial \xi _{j}}{\partial \varepsilon }\right\| _{2}^{2}. \end{aligned}$$

(B-64)

For given $k\in \{1, \ldots , n\}$, using the chain rule and the fact that $dB^{-1} = -B^{-1}dB B^{-1}$ for any square matrix B, we obtain that

$$\begin{aligned}&\frac{\partial }{\partial \varepsilon _{k}}\left( D_{[j]} - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}\right) \\&\quad = \frac{\partial D_{[j]}}{\partial \varepsilon _{k}} - \frac{\partial D_{[j]}}{\partial \varepsilon _{k}}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]} \\&\quad - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}\frac{\partial D_{[j]}}{\partial \varepsilon _{k}}\\&\quad + D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}\frac{\partial D_{[j]}}{\partial \varepsilon _{k}}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}D_{[j]}\\&\quad = G_{[j]}^{T}\frac{\partial D_{[j]}}{\partial \varepsilon _{k}}G_{[j]} \end{aligned}$$

where $G_{[j]} = I - X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}$ as defined in last subsection. This implies that

$$\begin{aligned} \frac{\partial \xi _{j}}{\partial \varepsilon _{k}} = \frac{1}{n}X_{j}^{T}G_{[j]}^{T}\frac{\partial D_{[j]}}{\partial \varepsilon _{k}}G_{[j]}X_{j}. \end{aligned}$$

Then (B-64) entails that

$$\begin{aligned} \hbox {Var}(\xi _{j})\le \frac{1}{n^{2}}\sum _{k=1}^{n}\mathbb {E}\left( X_{j}^{T}G_{[j]}^{T}\frac{\partial D_{[j]}}{\partial \varepsilon _{k}}G_{[j]}X_{j}\right) ^{2} \end{aligned}$$

(B-65)

First we compute $\frac{\partial D_{[j]}}{\partial \varepsilon _{k}}$. Similar to (B-44) in p. 40 and recalling the definition of $D_{[j]}$ in (B-17) and that of $G_{[j]}$ in (B-18) in p. 31, we have

$$\begin{aligned} \frac{\partial D_{[j]}}{\partial \varepsilon _{k}} =\tilde{D}_{[j]}\hbox {diag}(G_{[j]}e_{k}) \hbox {diag}(\tilde{D}_{[j]}G_{[j]}e_{k}), \end{aligned}$$

Let $\mathscr {X}_{j} = G_{[j]}X_{j}$ and $\tilde{\mathscr {X}}_{j} = \mathscr {X}_{j}\circ \mathscr {X}_{j}$ where $\circ $ denotes Hadamard product. Then

$$\begin{aligned} X_{j}^{T}G_{[j]}^{T}\frac{\partial D_{[j]}}{\partial \varepsilon _{k}}G_{[j]}X_{j}&= \mathscr {X}_{j}^{T}\frac{\partial D_{[j]}}{\partial \varepsilon _{k}}\mathscr {X}_{j} = \mathscr {X}_{j}^{T}\hbox {diag}(\tilde{D}_{[j]}G_{[j]}e_{k})\mathscr {X}_{j} \\&= \tilde{\mathscr {X}}_{j}^{T}\tilde{D}_{[j]}G_{[j]}e_{k}. \end{aligned}$$

Here we use the fact that for any vectors $x, a\in \mathbb {R}^{n}$,

$$\begin{aligned} x^{T}\hbox {diag}(a)x = \sum _{i=1}^{n}a_{i}x_{i}^{2} = (x\circ x)^{T}a. \end{aligned}$$

This together with (B-65) imply that

$$\begin{aligned} \hbox {Var}(\xi _{j})&\le \frac{1}{n^{2}}\sum _{k=1}^{n}\mathbb {E}\left( \tilde{\mathscr {X}}_{j}^{T}\tilde{D}_{[j]}G_{[j]}e_{k}\right) ^{2} = \frac{1}{n^{2}}\mathbb {E}\left\| \tilde{\mathscr {X}}_{j}^{T}\tilde{D}_{[j]}G_{[j]}\right\| _{2}^{2}\\&= \frac{1}{n^{2}}\mathbb {E}\left( \tilde{\mathscr {X}}_{j}^{T}\tilde{D}_{[j]}G_{[j]}G_{[j]}^{T}\tilde{D}_{[j]}\tilde{\mathscr {X}}_{j}\right) \end{aligned}$$

Note that $G_{[j]}G_{[j]}^{T}\preceq \Vert G_{[j]}\Vert _{\mathrm {op}}^{2} I$, and $\tilde{D}_{[j]}\preceq K_{3}I$ by Lemma B-2 in p. 32. Therefore we obtain that

$$\begin{aligned} \hbox {Var}(\xi _{j})&\le \frac{1}{n^{2}}\mathbb {E}\left( \left\| G_{[j]}\right\| _{op}^{2}\cdot \tilde{\mathscr {X}}_{j}^{T}\tilde{D}_{[j]}^{2}\tilde{\mathscr {X}}_{j} \right) \le \frac{K_{3}^{2}}{n^{2}}\cdot \mathbb {E}\left( \left\| G_{[j]}\right\| _{op}^{2}\cdot \Vert \tilde{\mathscr {X}}_{j}\Vert _{2}^{2}\right) \\&= \frac{K_{3}^{2}}{n^{2}}\mathbb {E}\left( \left\| G_{[j]}\right\| _{op}^{2}\cdot \left\| \mathscr {X}_{j}\right\| _{4}^{4}\right) \le \frac{K_{3}^{2}}{n}\mathbb {E}\left( \left\| G_{[j]}\right\| _{op}^{2}\cdot \left\| \mathscr {X}_{j}\right\| _{\infty }^{4}\right) \end{aligned}$$

As shown in (B-34),

$$\begin{aligned} \Vert G_{[j]}\Vert _{\mathrm {op}}\le \left( \frac{K_{1}}{K_{0}}\right) ^{\frac{1}{2}}. \end{aligned}$$

On the other hand, notice that the i-th row of $G_{[j]}$ is $h_{j, 1, i}$ (see (B-20) for definition), by definition of $\varDelta _{C}$ we have

$$\begin{aligned} \Vert \mathscr {X}_{j}\Vert _{\infty } = \Vert G_{[j]}X_{j}\Vert _{\infty } = \max _{i}\left| h_{j, 1, i}^{T}X_{j}\right| \le \varDelta _{C}\cdot \max \Vert h_{j, 1, i}\Vert _{2}. \end{aligned}$$

By (B-35) and assumption A5,

$$\begin{aligned} \Vert \mathscr {X}_{j}\Vert _{\infty }\le \varDelta _{C}\cdot \left( \frac{K_{1}}{K_{0}}\right) ^{\frac{1}{2}} = O_{L^{4}}\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$

This entails that

$$\begin{aligned} \hbox {Var}(\xi _{j}) = O\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

Combining with (B-62) and (B-63), we obtain that

$$\begin{aligned} I_{2} = O\left( \frac{\mathrm {polyLog(n)}}{n}\right) . \end{aligned}$$

1.6.5 Summary

Putting (B-55), (B-61) and (B-62) together, we conclude that

$$\begin{aligned} n\hbox {Var}(b_{j})= & {} \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) - O\left( \frac{1}{n\cdot \mathrm {polyLog(n)}}\right) \\= & {} \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) \Longrightarrow \hbox {Var}(b_{j}) = \varOmega \left( \frac{1}{n\cdot \mathrm {polyLog(n)}}\right) . \end{aligned}$$

Combining with (B-54),

$$\begin{aligned} \hbox {Var}(\hat{\beta }_{j}) = \varOmega \left( \frac{1}{n\cdot \mathrm {polyLog(n)}}\right) . \end{aligned}$$

Proof of other results

1.1 Proofs of propositions in Section 2.3

Proof

(Proof of Proposition 2.1) Let $H_{i}(\alpha ) = \mathbb {E}\rho (\varepsilon _{i} - \alpha )$. First we prove that the conditions imply that 0 is the unique minimizer of $H_{i}(\alpha )$ for all i. In fact, since $\varepsilon _{i}{\mathop {=}\limits ^{d}}-\varepsilon _{i}$,

$$\begin{aligned} H_{i}(\alpha ) = \mathbb {E}\rho (\varepsilon _{i} - \alpha ) = \frac{1}{2}\left( \mathbb {E}\rho (\varepsilon _{i} - \alpha ) + \rho (-\varepsilon _{i} - \alpha )\right) . \end{aligned}$$

Using the fact that $\rho $ is even, we have

$$\begin{aligned} H_{i}(\alpha ) = \mathbb {E}\rho (\varepsilon _{i} - \alpha ) = \frac{1}{2}\left( \mathbb {E}\rho (\varepsilon _{i} - \alpha ) + \rho (\varepsilon _{i} + \alpha )\right) . \end{aligned}$$

By (4), for any $\alpha \not = 0$, $H_{i}(\alpha ) > H_{i}(0)$. As a result, 0 is the unique minimizer of $H_{i}$. Then for any $\beta \in \mathbb {R}^{p}$

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}\mathbb {E}\rho \left( y_{i} - x_{i}^{T}\beta \right)= & {} \frac{1}{n}\sum _{i=1}^{n}\mathbb {E}\rho \left( \varepsilon _{i} - x_{i}^{T}(\beta - \beta ^{*})\right) \\= & {} \frac{1}{n}\sum _{i=1}^{n}H_{i}\left( x_{i}^{T}(\beta - \beta ^{*})\right) \ge \frac{1}{n}\sum _{i=1}^{n}H_{i}(0). \end{aligned}$$

The equality holds iff $x_{i}^{T}(\beta - \beta ^{*}) = 0$ for all i since 0 is the unique minimizer of $H_{i}$. This implies that

$$\begin{aligned} X(\beta ^{*}(\rho ) - \beta ^{*}) = 0. \end{aligned}$$

Since X has full column rank, we conclude that

$$\begin{aligned} \beta ^{*}(\rho ) = \beta ^{*}. \end{aligned}$$

$\square $

Proof

(Proof of Proposition 2.2) For any $\alpha \in \mathbb {R}$ and $\beta \in \mathbb {R}^{p}$, let

$$\begin{aligned} G(\alpha ; \beta ) = \frac{1}{n}\sum _{i=1}^{n}\mathbb {E}\rho \left( y_{i} - \alpha - x_{i}^{T}\beta \right) . \end{aligned}$$

Since $\alpha _{\rho }$ minimizes $\mathbb {E}\rho (\varepsilon _{i} - \alpha )$, it holds that

$$\begin{aligned} G(\alpha ; \beta ) = \frac{1}{n}\sum _{i=1}^{n}\mathbb {E}\rho \left( \varepsilon _{i} - \alpha - x_{i}^{T}(\beta - \beta ^{*})\right) \ge \frac{1}{n}\sum _{i=1}^{n}\mathbb {E}\rho (\varepsilon _{i} - \alpha _{\rho }) = G(\alpha _{\rho }, \beta ^{*}). \end{aligned}$$

Note that $\alpha _{\rho }$ is the unique minimizer of $\mathbb {E}\rho (\varepsilon _{i} - \alpha )$, the above equality holds if and only if

$$\begin{aligned} \alpha + x_{i}^{T}(\beta - \beta ^{*}) \equiv \alpha _{\rho }\Longrightarrow (\mathbf 1 \,\, X) \left( \begin{array}{c}\alpha - \alpha _{\rho } \\ \beta - \beta ^{*}\end{array}\right) = 0. \end{aligned}$$

Since $(\mathbf 1 \,\, X)$ has full column rank, it must hold that $\alpha = \alpha _{\rho }$ and $\beta = \beta ^{*}$. $\square $

1.2 Proofs of Corollary 3.1

Proposition C.1

Suppose that $\varepsilon _{i}$ are i.i.d. such that $\mathbb {E}\rho (\varepsilon _{1} - \alpha )$ as a function of $\alpha $ has a unique minimizer $\alpha _{\rho }$. Further assume that $X_{J_{n}^{c}}$ contains an intercept term, $X_{J_{n}}$ has full column rank and

$$\begin{aligned} \hbox {span}(\{X_{j}: j\in J_{n}\})\cap \hbox {span}\left( \left\{ X_{j}: j\in J_{n}^{c}\right\} \right) = \{0\} \end{aligned}$$

(C-66)

Let

$$\begin{aligned} \beta _{J_{n}}(\rho )= \mathop {\hbox {arg min}}\limits _{\beta _{J_{n}}}\left\{ \min _{\beta _{J_{n}^{c}}}\frac{1}{n}\sum _{i=1}^{n}\mathbb {E}\rho \left( y_{i} - x_{i}^{T}\beta \right) \right\} . \end{aligned}$$

Then $\beta _{J_{n}}(\rho ) = \beta ^{*}_{J_{n}}$.

Proof

let

$$\begin{aligned} G(\beta ) = \frac{1}{n}\sum _{i=1}^{n}\mathbb {E}\rho \left( y_{i} - x_{i}^{T}\beta \right) . \end{aligned}$$

For any minimizer $\beta (\rho )$ of G, which might not be unique, we prove that $\beta _{J_{n}}(\rho ) = \beta ^{*}_{J_{n}}$. It follows by the same argument as in Proposition 2.2 that

$$\begin{aligned} x_{i}^{T}(\beta (\rho ) - \beta ^{*})\equiv & {} \alpha _{0}\Longrightarrow X(\beta (\rho ) - \beta ^{*}) = \alpha _{0}{} \mathbf 1 \Longrightarrow X_{J_{n}}(\beta _{J_{n}}(\rho )) \\= & {} -X_{J_{n}^{c}}\left( \beta (\rho )_{J_{n}^{c}} - \beta ^{*}_{J_{n}^{c}}\right) + \alpha _{0}{} \mathbf 1 . \end{aligned}$$

Since $X_{J_{n}^{c}}$ contains the intercept term, we have

$$\begin{aligned} X_{J_{n}}(\beta _{J_{n}}(\rho ) - \beta ^{*}_{J_{n}})\in \hbox {span}\left( \left\{ X_{j}: j \in J_{n}^{c}\right\} \right) . \end{aligned}$$

It then follows from (C-68) that

$$\begin{aligned} X_{J_{n}}\left( \beta _{J_{n}}(\rho ) - \beta ^{*}_{J_{n}}\right) = 0. \end{aligned}$$

Since $X_{J_{n}}$ has full column rank, we conclude that

$$\begin{aligned} \beta _{J_{n}}(\rho ) = \beta ^{*}_{J_{n}}. \end{aligned}$$

$\square $

The Proposition C.1 implies that $\beta ^{*}_{J_{n}}$ is identifiable even when X is not of full column rank. A similar conclusion holds for the estimator $\hat{\beta }_{J_{n}}$ and the residuals $R_{i}$. The following two propositions show that under certain assumptions, $\hat{\beta }_{J_{n}}$ and $R_{i}$ are invariant to the choice of $\hat{\beta }$ in the presense of multiple minimizers.

Proposition C.2

Suppose that $\rho $ is convex and twice differentiable with $\rho ''(x)> c > 0$ for all $x\in \mathbb {R}$. Let $\hat{\beta }$ be any minimizer, which might not be unique, of

$$\begin{aligned} F(\beta )\triangleq \frac{1}{n}\sum _{i=1}^{n}\rho \left( y_{i} - x_{i}^{T}\beta \right) \end{aligned}$$

Then $R_{i} = y_{i} - x_{i}\hat{\beta }$ is independent of the choice of $\hat{\beta }$ for any i.

Proof

The conclusion is obvious if $F(\beta )$ has a unique minimizer. Otherwise, let $\hat{\beta }^{(1)}$ and $\hat{\beta }^{(2)}$ be two different minimizers of F denote by $\eta $ their difference, i.e. $\eta = \hat{\beta }^{(2)} - \hat{\beta }^{(1)}$. Since F is convex, $\hat{\beta }^{(1)} + v\eta $ is a minimizer of F for all $v\in [0, 1]$. By Taylor expansion,

$$\begin{aligned} F(\hat{\beta }^{(1)} + v\eta ) = F(\hat{\beta }^{(1)}) + v\nabla F(\hat{\beta }^{(1)}) \eta + \frac{v^{2}}{2}\eta ^{T}\nabla ^{2}F(\hat{\beta }^{(1)})\eta + o(v^{2}). \end{aligned}$$

Since both $\hat{\beta }^{(1)} + v\eta $ and $\hat{\beta }^{(1)}$ are minimizers of F, we have $F(\hat{\beta }^{(1)} + v\eta ) = F(\hat{\beta }^{(1)})$ and $\nabla F(\hat{\beta }^{(1)}) = 0$. By letting v tend to 0, we conclude that

$$\begin{aligned} \eta ^{T}\nabla ^{2}F(\hat{\beta }^{(1)})\eta = 0. \end{aligned}$$

The hessian of F can be written as

$$\begin{aligned} \nabla ^{2}F(\hat{\beta }^{(1)}) = \frac{1}{n}X^{T}\hbox {diag}\left( \rho ''(y_{i} - x_{i}^{T}\hat{\beta }^{(1)})\right) X\succeq \frac{cX^{T}X}{n}. \end{aligned}$$

Thus, $\eta $ satisfies that

$$\begin{aligned} \eta ^{T}\frac{cX^{T}X}{n}\eta = 0\Longrightarrow X\eta = 0. \end{aligned}$$

(C-67)

This implies that

$$\begin{aligned} y - X\hat{\beta }^{(1)} = y - X\hat{\beta }^{(2)} \end{aligned}$$

and hence $R_{i}$ is the same for all i in both cases. $\square $

Proposition C.3

Suppose that $\rho $ is convex and twice differentiable with $\rho ''(x)> c > 0$ for all $x\in \mathbb {R}$. Further assume that $X_{J_{n}}$ has full column rank and

$$\begin{aligned} \hbox {span}(\{X_{j}: j\in J_{n}\})\cap \hbox {span}\left( \left\{ X_{j}: j\in J_{n}^{c}\right\} \right) = \{0\} \end{aligned}$$

(C-68)

Let $\hat{\beta }$ be any minimizer, which might not be unique, of

$$\begin{aligned} F(\beta )\triangleq \frac{1}{n}\sum _{i=1}^{n}\rho \left( y_{i} - x_{i}^{T}\beta \right) \end{aligned}$$

Then $\hat{\beta }_{J_{n}}$ is independent of the choice of $\hat{\beta }$.

Proof

As in the proof of Proposition C.2, we conclude that for any minimizers $\hat{\beta }^{(1)}$ and $\hat{\beta }^{(2)}$, $X\eta = 0$ where $\eta = \hat{\beta }^{(2)} - \hat{\beta }^{(1)}$. Decompose the term into two parts, we have

$$\begin{aligned} X_{J_{n}}\eta _{J_{n}} = -X_{J_{n}}^{c}\eta _{J_{n}^{c}} \in \hbox {span}\left( \left\{ X_{j}: j \in J_{n}^{c}\right\} \right) . \end{aligned}$$

It then follows from (C-68) that $X_{J_{n}}\eta _{J_{n}} = 0$. Since $X_{J_{n}}$ has full column rank, we conclude that $\eta _{J_{n}}= 0$ and hence $\hat{\beta }^{(1)}_{J_{n}} = \hat{\beta }^{(2)}_{J_{n}}$. $\square $

Proof

(Proof of Corollary 3.1) Under assumption A3*, $X_{J_{n}}$ must have full column rank. Otherwise there exists $\alpha \in \mathbb {R}^{|J_{n}|}$ such that $X_{J_{n}}\alpha $, in which case $\alpha ^{T}X_{J_{n}}^{T}(I - H_{J_{n}^{c}})X_{J_{n}}\alpha = 0$. This violates the assumption that $\tilde{\lambda }_{-} > 0$. On the other hand, it also guarantees that

$$\begin{aligned} \hbox {span}(\{X_{j}: j\in J_{n}\})\cap \hbox {span}\left( \left\{ X_{j}: j\in J_{n}^{c}\right\} \right) = \{0\}. \end{aligned}$$

This together with assumption A1 and Proposition C.3 implies that $\hat{\beta }_{J_{n}}$ is independent of the choice of $\hat{\beta }$.

Let $B_{1}\in \mathbb {R}^{|J_{n}^{c}|\times |J_{n}|}$, $B_{2}\in \mathbb {R}^{|J_{n}^{c}|\times |J_{n}^{c}|}$ and assume that $B_{2}$ is invertible. Let $\tilde{X}\in \mathbb {R}^{n\times p}$ such that

$$\begin{aligned} \tilde{X}_{J_{n}} = X_{J_{n}} - X_{J_{n}^{c}}B_{1}, \quad \tilde{X}_{J_{n}^{c}} = X_{J_{n}^{c}}B_{2}. \end{aligned}$$

Then $\hbox {rank}(X) = \hbox {rank}(\tilde{X})$ and model (1) can be rewritten as

$$\begin{aligned} y = \tilde{X}\tilde{\beta ^{*}} + \varepsilon \end{aligned}$$

where

$$\begin{aligned} \tilde{\beta }_{J_{n}}^{*} = \beta ^{*}_{J_{n}}, \quad \tilde{\beta }_{J_{n}^{c}}^{*} = B_{2}^{-1}\beta ^{*}_{J_{n}^{c}} + B_{1}\beta ^{*}_{J_{n}}. \end{aligned}$$

Let $\tilde{\hat{\beta }}$ be an M-estimator, which might not be unique, based on $\tilde{X}$. Then Proposition C.3 shows that $\tilde{\hat{\beta }}_{J_{n}}$ is independent of the choice of $\tilde{\hat{\beta }}$, and an invariance argument shows that

$$\begin{aligned} \tilde{\hat{\beta }}_{J_{n}} = \hat{\beta }_{J_{n}}. \end{aligned}$$

In the rest of proof, we use $\tilde{\cdot }$ to denote the quantity obtained based on $\tilde{X}$. First we show that the assumption A4 is not affected by this transformation. In fact, for any $j\in J_{n}$, by definition we have

$$\begin{aligned} \hbox {span}(\tilde{X}_{[j]}) = \hbox {span}(X_{[j]}) \end{aligned}$$

and hence the leave-j-th-predictor-out residuals are not changed by Proposition C.2. This implies that $\tilde{h_{j, 0}} = h_{j, 0}$ and $\tilde{Q}_{j} = Q_{j}$. Recall the definition of $h_{j, 0}$, the first-order condition of $\hat{\beta }$ entails that $X^{T}h_{j, 0} = 0$. In particular, $X_{J_{n}^{c}}^{T}h_{j, 0} = 0$ and this implies that for any $\alpha \in \mathbb {R}^{n}$,

$$\begin{aligned} 0 = \hbox {Cov}\left( X_{J_{n}^{c}}^{T}h_{j, 0}, \alpha ^{T}h_{j, 0}\right) = X_{J_{n}^{c}}Q_{j}\alpha . \end{aligned}$$

Thus,

$$\begin{aligned} \frac{\tilde{X}_{j}^{T}\tilde{Q}_{j}\tilde{X}_{j}}{\hbox {tr}(\tilde{Q}_{j})} = \frac{\left( X_{j} - X_{J_{n}}^{c}(B_{1})_{j}\right) ^{T}Q_{j}\left( X_{j} - X_{J_{n}^{c}}(B_{1})_{j}\right) }{\hbox {tr}(Q_{j})} = \frac{X_{j}^{T}Q_{j}X_{j}}{\hbox {tr}(Q_{j})}. \end{aligned}$$

Then we prove that the assumption A5 is also not affected by the transformation. The above argument has shown that

$$\begin{aligned} \frac{\tilde{h}_{j, 0}^{T}\tilde{X}_{j}}{\Vert \tilde{h}_{j, 0}\Vert _{2}} = \frac{h_{j, 0}^{T}X_{j}}{\Vert h_{j, 0}\Vert _{2}}. \end{aligned}$$

On the other hand, let $B = \left( \begin{array}{cc} I_{|J_{n}|} &{} 0 \\ -B_{1} &{} B_{2} \end{array} \right) $, then B is non-singular and $\tilde{X} = XB$. Let $B_{(j),[j]}$ denote the matrix B after removing j-th row and j-th column. Then $B_{(j),[j]}$ is also non-singular and $\tilde{X}_{[j]} = X_{[j]}B_{(j), [j]}$. Recall the definition of $h_{j, 1, i}$, we have

$$\begin{aligned} \tilde{h}_{j, 1, i}&= \left( I - \tilde{D}_{[j]}\tilde{X}_{[j]}\left( \tilde{X}_{[j]}^{T}\tilde{D}_{[j]}\tilde{X}_{j}\right) ^{-1}\tilde{X}_{[j]}^{T}\right) e_{i}\\&= \left( I - D_{[j]}X_{[j]}B_{(j), [j]}\left( B_{(j), [j]}^{T}X_{[j]}^{T}D_{[j]}X_{j}B_{(j), [j]}\right) ^{-1}B_{(j), [j]}^{T}X_{[j]}\right) e_{i}\\&= \left( I - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{j}\right) ^{-1}X_{[j]}\right) e_{i}\\&= h_{j, 1, i}. \end{aligned}$$

On the other hand, by definition,

$$\begin{aligned} X_{[j]}^{T}h_{j, 1, i} = X_{[j]}^{T}\left( I - D_{[j]}X_{[j]}\left( X_{[j]}^{T}D_{[j]}X_{[j]}\right) ^{-1}X_{[j]}^{T}\right) e_{i} = 0. \end{aligned}$$

Thus,

$$\begin{aligned} h_{j, 1, i}^{T}\tilde{X}_{j} = h_{j, 1, i}^{T}\left( X_{j} - X_{J_{n}}^{c}(B_{1})_{j}\right) = h_{j, 1, i}^{T}X_{j}. \end{aligned}$$

In summary, for any $j\in J_{n}$ and $i\le n$,

$$\begin{aligned} \frac{\tilde{h}_{j, 1, i}^{T}\tilde{X}_{j}}{\Vert \tilde{h}_{j, 1, i}\Vert _{2}} = \frac{h_{j, 1, i}^{T}X_{j}}{\Vert h_{j, 1, i}\Vert _{2}}. \end{aligned}$$

Putting the pieces together we have

$$\begin{aligned} \tilde{\varDelta }_{C} = \varDelta _{C}. \end{aligned}$$

By Theorem 3.1,

$$\begin{aligned} \max _{j \in J_n}d_{\mathrm {TV}}\left( \mathscr {L}\left( \frac{\hat{\beta }_{j} - \mathbb {E}\hat{\beta }_{j}}{\sqrt{\hbox {Var}(\hat{\beta }_{j})}}\right) , N(0, 1)\right) = o(1). \end{aligned}$$

provided that $\tilde{X}$ satisfies the assumption A3.

Now let $U\varLambda V$ be the singular value decomposition of $X_{J_{n}^{c}}$, where $U\in \mathbb {R}^{n\times p}, \varLambda \in \mathbb {R}^{p\times p}, V \in \mathbb {R}^{p\times p}$ with $U^{T}U = V^{T}V = I_{p}$ and $\varLambda = \hbox {diag}(\nu _{1}, \ldots , \nu _{p})$ being the diagonal matrix formed by singular values of $X_{J_{n}^{c}}$. First we consider the case where $X_{J_{n}^{c}}$ has full column rank, then $\nu _{j} > 0$ for all $j\le p$. Let $B_{1} = (X_{J_{n}}^{T}X_{J_{n}})^{-}X_{J_{n}}^{T}X_{J_{n}}$ and $B_{2} = \sqrt{n / |J_{n}^{c}|}V^{T}\varLambda ^{-1}$. Then

$$\begin{aligned} \frac{\tilde{X}^{T}\tilde{X}}{n} = \frac{1}{n}\left( \begin{array}{cc} X_{J_{n}}^{T}\left( I - X_{J_{n}^{c}}\left( X_{J_{n}^{c}}^{T}X_{J_{n}^{c}}\right) ^{-1}X_{J_{n}^{c}}\right) X_{J_{n}} &{} 0\\ 0 &{} nI \end{array} \right) .\end{aligned}$$

This implies that

$$\begin{aligned} \lambda _{\max }\left( \frac{\tilde{X}^{T}\tilde{X}}{n}\right) = \max \left\{ \tilde{\lambda }_{\max }, 1\right\} , \quad \lambda _{\min }\left( \frac{\tilde{X}^{T}\tilde{X}}{n}\right) = \min \left\{ \tilde{\lambda }_{\min }, 1\right\} . \end{aligned}$$

The assumption A3* implies that

$$\begin{aligned} \lambda _{\max }\left( \frac{\tilde{X}^{T}\tilde{X}}{n}\right) = O(\mathrm {polyLog(n)}), \quad \lambda _{\min }\left( \frac{\tilde{X}^{T}\tilde{X}}{n}\right) = \varOmega \left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

By Theorem 3.1, we conclude that

Next we consider the case where $X_{J_{n}}^{c}$ does not have full column rank. We first remove the redundant columns from $X_{J_{n}}^{c}$, i.e. replace $X_{J_{n}^{c}}$ by the matrix formed by its maximum linear independent subset. Denote by $\mathbf {X}$ this matrix. Then $\hbox {span}(X) = \hbox {span}(\mathbf {X})$ and $\hbox {span}(\{X_{j}: j\not \in J_{n}\}) = \hbox {span}(\{\mathbf {X}_{j}: j\not \in J_{n}\})$. As a consequence of Propositions C.1 and C.3, neither $\beta ^{*}_{J_{n}}$ nor $\hat{\beta }_{J_{n}}$ is affected. Thus, the same reasoning as above applies to this case. $\square $

1.3 Proofs of results in Section 3.3

First we prove two lemmas regarding the behavior of $Q_{j}$. These lemmas are needed for justifying Assumption A4 in the examples.

Lemma C-1

Under assumptions A1 and A2,

$$\begin{aligned} \Vert Q_{j}\Vert _{\mathrm {op}}\le c_{1}^{2}\frac{K_{3}^{2}K_{1}}{K_{0}}, \quad \Vert Q_{j}\Vert _{\mathrm {F}}\le \sqrt{n}c_{1}^{2}\frac{K_{3}^{2}K_{1}}{K_{0}} \end{aligned}$$

where $Q_{j} = \hbox {Cov}(h_{j, 0})$ as defined in section B-1.

Proof

(Proof of Lemma C-1) By definition,

$$\begin{aligned} |\!|Q_{j}|\!|_{\mathrm {op}} = \sup _{\alpha \in \mathbb {S}^{n - 1}}\alpha ^{T}Q_{j}\alpha \end{aligned}$$

where $\mathbb {S}^{n - 1}$ is the n-dimensional unit sphere. For given $\alpha \in \mathbb {S}^{n - 1}$,

$$\begin{aligned} \alpha ^{T}Q_{j}\alpha = \alpha ^{T}\hbox {Cov}(h_{j, 0})\alpha = \hbox {Var}(\alpha ^{T}h_{j, 0}) \end{aligned}$$

It has been shown in (B-59) in “Appendix B-6.3” that

$$\begin{aligned} \frac{\partial r_{i, [j]}}{\partial \varepsilon _{k}} = e_{i}^{T}G_{[j]}e_{k}, \end{aligned}$$

where $G_{[j]} = I - X_{[j]}(X_{[j]}^{T}D_{[j]}X_{[j]})^{-1}X_{[j]}^{T}D_{[j]}$. This yields that

$$\begin{aligned}&\frac{\partial }{\partial \varepsilon ^{T}}\left( \sum _{i=1}^{n}\alpha _{i}\psi (r_{i, [j]})\right) = \sum _{i=1}^{n}\alpha _{i}\psi '(r_{i, [j]})\cdot \frac{\partial r_{i, [j]}}{\partial \varepsilon }\\&\quad = \sum _{i=1}^{n}\alpha _{i}\psi '(r_{i, [j]})\cdot e_{i}^{T}G_{[j]} = \alpha ^{T}\tilde{D}_{[j]}G_{[j]}. \end{aligned}$$

By standard Poincaré inequality (see Proposition B.2), since $\varepsilon _{i} = u_{i}(W_{i})$,

$$\begin{aligned}&\hbox {Var}\left( \sum _{i=1}^{n}\alpha _{i}\psi (r_{i, [j]})\right) \le \max _{k}|\!|u_{k}'|\!|_{\infty }^{2}\cdot \mathbb {E}\bigg \Vert \frac{\partial }{\partial \varepsilon ^{T}}\left( \sum _{i=1}^{n}\alpha _{i}\psi (r_{i, [j]})\right) \bigg \Vert ^{2}\\&\quad \le c_{1}^{2}\cdot \mathbb {E}\left( \alpha ^{T}\tilde{D}_{[j]}G_{[j]}G_{[j]}^{T}\tilde{D}_{[j]}\alpha \right) \le c_{1}^{2}\mathbb {E}\Vert \tilde{D}_{[j]}G_{[j]}G_{[j]}^{T}\tilde{D}_{[j]}\Vert _{2}^{2}\\&\quad \le c_{1}^{2}\mathbb {E}\Vert \tilde{D}_{j}\Vert _{\mathrm {op}}^{2}\Vert G_{[j]}\Vert _{\mathrm {op}}^{2}. \end{aligned}$$

We conclude from Lemma B-2 and (B-34) in “Appendix B-2” that

$$\begin{aligned} \Vert \tilde{D}_{[j]}\Vert _{\mathrm {op}}\le K_{3}, \quad \Vert G_{[j]}\Vert _{\mathrm {op}}^{2}\le \frac{K_{1}}{K_{0}}. \end{aligned}$$

Therefore,

$$\begin{aligned} |\!|Q_{j}|\!|_{\mathrm {op}} = \sup _{\alpha \in \mathbb {S}^{n - 1}}\hbox {Var}\left( \sum _{i=1}^{n}\alpha _{i}\psi (R_{i})\right) \le c_{1}^{2}\frac{K_{3}^{2}K_{1}}{K_{0}} \end{aligned}$$

and hence

$$\begin{aligned} |\!|Q_{j}|\!|_{\mathrm {F}} \le \sqrt{n}|\!|Q_{j}|\!|_{\mathrm {op}} \le \sqrt{n} \cdot c_{1}^{2}\frac{K_{3}^{2}K_{1}}{K_{0}}. \end{aligned}$$

$\square $

Lemma C-2

Under assumptions A1 - A3,

$$\begin{aligned} \hbox {tr}(Q_{j}) \ge K^{*}n = \varOmega (n\cdot \mathrm {polyLog(n)}), \end{aligned}$$

where $K^{*} = \frac{K_{0}^{4}}{K_{1}^{2}}\cdot \left( \frac{n - p + 1}{n}\right) ^{2}\cdot \min _{i}\hbox {Var}(\varepsilon _{i})$.

Proof

This is a direct consequence of Lemma B-4 in p. 49. $\square $

Throughout the following proofs, we will use several results from the random matrix theory to bound the largest and smallest singular values of Z. The results are shown in “Appendix E”. Furthermore, in contrast to other sections, the notation $P(\cdot ), \mathbb {E}(\cdot ), \hbox {Var}(\cdot )$ denotes the probability, the expectation and the variance with respect to both $\varepsilon $ and Z in this section.

Proof

(Proof of Proposition 3.1) By Proposition E.3,

$$\begin{aligned} \lambda _{+}= (1 + \sqrt{\kappa })^{2} + o_{p}(1) = O_{p}(1), \quad \lambda _{-}= (1 - \sqrt{\kappa })^{2} - o_{p}(1) = \varOmega _{p}(1) \end{aligned}$$

and thus the assumption A3 holds with high probability. By Hanson-Wright inequality ([27, 51]; see Proposition E.2), for any given deterministic matrix A,

$$\begin{aligned} P\left( \left| Z_{j}^{T}AZ_{j} - \mathbb {E}Z_{j}^{T}AZ_{j}\right| \ge t\right) \le 2\exp \left[ -c\min \left\{ \frac{t^{2}}{\sigma ^{4}\Vert A\Vert _{\mathrm {F}}^{2}}, \frac{t}{\sigma ^{2}\Vert A\Vert _{\mathrm {op}}}\right\} \right] \end{aligned}$$

for some universal constant c. Let $A = Q_{j}$ and conditioning on $Z_{[j]}$, then by Lemma C-1, we know that

$$\begin{aligned} \Vert Q_{j}\Vert _{\mathrm {op}}\le c_{1}^{2}\frac{K_{3}^{2}K_{1}}{K_{0}}, \quad \Vert Q_{j}\Vert _{\mathrm {F}}\le \sqrt{n}c_{1}^{2}\frac{K_{3}^{2}K_{1}}{K_{0}} \end{aligned}$$

and hence

$$\begin{aligned}&P\left( Z_{j}^{T}Q_{j}Z_{j} - \mathbb {E}(Z_{j}^{T}Q_{j}Z_{j}\big | Z_{[j]})\le -t \bigg | Z_{[j]}\right) \nonumber \\&\quad \le 2\exp \left[ -c\min \left\{ \frac{t^{2}}{\sigma ^{4}\cdot nc_{1}^{4}K_{3}^{4}K_{1}^{2} / K_{0}^{2}}, \frac{t}{\sigma ^{2}c_{1}^{2}K_{3}^{2}K_{1} / K_{0}}\right\} \right] . \end{aligned}$$

(C-69)

Note that

$$\begin{aligned} \mathbb {E}\left( Z_{j}^{T}Q_{j}Z_{j}\big | Z_{[j]}\right) = \hbox {tr}\left( \mathbb {E}\left[ Z_{j}Z_{j}^{T} | Z_{[j]}\right] Q_{j}\right) = \mathbb {E}Z_{1j}^{2} \hbox {tr}(Q_{j}) = \tau ^{2}\hbox {tr}(Q_{j}). \end{aligned}$$

By Lemma C-2, we conclude that

$$\begin{aligned}&P\left( \frac{Z_{j}^{T}Q_{j}Z_{j}}{\hbox {tr}(Q_{j})}\le \tau ^{2} - \frac{t}{nK^{*}}\bigg | Z_{[j]}\right) \le P\left( \frac{Z_{j}^{T}Q_{j}Z_{j}}{\hbox {tr}(Q_{j})}\le \tau ^{2} - \frac{t}{\hbox {tr}(Q_{j})}\bigg |Z_{[j]}\right) \nonumber \\&\quad \le 2\exp \left[ -c\min \left\{ \frac{t^{2}}{\sigma ^{4}\cdot nc_{1}^{4}K_{3}^{4}K_{1}^{2} / K_{0}^{2}}, \frac{t}{2\sigma ^{2}c_{1}^{2}K_{3}^{2}K_{1} / K_{0}}\right\} \right] . \end{aligned}$$

(C-70)

Let $t = \frac{1}{2}\tau ^{2}nK^{*}$ and take expectation of both sides over $Z_{[j]}$, we obtain that

$$\begin{aligned} P\left( \frac{Z_{j}^{T}Q_{j}Z_{j}}{\hbox {tr}(Q_{j})} \le \frac{\tau ^{2}}{2}\right) \le 2\exp \left[ -cn\min \left\{ \frac{K^{*2}\tau ^{4}}{4\sigma ^{4}c_{1}^{4}K_{3}^{4}K_{1}^{2} / K_{0}^{2}}, \frac{K^{*}\tau ^{2}}{2\sigma ^{2}c_{1}^{2}K_{3}^{2}K_{1} / K_{0}}\right\} \right] \end{aligned}$$

and hence

$$\begin{aligned}&P\left( \min _{j\in J_{n}}\frac{Z_{j}^{T}Q_{j}Z_{j}}{\hbox {tr}(Q_{j})} \le \frac{\tau ^{2}}{2}\right) \nonumber \\&\quad \le 2n\exp \left[ -cn\min \left\{ \frac{K^{*2}\tau ^{4}}{4\sigma ^{4}c_{1}^{4}K_{3}^{4}K_{1}^{2} / K_{0}^{2}}, \frac{K^{*}\tau ^{2}}{2\sigma ^{2}c_{1}^{2}K_{3}^{2}K_{1} / K_{0}}\right\} \right] = o(1).\nonumber \\ \end{aligned}$$

(C-71)

This entails that

$$\begin{aligned} \min _{j\in J_{n}}\frac{Z_{j}^{T}Q_{j}Z_{j}}{\hbox {tr}(Q_{j})} = \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

Thus, assumption A4 is also satisfied with high probability. On the other hand, since $Z_{j}$ has i.i.d. mean-zero $\sigma ^{2}$-sub-gaussian entries, for any deterministic unit vector $\alpha \in \mathbb {R}^{n}$, $\alpha ^{T}Z_{j}$ is $\sigma ^{2}$-sub-gaussian and mean-zero, and hence

$$\begin{aligned} P(|\alpha ^{T}Z_{j}|\ge t)\le 2e^{-\frac{t^{2}}{2\sigma ^{2}}}. \end{aligned}$$

Let $\alpha _{j, i} = h_{j, 1, i} / \Vert h_{j, 1, i}\Vert _{2}$ and $\alpha _{j, 0} = h_{j, 0} / \Vert h_{j, 0}\Vert _{2}$. Since $h_{j, 1, i}$ and $h_{j, 0}$ are independent of $Z_{j}$, a union bound then gives

$$\begin{aligned} P\left( \varDelta _{C}\ge t + 2\sigma \sqrt{\log n}\right) \le 2n^{2}e^{-\frac{t^{2} + 4\sigma ^{2}\log n}{2\sigma ^{2}}} = 2e^{-\frac{t^{2}}{2\sigma ^{2}}}. \end{aligned}$$

By Fubini’s formula ([16], Lemma 2.2.8.),

$$\begin{aligned} \mathbb {E}\varDelta _{C}^{8}&= \int _{0}^{\infty }8t^{7}P(\varDelta _{C}\ge t)dt \le \int _{0}^{2\sigma \sqrt{\log n}}8t^{7}dt + \int _{2\sigma \sqrt{\log n}}^{\infty }8t^{7}P(\varDelta _{C}\ge t)dt\nonumber \\&= (2\sigma \sqrt{\log n})^{8} + \int _{0}^{\infty }8(t + 2\sigma \sqrt{\log n})^{7}P(\varDelta _{C}\ge t + 2\sigma \sqrt{\log n})dt\nonumber \\&\le (2\sigma \sqrt{\log n})^{8} + \int _{0}^{\infty }64(8t^{7} + 128\sigma ^{7}(\log n)^{\frac{7}{2}})\cdot 2e^{-\frac{t^{2}}{2\sigma ^{2}}}dt\nonumber \\&= O(\sigma ^{8}\cdot \mathrm {polyLog(n)}) = O\left( \mathrm {polyLog(n)}\right) . \end{aligned}$$

(C-72)

This, together with Markov inequality, guarantees that assumption A5 is also satisfied with high probability. $\square $

Proof

(Proof of Proposition 3.2) It is left to prove that assumption A3 holds with high probability. The proof of assumption A4 and A5 is exactly the same as the proof of Proposition 3.2. By Proposition E.4,

$$\begin{aligned} \lambda _{+}= O_{p}(1). \end{aligned}$$

On the other hand, by Proposition E.7 [37],

$$\begin{aligned} P\left( \lambda _{\mathrm {\min }}\left( \frac{Z^{T}Z}{n}\right) < c_{1}\right) \le e^{-c_{2}n}. \end{aligned}$$

and thus$\square $

$$\begin{aligned} \lambda _{-}= \varOmega _{p}(1). \end{aligned}$$

Proof

(Proof of Proposition 3.3) Since $J_{n}$ excludes the intercept term, the proof of assumption A4 and A5 is still the same as Proposition 3.2. It is left to prove assumption A3. Let $R_{1}, \ldots , R_{n}$ be i.i.d. Rademacher random variables, i.e. $P(R_{i} = 1) = P(R_{i} = -1) = \frac{1}{2}$, and

$$\begin{aligned} Z^{*} = \hbox {diag}(B_{1}, \ldots , B_{n})Z. \end{aligned}$$

Then $(Z^{*})^{T}Z^{*} = Z^{T}Z$. It is left to show that the assumption A3 holds for $Z^{*}$ with high probability. Note that

$$\begin{aligned} (Z^{*}_{i})^{T} = \left( B_{i}, B_{i}\tilde{x}_{i}^{T}\right) . \end{aligned}$$

For any $r \in \{1, -1\}$ and borel sets $B_{1}, \ldots , B_{p}\subset \mathbb {R}$,

$$\begin{aligned}&P(B_{i} = r, B_{i}\tilde{Z}_{i1}\in B_{1}, \ldots , B_{i}\tilde{Z}_{i(p-1)}\in B_{p - 1}) \\&\quad = P(B_{i} = r, \tilde{Z}_{i1}\in rB_{1}, \ldots , \tilde{Z}_{i(p- 1)}\in rB_{p - 1})\\&\quad = P(B_{i} = r)P(\tilde{Z}_{i1}\in rB_{1})\ldots P(\tilde{Z}_{i(p- 1)}\in rB_{p - 1})\\&\quad = P(B_{i} = r)P(\tilde{Z}_{i1}\in B_{1})\ldots P(\tilde{Z}_{i(p- 1)}\in B_{p - 1})\\&\quad = P(B_{i} = r)P(B_{i}\tilde{Z}_{i1}\in B_{1})\ldots P(B_{i}\tilde{Z}_{i(p- 1)}\in B_{p - 1}) \end{aligned}$$

where the last two lines uses the symmetry of $\tilde{Z}_{ij}$. Then we conclude that $Z^{*}_{i}$ has independent entries. Since the rows of $Z^{*}$ are independent, $Z^{*}$ has independent entries. Since $B_{i}$ are symmetric and sub-gaussian with unit variance and $B_{i}\tilde{Z}_{ij}{\mathop {=}\limits ^{d}}\tilde{Z}_{ij}$, which is also symmetric and sub-gaussian with variance bounded from below, $Z^{*}$ satisfies the conditions of Propsition 3.2 and hence the assumption A3 is satisfied with high probability. $\square $

Proof

(Proof of Proposition 3.5 (with Proposition 3.4 being a special case)) Let $Z_{*} = \varLambda ^{-\frac{1}{2}}Z\varSigma ^{-\frac{1}{2}}$, then $Z_{*}$ has i.i.d. standard gaussian entries. By Proposition 3.3, $Z_{*}$ satisfies assumption A3 with high probability. Thus,

$$\begin{aligned} \lambda _{+}= & {} \lambda _{\mathrm {\max }}\left( \frac{\varSigma ^{\frac{1}{2}}Z_{*}^{T}\varLambda Z_{*}\varSigma ^{\frac{1}{2}}}{n}\right) \le \lambda _{\mathrm {\max }}(\varSigma )\cdot \lambda _{\mathrm {\max }}(\varLambda )\cdot \lambda _{\mathrm {\max }}\left( \frac{Z_{*}^{T}Z_{*}}{n}\right) \\= & {} O_{p}(\mathrm {polyLog(n)}), \end{aligned}$$

and

$$\begin{aligned} \lambda _{-}= & {} \lambda _{\mathrm {\min }}\left( \frac{\varSigma ^{\frac{1}{2}}Z_{*}^{T}\varLambda Z_{*}\varSigma ^{\frac{1}{2}}}{n}\right) \ge \lambda _{\mathrm {\min }}(\varSigma )\cdot \lambda _{\mathrm {\min }}(\varLambda )\cdot \lambda _{\mathrm {\min }}\left( \frac{Z_{*}^{T}Z_{*}}{n}\right) \\= & {} \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

As for assumption A4, the first step is to calculate $\mathbb {E}(Z_{j}^{T}Q_{j}Z_{j} | Z_{[j]})$. Let $\tilde{Z} = \varLambda ^{-\frac{1}{2}}Z$, then $\hbox {vec}(\tilde{Z})\sim N(0, I\otimes \varSigma )$. As a consequence,

$$\begin{aligned} \tilde{Z}_{j} | \tilde{Z}_{[j]}\sim N\left( \tilde{\mu }_{j}, \sigma _{j}^{2}I\right) \end{aligned}$$

where

$$\begin{aligned} \tilde{\mu }_{j} = \tilde{Z}_{[j]}\varSigma _{[j], [j]}^{-1}\varSigma _{[j], j} = \varLambda ^{-\frac{1}{2}}Z_{[j]}\varSigma _{[j], [j]}^{-1}\varSigma _{[j], j}. \end{aligned}$$

Thus,

$$\begin{aligned} Z_{j} | Z_{[j]}\sim N\left( \mu _{j}, \sigma _{j}^{2}\varLambda \right) \end{aligned}$$

where $\mu _{j} = Z_{[j]}\varSigma _{[j], [j]}^{-1}\varSigma _{[j], j}$. It is easy to see that

$$\begin{aligned} \lambda _{-}\le \min _{j}\sigma _{j}^{2}\le \max _{j}\sigma _{j}^{2}\le \lambda _{+}. \end{aligned}$$

(C-73)

It has been shown that $Q_{j}\mu _{j} = 0$ and hence

$$\begin{aligned} Z_{j}^{T}Q_{j}Z_{j} = (Z_{j} - \mu _{j})^{T}Q_{j}(Z_{j} - \mu _{j}). \end{aligned}$$

Let $\mathscr {Z}_{j} = \varLambda ^{-\frac{1}{2}}(Z_{j} - \mu _{j})$ and $\tilde{Q}_{j} = \varLambda ^{\frac{1}{2}}Q_{j}\varLambda ^{\frac{1}{2}}$, then $\mathscr {Z}_{j}\sim N(0, \sigma _{j}^{2}I)$ and

$$\begin{aligned}Z_{j}^{T}Q_{j}Z_{j} = \mathscr {Z}_{j}^{T}\tilde{Q}_{j}\mathscr {Z}_{j}.\end{aligned}$$

By Lemma C-1,

$$\begin{aligned} \Vert \tilde{Q}_{j}\Vert _{\mathrm {op}}\le \Vert \varLambda \Vert _{\mathrm {op}}\cdot \Vert Q_{j}\Vert _{\mathrm {op}} \le \lambda _{\mathrm {\max }}(\varLambda )\cdot c_{1}^{2}\frac{K_{3}^{2}K_{1}}{K_{0}}, \end{aligned}$$

and hence

$$\begin{aligned} \Vert \tilde{Q}_{j}\Vert _{\mathrm {F}}\le \sqrt{n}\lambda _{\mathrm {\max }}(\varLambda )\cdot c_{1}^{2}\frac{K_{3}^{2}K_{1}}{K_{0}}. \end{aligned}$$

By Hanson-Wright inequality ([27, 51]; see Proposition E.2), we obtain a similar inequality to (C-69) as follows:

$$\begin{aligned}&P\left( |Z_{j}^{T}Q_{j}Z_{j} - \mathbb {E}(Z_{j}^{T}Q_{j}Z_{j}\big | Z_{[j]})|\ge t \bigg | Z_{[j]}\right) \\&\quad \le 2\exp \left[ -c\min \left\{ \frac{t^{2}}{\sigma _{j}^{4}\cdot n\lambda _{\mathrm {\max }}(\varLambda )^{2}c_{1}^{4}K_{3}^{4}K_{1}^{2} / K_{0}^{2}}, \frac{t}{\sigma _{j}^{2}\lambda _{\mathrm {\max }}(\varLambda )c_{1}^{2}K_{3}^{2}K_{1} / K_{0}}\right\} \right] . \end{aligned}$$

On the other hand,

$$\begin{aligned} \mathbb {E}\left( Z_{j}^{T}Q_{j}Z_{j} | Z_{[j]}\right) = \mathbb {E}\left( \mathscr {Z}_{j}^{T}\tilde{Q}_{j}\mathscr {Z}_{j} | Z_{[j]}\right) = \sigma _{j}^{2}\hbox {tr}(\tilde{Q}_{j}). \end{aligned}$$

By definition,

$$\begin{aligned} \hbox {tr}(\tilde{Q}_{j}) = \hbox {tr}(\varLambda ^{\frac{1}{2}}Q_{j}\varLambda ^{\frac{1}{2}}) = \hbox {tr}(\varSigma Q_{j}) = \hbox {tr}\left( Q_{j}^{\frac{1}{2}}\varLambda Q_{j}^{\frac{1}{2}}\right) \ge \lambda _{\mathrm {\min }}(\varLambda )\hbox {tr}(Q_{j}). \end{aligned}$$

By Lemma C-2,

$$\begin{aligned} \hbox {tr}(\tilde{Q}_{j})\ge \lambda _{\mathrm {\min }}(\varLambda )\cdot nK^{*}. \end{aligned}$$

Similar to (C-70), we obtain that

$$\begin{aligned}&P\left( \frac{Z_{j}^{T}Q_{j}Z_{j}}{\hbox {tr}(Q_{j})}\ge \sigma _{j}^{2} - \frac{t}{nK^{*}}\bigg | Z_{[j]}\right) \\&\quad \le 2\exp \left[ -c\min \left\{ \frac{t^{2}}{\sigma _{j}^{4}\cdot n\lambda _{\mathrm {\max }}(\varLambda )^{2}c_{1}^{4}K_{3}^{4}K_{1}^{2} / K_{0}^{2}}, \frac{t}{\sigma _{j}^{2}\lambda _{\mathrm {\max }}(\varLambda )c_{1}^{2}K_{3}^{2}K_{1} / K_{0}}\right\} \right] . \end{aligned}$$

Let $t = \frac{1}{2}\sigma _{j}^{2}nK^{*}$, we have

$$\begin{aligned}&P\left( \frac{Z_{j}^{T}Q_{j}Z_{j}}{\hbox {tr}(Q_{j})}\ge \frac{\sigma _{j}^{2}}{2}\right) \\&\quad \le 2\exp \left[ -cn\min \left\{ \frac{K^{*2}}{4\lambda _{\mathrm {\max }}(\varLambda )^{2}c_{1}^{4}K_{3}^{4}K_{1}^{2} / K_{0}^{2}}, \frac{K^{*}}{2\lambda _{\mathrm {\max }}(\varLambda )c_{1}^{2}K_{3}^{2}K_{1} / K_{0}}\right\} \right] \\&\quad = o\left( \frac{1}{n}\right) \end{aligned}$$

and a union bound together with (C-73) yields that

$$\begin{aligned} \min _{j\in J_{n}}\frac{Z_{j}^{T}Q_{j}Z_{j}}{\hbox {tr}(Q_{j})} = \varOmega _{p}\left( \min _{j}\sigma _{j}^{2}\cdot \frac{1}{\mathrm {polyLog(n)}}\right) = \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

As for assumption A5, let

$$\begin{aligned} \alpha _{j, 0} = \frac{\varLambda ^{\frac{1}{2}}h_{j, 0}}{\Vert h_{j, 0}\Vert _{2}}, \quad \alpha _{j, i} = \frac{\varLambda ^{\frac{1}{2}}h_{j, 1, i}}{\Vert h_{j, 1, i}\Vert _{2}} \end{aligned}$$

then for $i = 0, 1, \ldots , p$,

$$\begin{aligned}\Vert \alpha _{j, i}\Vert _{2}\le \sqrt{\lambda _{\mathrm {\max }}(\varLambda )}.\end{aligned}$$

Note that

$$\begin{aligned} \frac{h_{j, 0}^{T}Z_{j}}{\Vert h_{j, 0}\Vert _{2}} = \alpha _{j, 0}^{T}Z_{j}, \quad \frac{h_{j, 1, i}^{T}Z_{j}}{\Vert h_{j, 1, i}\Vert _{2}} = \alpha _{j, i}^{T}Z_{j} \end{aligned}$$

using the same argument as in (C-72), we obtain that

$$\begin{aligned} \mathbb {E}\varDelta _{C}^{8} = O\left( \lambda _{\mathrm {\max }}(\varLambda )^{4}\cdot \max _{j}\sigma _{j}^{8}\cdot \mathrm {polyLog(n)}\right) = O\left( \mathrm {polyLog(n)}\right) , \end{aligned}$$

and by Markov inequality and (C-73),

$$\begin{aligned} \mathbb {E}\left( \varDelta _{C}^{8} | Z\right) = O_{p}\left( \mathbb {E}\varDelta _{C}^{8} \right) = O_{p}(\mathrm {polyLog(n)}). \end{aligned}$$

$\square $

Proof

(Proof of Proposition 3.6) The proof that assumptions A4 and A5 hold with high probability is exactly the same as the proof of Proposition 3.5. It is left to prove assumption A3*; see Corollary 3.1. Let $c = (\min _{i}|(\varLambda ^{-\frac{1}{2}}{} \mathbf 1 )_{i}|)^{-1}$ and $\mathbf {Z} = (c\mathbf 1 \,\, \tilde{Z})$. Recall the the definition of $\tilde{\lambda }_{+}$ and $\tilde{\lambda }_{-}$, we have

$$\begin{aligned} \tilde{\lambda }_{+} = \lambda _{\max }(\varSigma _{\{1\}}), \quad \tilde{\lambda }_{-} = \lambda _{\min }(\varSigma _{\{1\}}), \end{aligned}$$

where

$$\begin{aligned} \varSigma _{\{1\}} = \frac{1}{n}\tilde{Z}^{T}\left( I - \frac{\mathbf{1 }\mathbf{1 }^{T}}{n}\right) \tilde{Z}. \end{aligned}$$

Rewrite $\varSigma _{\{1\}}$ as

$$\begin{aligned} \varSigma _{\{1\}} = \frac{1}{n}\left( \left( I - \frac{\mathbf{1 }\mathbf{1 }^{T}}{n}\right) \tilde{Z}\right) ^{T}\left( \left( I - \frac{\mathbf{1 }\mathbf{1 }^{T}}{n}\right) \tilde{Z}\right) . \end{aligned}$$

It is obvious that

$$\begin{aligned} \hbox {span}\left( \left( I - \frac{\mathbf{1 }\mathbf{1 }^{T}}{n}\right) \tilde{Z}\right) \subset \hbox {span}({\mathbf {Z}}). \end{aligned}$$

As a consequence

$$\begin{aligned} \tilde{\lambda }_{+}\le \lambda _{\max }\left( \frac{\mathbf {Z}^{T}\mathbf {Z}}{n}\right) , \quad \tilde{\lambda }_{-}\ge \lambda _{\min }\left( \frac{\mathbf {Z}^{T}\mathbf {Z}}{n}\right) . \end{aligned}$$

It remains to prove that

$$\begin{aligned} \lambda _{\max }\left( \frac{\mathbf {Z}^{T}\mathbf {Z}}{n}\right) = O_{p}\left( \mathrm {polyLog(n)}\right) , \quad \lambda _{\min }\left( \frac{\mathbf {Z}^{T}\mathbf {Z}}{n}\right) = \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

To prove this, we let

$$\begin{aligned} Z_{*} = \varLambda ^{-\frac{1}{2}}\mathbf {Z}\left( \begin{array}{cc}1 &{} 0 \\ 0 &{} \varSigma ^{-\frac{1}{2}}\end{array}\right) \triangleq (\nu \,\, \tilde{Z}_{*}), \end{aligned}$$

where $\nu = c\varLambda ^{-\frac{1}{2}}{} \mathbf 1 $ and $\tilde{Z}_{*} = \varLambda ^{-\frac{1}{2}}\tilde{Z}\varSigma ^{-\frac{1}{2}}$. Then

$$\begin{aligned} \lambda _{\max }\left( \frac{\mathbf {Z}^{T}\mathbf {Z}}{n}\right) = \lambda _{\mathrm {\max }}\left( \frac{\varSigma ^{\frac{1}{2}}Z_{*}^{T}\varLambda Z_{*}\varSigma ^{\frac{1}{2}}}{n}\right) \le \lambda _{\mathrm {\max }}(\varSigma )\cdot \lambda _{\mathrm {\max }}(\varLambda )\cdot \lambda _{\mathrm {\max }}\left( \frac{Z_{*}^{T}Z_{*}}{n}\right) , \end{aligned}$$

and

$$\begin{aligned} \lambda _{\min }\left( \frac{\mathbf {Z}^{T}\mathbf {Z}}{n}\right) = \lambda _{\mathrm {\min }}\left( \frac{\varSigma ^{\frac{1}{2}}Z_{*}^{T}\varLambda Z_{*}\varSigma ^{\frac{1}{2}}}{n}\right) \ge \lambda _{\mathrm {\min }}(\varSigma )\cdot \lambda _{\mathrm {\min }}(\varLambda )\cdot \lambda _{\mathrm {\min }}\left( \frac{Z_{*}^{T}Z_{*}}{n}\right) . \end{aligned}$$

It is left to show that

$$\begin{aligned} \lambda _{\mathrm {\max }}\left( \frac{Z_{*}^{T}Z_{*}}{n}\right) = O_{p}(\mathrm {polyLog(n)}), \quad \lambda _{\mathrm {\min }}\left( \frac{Z_{*}^{T}Z_{*}}{n}\right) = \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

By definition, $\min _{i}|\nu _{i}| = 1$ and $\max _{i}|\nu _{i}| = O\left( \mathrm {polyLog(n)}\right) $, then

$$\begin{aligned} \lambda _{\mathrm {\max }}\left( \frac{Z_{*}^{T}Z_{*}}{n}\right) = \lambda _{\mathrm {\max }}\left( \frac{\tilde{Z}_{*}^{T}\tilde{Z}_{*}}{n} + \frac{\nu \nu ^{T}}{n}\right) \le \lambda _{\mathrm {\max }}\left( \frac{\tilde{Z}_{*}^{T}\tilde{Z}_{*}}{n}\right) + \frac{\Vert \nu \Vert _{2}^{2}}{n}. \end{aligned}$$

Since $\tilde{Z}_{*}$ has i.i.d. standard gaussian entries, by Proposition E.3,

$$\begin{aligned} \lambda _{\mathrm {\max }}\left( \frac{\tilde{Z}_{*}^{T}\tilde{Z}_{*}}{n}\right) = O_{p}(1). \end{aligned}$$

Moreover, $\Vert \nu \Vert _{2}^{2} \le n \max _{i}|\nu _{i}|^{2} = O(n\cdot \mathrm {polyLog(n)})$ and thus,

$$\begin{aligned} \lambda _{\mathrm {\max }}\left( \frac{Z_{*}^{T}Z_{*}}{n}\right) = O_{p}(\mathrm {polyLog(n)}). \end{aligned}$$

On the other hand, similar to Proposition 3.3,

$$\begin{aligned} \mathbf {Z}_{*} = \hbox {diag}(B_{1}, \ldots , B_{n})Z_{*} \end{aligned}$$

where $B_{1}, \ldots , B_{n}$ are i.i.d. Rademacher random variables. The same argument in the proof of Proposition 3.3 implies that $\mathbf {Z}_{*}$ has independent entries with sub-gaussian norm bounded by $\Vert \nu \Vert _{\infty }^{2}\vee 1$ and variance lower bounded by 1. By Proposition E.7, $Z_{*}$ satisfies assumption A3 with high probability. Therefore, A3* holds with high probability. $\square $

Proof

(Proof of Proposition 3.7) Let $\varLambda = (\lambda _{1}, \ldots , \lambda _{n})$ and $\mathscr {Z}$ be the matrix with entries $\mathscr {Z}_{ij}$, then by Proposition 3.1 or Proposition 3.2, $\mathscr {Z}_{ij}$ satisfies assumption A3 with high probability. Notice that

$$\begin{aligned} \lambda _{+}= \lambda _{\mathrm {\max }}\left( \frac{\mathscr {Z}^{T}\varLambda ^{2} \mathscr {Z}}{n}\right) \le \lambda _{\mathrm {\max }}(\varLambda )^{2}\cdot \lambda _{\mathrm {\max }}\left( \frac{\mathscr {Z}^{T}\mathscr {Z}}{n}\right) = O_{p}(\mathrm {polyLog(n)}), \end{aligned}$$

and

$$\begin{aligned} \lambda _{-}= \lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}^{T}\varLambda ^{2} \mathscr {Z}}{n}\right) \ge \lambda _{\mathrm {\min }}(\varLambda )^{2}\cdot \lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}^{T}\mathscr {Z}}{n}\right) = \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

Thus Z satisfies assumption A3 with high probability.

Conditioning on any realization of $\varLambda $, the law of $\mathscr {Z}_{ij}$ does not change due to the independence between $\varLambda $ and $\mathscr {Z}$. Repeating the arguments in the proof of Proposition 3.1 and Proposition 3.2, ow that

$$\begin{aligned} \frac{\mathscr {Z}_{j}^{T}\tilde{Q}_{j}\mathscr {Z}_{j}}{\hbox {tr}(\tilde{Q}_{j})}= & {} \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) , \quad \text{ and }\quad \nonumber \\ \mathbb {E}\max _{i = 0, \ldots , n; j = 1, \ldots , p}\left| \tilde{\alpha }_{j, i}^{T}\mathscr {Z}_{j}\right| ^{8}= & {} O_{p}(\mathrm {polyLog(n)}), \end{aligned}$$

(C-74)

where

$$\begin{aligned} \tilde{Q}_{j} = \varLambda Q_{j}\varLambda , \quad \tilde{\alpha }_{j, 0} = \frac{\varLambda h_{j, 0}}{\Vert \varLambda h_{j, 0}\Vert _{2}}, \quad \tilde{\alpha }_{j, 1, i} = \frac{\varLambda h_{j, 1, i}}{\Vert \varLambda h_{j, 1, i}\Vert _{2}}. \end{aligned}$$

Then

$$\begin{aligned} \frac{Z_{j}^{T}Q_{j}Z_{j}}{\hbox {tr}(Q_{j})} = \frac{\mathscr {Z}_{j}^{T}\tilde{Q}_{j}\mathscr {Z}_{j}}{\hbox {tr}(\tilde{Q}_{j})}\cdot \frac{\hbox {tr}(\varLambda Q_{j}\varLambda )}{\hbox {tr}(Q_{j})} \ge a^{2}\cdot \frac{\mathscr {Z}_{j}^{T}\tilde{Q}_{j}\mathscr {Z}_{j}}{\hbox {tr}(\tilde{Q}_{j})} = \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) , \end{aligned}$$

(C-75)

and

$$\begin{aligned} \mathbb {E}\varDelta _{C}^{8}&= \mathbb {E}\left[ \max _{i = 0, \ldots , n; j = 1, \ldots , p}|\tilde{\alpha }_{j, i}^{T}\mathscr {Z}_{j}|^{8}\cdot \max \left\{ \max _{j}\frac{\Vert \varLambda h_{j, 0}\Vert _{2}}{\Vert h_{j, 0}\Vert _{2}}, \max _{i, j}\frac{\Vert \varLambda h_{j, 1, i}\Vert _{2}}{\Vert h_{j, 1, i}\Vert _{2}}\right\} ^{8}\right] \nonumber \\&\le b^{8}\mathbb {E}\left[ \max _{i = 0, \ldots , n; j = 1, \ldots , p}|\tilde{\alpha }_{j, i}^{T}\mathscr {Z}_{j}|^{8}\right] \nonumber \\&= O_{p}(\mathrm {polyLog(n)}). \end{aligned}$$

(C-76)

By Markov inequality, the assumption A5 is satisfied with high probability. $\square $

Proof

(Proof of Proposition 3.8) The concentration inequality of $\zeta _{i}$ plus a union bound imply that

$$\begin{aligned} P\left( \max _{i}\zeta _{i} > (\log n)^{\frac{2}{\alpha }}\right) \le nc_{1}e^{-c_{2}(\log n)^{2}} = o(1). \end{aligned}$$

Thus, with high probability,

$$\begin{aligned} \lambda _{\mathrm {\max }} = \lambda _{\mathrm {\max }}\left( \frac{\mathscr {Z}^{T}\varLambda ^{2}\mathscr {Z}}{n}\right) \le (\log n)^{\frac{4}{\alpha }}\cdot \lambda _{\mathrm {\max }}\left( \frac{\mathscr {Z}^{T}\mathscr {Z}}{n}\right) = O_{p}(\mathrm {polyLog(n)}). \end{aligned}$$

Let $n' = \lfloor (1 - \delta )n\rfloor $ for some $\delta \in (0, 1 - \kappa )$. Then for any subset I of $\{1, \ldots , n\}$ with size $n'$, by Proposition E.6 (Proposition E.7), under the conditions of Proposition 3.1 (Proposition 3.2), there exists constants $c_{3}$ and $c_{4}$, which only depend on $\kappa $, such that

$$\begin{aligned} P\left( \lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}_{I}^{T}\mathscr {Z}_{I}}{n}\right) < c_{3}\right) \le e^{-c_{4}n} \end{aligned}$$

where $\mathscr {Z}_{I}$ represents the sub-matrix of $\mathscr {Z}$ formed by $\{\mathscr {Z}_{i}: i \in I\}$, where $\mathscr {Z}_{i}$ is the i-th row of $\mathscr {Z}$. Then by a union bound,

$$\begin{aligned} P\left( \min _{|I| = n'}\lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}_{I}^{T}\mathscr {Z}_{I}}{n}\right) < c_{3}\right) \le \left( \begin{array}{c}n \\ n'\end{array}\right) e^{-c_{4}n}. \end{aligned}$$

By Stirling’s formula, there exists a constant $c_{5} > 0$ such that

$$\begin{aligned} \left( \begin{array}{c}n \\ n'\end{array}\right) = \frac{n!}{n'!(n - n')!}\le c_{5}\exp \left\{ (-\tilde{\delta }\log \tilde{\delta } - (1 - \tilde{\delta })\log (1 - \tilde{\delta }))n\right\} \end{aligned}$$

where $\tilde{\delta } = n' / n$. For sufficiently small $\delta $ and sufficiently large n,

$$\begin{aligned}-\tilde{\delta }\log \tilde{\delta } - (1 - \tilde{\delta })\log (1 - \tilde{\delta }) < c_{4}\end{aligned}$$

and hence

$$\begin{aligned} P\left( \min _{|I| = n'}\lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}_{I}^{T}\mathscr {Z}_{I}}{n}\right)< c_{3}\right) < c_{5}e^{-c_{6}n} \end{aligned}$$

(C-77)

for some $c_{6} > 0$. By Borel–Cantelli Lemma,

$$\begin{aligned} \liminf _{n\rightarrow \infty }\min _{|I| = \lfloor (1 - \delta )n \rfloor }\lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}_{I}^{T}\mathscr {Z}_{I}}{n}\right) \ge c_{3}\quad a.s.. \end{aligned}$$

On the other hand, since $F^{-1}$ is continuous at $\delta $, then

$$\begin{aligned} \zeta _{(\lfloor (1 - \delta )n\rfloor )}{\mathop {\rightarrow }\limits ^{a.s.}} F^{-1}(\delta ) > 0. \end{aligned}$$

where $\zeta _{(k)}$ is the k-th largest of $\{\zeta _{i}: i = 1, \ldots , n\}$. Let $I^{*}$ be the set of indices corresponding to the largest $\lfloor (1 - \delta ) n\rfloor $ $\zeta _{i}'$s. Then with probability 1,

$$\begin{aligned} \liminf _{n\rightarrow \infty }\lambda _{\mathrm {\min }}\left( \frac{Z^{T}Z}{n}\right)&= \liminf _{n\rightarrow \infty }\lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}^{T}\varLambda ^{2} \mathscr {Z}}{n}\right) \\&\ge \liminf _{n\rightarrow \infty }\zeta _{(\lfloor (1 - \delta )n\rfloor )}\cdot \liminf _{n\rightarrow \infty }\lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}_{I^{*}}^{T}\varLambda _{I^{*}}^{2}\mathscr {Z}_{I^{*}}}{n}\right) \\&\ge \liminf _{n\rightarrow \infty }\zeta _{(\lfloor (1 - \delta )n\rfloor )}\cdot \liminf _{n\rightarrow \infty }\min _{|I| = \lfloor (1 - \delta )n \rfloor }\lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}_{I}^{T}\mathscr {Z}_{I}}{n}\right) \\&\ge c_{3}F^{-1}(\delta )^{2} > 0. \end{aligned}$$

To prove assumption A4, similar to (C-75) in the proof of Proposition 3.7, it is left to show that

$$\begin{aligned} \min _{j}\frac{\hbox {tr}(\varLambda Q_{j}\varLambda )}{\hbox {tr}(Q_{j})} = \varOmega _{p}\left( \frac{1}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

Furthermore, by Lemma C-2, it remains to prove that

$$\begin{aligned} \min _{j}\hbox {tr}(\varLambda Q_{j}\varLambda ) = \varOmega _{p}\left( \frac{n}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

Recalling the Eq. (B-60) in the proof of Lemma B-4, we have

$$\begin{aligned} e_{i}^{T}Q_{j}e_{i} \ge \frac{K_{0}}{K_{1}}\cdot \frac{1}{1 + e_{i}^{T}Z_{[j]}^{T}\left( Z_{(i), [j]}^{T}Z_{(i), [j]}\right) ^{-1}Z_{[j]}e_{i}}. \end{aligned}$$

(C-78)

By Proposition E.5,

$$\begin{aligned} P\left( \sqrt{\lambda _{\mathrm {\max }}\left( \frac{\mathscr {Z}_{j}^{T}\mathscr {Z}_{j}}{n}\right) } > 3C_{1}\right) \le 2e^{-C_{2}n}. \end{aligned}$$

On the other hand, apply (C-77) to $\mathscr {Z}_{(i), [j]}$, we have

$$\begin{aligned} P\left( \min _{|I| = \lfloor (1 - \delta ) n\rfloor }\lambda _{\mathrm {\min }}\left( \frac{(\mathscr {Z}_{(i), [j]})_{I}^{T}(\mathscr {Z}_{(i), [j]})_{I}}{n}\right)< c_{3}\right) < c_{5}e^{-c_{6}n}. \end{aligned}$$

A union bound indicates that with probability $(c_{5}np + 2p)e^{-\min \{C_{2}, c_{6}\}n} = o(1)$,

$$\begin{aligned}&\max _{j}\lambda _{\mathrm {\max }}\left( \frac{\mathscr {Z}_{[j]}^{T}\mathscr {Z}_{[j]}}{n}\right) \\&\quad \le 9C_{1}^{2},\quad \min _{i, j}\min _{|I| = \lfloor (1 - \delta ) n\rfloor }\lambda _{\mathrm {\min }}\left( \frac{(\mathscr {Z}_{(i), [j]})_{I}^{T}(\mathscr {Z}_{(i), [j]})_{I}}{n}\right) \ge c_{3}. \end{aligned}$$

This implies that for any j,

$$\begin{aligned} \lambda _{\mathrm {\max }}\left( \frac{Z_{[j]}^{T}Z_{[j]}}{n}\right) = \lambda _{\mathrm {\max }}\left( \frac{\mathscr {Z}_{[j]}^{T}\varLambda ^{2} \mathscr {Z}_{[j]}}{n}\right) \le \zeta _{(1)}^{2}\cdot 9C_{1}^{2} \end{aligned}$$

and for any i and j,

$$\begin{aligned}&\lambda _{\mathrm {\min }}\left( \frac{Z_{(i), [j]}^{T}Z_{(i), [j]}}{n}\right) = \lambda _{\mathrm {\min }}\left( \frac{\mathscr {Z}_{(i), [j]}^{T}\zeta _{(i)}^{2} \mathscr {Z}_{(i), [j]}}{n}\right) \\&\quad \ge \min \{\zeta _{(\lfloor (1 - \delta )n\rfloor )}, \zeta _{(\lfloor (1 - \delta )n\rfloor )} + 1\}^{2}\cdot \min _{|I| = \lfloor (1 - \delta )n \rfloor }\lambda _{\mathrm {\min }}\left( \frac{(\mathscr {Z}_{(i), [j]})_{I}^{T}\zeta _{(i)}^{2}(\mathscr {Z}_{(i), [j]})_{I}}{n}\right) \\&\quad \ge c_{3}\min \{\zeta _{(\lfloor (1 - \delta )n\rfloor )}, \zeta _{(\lfloor (1 - \delta )n\rfloor )} + 1\}^{2} > 0. \end{aligned}$$

Moreover, as discussed above,

$$\begin{aligned} \zeta _{(1)} \le (\log n)^{\frac{2}{\alpha }}, \min \{\zeta _{(\lfloor (1 - \delta )n\rfloor )}, \zeta _{(\lfloor (1 - \delta )n\rfloor )} + 1\}\rightarrow F^{-1}(\delta ) \end{aligned}$$

almost surely. Thus, it follows from (C-78) that with high probability,

$$\begin{aligned} e_{i}^{T}Q_{j}e_{i}&\ge \frac{K_{0}}{K_{1}}\cdot \frac{1}{1 + e_{i}^{T}Z_{[j]}^{T}\left( Z_{(i), [j]}^{T}Z_{(i), [j]}\right) ^{-1}Z_{[j]}e_{i}}\\&\ge \frac{K_{0}}{K_{1}}\cdot \frac{1}{1 + e_{i}^{T}\frac{Z_{[j]}^{T}Z_{[j]}}{n}e_{i}\cdot c_{3}(F^{-1}(\delta ))^{2}}\\&\ge \frac{K_{0}}{K_{1}}\cdot \frac{1}{1 + (\log n)^{\frac{4}{\alpha }}\cdot 9C_{1}^{2}\cdot c_{3}(F^{-1}(\delta ))^{2}}. \end{aligned}$$

The above bound holds for all diagonal elements of $Q_{j}$ uniformly with high probability. Therefore,

$$\begin{aligned} \hbox {tr}(\varLambda Q_{j}\varLambda )\ge & {} \zeta _{(\lfloor (1 - \delta )n\rfloor )}^{2} \cdot \lfloor (1 - \delta )n\rfloor \cdot \frac{K_{0}}{K_{1}}\cdot \frac{1}{1 + (\log n)^{\frac{4}{\alpha }}\cdot 9C_{1}^{2}\cdot c_{3}(F^{-1}(\delta ))^{2}} \\= & {} \varOmega _{p}\left( \frac{n}{\mathrm {polyLog(n)}}\right) . \end{aligned}$$

As a result, the assumption A4 is satisfied with high probability. Finally, by (C-76), we obtain that

$$\begin{aligned} \mathbb {E}\varDelta _{C}^{8}\le \mathbb {E}\left[ \max _{i = 0, \ldots , n; j = 1, \ldots , p}\left| \tilde{\alpha }_{j, i}^{T}\mathscr {Z}_{j}\right| ^{8}\cdot \Vert \varLambda \Vert _{\mathrm {op}}^{8}\right] . \end{aligned}$$

By Cauchy’s inequality,

$$\begin{aligned} \mathbb {E}\varDelta _{C}^{8}\le \sqrt{\mathbb {E}\max _{i = 0, \ldots , n; j = 1, \ldots , p}|\tilde{\alpha }_{j, i}^{T}\mathscr {Z}_{j}|^{16}}\cdot \sqrt{\mathbb {E}\max _{i}\zeta _{i}^{16}}. \end{aligned}$$

Similar to (C-72), we conclude that

$$\begin{aligned} \mathbb {E}\varDelta _{C}^{8} = O\left( \mathrm {polyLog(n)}\right) \end{aligned}$$

and by Markov inequality, the assumption A5 is satisfied with high probability. $\square $

1.4 More results of least-squares (Section 5)

1.4.1 The relation between $S_{j}(X)$ and $\varDelta _{C}$

In Sect. 5, we give a sufficient and almost necessary condition for the coordinate-wise asymptotic normality of the least-square estimator $\hat{\beta }^{LS}$; see Theorem 5.1. In this subsubsection, we show that $\varDelta _{C}$ is a generalization of $\max _{j\in J_{n}}S_{j}(X)$ for general M-estimators.

Consider the matrix $(X^{T}DX)^{-1}X^{T}$, where D is obtain by using general loss functions, then by block matrix inversion formula (see Proposition E.1),

$$\begin{aligned} e_{1}^{T}(X^{T}DX)^{-1}X^{T}&= e_{1}^{T}\left( \begin{array}{cc}X_{1}^{T}DX_{1} &{} X_{1}^{T}DX_{[1]}\\ X_{[1]}^{T}DX_{1} &{} X_{[1]}^{T}DX_{[1]}\end{array}\right) ^{-1}\left( \begin{array}{c}X_{1}^{T} \\ X_{[1]}^{T}\end{array}\right) \\&= \frac{X_{1}^{T}\left( I - DX_{[1]}\left( X_{[1]}^{T}DX_{[1]}\right) ^{-1}X_{[1]}^{T}\right) }{X_{1}^{T}\left( D - DX_{[1]}\left( X_{[1]}^{T}DX_{[1]}\right) ^{-1}X_{[1]}^{T}D\right) X_{1}}\\&\approx \frac{X_{1}^{T}\left( I - D_{[1]}X_{[1]}\left( X_{[1]}^{T}D_{[1]}X_{[1]}\right) ^{-1}X_{[1]}^{T}\right) }{X_{1}^{T}\left( D - DX_{[1]}\left( X_{[1]}^{T}DX_{[1]}\right) ^{-1}X_{[1]}^{T}D\right) X_{1}} \end{aligned}$$

where we use the approximation $D \approx D_{[1]}$. The same result holds for all $j\in J_{n}$, then

$$\begin{aligned} \frac{\Vert e_{j}^{T}(X^{T}DX)^{-1}X^{T}\Vert _{\infty }}{\Vert e_{j}^{T}(X^{T}DX)^{-1}X^{T}\Vert _{2}}\approx \frac{\left\| X_{1}^{T}\left( I - D_{[1]}X_{[1]}\left( X_{[1]}^{T}D_{[1]}X_{[1]}\right) ^{-1}X_{[1]}^{T}\right) \right\| _{\infty }}{\left\| X_{1}^{T}\left( I - D_{[1]}X_{[1]}\left( X_{[1]}^{T}D_{[1]}X_{[1]}\right) ^{-1}X_{[1]}^{T}\right) \right\| _{2}}. \end{aligned}$$

Recall that $h_{j, 1, i}^{T}$ is i-th row of $I - D_{[1]}X_{[1]}(X_{[1]}^{T}D_{[1]}X_{[1]})^{-1}X_{[1]}^{T}$, we have

$$\begin{aligned} \max _{i}\frac{\left| h_{j, 1, i}^{T}X_{1}\right| }{\Vert h_{j, 1, i}\Vert _{2}}\approx \frac{\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}\right\| _{\infty }}{\left\| e_{j}^{T}(X^{T}DX)^{-1}X^{T}\right\| _{2}}. \end{aligned}$$

The right-handed side equals to $S_{j}(X)$ in the least-square case. Therefore, although of complicated form, assumption A5 is not an artifact of the proof but is essential for the asymptotic normality.

1.4.2 Additional examples

Benefit from the analytical form of the least-square estimator, we can depart from sub-gaussinity of the entries. The following proposition shows that a random design matrix Z with i.i.d. entries under appropriate moment conditions satisfies $\max _{j\in J_{n}}S_{j}(Z) = o(1)$ with high probability. This implies that, when X is one realization of Z, the conditions Theorem 5.1 are satisfied for X with high probability over Z.

Proposition C.4

If $\{Z_{ij}: i\le n, j\in J_{n}\}$ are independent random variables with

1.
$\max _{i\le n, j\in J_{n}}(\mathbb {E}|Z_{ij}|^{8 + \delta })^{\frac{1}{8 + \delta }}\le M$ for some $\delta , M > 0$;
2.
$\min _{i\le n, j\in J_{n}}\hbox {Var}(Z_{ij}) > \tau ^{2}$ for some $\tau > 0$
3.
$P(Z \text{ has } \text{ full } \text{ column } \text{ rank }) = 1 - o(1)$;
4.
$\mathbb {E}Z_{j} \in \hbox {span}\{Z_{j}: j\in J_{n}^{c}\}$ almost surely for all $j\in J_{n}$;

where $Z_{j}$ is the j-th column of Z. Then

$$\begin{aligned} \max _{j\in J_{n}}S_{j}(Z) = O_{p}\left( \frac{1}{n^{\frac{1}{4}}}\right) = o_{p}(1). \end{aligned}$$

A typical practically interesting example is that Z contains an intercept term, which is not in $J_{n}$, and $Z_{j}$ has i.i.d. entries for $j\in J_{n}$ with continuous distribution and sufficiently many moments, in which case the first three conditions are easily checked and $\mathbb {E}Z_{j}$ is a multiple of $(1, \ldots , 1)$, which belongs to $\hbox {span}\{Z_{j}: j\in J_{n}^{c}\}$.

In fact, the condition 4 allows Proposition C.4 to cover more general cases than the above one. For example, in a census study, a state-specific fix effect might be added into the model, i.e.

$$\begin{aligned} y_{i} = \alpha _{s_{i}} + z_{i}^{T}\beta ^{*}+ \varepsilon _{i} \end{aligned}$$

where $s_{i}$ represents the state of subject i. In this case, Z contains a sub-block formed by $z_{i}$ and a sub-block with ANOVA forms as mentioned in Example 1. The latter is usually incorporated only for adjusting group bias and not the target of inference. Then condition 4 is satisfied if only $Z_{ij}$ has same mean in each group for each j, i.e. $\mathbb {E}Z_{ij} = \mu _{s_{i}, j}$.

Proof

(Proof of Proposition C.4) By Sherman–Morison–Woodbury formula,

$$\begin{aligned} e_{j}^{T}(Z^{T}Z)^{-1}Z^{T} = \frac{Z_{j}^{T}(I - H_{j})}{Z_{j}^{T}(I - H_{j})Z_{j}} \end{aligned}$$

where $H_{j} = Z_{[j]}(Z_{[j]}^{T}Z_{[j]})^{-1}Z_{[j]}^{T}$ is the projection matrix generated by $Z_{[j]}$. Then

$$\begin{aligned} S_{j}(Z) = \frac{\left\| e_{j}^{T}(Z^{T}Z)^{-1}Z^{T}\right\| _{\infty }}{\left\| e_{j}^{T}(Z^{T}Z)^{-1}Z^{T}\right\| _{2}} = \frac{\left\| Z_{j}^{T}(I - H_{j})\right\| _{\infty }}{\sqrt{Z_{j}^{T}(I - H_{j})Z_{j}}}. \end{aligned}$$

(C-79)

Similar to the proofs of other examples, the strategy is to show that the numerator, as a linear contrast of $Z_{j}$, and the denominator, as a quadratic form of $Z_{j}$, are both concentrated around their means. Specifically, we will show that there exists some constants $C_{1}$ and $C_{2}$ such that

$$\begin{aligned} \max _{j\in J_{n}}\sup _{\begin{array}{c} A\in \mathbb {R}^{n\times n}, A^{2} = A,\\ \hbox {tr}(A) = n - p + 1 \end{array}}\left\{ P\left( \Vert AZ_{j}\Vert _{\infty } > C_{1}n^{\frac{1}{4}}\right) + P\left( Z_{j}^{T}AZ_{j} < C_{2}n\right) \right\} = o\left( \frac{1}{n}\right) . \end{aligned}$$

(C-80)

If (C-80) holds, since $H_{j}$ is independent of $Z_{j}$ by assumptions, we have

$$\begin{aligned}&P\left( S_{j}(Z) \ge \frac{C_{1}}{\sqrt{C_{2}}}\cdot n^{-\frac{1}{4}}\right) = P\left( \frac{\Vert Z_{j}^{T}(I - H_{j})\Vert _{\infty }}{\sqrt{Z_{j}^{T}(I - H_{j})Z_{j}}} \ge \frac{C_{1}}{\sqrt{C_{2}}}\cdot n^{-\frac{1}{4}}\right) \nonumber \\&\quad \le P\left( \Vert (I - H_{j})Z_{j}\Vert _{\infty }> C_{1}n^{\frac{1}{4}}\right) + P\left( Z_{j}^{T}(I - H_{j})Z_{j}< C_{2}n\right) \nonumber \\&\quad = \mathbb {E}\left[ P\left( \Vert (I - H_{j})Z_{j}\Vert _{\infty } > C_{1}n^{\frac{1}{4}}\right) \bigg | Z_{[j]}\right] \nonumber \\&\qquad + \mathbb {E}\left[ P\left( Z_{j}^{T}(I - H_{j})Z_{j} < C_{2}n\right) \bigg | Z_{[j]}\right] \end{aligned}$$

(C-81)

$$\begin{aligned}&\quad \le \sup _{A\in \mathbb {R}^{n\times n}, A^{2} = A, \hbox {tr}(A) = n - p + 1}P\left( \Vert AZ_{j}\Vert _{\infty }> C_{1}n^{\frac{1}{4}}\right) + P\left( Z_{j}^{T}AZ_{j}< C_{2}n\right) \nonumber \\&\quad \le \max _{j\in J_{n}}\left\{ \sup _{A\in \mathbb {R}^{n\times n}, A^{2} = A, \hbox {tr}(A) = n - p + 1}P\left( \Vert AZ_{j}\Vert _{\infty } > C_{1}n^{\frac{1}{4}}\right) \right. \nonumber \\&\qquad \left. + P\left( Z_{j}^{T}AZ_{j} < C_{2}n\right) \right\} = o\left( \frac{1}{n}\right) . \end{aligned}$$

(C-82)

Thus with probability $1 - o(|J_{n}| / n) = 1 - o(1)$,

$$\begin{aligned} \max _{j\in J_{n}}S_{j}(Z)\le \frac{C_{1}}{\sqrt{C_{2}}}\cdot n^{-\frac{1}{4}} \end{aligned}$$

and hence

$$\begin{aligned} \max _{j\in J_{n}}S_{j}(Z) = O_{p}\left( \frac{1}{n^{\frac{1}{4}}}\right) . \end{aligned}$$

Now we prove (C-80). The proof, although looks messy, is essentially the same as the proof for other examples. Instead of relying on the exponential concentration given by the sub-gaussianity, we show the concentration in terms of higher-order moments.

In fact, for any idempotent A, the sum square of each row is bounded by 1 since

$$\begin{aligned} \sum _{i}A_{ij}^{2} = (A^{2})_{j, j} \le \lambda _{\mathrm {\max }}(A^{2}) = 1. \end{aligned}$$

By Jensen’s inequality,

$$\begin{aligned} \mathbb {E}Z_{ij}^{2}\le (\mathbb {E}|Z_{ij}|^{8 + \delta })^{\frac{2}{8 + \delta }}. \end{aligned}$$

For any j, by Rosenthal’s inequality [48], there exists some universal constant C such that

$$\begin{aligned} \mathbb {E}\left| \sum _{i=1}^{n}A_{ij}Z_{ij}\right| ^{8 + \delta }&\le C\left\{ \sum _{i=1}^{n}|A_{ij}|^{8 + \delta }\mathbb {E}|Z_{ij}|^{8 + \delta } + \left( \sum _{i=1}^{n}A_{ij}^{2}\mathbb {E}Z_{ij}^{2}\right) ^{4 + \delta / 2}\right\} \\&\le C\left\{ \sum _{i=1}^{n}|A_{ij}|^{2}\mathbb {E}|Z_{ij}|^{8 + \delta } + \left( \sum _{i=1}^{n}A_{ij}^{2}\mathbb {E}Z_{ij}^{2}\right) ^{4 + \delta / 2}\right\} \\&\le CM^{8 + \delta }\left\{ \sum _{i=1}^{n}A_{ij}^{2} + \left( \sum _{i=1}^{n}A_{ij}^{2}\right) ^{4 + \delta / 2}\right\} \le 2CM^{8 + \delta }. \end{aligned}$$

Let $C_{1}= (2CM^{8 + \delta })^{\frac{1}{8 + \delta }}$, then for given i, by Markov inequality,

$$\begin{aligned} P\left( \bigg |\sum _{i=1}^{n}A_{ij}Z_{ij}\bigg | > C_{1}n^{\frac{1}{4}}\right) \le \frac{1}{n^{2 + \delta / 4}} \end{aligned}$$

and a union bound implies that

$$\begin{aligned} P\left( \Vert AZ_{j}\Vert _{\infty } > C_{1}n^{\frac{1}{4}}\right) \le \frac{1}{n^{1 + \delta / 4}} = o\left( \frac{1}{n}\right) . \end{aligned}$$

(C-83)

Now we derive a bound for $Z_{j}^{T}AZ_{j}$. Since $p/n \rightarrow \kappa \in (0, 1)$, there exists $\tilde{\kappa }\in (0, 1 - \kappa )$ such that $n - p > \tilde{\kappa } n$. Then

$$\begin{aligned} \mathbb {E}Z_{j}^{T}AZ_{j} = \sum _{i=1}^{n}A_{ii}\mathbb {E}Z_{ij}^{2}> \tau ^{2}\hbox {tr}(A) = \tau ^{2}(n - p + 1) > \tilde{\kappa } \tau ^{2} n. \end{aligned}$$

(C-84)

To bound the tail probability, we need the following result: $\square $

Lemma C-3

[2, Lemma 6.2] Let B be an $n\times n$ nonrandom matrix and $W = (W_{1}, \ldots , W_{n})^{T}$ be a random vector of independent entries. Assume that $\mathbb {E}W_{i} = 0$, $\mathbb {E}W_{i}^{2} = 1$ and $\mathbb {E}|W_{i}|^{k}\le \nu _{k}$. Then, for any $q\ge 1$,

$$\begin{aligned} E|W^{T}BW - \hbox {tr}(B)|^{q} \le C_{q}\left( (\nu _{4}\hbox {tr}(BB^{T}))^{\frac{q}{2}} + \nu _{2q}\hbox {tr}(BB^{T})^{\frac{q}{2}}\right) , \end{aligned}$$

where $C_{q}$ is a constant depending on q only.

It is easy to extend Lemma C-3 to non-isotropic case by rescaling. In fact, denote $\sigma _{i}^{2}$ by the variance of $W_{i}$, and let $\varSigma = \hbox {diag}(\sigma _{1}, \ldots , \sigma _{n})$, $Y = (W_{1} / \sigma _{1}, \ldots , W_{n} / \sigma _{n})$. Then

$$\begin{aligned} W^{T}BW = Y^{T}\varSigma ^{\frac{1}{2}}B\varSigma ^{\frac{1}{2}}Y, \end{aligned}$$

with $\hbox {Cov}(Y) = I$. Let $\tilde{B} = \varSigma ^{\frac{1}{2}}B\varSigma ^{\frac{1}{2}}$, then

$$\begin{aligned} \tilde{B}\tilde{B}^{T} = \varSigma ^{\frac{1}{2}}B\varSigma B^{T}\varSigma ^{\frac{1}{2}}\preceq \nu _{2}\varSigma ^{\frac{1}{2}}BB^{T}\varSigma ^{\frac{1}{2}}. \end{aligned}$$

This entails that

$$\begin{aligned} \hbox {tr}(\tilde{B}\tilde{B}^{T})\le nu_{2}\hbox {tr}\left( \varSigma ^{\frac{1}{2}}BB^{T}\varSigma ^{\frac{1}{2}}\right) = \nu _{2}\hbox {tr}(\varSigma BB^{T})\le \nu _{2}^{2}\hbox {tr}(BB^{T}). \end{aligned}$$

On the other hand,

$$\begin{aligned} \hbox {tr}(\tilde{B}\tilde{B}^{T})^{\frac{q}{2}}\le n\lambda _{\mathrm {\max }}(\tilde{B}\tilde{B}^{T})^{\frac{q}{2}} = n\nu _{2}^{\frac{q}{2}}\lambda _{\mathrm {\max }}\left( \varSigma ^{\frac{1}{2}}BB^{T}\varSigma ^{\frac{1}{2}}\right) ^{\frac{q}{2}}\le n\nu _{2}^{q}\lambda _{\mathrm {\max }}(BB^{T})^{\frac{q}{2}}. \end{aligned}$$

Thus we obtain the following result

Lemma C-4

Let B be an $n\times n$ nonrandom matrix and $W = (W_{1}, \ldots , W_{n})^{T}$ be a random vector of independent mean-zero entries. Suppose $\mathbb {E}|W_{i}|^{k}\le \nu _{k}$, then for any $q\ge 1$,

$$\begin{aligned} E|W^{T}BW -\mathbb {E}W^{T}BW|^{q} \le C_{q}\nu _{2}^{q}\left( (\nu _{4}\hbox {tr}(BB^{T}))^{\frac{q}{2}} + \nu _{2q}\hbox {tr}(BB^{T})^{\frac{q}{2}}\right) , \end{aligned}$$

where $C_{q}$ is a constant depending on q only.

Apply Lemma C-4 with $W = Z_{j}$, $B = A$ and $q = 4 + \delta / 2$, we obtain that

$$\begin{aligned} E\left| Z_{j}^{T}AZ_{j} -\mathbb {E}Z_{j}^{T}AZ_{j}\right| ^{4 + \delta / 2}\le CM^{16 + 2\delta }\left( (\hbox {tr}(AA^{T}))^{2 + \delta /4} + \hbox {tr}(AA^{T})^{2 + \delta / 4}\right) \end{aligned}$$

for some constant C. Since A is idempotent, all eigenvalues of A is either 1 or 0 and thus $AA^{T}\preceq I$. This implies that

$$\begin{aligned} \hbox {tr}(AA^{T})\le n, \quad \hbox {tr}(AA^{T})^{2 + \delta / 4}\le n \end{aligned}$$

and hence

$$\begin{aligned} E\left| Z_{j}^{T}AZ_{j} -\mathbb {E}Z_{j}^{T}AZ_{j}\right| ^{4 + \delta / 2} \le 2CM^{16 + 2\delta }n^{2 + \delta / 4} \end{aligned}$$

for some constant $C_{1}$, which only depends on M. By Markov inequality,

$$\begin{aligned} P\left( |Z_{j}^{T}AZ_{j} - \mathbb {E}Z_{j}^{T}AZ_{j}| \ge \frac{\tilde{\kappa }\tau ^{2}n}{2} \right) \le 2CM^{16 + 2\delta }\left( \frac{2}{\tilde{\kappa }\tau ^{2}}\right) ^{4 + \delta / 2}\cdot \frac{1}{n^{2 + \delta / 4}}. \end{aligned}$$

Combining with (C-84), we conclude that

$$\begin{aligned} P\left( Z_{j}^{T}AZ_{j} < C_{2}n\right) = O\left( \frac{1}{n^{2 + \delta / 4}}\right) = o\left( \frac{1}{n}\right) \end{aligned}$$

(C-85)

where $C_{2} = \frac{\tilde{\kappa }\tau ^{2}}{2}$. Notice that both (C-83) and (C-85) do not depend on j and A. Therefore, (C-80) is proved and hence the Proposition.

Additional numerical experiments

In this section, we repeat the experiments in Sect. 6 by using $L_{1}$ loss, i.e. $\rho (x) = |x|$. $L_{1}$-loss is not smooth and does not satisfy our technical conditions. The results are displayed below. It is seen that the performance is quite similar to that with the huber loss (Figs. 5, 6, 7).

Miscellaneous

In this appendix we state several technical results for the sake of completeness.

Proposition E.1

([28], formula (0.8.5.6)) Let $A\in \mathbb {R}^{p\times p}$ be an invertible matrix and write A as a block matrix

$$\begin{aligned} A = \left( \begin{array}{cc} A_{11} &{} A_{12}\\ A_{21} &{} A_{22} \end{array} \right) \end{aligned}$$

with $A_{11}\in \mathbb {R}^{p_{1}\times p_{1}}, A_{22}\in \mathbb {R}^{(p - p_{1})\times (p -p_{1})}$ being invertible matrices. Then

$$\begin{aligned}A^{-1} = \left( \begin{array}{cc} A_{11} + A_{11}^{-1}A_{12}S^{-1}A_{21}A_{11}^{-1} &{} -A_{11}^{-1}A_{12}S^{-1}\\ -S^{-1}A_{21}A_{11}^{-1} &{} S^{-1} \end{array} \right) \end{aligned}$$

where $S = A_{22} - A_{21}A_{11}^{-1}A_{12}$ is the Schur’s complement.

Proposition E.2

([51]; improved version of the original form by [27]) Let $X = (X_{1}, \ldots , X_{n})\in \mathbb {R}^{n}$ be a random vector with independent mean-zero $\sigma ^{2}$-sub-gaussian components $X_{i}$. Then, for every t,

$$\begin{aligned} P\left( |X^{T}AX - \mathbb {E}X^{T}AX| > t\right) \le 2\exp \left\{ -c\min \left( \frac{t^{2}}{\sigma ^{4}\Vert A\Vert _{F}^{2}}, \frac{t}{\sigma ^{2}\Vert A\Vert _{\mathrm {op}}}\right) \right\} \end{aligned}$$

Proposition E.3

[3] If $\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}$ are i.i.d. random variables with zero mean, unit variance and finite fourth moment and $p / n\rightarrow \kappa $, then

$$\begin{aligned} \lambda _{\mathrm {\max }}\left( \frac{Z^{T}Z}{n}\right) {\mathop {\rightarrow }\limits ^{a.s.}} (1 + \sqrt{\kappa })^{2}, \quad \lambda _{\mathrm {\min }}\left( \frac{Z^{T}Z}{n}\right) {\mathop {\rightarrow }\limits ^{a.s.}} (1 - \sqrt{\kappa })^{2}. \end{aligned}$$

Proposition E.4

[35] Suppose $\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}$ are independent mean-zero random variables with finite fourth moment, then

$$\begin{aligned} \mathbb {E}\sqrt{\lambda _{\mathrm {\max }}\left( Z^{T}Z\right) }\le C\left( \max _{i}\sqrt{\sum _{j}\mathbb {E}Z_{ij}^{2}} + \max _{j}\sqrt{\sum _{i}\mathbb {E}Z_{ij}^{2}} + \root 4 \of {\sum _{i, j}\mathbb {E}Z_{ij}^{4}}\right) \end{aligned}$$

for some universal constant C. In particular, if $\mathbb {E}Z_{ij}^{4}$ are uniformly bounded, then

$$\begin{aligned} \lambda _{\mathrm {\max }}\left( \frac{Z^{T}Z}{n}\right) = O_{p}\left( 1 + \sqrt{\frac{p}{n}}\right) . \end{aligned}$$

Proposition E.5

[50] Suppose $\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}$ are independent mean-zero $\sigma ^{2}$-sub-gaussian random variables. Then there exists a universal constant $C_{1}, C_{2} > 0$ such that

$$\begin{aligned} P\left( \sqrt{\lambda _{\mathrm {\max }}\left( \frac{Z^{T}Z}{n}\right) } > C\sigma \left( 1 + \sqrt{\frac{p}{n}} + t\right) \right) \le 2e^{-C_{2}nt^{2}}. \end{aligned}$$

Proposition E.6

[49] Suppose $\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}$ are i.i.d. $\sigma ^{2}$-sub-gaussian random variables with zero mean and unit variance, then for $\varepsilon \ge 0$

$$\begin{aligned} P\left( \sqrt{\lambda _{\mathrm {\min }}\left( \frac{Z^{T}Z}{n}\right) }\le \varepsilon (1 - \sqrt{\frac{p - 1}{n}})\right) \le (C\varepsilon )^{n - p + 1} + e^{-cn} \end{aligned}$$

for some universal constants C and c.

Proposition E.7

[37] Suppose $\{Z_{ij}: i = 1, \ldots , n, j = 1, \ldots , p\}$ are independent $\sigma ^{2}$-sub-gaussian random variables such that

$$\begin{aligned} Z_{ij}{\mathop {=}\limits ^{d}}-Z_{ij}, \quad \hbox {Var}(Z_{ij}) > \tau ^{2} \end{aligned}$$

for some $\sigma , \tau > 0$, and $p / n\rightarrow \kappa \in (0, 1)$, then there exists constants $c_{1}, c_{2} > 0$, which only depends on $\sigma $ and $\tau $, such that

$$\begin{aligned} P\left( \lambda _{\mathrm {\min }}\left( \frac{Z^{T}Z}{n}\right) < c_{1}\right) \le e^{-c_{2}n}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lei, L., Bickel, P.J. & El Karoui, N. Asymptotics for high dimensional regression M-estimates: fixed design results. Probab. Theory Relat. Fields 172, 983–1079 (2018). https://doi.org/10.1007/s00440-017-0824-7

Download citation

Received: 26 January 2017
Revised: 07 December 2017
Published: 09 February 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s00440-017-0824-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Asymptotics for high dimensional regression M-estimates: fixed design results

Abstract

Access this article

Similar content being viewed by others

A Bayesian-motivated test for high-dimensional linear regression models with fixed design matrix

Goodness-of-fit tests in linear EV regression with replications

Comments on: Statistical inference and large-scale multiple testing for high-dimensional regression models

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Appendices

Appendix

Proof sketch of Lemma 4.4

Lemma A-1

1.1 Upper bound of \(M_{j}\)

1.2 Lower bound of \(\hbox {Var}(\hat{\beta }_{j})\)

1.2.1 Approximating \(\hbox {Var}(\hat{\beta }_{j})\) by \(\hbox {Var}(b_{j})\)

1.2.2 Bounding \(\hbox {Var}(b_{j})\) via \(\hbox {Var}(N_{j})\)

1.2.3 Bounding \(\hbox {Var}(N_{j})\) via \(\hbox {tr}(Q_{j})\)

1.2.4 Lower bound of \(\hbox {tr}(Q_{j})\)

Proof of Theorem 3.1

1.1 Notation

1.2 Deterministic approximation results

Lemma B-1

Proof

Lemma B-2

Proposition B.1

Proof

1.3 Summary of approximation results

Theorem B.1

Proof

1.4 Controlling gradient and Hessian

Proof

Proof

Proof

Remark B.1

1.5 Upper bound of \(M_{j}\)

1.5.1 Controlling \(|M_{j} - M_{j}^{(1)}|\)

1.5.2 Bound of \(|M_{j}^{(1)} - M_{j}^{(2)}|\)

Lemma B-3

Proof

1.5.3 Bound of \(M_{j}^{(2)}\)

1.5.4 Summary

1.6 Lower Bound of \(\hbox {Var}(\hat{\beta }_{j})\)

1.6.1 Approximating \(\hbox {Var}(\hat{\beta }_{j})\) by \(\hbox {Var}(b_{j})\)

1.6.2 Controlling \(\hbox {Var}(b_{j})\) by \(\hbox {Var}(N_{j})\)

1.6.3 Controlling \(I_{1}\)

Lemma B-4

Proof

1.6.4 Controlling \(I_{2}\)

Proposition B.2

1.6.5 Summary

Proof of other results

1.1 Proofs of propositions in Section 2.3

Proof

Proof

1.2 Proofs of Corollary 3.1

Proposition C.1

Proof

Proposition C.2

Proof

Proposition C.3

Proof

Proof

1.3 Proofs of results in Section 3.3

Lemma C-1

Proof

Lemma C-2

Proof

Proof

Proof

Proof

Proof

Proof

Proof

Proof

1.4 More results of least-squares (Section 5)