Abstract
With the development of contemporary science, a large amount of generated data includes heterogeneity and outliers in the response and/or covariates. Furthermore, subsampling is an effective method to overcome the limitation of computational resources. However, when data include heterogeneity and outliers, incorrect subsampling probabilities may select inferior subdata, and statistic inference on this subdata may have a far inferior performance. Combining the asymmetric least squares and \(L_2\) estimation, this paper proposes a double-robustness framework (DRF), which can simultaneously tackle the heterogeneity and outliers in the response and/or covariates. The Poisson subsampling is implemented based on the DRF for massive data, and a more robust probability will be derived to select the subdata. Under some regularity conditions, we establish the asymptotic properties of the subsampling estimator based on the DRF. Numerical studies and actual data demonstrate the effectiveness of the proposed method.
Similar content being viewed by others
References
Ai M, Wang F, Yu J, Zhang H (2021) Optimal subsampling for large-scale quantile regression. J Complexity 62:101512
Ai M, Yu J, Zhang H, Wang H (2021) Optimal subsampling algorithms for big data regressions. Stat Sin 31(2):749–772
Aigner D, Amemiya T, Poirier D (1976) On the estimation of production frontiers: maximum likelihood estimation of the parameters of a discontinuous density function. Int Econ Rev 17(2):377–396
Barry A, Oualkacha K, Charpentier A (2022) A new gee method to account for heteroscedasticity using asymmetric least-square regressions. J Appl Stat 49(14):3564–3590
Bowman A (1984) An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71(2):353–360
Ciuperca G (2021) Variable selection in high-dimensional linear model with possibly asymmetric errors. Comput Stat Data Anal 155:107–112
Drineas P, Mahoney M, Muthukrishnan S, Sarlós T (2011) Faster least squares approximation. Numer Math 117:219–249
Efron B (1991) Regression percentiles using asymmetric squared error loss. Stat Sin 1(1):93–125
Fan J, Han F, Liu H (2014) Challenges of big data analysis. Nat Sci Rev 1(2):293–314
Gijbels I, Karim R, Verhasselt A (2019) On quantile-based asymmetric family of distributions: properties and inference. Int Stat Rev 87(3):471–504
Gu Y, Zou H (2016) High-dimensional generalizations of asymmetric least squares regression and their applications. Ann Stat 44(6):2661–2694
Hájek J (1964) Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann Math Stat 35(4):1491–1523
Hjort N, Pollard D (2011) Asymptotics for minimisers of convex processes. arXiv:1107.3806
Koenker R (2005) In quantile regression. Cambridge University Press, Cambridge
Koenker R, Bassett G (1978) Regression quantiles. Econometrica 46(1):33–50
Liao L, Park C, Choi H (2019) Penalized expectile regression: an alternative to penalized quantile regression. Ann Inst Stat Math 71(2):409–438
Lin L, Li F (2019) A global bias-correction dc method for biased estimation under memory constraint. arXiv:1904.07477
Ma P, Mahoney W, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16(1):861–919
Ma P, Zhang X, Xing X, Ma J, Mandal A (2022) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. J Mach Learn Res 23(177):1–45
Meng C, Xie R, Mandal A, Zhang X, Zhong W, Ma P (2021) Lowcon: a design-based subsampling approach in a misspecified linear model. J Comput Graph Stat 30:694–708
Newey W, Powell J (1987) Asymmetric least squares estimation and testing. Econometric 55(4):819–847
Pollard D (1982) Empirical choice of histogram and kernel density estimators. Scand J Stat 9(2):65–78
Pollard D (1991) Asymptotics for least absolution deviation regression estimators. Economet Theor 7(2):186–199
Pukelsheim F (2006) In optimal design of experiments. Society for Industrial and Applied Mathematics, Philadelphia
Schifano E, Wu J, Wang C, Yan J, Chen H (2016) Online updating of statistical inference in the big data setting. Technometrics 58(3):393–403
Scott D (2012) Parametric statistical modeling by minimum integrated square error. Technometrics 43(3):274–285
Shao Y, Wang L (2022) Optimal subsampling for composite quantile regression model in massive data. Stat Pap 63:1139–1161
van der Vaart (1998) In asymptotic statistics. Cambridge University Press, London
Wang H, Ma Y (2021) Optimal subsampling for quantile regression in big data. Biometrika 108(1):99–112
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113(522):829–844
Wang H, Yang M, Stufken J (2019) Information-based optimal subdata s-election for big data linear regression. J Am Stat Assoc 114(525):393–405
Wang L, Elmstedt J, Wong W, Xu H (2021) Orthogonal subsampling for big data linear regression. Ann Appl Stat 15(3):1273–1290
Xiong S, Li G (2008) Some results on the convergence of conditional distributions. Stat Probab Lett 78(18):3249–3253
Yan Q, Li Y, Niu M (2022) Optimal subsampling for functional quantile regression. Stat Pap. https://doi.org/10.1007/s00362-022-01367-z
Yu J, Wang H, Ai M, Zhang H (2022) Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J Am Stat Assoc 117(537):265–276
Yuan X, Li Y, Dong X, Liu T (2022) Optimal subsampling for composite quantile regression in big data. Stat Pap 63:1649–1676
Yu J, Wang H (2022) Subdata selection algorithm for linear model discrimination. Stat Pap. 63:1883–1906
Yu J, Ai M, Ye Z (2023a) A review on design inspired subsampling for big data. Stat Pap. https://doi.org/10.1007/s00362-022-01386-w
Yu J, Liu J, Wang H (2023b) Information-based optimal subdata selection for non-linear models. Stat Pap. https://doi.org/10.1007/s00362-023-01430-3
Zhu X, Li F, Wang H (2021) Least-square approximation for a distributed system. J Comput Graph Stat 30(4):1004–1018
Acknowledgements
The authors would like to thank the Editor and two referees for the constructive suggestions that lead to a significant improvement over the article. This research is supported in part by the National Natural Science Foundation of China (12171277, 12271294, 12071248).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Technical details
Appendix: Technical details
Proof of Proposition 1
Direct calculation yields
and
The result can be derived. \(\square \)
Lemma 1
(Gu and Zou 2016) Denote \(r(v_{i}) = \rho _{\tau }(\varepsilon _{i} - v_{i}) - \rho _{\tau }(\varepsilon _{i}) + 2\varepsilon _{i}v_{i}\psi _{\tau }(\varepsilon _{i})\), \(i = 1, \ldots , N\). The asymmetric squared error loss \(\rho _{\tau }(\cdot )\) is continuously differentiable, but is not twice differentiable at zero when \(\tau \ne 0.5\). Moreover, for any \(\varepsilon _{i}\), \(v_{i} \in {\mathbb {R}}\) and \(\tau \in (0,1)\), we have
where \(\tau \wedge (1-\tau ) = \text {min}\{\tau , 1-\tau \}\) and \(\tau \vee (1-\tau ) = \text {max }\{\tau , 1-\tau \}\). It follows that \(\rho _{\tau }(\cdot )\) is strongly convex.
Lemma 2
(Corollary of Hjort and Pollard (2011)) Suppose \(\varvec{Z}_{n}(\varvec{d})\) is convex and can be represented as \(\frac{1}{2} \varvec{d}' \varvec{V} \varvec{d} + \varvec{W}'_{n}\varvec{d} + C_{n} + a_{n}(\varvec{d})\), where \(\varvec{V}\) is symmetric and positive definite, \(\varvec{W}_{n}\) is stochastically bounded, \(C_{n}\) is an arbitrary constant and \(a_{n}(\varvec{d})\) goes to zero in probability for each \(\varvec{d}\). Then \(\varvec{\beta }_{n} = \arg \min \varvec{Z}_{n}\) is only \(o_P(1)\) away from \(\varvec{\alpha }_{n} = -\varvec{V}^{-1}\varvec{W}_{n}\), where \(\varvec{\alpha }_{n} = \arg \min (\frac{1}{2}\varvec{d}'\varvec{V} \varvec{d} + \varvec{W}'_{n}\varvec{d} + \varvec{C}_{n})\). If \(\varvec{W}_{n} \overset{d}{\rightarrow }\ \varvec{W}\), then \(\varvec{\beta }_{n} \overset{d}{\rightarrow }\ -\varvec{V}^{-1}\varvec{W}\).
Lemma 3
If Conditions 1, 3, 4, 5 hold, as \(n, N \rightarrow \infty \), then
-
(a)
\(\underset{\varvec{\beta }\in \Lambda _\textrm{B}}{\textrm{sup}}|Q_n(\varvec{\beta }) - Q_N(\varvec{\beta })|\rightarrow 0\) in conditional probability for given \({\mathcal {F}}_{N}\),
-
(b)
\(\parallel \tilde{\varvec{\beta }} - \varvec{\beta }_{t}\parallel = o_{P}(1)\).
Proof
Direct calculation yields
Since \((\varvec{x}_{i}, y_{i})\)’s are i.i.d., we have
where the last equality is derived by
where the last equality is due to Conditions 1, 3, 4. So \({\mathbb {E}}\left\{ Q_n(\varvec{\beta })-Q_N(\varvec{\beta })\mid {\mathcal {F}}_{N} \right\} ^{2}\rightarrow 0\) as \(N\rightarrow \infty \) and \(n\rightarrow \infty \). Combining (A.1) and the Chebyshev inequality, \(Q_n(\varvec{\beta }) - Q_N(\varvec{\beta }) \rightarrow 0\) in conditional probability for given \({\mathcal {F}}_{N}\). Since \(Q_n(\varvec{\beta })\) is convex function of \(\varvec{\beta }\), by the Convexity Lemma of Pollard (1991), then \(\underset{\varvec{\beta }\in \Lambda _{\textrm{B}}}{\text {sup}}|Q_n(\varvec{\beta }) - Q_N(\varvec{\beta })|\rightarrow 0\) in conditional probability for given \({\mathcal {F}}_{N}\).
\(Q_N(\varvec{\beta })\) has a unique minimum \(\hat{\varvec{\beta }}_{f}\) by Lemma A of Newey and Powell (1987). Thus, based on Theorem 5.9 and its remark of van der Vaart (1998), we have
Xiong and Li (2008) showed that if a sequence is bounded in conditional probability, then it is bounded in unconditional probability, so we have \(\Vert \tilde{\varvec{\beta }}-\hat{\varvec{\beta }}_{f}\Vert = o_{P}(1)\). Newey and Powell (1987) proved that \(\Vert \hat{\varvec{\beta }}_{f} - \varvec{\beta }_{t} \Vert = o_{P}(1)\). By the triangle inequality, then
This completes the proof. \(\square \)
Lemma 4
Denote \(\varvec{L}^{*}(\varvec{\beta })=\frac{\partial Q_n(\varvec{\beta })}{\partial \varvec{\beta }}\). If Conditions 1, 3, 4, 5 hold, as \(n, N \rightarrow \infty \) then
Proof
For any \(\varvec{\beta } \in \Lambda _{\textrm{B}}\), direct calculation yields
Substituting \(\varvec{\beta }=\varvec{\beta }_{t}\) into (A.2), we have
Let \(L_{j_1}^{*}(\varvec{\beta }_{t})\), \(L_{j_2}^{*}(\varvec{\beta }_{t})\) be the elements of \(\varvec{L}^{*}(\varvec{\beta }_{t})\) and \(x_{ij_1}\), \(x_{ij_2}\) be the elements of \(\varvec{x}_i\), \(j_1,j_2 = 0, 1, \ldots , p\), then the conditional expectation and conditional covariance are
where the last equality is due to Conditions 1, 4, 5. For the (A.3), this is because
and
By the Chebyshev’s inequality, (A.3) can be obtained. Therefore, from (A.3), (A.4) and Chebyshev’s inequality, the result can be derived. \(\square \)
Lemma 5
Denote \(Z=\frac{1}{N}\sum _{i=1}^{N}[\rho _{\tau }(\varepsilon _{i}-v_{i})-\rho _{\tau }(\varepsilon _{i})]\). Under Conditions 1 and 3, then we are able to split Z in two functions, i.e.
where \(v_{i}=\varvec{x}_{i}'\left( \varvec{\beta }-\varvec{\beta }_{t} \right) \), \(\varvec{\beta }\in \Lambda _{\textrm{B}}\).
Proof
From Lemma 1, then we have
This completes the proof. \(\square \)
Proof of Theorem 3
The first part of Theorem 3 has showed in Lemma 3. Now we will proof the second part. Let \(\varvec{\xi } = \varvec{\beta }-\varvec{\beta }_t\) and
Note that \(\varvec{Z}(\varvec{\xi })\) is convex and minimized by \(\tilde{\varvec{\beta }}-\varvec{\beta }_t\). Thus, we focus on \(\varvec{Z}(\varvec{\xi })\) when assessing the properties of \(\tilde{\varvec{\beta }}-\varvec{\beta }_t\). Denote \(\varvec{Z}_{N2}=\sum _{i=1}^{N}\frac{\omega _i(\varvec{\beta }_0)\psi _{\tau }(\varepsilon _{i})R_i\varvec{x}_i\varvec{x}_i'}{N\pi _i}\), then
where \(\varvec{Z}_{N2}-{\mathbb {E}}(\varvec{Z}_{N2}\mid {\mathcal {F}}_{N})=o_{P\mid {\mathcal {F}}_{N}}(1)\) can be derived by (A.5), (A.6) and Chebyshev’s inequality. Denote \(\varvec{Z}_{N3}=\varvec{Z}_{N2}-{\mathbb {E}}(\varvec{Z}_{N2}\mid {\mathcal {F}}_{N})\), then
and let \(Z_{N3j_1}\), \(Z_{N3j_2}\) be the elements of \(\varvec{Z}_{N3}\) and \(x_{ij_1}\), \(x_{ij_2}\) be the elements of \(\varvec{x}_i\), \(j_1,j_2 =0, 1, \ldots , p\),
From Lemma 5, we have
Since \(\varvec{Z}(\varvec{\xi })\) is convex, and from Lemma 2,
By Condition 2 and Lemma 4, we have
This completes the proof of Theorem 3. \(\square \)
Proof of Theorem 4
By Lemma 4,
Now we check the Lindeberg-Feller condition. Note that
For every \(\epsilon >0\), we have
where the last inequality holds by Conditions 1, 4, 5.
Given \({\mathcal {F}}_{N}\), using (A.8) and Lindeberg-Feller central limit theorem,
with \(\sqrt{n}{\mathbb {E}}\left( \varvec{L}^{*}(\varvec{\beta }_t)\mid {\mathcal {F}}_{N} \right) = O_{P}\left( \sqrt{\frac{n}{N}}\right) = o_{P}(1)\).
By Theorems 3, (A.7), (A.9) and Slutsky’s theorem, we conclude that as \(n\rightarrow \infty \), \(N\rightarrow \infty \), conditional on \({\mathcal {F}}_{N}\), with probability approaching one,
where \(\varvec{D}_N\) and \(\varvec{V}_{\pi }\) are defined in (12) and (13), respectively. \(\square \)
Proof of Theorem 5
Define \(h_i^{\textrm{Aopt}}=\Vert \omega _i(\varvec{\beta }_0)\varepsilon _i\psi _{\tau }(\varepsilon _i)\varvec{D}_{N}^{-1}\varvec{x}_{i}\Vert \), \(i=1,\ldots ,N\). Without loss of generality, we assume that \(h_i^{\textrm{Aopt}}>0\), for any i, and \(h_{N+1}^{\textrm{Aopt}}=+\infty \). Minimizing \(\text {tr}(\varvec{D}^{-1}_N\varvec{V}_{\pi }\varvec{D}^{-1}_N)\) is sufficient to solve the following optimization problem:
Without loss of generality, we assume that \(h_1^{\textrm{Aopt}}\le h_2^{\textrm{Aopt}}\le \ldots \le h_N^{\textrm{Aopt}}\),
where the last step is from the Cauchy-Schwarz inequality and the equality holds if and only if \(\pi _{i}\propto h_i^{\textrm{Aopt}}\). Now we consider two cases:
Case 1. If all \(\frac{nh_i^{\textrm{Aopt}}}{\sum _{j=1}^{N}h_j^{\textrm{Aopt}}}\le 1\), then \(\pi _i^{\textrm{Aopt}}=\frac{nh_i^{\textrm{Aopt}}}{\sum _{j=1}^{N}h_j^{\textrm{Aopt}}}\), where \(i=1,\ldots ,N\).
Case 2. Assume that exists some i such that \(\pi _i^{\textrm{Aopt}}=\frac{nh_i^{\textrm{Aopt}}}{\sum _{j=1}^{N}h_j^{\textrm{Aopt}}}>1\), by the definition of k, we know that the number of such i is k. Therefore, the original optimization turns into the following optimization problem:
Similar with the calculating of \(\pi _i^{\textrm{Aopt}}\) under Case 1, from the Cauchy–Schwarz inequality,
and the equality holds if and only if \(\pi _{i}\propto h_i^{\textrm{Aopt}}\), i.e. \(\pi _{i}^{\textrm{Aopt}}=\frac{(n-k)h_i^{\textrm{Aopt}}}{\sum _{j=1}^{N-k}h_j^{\textrm{Aopt}}}\), \(i=1,\ldots ,N-k\). Assume there exists \(\tilde{\textrm{M}}\) such that
and \(h_{N-k}^{\textrm{Aopt}}\le \tilde{\textrm{M}}\le h_{N-k+1}^{\textrm{Aopt}}\), so \(\sum _{i=1}^{N-k}h_i^{\textrm{Aopt}}=(n-k)\tilde{\textrm{M}}\) holds. Thus, the set \(\{1, \ldots , N\}\) can be divided into two parts, i.e. \(\{1, \ldots , N-k\}\) and \(\{N-k+1, \ldots , N\}\), which correspond to \(\pi _{i}^{\textrm{Aopt}}=\frac{h_i^{\textrm{Aopt}}}{\tilde{\textrm{M}}}\) and \(\pi _{i}^{\textrm{Aopt}}=1\). So we have
(A.10) describes that the lower bound of \(\text {tr}(\varvec{D}^{-1}_N\varvec{V}_{\pi }\varvec{D}^{-1}_N)\) can be attained when the equality of the Cauchy-Schwarz inequality holds.
When \(\pi _{i}^{\textrm{Aopt}}=\frac{n(h_i^{\textrm{Aopt}}\wedge \tilde{\textrm{M}})}{\sum _{j=1}^{N}(h_j^{\textrm{Aopt}}\wedge \tilde{\textrm{M}})}\), and takes \(\pi _{i}^{\textrm{Aopt}}\) into \({\tilde{H}}\), the following equation holds:
which echoes the lower bound of \(\text {tr}(\varvec{D}^{-1}_N\varvec{V}_{\pi }\varvec{D}^{-1}_N)\) in (A.10). Therefore, by (A.10) and (A.11), \(\pi _{i}^{\textrm{Aopt}}=\frac{n(h_i^{\textrm{Aopt}}\wedge \tilde{\textrm{M}})}{\sum _{j=1}^{N}(h_j^{\textrm{Aopt}}\wedge \tilde{\textrm{M}})}\) is the optimal solution.
Now, we will prove the existence and rationality of \(\tilde{\textrm{M}}\) when \(\tilde{\textrm{M}} \in (h_{N-k}^{\textrm{Aopt}}, h_{N-k+1}^{\textrm{Aopt}}]\). The definition of k, which implies that
Taking \(\tilde{\textrm{M}}_{1}=h_{N-k+1}^{\textrm{Aopt}}\) and \(\tilde{\textrm{M}}_{2}=h_{N-k}^{\textrm{Aopt}}\), we have
which implies the fact that
We can see that \(\underset{1\le i\le N}{\max }\frac{h_{i}^{\textrm{Aopt}}\wedge \tilde{\textrm{M}}}{\sum _{j=1}^{N}(h_{j}^{\textrm{Aopt}} \wedge \tilde{\textrm{M}})}\) is continuous, which shows the existence of \(\tilde{\textrm{M}}\).
For the rationality of \(\tilde{\textrm{M}}\), we only prove that \(\underset{1\le i\le N}{\max }\frac{h_{i}^{\textrm{Aopt}}\wedge \tilde{\textrm{M}}}{\sum _{j=1}^{N}(h_{j}^{\textrm{Aopt}} \wedge \tilde{\textrm{M}})}=\frac{1}{n}\), i.e. \(\frac{h_{N}^{\textrm{Aopt}}\wedge \tilde{\textrm{M}}}{\sum _{j=1}^{N}(h_{j}^{\textrm{Aopt}} \wedge \tilde{\textrm{M}})}\) is nondecreasing on \(\tilde{\textrm{M}} \in (h_{1}^{\textrm{Aopt}},h_{N}^{\textrm{Aopt}})\). For any \(h_N^{\textrm{Aopt}}\ge \tilde{\textrm{M}}'\ge \tilde{\textrm{M}}\), \(\tilde{\textrm{M}}'\wedge h_N^{\textrm{Aopt}}\ge \tilde{\textrm{M}}\wedge h_N^{\textrm{Aopt}}\), and \(\left( \tilde{\textrm{M}}' / \tilde{\textrm{M}}\right) \sum _{i=1}^{N}(h_{i}^{\textrm{Aopt}} \wedge \tilde{\textrm{M}}) \ge \sum _{i=1}^{n}(h_{i}^{\textrm{Aopt}} \wedge \tilde{\textrm{M}}')\). So, the rationality can be proved. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ren, M., Zhao, S., Wang, M. et al. Robust optimal subsampling based on weighted asymmetric least squares. Stat Papers (2023). https://doi.org/10.1007/s00362-023-01480-7
Received:
Revised:
Published:
DOI: https://doi.org/10.1007/s00362-023-01480-7