Optimal Subsampling Bootstrap for Massive Data

The bootstrap is a widely used procedure for statistical inference because of its simplicity and attractive statistical properties. However, the vanilla version of bootstrap is no longer feasible computationally for many modern massive datasets due to the need to repeatedly resample the entire data. Therefore, several improvements to the bootstrap method have been made in recent years, which assess the quality of estimators by subsampling the full dataset before resampling the subsamples. Naturally, the performance of these modern subsampling methods is influenced by tuning parameters such as the size of subsamples, the number of subsamples, and the number of resamples per subsample. In this paper, we develop a novel hyperparameter selection methodology for selecting these tuning parameters. Formulated as an optimization problem to find the optimal value of some measure of accuracy of an estimator subject to computational cost, our framework provides closed-form solutions for the optimal hyperparameter values for subsampled bootstrap, subsampled double bootstrap and bag of little bootstraps, at no or little extra time cost. Using the mean square errors as a proxy of the accuracy measure, we apply our methodology to study, compare and improve the performance of these modern versions of bootstrap developed for massive data through simulation study. The results are promising.


Introduction
Real data analysis often faces situations where statistical inference is not tractable.This can happen if the interest is to estimate the variance of an estimator that is not easily estimable or when a robust estimator of the variance is warranted under the suspicion that assumptions are invalid.In this case, the bootstrap method (Efron, 1990;Efron and Tibshirani, 1994) provides a simple elegant solution for automatic inference, enabling the computation of various inference quantities without the need to know their analytical formula.Under certain conditions, many bootstrap estimators are generally consistent (Van Der Vaart and Wellner, 1996) and can be more accurate than those based on asymptotic approximation (Hall, 1994).
Traditional bootstrap (TB) methods were first developed for small datasets for which computation was not an issue.In particular, the vanilla version of the bootstrap estimates parameters of interest by repeatedly resampling N observations with replacement from N data points in the original sample and thus is feasible only if the sample size is relatively modest and the computation is performed on a single computer (Booth and Hall, 1994).
With the emergence of Big Data, these methods are no longer applicable to modern datasets that are massive in size.First and foremost, it is no longer the case that these datasets can be loaded into the main memory.One approach to mitigate this problem is to use parallel and distributed computing systems to divide a dataset, estimate the parameters of interest on each computer, and aggregate these estimates on a central machine.This approach usually involves high communication costs between different computer nodes and, thus, is sometimes not desirable (Li et al., 2013), although progress has been made to alleviate this (Jordan et al., 2019;Volgushev et al., 2019;Chen et al., 2019;Chen and Peng, 2021;Fan et al., 2021).The main focus of this paper is to develop methods for optimal bootstrap inference on a single computer which will free us from the concern on communication cost.
Recognizing the limitation of TB for big datasets, subsampling methods that only requires repeated computation of the estimator for subsamples with size much smaller than the original dataset, have been developed (Politis et al., 1999).A leading example is the so called n-out of-N, bootstrap (Bickel et al., 1997), in which each subsample draws n observations out of N points often with n ≪ N. We will refer to this bootstrap scheme as subsampling bootstrap and abbreviate it as SB for short hereafter.Although SB reduces the size of each subsample from N in TB to n, Bickel and Sakov (2008) showed that its performance is rather sensitive to the choice of subsample size.Moreover, SB must perform a rescaling of their output which requires knowledge and explicit use of the convergence rate of the estimator, making it less automatic to deploy than TB.The above limitations of SB prompted Kleiner et al. (2014) to introduce the method of Bag of Little Bootstraps (BLB).Similar to SB, BLB starts by subsampling the whole data with the size of each subsampled subset much smaller than N and then follows up by resampling the subsets via simple random sampling with replacement.Crucially, the resampling in the second step of BLB is performed in such a way that the size of each resample is N, the same as the size of the entire data.The computational saving of BLB over that of TB roots in the fact that the maximum number of distinct elements in each resample is bounded by the size of each subsample in the first step.For many estimators such as M-estimators obtained via empirical risk minimization, this means that we just need to optimize a weighted loss function with a smaller number of distinct items in the empirical risk than that of TB.Compared to SB, BLB requires no analytical re-scalling because of the size of the resamples in step two and thus makes a fully automatic method for statistical inference (Kleiner et al., 2014).
Intuitively, the performance of BLB relies on three hyperparameters • n (the size of the bootstrap subsamples or subsets in step one), • R (the number of the subsamples or replicates in step one), • B (the total number of resamples per subset in step two).
With unlimited computational resources, those hyperparameters should be set as large as possible for reliable inference.With limited computational budget, however, these three hyperparameters need to be selected.Indeed, it is found that the performance of BLB can be sensitive to the choice of these parameters as discussed in Kleiner et al. (2014) and Sengupta et al. (2016).Because of this, Sengupta et al. (2016) raises an open question: How do we optimally choose n, B and R in BLB to balance statistical accuracy and running time?
In Kleiner et al. (2014), they suggested an adaptive method for selecting R and B. By introducing a new tolerance parameter, they tracked resamples for each subset until that tolerance level was reached.Without knowing the variability of the precision estimate though, it is unclear how this additional hyperparameter should be specified for a given computational cost.As a partial solution, Sengupta et al. (2016) proposed an alternative approach named subsampled double bootstrap (SDB), which simply sets B = 1 as in BLB.This certainly solves the problem of choosing between R and B. However, the relationship between the subset size n and the number of replicates R remains elusive.In order to choose n and R in SDB, an intuitive approach is to choose n as large as possible as long as n data points can fit on a computer while adjusting the size of R such that the computational budget is not exceeded.As we show however, the choice of n and R does impact the performance of SDB.
In this paper, we develop a novel framework to find the optimal balance between statistical efficiency and computational cost for these modern variants of the bootstrap method.To strike this balance, an intuitive procedure is to maximize a certain measure of this efficiency subject to a running time constraint.Intuitively, any measure of efficiency will depend on the hyperparameters n, R, and B and so will the running time.Denoting these dependencies as Efficiency(n, R, B) and Time(n, R, B) respectively, our general framework seeks to identify where C max is a given computational cost allocated to bootstrap and arg optim reads as the argument that optimizes the subsequent function.For the constraint C max , it is often more convenient to specify it corresponding to the time needed to implement a bootstrap method with some prespecified (n, R, B) combination.In this case, we can immediately understand that our framework aims to improve the performance of BLB with prespecified n, R and B, by using an optimal combination of these hyperparameters obtained via optimization.
Often for mathematical convenience, it is easier to work with the inefficiency of an estimator based on various bootstrap implementations, for example by examining some loss function or the variance of an estimator.Intuitively the dependence of the inefficiency on these hyperparameters can be characterized by terms usually inversely proportional to n, R and B or simple functions of them.This is because conditional on data, bootstrap subsamples and resamples can be viewed as independent and identically distributed (Efron, 1990;Efron and Tibshirani, 1994).The measure of inefficiency of interest often depends on the empirical distribution of these bootstrap subsamples and resamples.This is in sharp contrast to n-out of-N bootstrap where explicit knowledge of the convergence rate must be known for it to be applicable.
To illustrate our general framework, we use the mean square error (MSE) of an estimator as a proxy for the accuracy measure and discuss the use of various bootstrap methods for estimating the standard errors of the sample mean.The sample mean estimates the population mean which is a quantity of major interest in many statistical problems (He and Shao, 1996).
By a careful theoretical analysis, we find that the asymptotic efficiency of various bootstrap estimates are closely related to the hyperparameters.In particular, these relationships for BLB and SDB are analytically simple, allowing us to identify the optimal combinations of the hyperparameters in these procedures in closed-form.Thus, we go beyond providing an affirmative answer to the open question raised in Sengupta et al. (2016) by presenting closed-form solutions to the optimal hyperparameter triple (n, R, B) for BLB.Furthermore, our procedure provides the optimal pair of parameters in n and R for SB in Sengupta et al. (2016).Our approach can be readily extended to deal with multivariate random variables and general parameters and we discuss how this can be done.Although we only consider the independent data, by a similar approach, this method can be extended to block bootstrap designed for dependent data.We show via extensive simulations that our approach improve the performance of BLB and SB with similar computational budget.Thus our answer to the open question posed in Sengupta et al. (2016) is confirmed empirically.Note that an early approach to find the optimal resampling size in the double bootstrap appeared in Booth and Hall (1994) but their method is computationally infeasible for massive data.
The rest of the article is organized as follows.Section 2 introduces different bootstrap methods and their associated theoretical properties.Our hyperparameter selection approach is presented in Section 3. We discuss the estimation of more general parameters and statistics in Section 3.2.Extensive numerical study is conducted in Section 4. The article is concluded with a discussion in Section 5.All the theoretical conditions, proofs, and additional numerical results are found in the Supplementary Material.A F = ( i,j a 2 ij ) 1/2 as its Frobenius norm.

Bootstrap Methods for Univariate Random Variable
For better illustration of our approach, we start by examining the simplest case where the statistic of interest is the mean of a univariate random variable.The theory to be presented hereafter is further developed for multivariate random variables in Section 2.5 and we discuss general statistics in Section 3.2.To facilitate the theoretical development, we need the following technical conditions.
The above conditions are mild and reasonable.By assuming these relationships between different hyperparameters and the whole sample size N, they essentially require that n, R, and B should be large enough to facilitate an asymptotic analysis of higher order terms.
identically distributed random variables with mean µ and variance σ 2 .We assume that the centered j th moment of X i exists such that E(X i − µ) j = σ j < ∞ for 3 ≤ j ≤ 6.A simple estimator of the mean parameter µ is the same average denoted as X = N −1 N i=1 X i .The estimation accuracy of X is measured by its standard error (SE) which equals σ/ √ N analytically.This SE can be consistently estimated as where σ 2 = N i=1 (X i − X) 2 /N.We characterize the mean and the variance of SE 2 in the following theorem.
Theorem 1.For the estimator in (2.1), we have By this theorem, the MSE of SE 2 is dominated by its variance and can be seen as which scales inverse-proportionally to N 3 in the leading order.

Traditional bootstrap
We analyze the MSE of the traditional bootstrap or the vanilla bootstrap estimator of SE in this subsection.Denote B as the total number of bootstrap resamples that are sampled uniformly with replacement from the original sample.That is, for any b = 1, • • • , B, the bth bootstrap sample is obtained as N }, where X (b) i is independently generated via simple random sampling with replacement from the whole sample S. From B (b) , the bth bootstrap sample mean can be calculated as i .Subsequently, an estimate of SE 2 via the bootstrap method is seen as Conditional on S, for any b, X i 's are independent and identically distributed due to the bootstrap scheme.This immediately implies that P (X That is, each element in S has equal probability of being sampled.Accordingly, we have We have the following theorem for this traditional bootstrap estimator of SE 2 . Theorem 2. For the traditional bootstrap estimator, assume Condition (C1) holds, we have From this theorem, we can immediately obtain the MSE of SE This suggests that B, the number of bootstrap resamples, needs to be the same order of N or larger, for MSE( SE 2 TB ) to achieve the same convergence rate as MSE( SE 2 ) in (2.2).Given a computational budget, the most efficient traditional bootstrap estimator is to keep drawing resamples until this budget runs out.

Bag of little bootstraps
We next provide a brief review of the Bag of Little Bootstraps (BLB) method in Kleiner et al. (2014), which is carried out via a two-step procedure including a subsampling and a resampling step.
Step 1.The subsampling step: We draw R subsamples or little bootstrap samples, each of which, denoted as random sampling with replacement from the whole sample S. Note that the size of r) is n, which is usually much smaller than N.
Step 2. The resampling step: B weighted resamples are drawn such that the cardinality of each resample is N. Specifically, denote the bth resample as is independently drawn via simple random sampling from S (r)   with replacement.This scheme implies that the number of distinct points in each bootstrap resample B (r,b) is n at maximum.From B (r,b) , the target statistic X can be estimated by the sample average in the resample as As the result of the two-step procedure, each bootstrap resample B (r,b) has cardinality N, but with at most n distinct elements.Equivalently, each B (r,b) can be seen as a weighted resample of size n.Thus, the BLB avoids the need for repeated computation on resamples having size comparable with that of the original data set, since each BLB resample contains at most n distinct elements.In a large class of estimators commonly encountered, including M-estimators, computation can take weighted data representation.It is for these estimators that BLB has huge computational advantages.In comparison to the traditional bootstrap, the storage requirement for the BLB method is substantially less demanding as long as n ≪ N, and for these estimators, the cost of computing the estimator based on the BLB resamples will be substantially lower than that based on the traditional bootstrap resamples.
Given the bootstrap resamples X (r,b) , SE 2 can be estimated as , where X i .We now derive the MSE of SE 2 BLB .Note that conditional on S and S (r) , X (r,b) 's are independent and identically distributed.Accordingly, we have ) 2 .We have the following theorem.
Theorem 3.For the BLB estimator, assume Condition (C2) holds, we have From the above theorem, the MSE of SE Compared with (2.2), for the MSE of SE and B √ N for example.Alternatively, we can take and B ∼ N γ with γ > 0.5.

Subsampled bootstrap
The BLB method is closely related to the so-called "n-out of-N" bootstrap method or subsampled bootstrap studied in Bickel et al. (1997).With some abuse of notation, let R be the total number of subsamples.For any 1 ≤ r ≤ R, we use is generated independently by simple random sampling with replacement from the whole sample S. Note that the number of distinct elements in each subsample is at most n.Based on S (r) , the target statistic X can be computed as i .Accordingly, SE2 can be estimated by Comparing SE 2 SB with SE 2 TB , we find that a re-scaling factor n/N is needed for SE 2 SB which requires the knowledge of the convergence rate of the target estimator.This makes the SB less automatic (Kleiner et al., 2014;Sengupta et al., 2016).
We now study the MSE of SE 2 SB .Note that conditional on S, for any given r, X (r) i s are independent and identically distributed with P (X (r) i = X j ) = 1/N for any 1 ≤ j ≤ N.
We have the following theorem.
Theorem 4. For the SB estimator, assume Condition (C3) holds, we have Thus, the MSE for SE 2 SB can be expressed as It is seen that for the MSE of the subsampled bootstrap to be comparable to that of the analytical estimator in (2.2), we require n √ N and R N.

Subsampled double bootstrap
Motivated by the BLB method and the subsampled bootstrap, Sengupta et al. (2016) proposed the so-called subsampled double bootstrap (SDB).The SDB takes a similar twostep approach as in the BLB, with the crucial difference of taking a single resample in the resampling step of BLB.Specifically, in the first step, SDB randomly draw R subsamples of the data, denoted as n }, where X (r) i 's are independently generated by simple random sampling with replacement from the whole sample S. In the second stage, we only generate one bootstrap resample from each subset S (r) , which is denoted as Here X (r,1) i 's are independently generated from S (r)   by simple random sampling with replacement.Based on B (r,1) , X can be estimated as , and SE 2 can be estimated by SE i .Conditional on S, X (r,1) s are independent and identically distributed for 1 ≤ r ≤ R.
Accordingly, we have E(X (r,1) |S, S (r) ) = X (r) and var(X We have the following theorem. Theorem 5.For the SDB estimator, assume Condition (C3) holds, we have From the above theorem, we can immediately obtain the MSE of SE 2 SDB , which can be expressed as

MSE( SE
From the above expression, for the MSE of SE 2 SDB to be comparable with that of the analytical estimator in (2.2), we will require that n √ N and R N.

Bootstrap for multivariate random variables
In this subsection, we extend the univariate random variable to the multivariate case.
With some abuse of notations, we still use S = {X 1 , • • • , X N } to denote the whole sample set where , is independent and identically distributed with mean µ = (µ 1 , • • • , µ p ) ⊤ and variance Σ = (σ j,k ).We will assume that p ≥ 2 is fixed for easy exposition.By definition, we have E(X i,j − µ j )(X i,k − µ k ) = σ j,k and E(X i,j − µ j ) 2 = σ j,j .We assume that the centered fourth moment of X i,j exists; that is, The estimation accuracy of X = (X 1 , • • • , X p ) ⊤ ∈ R p can be measured by Σ/N, which can be consistently estimated via its sample analogue.
Using the vector notation for X, an estimate of Σ/N via BLB is denoted as Σ BLB where with its (j, k)th (j = k) element denoted by SE Theorem 6.For the BLB estimator, assume Condition (C2) holds, we have for j = k On the other hand, by the result of Theorem 3, we immediately have the following results for the diagonal terms Define the MSE of Σ BLB as the sum of the MSEs of estimating each individual element.
Combining (2.7) and (2.8), we further obtain the MSE of Σ SB as MSE( SE where (2.10) (2.11) (2.12) Note that in order to implement our optimal hyperparameter selection method, c 1 , c 2 and c 3 need to be consistently estimated.Since they only need to be computed once, we know that the computational cost of obtaining them is negligible in comparison to the bootstrap procedure.
Next, the estimate of Σ/N via SB can be seen as for which we have the following theorem.
Theorem 7.For the SB estimator, assume Condition (C3) holds, we have Moreover, for its diagonal terms, from Theorem 4, we can obtain Combining the above results, we further obtain the MSE for all the elements in Σ SB as MSE( SE (2.13) Lastly, we derive the estimate of Σ/N via SDB denoted as with its MSE characterized by the following theorem.
Theorem 8.For the SDB estimator, assume Condition (C3) holds, we have for j = k For the diagonal terms, from Theorem 5, we can obtain Combining the above results, we immediately obtain the MSE for all the elements in Σ SDB as MSE( SE (1)}, where c 4 = j 1 =j 2 (σ j 1 ,j 2 ,2 − σ j 1 ,j 1 σ j 2 ,j 2 ).When R is of a larger order of n, we have

MSE( SE
We note that the MSEs of these bootstrap estimators in (2.9), (2.13) and (2.14) based on multivariate random variables are similar to their univariate counterparts but with different constants in the numerators.
3 Optimal Subsampling Bootstraps and General Parameters

Optimal subsampling bootstraps
All the subsampling bootstrap methods discussed so far require the specification of the hyperparameters including the size of the subsamples n, the number of subsamples R, and for the BLB, the number of the resamples B. For these methods to produce satisfactory statistical performance, intuitively these hyperparameters should be set as large as possible.
However, setting them unnecessarily large will incur additional computational cost that we want to avoid in the first place when resorting to subsampling bootstrap instead of the bootstrap.Thus, there is a clear trade-off between statistical accuracy of the estimator and computational budget.Our approach to achieve the optimal trade-off is to optimize the statistical accuracy subject to a fixed pre-specified computational constraint.
Having analyzed the MSE of the three subsampling bootstrap methods BLB, SB, and SDB in the previous section, we are ready to derive the hyperparameter values that give optimal statistical accuracy in terms of MSE with a fixed computational cost.Since these bootstrap methods are solely developed for massive data, it makes sense to compare their performance when the full sample resides on a hard drive.That is, to conduct any bootstrap scheme, the data will have to be repeatedly read from the hard disk, while the computation happens in memory.Sampling n data points from the disk costs O(n) in terms of time complexity.Denote the computational time for obtaining an estimator on a sample with size n as t(n).For an estimator taking weighted data representation, we assume that the estimation time is t(n) where n is the number of distinct points in the sample.This is the scenario where BLB and SDB can gain computationally over TB.We summarize the sampling and computational cost in Table 1, with more details provided when a specific method is discussed.Later in this paper, we focus on those estimators where t(n) ≍ n, that is, those estimators that can be obtained in a computational time linear in the sample size.This setup comprises a large class of estimators, including the sample mean and the least-squares estimator of the coefficients in a linear model.For example, to compute the mean discussed so far, the total time required for BLB is α 1 • nR + α 2 • nRB, where α 1 and α 2 are two parameters specific to the computer system.See below for more details.
We start by looking at SB first.Since SB needs to sample n data points from the disk R times, its sampling cost is O(nR).To compute SE 2 SB , the leading term is X(b) , i.e., the sample average of each bootstrap subsample, which is evaluated R times, resulting in a computational cost of O(nR).Therefore, the total computational cost for the SB is of the order O(nR).Motivated by (2.5) and (2.13), to identify the optimal subsample size and optimal number of subsamples in terms of minimizing the MSE of SE 2 SB , we need to find the (n, R) pair such that where α SB is a constant quantifying the time constant before nR for the SB implementation, and C max is the computational budget.Here, c ′ 1 = 2 and c ′ 2 = 1 when estimating the mean parameter for univariate random variables.For multivariate random variables, we have c ′ 1 = c 1 and c ′ 2 = c 3 where c 1 and c 3 are defined in (2.10) and (2.12) respectively.Note that for the multivariate case, c 1 and c 3 can be consistently estimated and the computational cost of obtaining them is negligible in comparison to the bootstrap procedure.In practice, if we have the computational budget to subsample R times, each subsample with size n, that is, the computational budget is nR, then solving the above optimization problem gives SB optimal hyperparameters: where ⌊s⌋ stands for the largest integer that is not larger than s.
For the SDB procedure, the MSEs for the univariate and multivariate case as in (2.6) and (2.14) are similar to those of the SB in that the optimal (n, R) pair is given by SDB optimal hyperparameters: In practice, α SB and α SDB can be dependent on the subsample size n because different computer architectures implement their algorithms differently.However, their dependence on R is linear because resampling is just a repetition of the same operation.Our strategy is to tune α SB or α SDB progressively several times.The time needed to tune them can be costly but we minimize its impact by pilot-running experiments on small n and small R. The details can be found in the simulations in Section 4.2 -4.3 and also Appendix B.1.
We are now ready to discuss how to choose optimal hyperparameters for BLB which involves three parameters n, R and B. The sampling cost of BLB is similar to that of SB which is O(nR).The computational cost can be seen as O(nRB) as the leading computation is to calculate X (r,b) , each based on a resample of size n, for a total of RB times.Motivated by the expression for the MSE in (2.4) and (2.9), to identify the optimal hyperparameter triple, we minimize the MSE of SE 2 BLB subject to the constraint that the computational cost is capped at C max .More specifically, we solve (n * , R * , B * ) equals to arg min n,R,B

MSE( SE
where c 1 = 2σ 4 (σ 4 −σ 4 ) −1 , c 2 = 1, and c 3 = σ 4 (σ 4 −σ 4 ) −1 for the univariate mean estimating case and c 1 = c 1 , c 2 = c 2 , and c 3 = c 3 for the multivariate case with c 1 , c 2 , and c 3 defined in (2.10), (2.11), and (2.12), respectively.In the above optimization problem, α 1 and α 2 are two constants quantifying system specific constants for the sampling cost and the computational cost respectively.To figure out their values for a specific problem, we can take different combinations of (n, R, B) and record the running time of BLB under these combinations.
Fitting a linear model with the running time as the response variable and nRB and nR as the two covariates, we can obtain the estimates of α 1 and α 2 .Our simulation study below suggests that with very few combinations (e.g., 8), we can obtain fairly accurate estimates of α 1 and α 2 with the associated R 2 as large as 98%.In estimating α 1 and α 2 using pilot runs, we can keep R, the number of resamples, small because any dependence of the computational and sampling time on R will be proportional to it.Therefore, the extra time needed to estimate α 1 and α 2 is negligible in comparison to the actual implementation of any bootstrap subsampling method discussed so far because for any bootstrap, R is often very large.
Solving the optimization problem above for BLB amounts to finding the minimum value of c 1 (RB) −1 + c 2 (nR) −1 + c 3 (n) −2 under the constraint α 1 nRB + α 2 nR = C max .To this end, let (n, R, B) be an arbitrary specification such that α 1 nRB + α 2 nR = C max .By Cauchy's inequality, we have The first inequality in (3.4) becomes an equality if c 1 α 2 n/B = c 2 α 1 B. When n is fixed, this leads to the BLB optimal hyperparameters as For n, our simulation results suggest that n = ⌊N 0.7 ⌋ is an optimal choice in most cases consistent with the findings in Kleiner et al. (2014) and Sengupta et al. (2016).Accordingly, we can directly use n = ⌊N 0.7 ⌋ in practice.

General parameters and statistics
We have so far focused on estimating the mean.In this subsection, we study more general parameters.Without loss of generality, we first assume that the random variables X i ∈ R 1 , i = 1, ..., N, have mean µ and variance σ 2 .Our interest is to estimate θ = g(µ), where g(•) is a known, possibly complicated, but sufficiently smooth function with twice differentiable.
Extension to random vectors is straightforward.One simple estimator of θ can be constructed as θ = g(X) since X estimates µ.Asymptotically, we have where ġ(•) is the first order derivative of g(•).We then have This means that the asymptotic SE 2 of θ is given by SE * 2 = ġ2 (µ)σ 2 /N.Accordingly, a natural estimator for SE * 2 is given by SE * 2 = ġ2 (X) σ 2 /N, which can be easily computed if ġ2 (µ) is analytically simple.
Next, we discuss how to use the three subsampling bootstraps discussed so far to estimate SE * 2 .We start with the traditional bootstrap method for which a natural estimator of SE * 2 is given by . This means that the asymptotic SE 2 of θ is given by SE * 2 = ġ(µ) ⊤ Σ ġ(µ)/N.Similar to the above calculation, we This suggests that the asymptotic behavior of SE * 2 is determined by the estimate of Σ.
When θ ∈ R d for d > 1, we use Σ * to denote the covariance matrix of θ = g(X).By a careful calculation, we can find that Σ * = ġ(µ) ⊤ Σ ġ(µ)/N still holds.We then discuss how to use the three subsampling bootstraps to estimate Σ * .We first study the SB method for which a natural estimator is given by This suggests that the asymptotic behavior of Σ * SB is largely determined by that of Σ SB .Similar analysis can be conducted for the BLB and SDB methods.The associated estimators for Σ * are given by respectively.Again, the asymptotic behaviors of Σ * BLB , Σ * SB , and Σ * SDB are largely determined by those of Σ BLB , Σ SB , and Σ SDB .Accordingly, the above mentioned optimization approaches in Section 3.1 are still applicable for the general parameters and statistics, as long as the target parameter can be expressed as θ = g(µ) where g(•) is a sufficiently smooth function.
Note that the computational advantages of SDB and BLB relative to the traditional bootstrap lies in the fact that the estimator of interest can take weighted data representation as its argument.A large class of commonly used estimators, including M-estimators, can be represented by weighted data.We now discuss a few examples that can be placed in the general parameter framework.
Example 1 (Estimators based on moments).Denote µ j = E(X −µ) j as the jth moment of a univariate random variable X and its moment estimator as µ j = N −1 N i=1 (X i − X) j .Note that µ j is a consistent estimator for µ j which can take weighted data representation.Thus, the optimal hyperparameter selection method presented in Section 3.1 is applicable for minimizing the MSE of θ = g( µ) when one uses the subsampling bootstrap for inference.
As another example, we will later apply our method to the problem of estimating the correlation coefficient when missing data is present, a problem considered in Shao and Wang (2002).Denote the dataset as the missing indicator as w i , where w i = 1 if the i th observation (X i , Y i ) is missed and w i = 0 if not.Shao and Wang (2002) is interested in estimating the correlation coefficient A simple sample estimator is given by Then we can write ρ = g(µ) and ρ = g( µ).The optimization technique for multivariate random variables in Section 3.1 can be directly applied to optimize the estimation accuracy of SE 2 for ρ.
for estimating the mean.In Section 4.2, we apply our hyperparameter selection method to linear regression and further extend it to logistic regression in Section 4.3.Due to page limit, we leave the estimation of the correlation coefficient defined for missing data, hypotheses test and comparison with plain subsampling to the supplementary Material.All programs are written in python 3.7 and run on a cloud computing platform called Matpool (https://www.matpool.com).The simulation is run on a computer equipped with NVIDIA Tesla P100 data center GPU, 16 GB of graphics memory, 64 GB ram and 500 GB SSD capacity.

Verify the formula for the mean squared errors
This section aims to verify the MSE( SE 2 ) formula derived in Section 2 as expressed in (2.2), (2.3), (2.4), (2.5), and (2.6), when the mean is estimated.Towards this, we repeat each experiment M times and denote SE 2(m) as the estimate of SE 2 on the mth simulation replicate.The true mean squared error (MSE) can then be estimated as where SE * 2 = M −1 M m=1 (X (m) − X) 2 and X (m) is the estimator of X based on the mth simulation replicate.With some abuse of notation, denote MSE * as the estimator of MSE( SE 2 ) in (2.2), (2.3), (2.4), (2.5), and (2.6), where σ and σ 4 are estimated as σ 2 = N i=1 (X i −X) 2 /N and σ 4 = N i=1 (X i − X) 4 /N, respectively.We then compare the difference between MSE and MSE * by evaluating the ratio MSE * / MSE.We generate X i ∈ R 1 for i = 1, • • • , N either from a standard normal distribution (Normal) or a centered standard exponential distribution (Exponential).We take N = 10 5 , n ∈ {⌊N 0.4 ⌋, ⌊N 0.5 ⌋, ⌊N 0.6 ⌋}, B ∈ {25, 50}, and R ∈ {25, 50}.For each setup, a total of M = 1000 datasets are generated.The simulation results are summarized in Table 2. From this table, we observe that all of the ratios are close to one, suggesting that our analytical formula are fairly accurate.Similarly, we can verify the formula for MSE( SE 2 ) derived in (2.9), (2.13), and (2.14) with X i ∈ R p .We find that the results are similar to those presented in Table 2 and omit them to save space.

Hyperparameter selection for linear regression
As an application, we consider a two-dimensional linear regression model Y i = X T i β+ε i for i = 1, • • • , N, where β 0 = (0.1, 0.1) ⊤ is the regression coefficient to be estimated, X i follows a bivariate normal distribution N(0, Σ) with Σ = (1, 0; 0, 1) and ε i are the independent noise following the standard normal distribution.We denote the ordinary least-squares estimator as β whose theoretical covariance is Cov( β) = Σ/N.This covariance can be consistently estimated by the BLB as where β (r,b) is the ordinary least-squares estimator obtained from the bth resample of the rth subsample and β (r) is the ordinary least-squares estimate based on the rth resample.Since tor is evaluated by computing MSE( SE * 2 where M 0 = 2000 and X (m) i represents that X i is generated in the m th replication.We can calculate the MSE of the SB and SDB methods likewise.We then evaluate the relative performance of BLB and SDB under the same computational budget.Here, the original SDB corresponds to the cases with n = ⌊N 0.7 ⌋ and R is further calculated based on the time consumption of BLB.We train the optimal hyperparameter the same way with Section 4.2.Towards this, we randomly generate 6 different combinations of (R, B) values from ⌊U(15, 30)⌋ × ⌊U(2500, 5000)⌋ and another 6 pairs of (R, B) from ⌊U(150, 300)⌋ × ⌊U(1, 80)⌋, as seen in Table 4. Their optimal specifications are obtained via (3.1), (3.2) and (3.5) under the multivariate case.From this table, we can see clearly that the running times of the optimal BLB and SDB are all similar to those of the original setting with prespecified hyperparameters.On the other hand, both methods with optimally chosen hyperparameters outperform their original settings.Moreover, optimal BLB always performs better than optimal (and also original) SDB and the average improve margin is 44.9%.It is reasonable that optimizing BLB over both parameters is expected to yield better result than just fixing one of the parameters.

Real Data Analysis
To demonstrate the application of our method in data analysis, we consider a US Airline Dataset publicly available at http://stat-computing.org.This dataset contains detailed flight information.For this analysis, we take the data in the year of 2008 with a sample size N = 492809, and study how the flight Distance and ArrDelay (arrival delay) affect the ActualElapsedTime (actual elapsed time) via a linear model.We preprocess the two covariates by taking the signed-log-transformation log |x| • sign(x) to eliminate the influence of outliers.We evaluate the performance using a procedure similar to Section 4.2, with the critical difference being that the variance of the ordinary least-squares estimator has to be estimated.Towards this, we make use of traditional bootstrap by estimating it as m) is the estimate obtained on the m th sample done by sampling with replacement.The number of bootstrap resamples is taken as M 0 = 10000.
We set n = ⌊N 0.7 ⌋, and generate 5 different (R, B) combinations from ⌊U(15, 30)⌋ × ⌊U(2500, 5000)⌋ and another 5 pairs of (R, B) from ⌊U(250, 500)⌋ × ⌊U(1, 80)⌋, as seen in Table 5.We repeat the experiment 50 times under each setting.Again, we use the MSE defined in Section 4.2 as the performance measure.From Table 5, we can see clearly that the running times of the optimal BLB and SDB are all similar to those under the original settings with prespecified hyperparameters.On the other hand, BLB and SDB with hyperparameters tuned by our approach outperform their original settings.Moreover, the optimal BLB outperforms optimal SDB (and also SB, results not shown) under the same time cost.

Conclusion
In this article we propose a hyperparameter selection approach that can be applied to subsampling bootstrap methods.A novelty of the approach is to formulate the problem of finding optimal hyperparameters as an optimization one, for which closed-form solutions can be readily obtained using simple, intuitive, almost back-of-envelope calculations.Our extensive simulation study confirms that our hyperparameter selection approach improves efficiency of an estimator without increasing the computational burden.We emphasize, as we have discussed in the Introduction, that bootstrap methods are preferred when analytical formula are unavailable or not easily computable.In view of this, the examples examined in this paper should be taken as proof of concept.
This paper focuses on those estimators whose computational cost is linear in the number of distinct observations in a dataset.It is straightforward to extend our methodology to estimators whose computational cost is of the order O(n γ ) for some γ ≥ 1 based on a dataset of size n.This setup includes all the computationally feasible estimators in statistics that can be computed in polynomial time.To see how the extension can be done, for the SB and SDB, we just need to solve: arg min n,R {c ′ 1 /R + c ′ 2 /n 2 } under the constraint αn γ R ≤ C max , where the constraint is on n γ R instead of nR as in the estimators discussed in Section 3.
We now discuss an extension of our approach to quantile regression.Specifically, we examine linear quantile regression for illustration by considering the model: where β τ is a d-dimensional coefficient vector, ε i ∈ R 1 is the random noise such that P (ε i ≤ 0|X i ) = τ and τ ∈ (0, 1).That is, X i β τ is the τ th quantile of Y i given X i .Using the quantile regression approaches in Koenker (2005), Wang et al. (2007), and Chen et al. (2019), we know that β τ , the quantile regression estimator, admits the following asymptotic expression: X i τ − I(ε i ≤ 0) + o p (N −1/2 ). (5.1) Intuitively, if the smaller-order term in (5.1) can be ignored, the quantile regression estimator can be expressed as g(µ 1 , µ 2 ), a smooth function of two moment estimators µ 1 = N i=1 X i X ⊤ i and µ 2 = N i=1 X i {τ − I(ε i ≤ 0)}.Thereafter, our theory in Section 3.1 can be potentially applied, though justifying it is challenging due to the non-smoothness of the loss function in quantile regression.We leave it for a future research topic.
Finally, we find the optimal estimator in terms of its MSE in this paper, but it need not be the only choice.Our framework can be extended to other efficiency measures by quantifying the relationship between the choice of efficiency measure and the hyperparameters and then formulating an optimization problem in a manner similar to that used in this paper.
The following notations are used throughout the paper.For two positive sequences {a n }, {b n }, we use a n b n to mean a n ≤ Cb n for some constant C independent of n, a n b n to mean b n a n , and a n ≍ b n to mean a n b n and b n a n .When stating any results, we always assume that the sample size satisfies N → ∞.We use U(a, b) to denote a uniform random variable on [a, b].Finally, for a matrix A = (a ij ) ∈ R m×n , we denote 2 j,k,BLB and the jth diagonal term denoted as SE 2 j,BLB .We now derive the MSE of SE 2 j,k,BLB in the following theorem.

Table 1 :
Sampling time and estimation time of various subsampling bootstrap methods.N: the total sample size; n: the sample size of subsamples; R: the number of subsamples; B: the number of resamples.

Table 2 :
Simulation results for checking the analytical formula of MSE for various bootstrap methods.

Table 4 :
Simulation results for logistic regression.The explanation of the column headings can be found in the caption of Table3.