A short note on fitting a single-index model with massive data

This paper studies the inference problem of index coefficient in single-index models under massive dataset. Analysis of massive dataset is challenging owing to formidable computational costs or memory requirements. A natural method is the averaging divide-and-conquer approach, which splits data into several blocks, obtains the estimators for each block and then aggregates the estimators via averaging. However, there is a restriction on the number of blocks. To overcome this limitation, this paper proposed a computationally efficient method, which only requires an initial estimator and then successively refines the estimator via multiple rounds of aggregations. The proposed estimator achieves the optimal convergence rate without any restriction on the number of blocks. We present both theoretical analysis and experiments to explore the property of the proposed method.


Introduction
Single-index models provide an efficient way of coping with high-dimensional nonparametric estimation problem and avoid the 'curse of dimensionality' by assuming that the response is only related to a single linear combination of the covariates. Because of its usefulness in several areas such as discrete choice analysis in econometrics and dose-response models in biometrics, we restrict our attention to the single-index model in the following form: where Y is the univariate response and X is a vector of the p-dimensional covariates. The function g 0 (·) is an unspecified and nonparametric smoothing function; γ 01 is the unknown index vector coefficient. For identifiability, one imposes certain conditions on γ 01 , and we assume that γ 01 = (1, γ 0 ) with γ 0 ∈ R p−1 . This 'remove-onecomponent' method for γ 01 has also been applied in Christou and Akritas (2016), Delecroix et al. (2006) and Ichimura (1993). ε is assumed to be independent and identically distributed random error with E[ε | X] = 0.
In single-index model (1), the primary parameter of interest is the coefficient γ 01 in the index X γ 01 , since γ 01 makes explicit relationship between the response variable Y and the covariate X. Various strategies for estimating γ 01 have been proposed in the literature, see Jiang et al. (2013), Jiang et al. (2016), Tang et al. (2018), Wu et al. (2010), and Xia et al. (2002) and so on.
The development of modern technology has enabled data collection of unprecedented size. For instance, Facebook had 1.55 billion monthly active users in the third quarter of 2015. In recent years, statistical analysis of such massive dataset has become a subject of increased interest. However, when the sample size is excessively large, there are two major obstacles. First, the data can be too big to be held in a computer's memory. Second, the computing task can take too long to wait for the results. Some statisticians have made important contributions. One of these methods, called the averaging divide-and-conquer (ADC) has been widely adopted. The main idea of ADC is to first compute local estimators on each block and then take the average, see Chen and Xie (2014), Chen et al. (2019), Jiang et al. (2020), Lin and Xi (2011) and so on.
These averaging-based, ADC approaches suffer from one drawback. In order for the averaging estimator to achieve the optimal convergence rate that utilizes all data points at once, each block must have access to at least O( √ n) samples, where n is the sample size of the full data set. In other words, the number of blocks should be much smaller than √ n, which is a highly restrictive assumption. Jordan et al. (2019) proposed the communicationefficient surrogate likelihood procedure to solve distributed statistical learning problem, which relaxes the condition on the number of blocks. However, their methods cannot be applied to estimate unknown index vector coefficient in the single-index model (1), according to the unknown nonparametric function. This paper proposes an iterative divide-and-conquer (IDC) method for estimating the unknown index vector coefficient in model (1) under massive dataset, which reduces the required primary memory and computation time. The proposed IDC method can remove the constraint on the number of blocks in ADC method, which only requires an initial estimator and then successively refines the estimator via multiple rounds of aggregations. The resulting estimator is as efficient as the estimator by the entire dataset.
The remainder of the paper is organized as follows. In Section 2, we introduce the proposed procedures for model (1). Both the simulation examples and the applications of two real datasets are given in Section 3 to illustrate the proposed procedures. Final remarks are given in Section 4. All the conditions and their discussions as well as technical proofs are relegated to the Appendix.

Iterative divide-and-conquer method
We first review the estimation method for full data (Wang et al., 2010), which can be analysed by one single machine.
We can obtain the estimatorγ of γ 0 by minimizing where is a symmetric kernel function and h is a bandwidth.
However, for massive dataset, we cannot obtain the estimator of γ 0 , because a computer can't store or spend a long time to solve the optimization problem of (2).
Let us assume that n samples are partitioned into M subsets. In particular, we split the data index set {1, . . . , n} into S 1 , . . . , S M , where S m denotes the set of indices on the m-th block, m = 1, . . . , M. Without loss of generality, each block has the sample sizeñ = n/M, whereñ should be an integer.
The averaging divide-and-conquer (ADC) method for γ 0 can be obtained by Jiang et al. (2020) as follows: whereγ m is obtained by minimizing (2) with the subset {S m } M m=1 . Sensor network data are naturally collected by many sensors. However, by the results of Theorem 4.1 in Jiang et al. (2020), forγ ADC to achieve the optimal convergence rate O p (n −1/2 ), the number of machines M has to be fixed. It is a highly restrictive assumption. In this section, we will propose a method for the case of M → ∞, and it is also valid for fixed M.

Asymptotic normality of the resulting estimator
To establish the asymptotic property of the proposed estimator, the following technical conditions are imposed.
(C1) The density function of X γ 1 is positive and satisfies a Lipschitz condition of order 1 for γ 1 in a neighbourhood of γ 01 . Further, X γ 01 has a positive and bounded density function on , where = {t = X γ 01 : X ∈ } and is the compact support set of X.
have two bounded and continuous derivatives.
is a bounded, continuous and symmetric probability density function, satisfying Remark 2.1: Conditions (C1)-(C4) are commonly used in the literature, see Wang et al. (2010). Condition (C1) is used to bound the density function of X γ 1 away from zero. This ensures that the denominators ofĝ(u, γ 1 ) andĝ (u, γ 1 ) are, with high probability, bounded away from 0 for u = X γ 1 , where X ∈ and γ 1 is near γ 01 . The Lipschitz condition and the two derivatives in conditions (C1) and (C2) are standard smoothness conditions. Condition (C3) is a necessary condition for the asymptotic normality of an estimator. Condition (C4) is a usual assumption for kernel function.
Theorem 2.1 shows thatγ achieves the same asymptotic efficiency as estimator in (2) computed directly on all the samples, see Theorem 2 in Wang et al. (2010). Compared to the averaging divide-and-conquer method that also can achieve the convergence rate O p (n −1/2 ) but under the condition fixed M, our approach removes the restriction on the number of machines M by applying multiple rounds of aggregations. It is also important to note that the required number of rounds Q is usually quite small. For example, if n = 10 20 andñ = 10 5 , then Q = 3.
After obtaining the estimationγ of γ 0 , for any given point u, we can estimate g 0 (·) in model (1) with massive dataset by (3).

Algorithm
Based on the above analysis, we now introduce an iterative divide-and-conquer method for estimating γ 0 .
Step 1: Without loss of generality, the entire data set is partitioned into M subsets: S 1 , . . . , S M .
Step 2: Calculate the initial estimatorγ 0 based on S 1 .

Numerical studies
In this section, we first use Monte Carlo simulation studies to assess the finite sample performance of the proposed procedures and then demonstrate the application of the proposed methods with two real data analyses. All programs are written in R code and our computer has a 3.3 GHz Pentium processor and 8G memory.

Simulation example 1: effect of M with fixed n
In this example, we fix the total sample size n = 10000 and vary the number of blocks M from {10, 50, 100}, to access the influence of M on the proposed estimation method. The model for the simulated data is where X is uniformly distributed on [0, 1] 3 , γ 01 = (1, γ 0 ) , γ 0 = (2, −1) and ε ∼ N(0, 1). We compare the proposed iterative divide-and-conquer (IDC) method for γ 0 with the oracle procedure (Oracle) which is obtained by solving (2) by the full data, and averaging divide-and-conquer (ADC) method. Table 1 depicts the mean squared errors (MSE = (γ − γ 0 ) (γ − γ 0 )), and Absolute Bias (|γ − γ 0 |) of the estimateγ to assess the accuracy of the estimation methods. From Table 1, the following conclusions can be drawn.  (iii) t in Table 1 is the average computing time in seconds used to estimate the index parameter. From t, we see that the operation times of ADC and IDC are faster than that of Oracle. Moreover, IDC is faster than ADC under case of M = 10.

Simulation example 2: effect of M with fixedñ
To compare the effects of the three methods on the number of blocks with fixed sample size on each block (ñ = 100), we consider M of {10,20,. . . ,100}. The model for the simulated data is also from (8).
The results of MSE are presented in Figure 1. The average computing time in seconds used to estimate the index parameter is presented in Figure 1. From Figures 1 and 2, the following conclusions can be drawn.
(i) From Figure 1, we can see that the performances of Oracle method are the best of the three methods under different M. However, by Figure 2, the operation times of Oracle method are much greater than those of ADC and IDC under different M. (ii) As the number of blocks M continues to increase, the MSEs of Oracle and IDC decrease. However, ADC doesn't have this pattern. (iii) If the number of blocks is less than 30, the MSEs of the ADC method are less than that of IDC. However, as the number of blocks continues to increase, IDC can significantly outperform the ADC method. Furthermore, if the number of blocks is less than 60, the operation times of the IDC method are less than that of ADC.

Simulation example 3: effect of n
To examine the effect of increasing the sample size, n = 5000, 10000 and 20000 are considered. The following single-index model is considered: where X = (X 1 , X 2 ) is a two-dimensional standard normal variable, the correlation between X 1 and X 2 is 0.5, γ 01 = (1, γ 0 ) , γ 0 = 2 and ε ∼ N(0, 0.2 2 ).    Table 2 presents the averages of Absolute Bias and computing time t over the 100 data sets along with its estimated standard error. By varying the sample size, as expected, the Absolute Bias becomes smaller and computing time t becomes bigger as the sample size grows.

Real data example 1: combined cycle power plant data
We apply the proposed method to combined cycle power plant data. The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006)(2007)(2008)(2009)(2010)(2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (AT), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant. The data set is obtained from online site: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant. In this study, the following single-index model is used to fit the data: where all the data are normalized. We considered the number of blocks M ∈ {8, 16, 26, 46, 92}; hence, the respective values of the local sample size areñ ∈ {1196, 598, 368, 208, 104}. Table 3 summarizes the estimated coefficients for the above model, showing that AP has the smallest effect on EP among the four covariates and AT is the most important covariate. Figure 3 shows the estimated EP by the Oracle method along with the observations, illustrating that single-index model (10) is very suitable to the combined cycle power plant data. Furthermore, we evaluate the performances of three estimators based on mean square fitting error Table 3, the following conclusions can be drawn.  Table 3 is the average computing time in seconds used to estimate the index parameter. From t, we see that the operation times of IDC are faster than that of Oracle and ADC.

Real data example 2: airline on-time data
Airline on-time performance data from the 2009 ASA Data Expo are used as a case study. These data are publicly available (http://stat-computing.org/dataexpo/2009/the-data.html). This dataset consists of flight arrival and departure details for all commercial flights within the United States from October 1987 to April 2008. About 12 million flights were recorded with 29 variables. In this section, we only consider the data of year 2008 (the number of samples is 1011963). The first 1000000 data points are used for the estimation and the remaining 11963 data are used for the prediction. Schifano et al. (2016) developed a linear model that fits the data as follows: where AD is the arrival delay (ArrDelay), which is a continuous variable found by modelling log(ArrDelay − min(ArrDelay) + 1), HD is the departure hour (range 0 to 24), DIS is the distance (in 1000 miles), NF is the dummy variable for a night flight (1 if departure between 8 p.m. and 5 a.m., 0 otherwise), and WF is the dummy variable for a weekend flight (1 if departure occurred during the weekend, 0 otherwise). In this study, the following single-index model is used to fit the data: For comparison purposes, we use the least squares (LS) method to estimate (γ 1 , γ 2 , γ 3 , γ 4 ) in model (11), and use the ADC and IDC methods proposed in Section 2 to estimate (γ 1 , γ 2 , γ 3 , γ 4 ) in model (12). The number of blocks is 1000 for these three methods. Furthermore, we evaluate the performance of these estimators based on their out-of-sample prediction with the mean absolute prediction error (MAPE) of the predictions, where AD i is the fitted value of AD i , i = 1, . . . , n with n = 11, 963. AD i can be obtained by (3). Table 4 presents the estimated coefficients and MAPEs of the three methods. From Table 4, we can see that the IDC method performs better than LS and ADC according to the smaller MAPE.

Disclosure statement
We proposed a divide-and-conquer method to deal with single-index model for massive dataset. The proposed method significantly reduces the required amount of primary memory and enjoys a very low computational cost. The proposed method achieves the same asymptotic efficiency as the estimator using all the data. Furthermore, it allows a weak condition on the sample size as a function of memory size.

Funding
We would like to acknowledge support for this project from the Fundamental Research Funds for the Central Universities of China (No. 2232020D-43).
We prove that the mean and the variance of n 1/2 V n3,s tend to 0. Using This proves that V n3 2 = o p (n −1/2 ).
We now consider V n4 .
Finally, we consider V n5 .