Divide and Conquer Local Average Regression

The divide and conquer strategy, which breaks a massive data set into a se- ries of manageable data blocks, and then combines the independent results of data blocks to obtain a final decision, has been recognized as a state-of-the-art method to overcome challenges of massive data analysis. In this paper, we merge the divide and conquer strategy with local average regression methods to infer the regressive relationship of input-output pairs from a massive data set. After theoretically analyzing the pros and cons, we find that although the divide and conquer local average regression can reach the optimal learning rate, the restric- tion to the number of data blocks is a bit strong, which makes it only feasible for small number of data blocks. We then propose two variants to lessen (or remove) this restriction. Our results show that these variants can achieve the optimal learning rate with much milder restriction (or without such restriction). Extensive experimental studies are carried out to verify our theoretical assertions.


Introduction
Rapid expansion of capacity in the automatic data generation and acquisition has made a profound impact on statistics and machine learning, as it brings data in unprecedented size and complexity. These data are generally called as the "massive data" or "big data" (Wu et al., 2014). Massive data bring new opportunities of discovering subtle population patterns and heterogeneities which are believed to embody rich values and impossible to be found in relatively small data sets. It, however, simultaneously leads to a series of challenges such as the storage bottleneck, efficient computation, etc. (Zhou et al., 2014).
To attack the aforementioned challenges, a divide and conquer strategy was suggested and widely used in statistics and machine learning communities (Li et al., 2013;Mcdonald et al., 2009;Zhang et al., 2013Zhang et al., , 2015Dwork and Smith, 2009;Battey et al., 2015;Wang et al., 2015). This approach firstly distributes a massive data set into m data blocks, then runs a specified learning algorithm on each data block independently to get a local estimatef j , j = 1, . . . , m and at last transmits m local estimates into one machine to synthesize a global estimatef , which is expected to model the structure of original massive dataset. A practical and exclusive synthesizing method is the average mixture (AVM) (Li et al., 2013;Mcdonald et al., 2009;Zhang et al., 2013Zhang et al., , 2015Dwork and Smith, 2009;Battey et al., 2015), i.e.,f = 1 m m j=1f j .
In practice, the above divide and conquer strategy has many applicable scenarios.
We show the following three situations as motivated examples. The first one is using limited primary memory to handle a massive data set. In this situation, the divide and conquer strategy is employed as a two-stage procedure. In the first stage, it reads the whole data set sequentially block by block, each having a manageable sample size for the limited primary memory, and derives a local estimate based on each block. In the second stage, it averages local estimates to build up a global estimate (Li et al., 2013).
The second motivated example is using distributed data management systems to tackle massive data. In this situation, distributed data management systems (e.g., Hadoop) are designed by the divide and conquer strategy. They can load the whole data set into the systems and tackle computational tasks separably and automatically. Guha et al. (2012) has developed an integrated programing environment of R and Hadoop (called RHIPE) for expedient and efficient statistical computing. The third motivated example is the massive data privacy. In this situation, it divides a massive data set into several small pieces and combining the estimates derived from these pieces for keeping the data privacy (Dwork and Smith, 2009).
For nonparametric regression, the aforementioned divide and conquer strategy has been shown to be efficient and feasible for global modeling methods such as the kernel ridge regression (Zhang et al., 2015) and conditional maximum entropy model (Mcdonald et al., 2009). Compared with the global modeling methods, local average regression (LAR) (Györfi et al., 2002;Fan, 2000;Tsybakov, 2008), such as the Nadaraya-Watson kernel (NWK) and k nearest neighbor (KNN) estimates, benefits in computation and therefore, is also widely used in image processing (Takeda et al., 2007), power prediction (Kramer et al., 2010), recommendation system (Biau et al., 2010) and financial engineering (Kato, 2012). LAR is by definition a learning scheme that averages outputs whose corresponding inputs satisfy certain localization assumptions. To tackle massive data regression problems, we combine the divide and conquer approach with LAR to produce a new learning scheme, average mixture local average regression (AVM-LAR), just as Zhang et al. (2015) did for kernel ridge regression.
Our first purpose is to analyze the performance of AVM-LAR. After providing the optimal learning rate of LAR, we show that AVM-LAR can also achieve this rate, provided the number of data blocks, m, is relatively small. We also prove that the restriction concerning m cannot be essentially improved. In a word, we provide both the optimal learning rate and the essential restriction concerning m to guarantee the optimal rate of AVM-LAR. It should be highlighted that this essential restriction is a bit strong and makes AVM-LAR feasible only for small m. Therefore, compared with LAR, AVM-LAR does not bring the essential improvement, since we must pay much attention to determine an appropriate m.
Our second purpose is to pursue other divide and conquer strategies to equip LAR efficiently. In particular, we provide two concrete variants of AVM-LAR in this paper.
The first variant is motivated by the distinction between KNN and NWK. In our experiments, we note that the range of m to guarantee the optimal learning rate of AVM-KNN is much larger than AVM-NWK. Recalling that the localization parameter of KNN depends on data, while it doesn't hold for NWK, we propose a variant of AVM-LAR such that localization parameters of each data block depend on data. We establish the optimal learning rate of this variant with mild restriction to m and verify its feasibility by numerical simulations. The second variant is based on the definitions of AVM and LAR. It follows from the definition of LAR that the predicted value of a new input depends on samples near the input. If there are not such samples in a specified data block, then this data block doesn't affect the prediction of LAR. However, AVM averages local estimates directly, neglecting the concrete value of a specified local estimate, which may lead to an inaccurate prediction. Based on this observation, we propose another variant of AVM-LAR by distinguishing whether a specified data block affects the prediction. We provide the optimal learning rate of this variant without any restriction to m and also present the experimental verifications.
To complete the above missions, the rest of paper is organized as follows. In Section 2, we present optimal learning rates of LAR and AVM-LAR and analyze the pros and cons of AVM-LAR. In Section 3, we propose two new modified AVM-LARs to improve the performance of AVM-LAR. A set of simulation studies to support the correctness of our assertions are given in Section 4. In Section 5, we detailedly justify all the theorems. In Section 6, we present the conclusion and some useful remarks.

Divide and Conquer Local Average Regression
In this section, after introducing some basic concepts of LAR, we present a baseline of our analysis, where an optimal minimax learning rate of LAR is derived. Then, we deduce the learning rate of AVM-LAR and analyze its pros and cons.

Local Average Regression
is the real-valued response. We always assume X is a compact set. Suppose LAR, as one of the most widely used nonparametric regression approaches, con-structs an estimate formed as Here, h > 0 is the so-called localization parameter. Generally speaking, W h,X i (x) is "small" if X i is "far" from x. Two widely used examples of LAR are Nadaraya-Watson kernel (NWK) and k nearest neighbor (KNN) estimates.
Example 1. (NWK estimate) Let K : X → R + be a kernel function (Györfi et al., 2002), and h > 0 be its bandwidth. The NWK estimate is defined bŷ and therefore, It is worth noting that we use the convention 0 0 = 0 in the following. Two popular kernel functions are the naive kernel, is an indicator function with the feasible domain x ≤ 1 and · denotes the Euclidean norm.
Then the KNN estimate is defined bŷ According to (2.1), we have Here we denote the weight of KNN as W h,X i instead of W k,X i in order to unify the notation and h in KNN depends on the distribution of points and k.

Optimal Learning Rate of LAR
The weakly universal consistency and optimal learning rates of some specified LARs have been justified by Stone (1977Stone ( , 1980Stone ( , 1982 and summarized in the book Györfi et al. (2002). In particular, Györfi et al. (2002, Theorem 4.1) presented a sufficient condition to guarantee the weakly universal consistency of LAR. Györfi et al. (2002, Theorem 5.2, Theorem 6.2) deduced optimal learning rates of NWK and KNN. The aim of this subsection is to present some sufficient conditions to guarantee optimal learning rates of general LAR.
x ∈ X }. We suppose in this paper that f ρ ∈ F c 0 ,r . This is a commonly accepted prior assumption of regression function which is employed in (Tsybakov, 2008;Györfi et al., 2002;Zhang et al., 2015). The following Theorem 1 is our first main result.
Theorem 1. Let f D,h be defined by (2.1). Assume that: (A) there exists a positive number c 1 such that (B) there exists a positive number c 2 such that If h ∼ N −1/(2r+d) , then there exist constants C 0 and C 1 depending only on d, r, c 0 , c 1 and c 2 such that Theorem 1 presents sufficient conditions of the localization weights to ensure the optimal learning rate of LAR. There are totally four constraints of the weights W h,X i (·).
The first one is the averaging constraint N i=1 W h,X i (x) = 1, for all X i , x ∈ X . It essentially reflects the word "average" in LAR. The second one is the non-negative constraint. We regard it as a mild constraint as it holds for all the widely used LAR such as NWK and KNN. The third constraint is condition (A), which devotes to controlling the scope of the weights. It aims at avoiding the extreme case that there is a very large weight near 1 and others are almost 0. The last constraint is condition (B), which implies the localization property of LAR.
We should highlight that Theorem 1 is significantly important for our analysis. On the one hand, it is obvious that the AVM-LAR estimate (see Section 2.3) is also a new LAR estimate. Thus, Theorem 1 provides a theoretical tool to derive optimal learning rates of AVM-LAR. On the other hand, Theorem 1 also provides a sanity-check that an efficient AVM-LAR estimate should possess the similar learning rate as (2.4).

AVM-LAR
The AVM-LAR estimate, which is a marriage of the classical AVM strategy (Mcdonald et al., 2009;Zhang et al., 2013Zhang et al., , 2015 and LAR, can be formulated in the following Algorithm 1.
Local processing: For j = 1, 2, . . . , m, implement LAR for the data block D j to get the jth local estimate

Synthesization:
Transmit m local estimates f j,h to a machine, getting a global estimate defined by In Theorem 2, we show that this simple generalization of LAR achieves the optimal learning rate with a rigorous condition concerning m. We also show that this condition is essential.
Theorem 2. Let f h be defined by (2.5) and h D j be the mesh norm of the data block (D) for all D 1 , . . . , D m , there holds almost surely If h ∼ N −1/(2r+d) , and the event {h D j ≤ h for all D j } holds, then there exists a constant C 2 depending only on d, r, M, c 0 , c 3 and c 4 such that The assertions in Theorem 2 can be divided into two parts. The first one is the positive assertion, which means that if some conditions of the weights and an extra constraint of the data blocks are imposed, then the AVM-LAR estimate (2.5) possesses the same learning rate as that in (2.4) by taking the same localization parameter h (ignoring constants). This means that the proposed divide and conquer operation in Algorithm 1 doesn't affect the learning rate under this circumstance. In fact, we can relax the restriction (D) for the bound (2.6) to the following condition (D * ).
(D * ) For all D 1 , . . . , D m , there exists a positive number c 4 such that To guarantee the optimal minimax learning rate of AVM-LAR, condition (C) is the same as condition (A) by noting that there are only n samples in each D j . Moreover, condition (D * ) is a bit stronger than condition (B) as there are totally n samples in D j but the localization bound of it is c 4 /( However, we should point out that such a restriction is also mild, since in almost all widely used LAR, the localization bound either is 0 (see NWK with naive kernel, and KNN) or decreases exponentially (such as NWK with Gaussian kernel). All the above methods satisfy conditions (C) and (D * ).
The negative assertion, however, shows that if the event {there is a D j such that h D j > h} holds, then for any h ≥ 1 2 (n + 2) −1/d , the learning rate of AVM-LAR isn't faster than 1 nh d . It follows from Theorem 1 that the best localization parameter to guarantee the optimal learning rate satisfies h ∼ N −1/(2r+d) . The condition h ≥ 1 2 (n+2) −1/d implies that if the best parameter is selected, then m should satisfy m ≤ O(N 2r/(2r+d) ).
Under this condition, from (2.7), we have This means, if we select h ∼ N −1/(2r+d) and m ≤ O(N 2r/(2r+d) ), then the learning rate of AVM-LAR is essentially slower than that in (2.4). If we select a smaller h, then the above inequality also yields the similar conclusion. If we select a larger h, however, the approximation error (see the proof of Theorem 1) is O(h 2r ) which can be larger than the learning rate in (2.4). In a word, if the event {h D j ≤ h for all D j } does not hold, then the AVM operation essentially degrades the optimal learning rate of LAR.
At last, we should discuss the probability of the event {h D j ≤ h for all D j }. As P{h D j ≤ h for all D j } = 1 − mP{h D 1 > h}, and it can be found in (Györfi et al., 2002, P.93-94) The above quantity is small when m is large, which means that the event {h D j ≤ h for all D j } has a significant chance to be broken down. By using the method in (Györfi et al., 2002, Problem 2.4), we can show that the above estimate for the confidence is essential in the sense that for the uniform distribution, the equality holds for some constant c .

Modified AVM-LAR
As shown in Theorem 2, if h D j ≤ h does not hold for some D j , then AVM-LAR cannot reach the optimal learning rate. In this section, we propose two variants of AVM-LAR such that they can achieve the optimal learning rate under mild conditions.

AVM-LAR with data-dependent parameters
The event {h D j ≤ h for all D j } essentially implies that for arbitrary x, there is at least This condition holds for KNN as the parameter h in KNN changes with respect to samples. However, for NWK and other local average methods (e.g., partition estimation (Györfi et al., 2002)), such a condition usually fails. Motivated by KNN, it is natural to select a sample-dependent localization h to ensure the event {h D j ≤ h for all D j }. Therefore, we propose a variant of AVM-LAR with data-dependent parameters in Algorithm 2.
Algorithm 2 AVM-LAR with data-dependent parameters Local processing: For any j = 1, 2, . . . , m, implement LAR with bandwidth parameterh for the data block D j to get the jth local estimate Comparing with AVM-LAR in Algorithm 1, the only difference of Algorithm 2 is the division step, where we select the bandwidth parameter to be greater than all h D j , j = 1, . . . , m. The following Theorem 3 states the theoretical merit of AVM-LAR with data-dependent parameters.
Theorem 3. Let r < d/2,fh be defined by (3.1). Assume (C) and (D * ) hold. Supposẽ N 2r/(2r+d) , then there exists a constant C 3 depending only on c 0 , c 3 , c 4 , r, d and M such that Theorem 3 shows that if the localization parameter is selected elaborately, then AVM-LAR can achieve the optimal learning rate under mild conditions concerning m.
It should be noted that there is an additional restriction to the smoothness degree, r < d/2. We highlight that this condition cannot be removed. In fact, without this condition, (3.2) may not hold for some marginal distribution ρ X . For example, let d = 1, it can be deduced from (Györfi et al., 2002, Problem 6.1) that there exists a ρ X such that (3.2) doesn't hold. However, if we don't aim at deriving a distribution free result, we can fix this condition by using the technique in (Györfi et al., 2002, Problem 6.7). Actually, for d ≤ 2r, assume the marginal distribution ρ X satisfies that there exist ε 0 > 0, a nonnegative function g such that for all x ∈ X , and 0 < ε ≤ ε 0 satisfying ρ X (B ε (x)) > g(x)ε d , and X 1 g 2/d (x) dρ X < ∞, then (3.2) holds for arbitrary r and d. It is obvious that the uniform distribution satisfies the above conditions.
Instead of imposing a restriction to h D j , Theorem 3 states that after using the datadependent parameterh, AVM-LAR doesn't degrade the learning rate for a large range of m. We should illustrate that the derived bound of m cannot be improved further.
Indeed, it can be found in our proof that the bias of AVM-LAR can be bounded by CE{h 2r }. Under the conditions of Theorem 3, if m ∼ N (2r+ε)/(2r+d) , then for arbitrary Thus, it is easy to check that E{h 2r } ≤ CN (−2r+ε)/(2r+d) , which implies a learning rate slower than

Qualified AVM-LAR
Algorithm 2 provided an intuitive way to improve the performance of AVM-LAR.
However, Algorithm 2 increases the computational complexity of AVM-LAR, because we have to compute the mesh norm h D j , j = 1, . . . , m. A natural question is whether we can avoid this procedure while maintaining the learning performance. The following Algorithm 3 provides a possible way to tackle this question.
Algorithm 3 Qualified AVM-LAR be N samples, m be the number of data blocks, h be the bandwidth parameter. Output: The global estimatef h . Division: Randomly divide D into m data blocks, i.e. D = ∪ m j=1 D j with D j ∩ D k = ∅ for k = j and |D 1 | = · · · = |D m | = n. Qualification: For a test input x, if there exists an X j 0 ∈ D j such that |x−X j 0 | ≤ h, then we qualify D j as an active data block for the local estimate. Rewrite all the active data blocks as T 1 , . . . , T m 0 . Local processing : For arbitrary data block T j , j = 1, . . . , m 0 , define Synthesization: Transmit m 0 local estimates f j,h to a machine, getting a global estimate defined byf Comparing with Algorithms 1 and 2, the only difference of Algorithm 3 is the qualification step which essentially doesn't need extra computation. In fact, the qualification and local processing steps can be implemented simultaneously. It should be further mentioned that the qualification step actually eliminates the data blocks which have a chance to break down the event {h D j ≤ h for all D j }. Although, this strategy may loss a part of information, we show that the qualified AVM-LAR can achieve the optimal learning rate without any restriction to m.
In Theorem 3, we declare that AVM-LAR with data-dependent parameter doesn't slow down the learning rate of LAR. However, the bound of m in Theorem 3 depends on the smoothness of the regression function, which is usually unknown in the real world applications. This makes m be a potential parameter in AVM-LAR with datadependent parameter, as we do not know which m definitely works. However, Theorem 4 states that we can avoid this problem by introducing a qualification step. The theoretical price of such an improvement is only to use condition (E) to take place condition (D * ). As shown above, all the widely used LARs such as the partition estimate, NWK with naive kernel, NWK with Gaussian kernel and KNN satisfy condition (E) (with a logarithmic term for NWK with Gaussian kernel).

Experiments
In this section, we report experimental studies on synthetic data sets to demonstrate the performances of AVM-LAR and its variants. We employ three criteria for the comparison purpose. The first criterion is the global error (GE) which is the mean square error of testing set when N samples are used as a training set. We use GE as a baseline that does not change with respect to m. The second criterion is the local error (LE) which is the mean square error of testing set when we use only one data block (n samples) as a training set. The third criterion is the average error (AE) which is the mean square error of AVM-LAR (including Algorithms 1, 2 and 3).

Simulation 1
We use a fixed total number of samples N = 10, 000, but assume that the number of data blocks m (the data block size n = N/m) and dimensionality d are varied. The simulation results are based on the average values of 20 trails.
We generate data from the following regression models y = g j (x) + ε, j = 1, 2, where ε is the Gaussian noise N (0, 0.1), and Wendland (2004) revealed that g 1 and g 2 are the so-called Wendland functions with the property g 1 , g 2 ∈ F c 0 ,1 and g 1 , g 2 / ∈ F c 0 ,2 for some absolute constant c 0 . • KNN: According to Theorem 2, the parameter k is set to k ∼ N Theorems 2, 3 and 4 yield that all these estimates can reach optimal learning rates.
As m increasing, the event {h D j > h for some j} inevitably happens, then Algorithm 1 fails according to the negative part of Theorem 2. At the same time, AE-A1 begins to increase dramatically. As Algorithms 2 and 3 are designed to avoid the weakness of Algorithm 1, the curves of AE-A2 and AE-A3 are always below that of AE-A1 when m > m . Another interesting fact is that AE-A3 is smaller than AE-A2, although both of them all can achieve the same learning rate in theory.

Simulation 2
We make use of the same simulation study which is conducted by Zhang et al. (2015) for comparing the learning performance of AVM-NWK (including Algorithms 1, 2 and 3) and the divide and conquer kernel ridge regression (DKRR for short).

3D Road Network Data
Building accurate 3D road networks is one of centrally important problems for Advanced Driver Assistant Systems (ADAS). Benefited from an accurate 3D road network, ecorouting, as an application of ADAS, can yield fuel cost savings 8-12% when compared with standard routing (Tavares et al., 2009). For this reason, obtaining an accurate 3D road networks plays an important role for ADAS (Kaul et al., 2013).
North Jutland (NJ), the northern part of Justland, Denmark, covers a region of 185km×130km. NJ contains a spatial road network with a total length of 1.17 × 10 7 m, whose 3D ployline representation is containing 414,363 points. Elevation values where extracted from a publicly available massive Laser Scan Point Clod for Denmark are added in the data set. Thus, the data set includes 4 attribute: OSMID which is the OpenStreetMap ID for each road segment or edge in the graph; longitude and latitude with Google format; elevation in meters. In practice, the acquired data set may include missing values. In this subsection, we try to use AVM-LAR based on Algorithms 1, 2 and 3 for rebuilding the missing elevation information on the points of 3D road networks via aerial laser scan data.
To this end, we randomly select 1000 samples as a test set (  We can find in Figure 5 that AEs of Algorithm 1 are larger than GE, which implies the weakness of AVM-NWK based on the negative part of Theorem 2. AE-A3 has almost same values with the GE for all m, however, AE-A2 possesses similar property only when m ≤ 32. Then, for the 3D road network data set, Algorithm 3 is applicable to fix the missed elevation information for the data set.

Proofs
where E * {·} = E{·|X 1 , X 2 , . . . , X n }. Therefore, we can deduce That is, The first and second terms are referred to the sample error and approximation error, respectively. To bound the sample error, noting Therefore we can use (A) to bound the sample error as Now, we turn to bound the approximation error. Let B h (x) be the l 2 ball with center x and radius h, we have It follows from (Györfi et al., 2002, P.66) and where the last inequality is deduced by f ρ ∈ F c 0 ,r , condition (B) and Jensen's inequality. Under this circumstance, we get If we set h = 4(c 1 +c 2 2 +4)M 2 c 2 0 N −1/(2r+d) , then (4(c 1 + c 2 2 + 4)M 2 ) 2r/(2r+d) N −2r/(2r+d) .

Proof of Theorem 2
Since As h ≥ h D j for all 1 ≤ j ≤ m, we obtain that B h (x) ∩ D j = ∅ for all x ∈ X and 1 ≤ j ≤ m. Then, using the same method as that in the proof of Theorem 1, (C) and Due to the Jensen's inequality, we have Noting B h (X)∩D j = ∅ almost surely, the same method as that in the proof of Theorem 1 together with (D) yields that Thus, This finishes the proof of (2.6) by taking h = 16(c 3 +c 2 4 )M 2 3c 2 0 nm −1/(2r+d) into account. Now, we turn to prove (2.7). According to (5.1), we have Without loss of generality, we assume h < h D 1 . It then follows from the definition of the mesh norm that there exits X ∈ X which is not in B h (X i ), X i ∈ D 1 . Define the separation radius of a set of points S = {ζ i } n i=1 ⊂ X via The mesh ratio τ S := h S q S ≥ 1 provides a measure of how uniformly points in S are distributed on X . If τ S ≤ 2, we then call S as the quasi-uniform point set. Let Ξ l = {ξ 1 , . . . , ξ l } be l = (2h) −d quasi-uniform points (Wendland, 2004) in X . That Then, Since h ≥ 1 2 (n + 2) −1/d , we can let ρ X be the marginal distribution satisfying ρ X (B q Ξ l (ξ k )) = 1/n, k = 1, 2, . . . , l − 1. Then This finishes the proof of Theorem 2.

}.
To bound E{h 2rd/(2r+d) D 1 }, we note that for arbitrary ε > 0, there holds Let t 1 , . . . , t l be the quasi-uniform points of X . Then it follows from (Györfi et al., 2002, P.93) that P{h D 1 > ε} ≤ 1 nε d . Then, we have To bound E{h 2r D 1 }, we can use the above method again and r < d/2 to derive E{h 2r D 1 } ≤ N 2r/(2r+d) , Now, we turn to prove (B) holds. This can be deduced directly by using the similar method as the last inequality and the condition (E). That is, Then Theorem 4 follows from Theorem 1.

Conclusion
In this paper, we combined the divide and conquer strategy with local average regression to provide a new method called average-mixture local average regression (AVM-LAR) to attack the massive data regression problems. We found that the estimate obtained by AVM-LAR can achieve the minimax learning rate under a strict restriction concerning m. We then proposed two variants of AVM-LAR to either lessen the restriction or remove it. Theoretical analysis and simulation studies confirmed our assertions.
We discuss here three interesting topics for future study. Firstly, LAR cannot handle the high-dimensional data due to the curse of dimensionality (Györfi et al., 2002;Fan, 2000). How to design variants of AVM-LAR to overcome this hurdle can be accommodated as a desirable research topic. Secondly, we have justified that applying the divide and conquer strategy on the LARs does not degenerate the order of learning rate under mild conditions. However, we did not show there is no loss in the constant factor. Discussing the constant factor of the optimal learning rate is an interesting project. Finally, equipping other nonparametric methods (e.g., Fan and Gijbels (1994); Györfi et al. (2002);Tsybakov (2008)) with the divide and conquer strategy can be taken into consideration for massive data analysis. For example, Cheng and Shang (2015) have discussed that how to appropriately apply the divide and conquer strategy to the smoothing spline method.