Additive Partially Linear Models for Massive Heterogeneous Data

We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if there were no heterogeneity. This oracle result holds when the number of sub-populations does not grow too fast and the tuning parameters are selected carefully. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. Furthermore, we develop a heterogeneity test for the linear components and a homogeneity test for the non-linear components accordingly. The performance of the proposed methods is evaluated via simulation studies and an application to the Medicare Provider Utilization and Payment data.


Introduction
Recent revolutions in technologies have produced many kinds of massive data, where the number of variables p is fixed but the sample size N is very large. In this paper we consider massive heterogeneous data. The analysis of non-massive heterogeneous data has been well studied in the literature. For example, non-massive heterogeneous data can be handled by fitting mixture models [1] and by modeling variance functions [3]. However, as far as we are aware, [21] is the only work that considers the analysis of massive heterogeneous data. They proposed a partially linear framework for modelling massive heterogeneous data, attempting to extract the common feature across all subpopulations while exploring heterogeneity of each sub-population. But the partially linear framework can only deal with only one common feature. In this paper, we propose an additive partially linear framework for modelling massive heterogeneous data, which can be applied to extract several common features across all sub-populations while exploring heterogeneity of each sub-population. The additive partially linear models (APLMs) are a generalization of multiple linear regression models, and at the same time they are a special case of generalized additive nonparametric regression models [7]. As discussed in [11], APLMs allow an easier interpretation of the effect of each variable and are preferable to completely nonparametric additive models, since they combine both parametric and nonparametric components when it is believed that the response variable depends on some variables in a linear way but is non-linearly related to the remaining independent variables. Estimation and inference for APLMs have been well studied in literature [e.g., 2,13]. Recently, [6] proposed an approach for the analysis of heterogeneous data, fitting both the mean function and variance function using different additive partially linear models.
In this paper, we generalize the partially linear model (PLM) considered in [21] and propose an additive partially linear model (APLM) for modeling massive heterogeneous data. Let tpY i , X i , Z i qu N i"1 be the observations from a sample of N subjects. We assume that there exist s independent sub-populations, and the data from the jth sub-population follow the following additive partially linear model, Y pjq " X T β pjq 0`K ÿ k"1 g 0k pZ k q`ε, for j " 1, . . . , s, (1.1) where X " pX 1 , . . . , X d q T , Z " pZ 1 , . . . , Z K q, β pjq 0 " pβ pjq 01 , . . . , β pjq 0d q T is the vector of unknown parameters for jth sub-population, g 01 , . . . , g 0K are unknown smooth functions, and ε has zero mean and variance σ 2 . Under model (1.1), Y pjq depends on X linearly but with coefficients varying across different sub-populations, whereas Y pjq depends on Z through additive non-linear functions that are common to all subpopulations. This model implies that the heterogeneity of the data is coming from the difference among β pjq 0 , j " 1, . . . , s. Compared with [21], the novelty of this paper is two-fold. First, we consider a more general and practical model. The partially linear model considered in [21] is a special case of (1.1) where K " 1. Second, we use a different nonparametric tool (i.e., the regression splines tool) to fit the non-parametric functions than the reproducing kernel Hilbert Space (RKHS) tool that was used in [21], making the theoretical development easier and computational implementation faster. The first fold of novelty is more significant than the second one, because it allows to consider more than one non-parametric components (K ą 1). However, the theoretical novelty of our approach is limited, because it would be straightforward generate the PLM in [21] to the APLM using the same RKHS tool. Instead, we propose to use the regression splines tool, because the resulting computational implementation is fast, especially for massive data. In order to understand this, let's count the number of knots when the RKHS tool and the regression splines are used, respectively. If the RKHS tool is used, the number of knots is N for each non-parametric component, and therefore the total number of knots is equal to KN . If the regression splines is used, the number of knots is to be denoted as J N , satisfying that J N ! N , and therefore the total number of knots is equal to KJ N . The computational gain is small if the PLM is fitted, but the gain is significant if the APLM is fitted for massive data where N is extremely large.
The rest of the paper is organized as follows. We propose the main methods in Section 2 and some hypothesis testing procedures in Section 3, deriving their asymptotic properties. We evaluate the performance of the proposed methods via simulation studies in Section 3 and a real data application in Section 4. We conclude the paper with a brief summary in Section 5 and relegate all the technical proofs to the Appendix.

Notation and assumptions
Recall that β pjq 0 is the true sub-population specific parameter-vector for the jth subpopulation, j " 1, . . . , s, and g 0 pzq " g 01 pz 1 q`¨¨¨`g 0K pz K q is the true additive common non-parametric function. Without loss of generality, assume that g 0k " g 0k p¨q, k " 1, . . . , K, have a common support r0, 1s. We propose to use polynomial splines [2] to approximate smooth function g 0k , k " 1, . . . , K. Let S N be the space of polynomial splines on r0, 1s of degree ě 1, with a sequence of J N interior knots, t´ "¨¨¨" t´1 " t 0 " 0 ă t 1 ă¨¨¨ă t J N ă 1 " t J N`1 "¨¨¨" t J N` `1 , where J N increases with the overall sample size N . Although we can choose different sequences of interior knots for different non-parametric functions in different subpopulations, for simplicity, as in [11], here we consider the same sequence of equally spaced knots and let h N " 1{pJ N`1 q be the distance between neighboring knots.
Assume that X i are i.i.d. with X and Z i are i.i.d. with Z. Define T " pX, Zq. Let m pjq 0 pT q " X T β pjq 0`g 0 pZq, Γpzq " EpX|Z " zq, and Ă X " X´ΓpZq. And C b2 denotes CC T for any matrix or vector C. Let r be a positive integer and ν P p0, 1s such that p " r`ν ą 2. Let H be the collection of functions h on r0, 1s whose rth derivative exists and satisfies the Lipschitz condition of order ν,ˇˇh where and hereafter C is a generic positive constant. In order to derive asymptotic results, we make the following mild assumptions.
(A1). Each component function g 0k P H, k " 1, . . . , K ; (A2). The distribution of Z is absolutely continuous and its density f is bounded away from zero and infinity on r0, 1s K ; (A3). There exists c ą 0 such that c}ω} 2 ď ω T EpX b2 |Z " zqω ď C}ω} 2 , for any vector ω P R d ; (A4). The number of interior knots J N satisfies: N 1{p4pq ! J N ! N 1{4 ; (A5). The projection function Γpzq has the additive form Γpzq " Γ 1 pz 1 q`¨¨¨Γ K pz K q, where Γ k P H, ErΓ k pz k qs " 0 and ErΓ k pz k qs 2 ă 8, k " 1, . . . , K. In addition, to quantify the asymptotic consistencies of the non-parametric estimators, we consider both the empirical norms and the corresponding population norms. Let }z} be the Euclidean norm, }z} 8 be the supremum norm, and }z} 1 be the absolutevalue norm of a vector z, respectively. For a matrix C, its L 2 -norm is defined as }C} 2 " sup }u}‰0 }Cu}{}u}. Let }ϕ} 8 " sup xPr0,1s |ϕpxq| be the supremum norm of a function ϕ on r0, 1s. Following [14] and [9], for any measurable function φ 1 and φ 2 on r0, 1s K , the empirical inner product and norm for the jth sub-sample and the whole sample, respectively, are defined as If φ 1 and φ 2 are L 2 -integrable, the population inner product and norm are defined as where f is the density of Z. Similarly, for the kth component of Z, Z k with density f k , the empirical norm on the jth sub-sample, the empirical norm on the whole sample, and the population norm of any L 2 -integrable univariate function ϕ on r0, 1s are defined as

Estimations for each sub-population
First we consider the estimations for β pjq 0 and g 0 " g 0 p¨q based on the data from the jth sub-population only, j " 1, . . . , s. To this aim, let G j denotes the index set of all the observations from the sub-population j, and let G pjq n " tg pjq p¨qu be the collection of additive functions with the form that g pjq pzq " g pjq 1 pz 1 q`¨¨¨`g pjq K pz K q, where each component function g pjq k P S N and ř iPGj g pjq k pZ ik q " 0. Thus ř iPGj g pjq pZ i q " 0 for any g pjq P G pjq n . For the jth sub-population, we consider the following estimators, , . - For the kth covariate Z k , let b m,k pz k q be the B-spline basis functions of degree equipped with J N knots defined above. For any g P G pjq n , we can write gpzq " bpzq T γ, where bpzq " tb m,k pz k q, m "´ , . . . , J N , k " 1, . . . , Ku T , which is a KpJ N` `1q-dim vector given z, along with KpJ N` `1q-dim coefficient-vector γ " tγ m,k , m "´ , . . . , J N , k " 1, . . . , Ku T . Therefore, (2.1) is equivalent to if we consider the empirically centered estimator p g pjq pzq " We derive some asymptotic results associated with the sub-population specific estimators, summarized in the following theorem.
, if the number of knots satisfies that J N ! n 1{2 , we have, for each sub-population, j " 1, . . . , s, imsart-ejs ver. 2014/10/16 file: APLM_EJS_revision.tex date: January 1, 2019 If the number of knots further satisfies that J N " n 1{p2pq we have Remark 1: Assume that we consider s " OpN 1´γ q sub-samples, each sub-sample of n " OpN γ q observations, where γ is some positive number between 0 and 1. In order to minimize the mean-square error of estimating Under this selection, the mean-square error achieves the optimal rate, OpN pγ 2p`1 q, or equivalently, Opn p 2p`1 q. Remark 2: On the other hand, in order to ensure that p β pjq is ? n-consistent for estimating β pjq 0 , we should adopt under-smoothing tuning with J N " n 1{p2pq and carefully determine a balance between the number of sub-samples and the size of each sub-sample. For example, this can be achieved if we select J N as OpN q q with 1{p4pq ă q ă 1{4, and consider s " OpN 1´γ q sub-samples, each sub-sample of n " OpN γ q, with 2q ă γ ă 2pq. The order of J N is consistent with the existing results in the literature. The recommended balance between s and n provides a guidance for the appropriate application of the divide-and-conquer strategy.

Aggregation of commonality
We consider the aggregated estimator, gpzq " 1 s ř s j"1 p g pjq pzq, as the final estimator of g 0 pzq based on the whole sample. Let G N be the collection of functions with the additive form gpzq " g 1 pz 1 q`¨¨¨`g K pz K q, where g k P S N and ř s j"1 ř iPGj g k pZ ik q " 0. Thus, for any g P G N , ř s j"1 ř iPGj gpZ i q " 0. In order to ensure that g P G N , as in (2.3), we center the individual estimator p g pjq k pz k q via p g pjq k pz k q " To abuse the notation, we still denote the centered estimator as p g pjq k pz k q and p g pjq pzq " ř K k"1 p g pjq k pz k q. We derive the mean-square error of g in the following theorem.
Remark 3: In order to minimize the mean-square error of estimating g 0 using the aggregated estimator, if we select J N as OpN 1 2p`1 q, the mean-square error achieves the optimal rate OpN p 2p`1 q. Remark 4: We compare the mean-square error of g with that of the following "oracle estimator": assuming β pjq 0 , j " 1, . . . , s, are known. Following the proof of Theorem 2.1, we can show that }p g oracle´g0 N¯. Therefore, as long as n " J 2 N , the means-square errors of the aggregated estimator g and the oracle estimator p g oracle are of the same order.
We conclude this subsection with some results for the massive homogeneous data where β pjq 0 " β 0 , j " 1, . . . , s. These results are of their own interest, when the divide-and-conquer strategy is applied to massive homogeneous data, where β 0 and g 0 are estimated using the aggregated estimators β " 1 s ř s j"1 p β pjq and g, respectively. The result for g is the same as that in Theorem 2.2 and the result for β is stated in the following corollary.

Efficiency boosting for heterogeneous parameters
The asymptotic variance matrix of p β pjq derived in Theorem 2.1 shows that there is some room to improve the estimation efficiency, because D´1 " E´1p Ă X b2 q is bigger than the Cramer-Rao lower bound, E´1pX b2 q. Therefore, we re-substitute the aggregated estimator of g, g, into (2.1) to improve the efficiency of estimating β pjq 0 . This leads to the following more efficient estimator, for j " 1,¨¨¨, s. We derive the asymptotic normality ofβ pjq in the following theorem.
Theorem 2.3. Under Assumptions (A1)-(A5), if J N satisfies the condition that J N ! n 1{2 given in the first part of Theorem 2.1 and the condition that J N " N 1{p2pq given in Corollary 2.1, and it further satisfies that J N ! s 1{2 , then we have where A " EpX b2 q.
Remark 5: As in Remarks 1-2, assume that we consider s " OpN 1´γ q sub-samples, each sub-sample of n " OpN γ q observations, where γ is some positive number between 0 and 1. In order to satisfy all the conditions in Theorem 2.3, we can consider N 2q ! n ! N 1´2q , with 1{p2pq ă q ă 1{4, and select J N " OpN q q. If X and Z are not independent, then A´1 ă D´1, implying that we can achieve such efficiency boosting through balancing between n and s.

Practical issues
In this subsection, we consider several practical issues, including selection of the number of knots J N , determination of linear components and non-linear components, and how to conduct statistical inference. We first consider the selection of J N . All the theoretical results need Assumption (A4): N paq J N ! n 1{2 ; pbq J N " n 1{p2pq ; pcq J N " N 1{p2pq and n " N 1{2 ; pdq J N " N 1{p2pq and J N ! s 1{2 .
In Theorem 2.1, under Condition (a), we derive the bound for the mean-square error of each sub-population specific estimator p g pjq , j " 1,¨¨¨s. In Theorem 2.1, under Conditions (a) and (b), we derive the asymptotic normality for each sub-population specific estimator p β pjq , j " 1,¨¨¨s. In Theorem 2.2, under Condition (a), we derive the bound for the mean-square error of the aggregated estimator g. In Corollary 2.1, under Condition (c) and for the massive homogeneous data, we derive the asymptotic normality for the aggregated estimator β. In Theorem 2.3, under Condition (d), we derive the asymptotic normality for each sub-population specific efficiency-boosted estimatorβ pjq . These conditions can be satisfied by carefully selecting the balance between n and s, with some guidance provided in Remarks 1-5.
However, it is hard to use these theoretical requirements on the order of J N to guide the selection of J N in practice. If computational power allows, we can utilize crossvalidation to select J N adaptively. If cross-validation is used, as discussed in [20], we should consider distributed cross-validation aiming for global optimality (selecting a single J N based on the entire dataset) instead of subsample cross-validation aiming for local optimality (selecting an J N for each subsample separately). As we consider massive data here, we don't recommend any data-driven tuning procedure, which requires heavy computational cost. Instead, as in most studies using regression splines, we recommend the fixed choice of the number of internal knots J N (e.g., some small number between 2 and 10). Although pre-specifying a fixed J N which might be sub-optimal, it is much more convenient and computationally efficient than any data-driven procedure. This is another advantage of the regression splines compared with smoothing splines: the number of knots in the regression splines can be determined directly in terms of model construction without looking at the data, while in the smoothing splines the tuning parameter λ, which controls the balance between goodness-of-fit and function smoothness, has to be determined adaptively. In order to determine tuning parameter λ in the smoothing splines, we have to use data-driven procedure as in [20].
Another practical issue is the determination of linear components and non-linear components. In practice, we rely on graphical tools such as box-plot and scatterplot to have a rough idea on which components might be linear or non-linear. For the setting in this paper, we consider parametric function for each heterogeneous component (the sample size is smaller for each subsample, so a simpler model is considered) and non-parametric function for each homogeneous component (the sample size is larger for the entire sample, so a more flexible model is considered). Motivated by [12] and [21], some formal hypothesis testing methods for heterogeneity and homogeneity are proposed in the next section.
The third practical issue is conducting statistical inferences about the regression parameters. In the above theorems, we derive the explicit formulas for their corresponding covariance matrices by examining the asymptotic normality. If the covariance matries were to involve the density of error term, we wouldn't apply the plug-in procedure to estimate them as discussed in Section 4 of [16]. Fortunately, all the asymptotic covariance matrices we developed in the above theorems can be estimated using the plug-in procedure and conducting statistical inferences is straightforward. In addition, as discussed in Remark 2, in order to achieve the estimation efficiency of regression parameters, the non-parametric functions should be under-smoothing. This is another reason that we don't recommend data-driven method for the determination of J N . If J N is selected adaptively, the estimation of regression parameters may be sub-efficient and it is also subject to the problem of inference-after-selection.

Testing heterogeneity
As in [21], we develop statistical tests for the heterogeneity of the linear components across sub-populations. First, we consider a general class of pairwise test for heterogeneous parameters. Then, we develop a more general heterogeneity test involving many (up to as many as s) sub-populations.
First, consider the following general class of pairwise test for heterogeneous parameters: where j 1 ‰ j 2 P t1, . . . , su, and Q " pq T 1 , . . . , q T d1 q T is a d 1ˆd matrix with d 1 ď d. This class of tests includes testing if either the whole vector or specific entries of s test statistic, which is based on the estimators from Subsection 2.2 or the estimators from Subsection 2.4, respectively. We summarize the asymptotic properties of these two test statistics in the following theorem, based on which we can conduct the Wald tests.
Furthermore, if the conditions in Theorem 2.3 hold, under the null hypothesis (3.1), imsart-ejs ver. 2014/10/16 file: APLM_EJS_revision.tex date: January 1, 2019 Based on Theorem 3.1, we can construct the Wald tests as what follows, where α is a given significant level and Z 1´α{2 is the p1´α{2q quantile of a standard normal distribution. It is clear that Ψ 2 is more powerful than Ψ 1 because of a smaller variance ofβ pjq compared with p β pjq . Coverage probabilities, length of confidence intervals, nominal levels and powers are empirically evaluated in simulation studies, which confirm our theoretical results.
Then, we consider a more general heterogeneity test involving many sub-populations indexed by j, j P S Ď t1, . . . , su. Note that |S| is allowed to be as large as s, i.e., all sub-populations. The null and alternative hypotheses are formulated as whereβ pjq 0 's are some predetermined values. To test (3.2), it is natural to define the following test statistic: Its distribution can be approximated by bootstrapping the quantity i and e i " N p0, σ 2 q's are i.i.d. The following theorem summarizes the consistency of the proposed multiplier bootstrap method. ? n " op1q, then we have where c S " inftw P R : P pW S ď w|Xq ě 1´αu.
Remark 6: Actually, the above hypotheses can be extended to test the heterogeneity of all sub-populations without specifyingβ pjq 0 's. With a similar argument as that in [21], we only need to test if the differences of heterogeneity parameters between any two consecutive sub-populations equal to zero. The modified test statistic is imsart-ejs ver. 2014/10/16 file: APLM_EJS_revision.tex date: January 1, 2019 and corresponding bootstrap quantity is given by Similarly, we define c 1 S " inftw P R : P pW 1 S ď w|Xq ě 1´αu.

Testing homogeneity
Now we consider the test of whether the non-linear components, g 0k pZ k q, k " 1, 2, . . . , K are homogeneous across all s sub-populations, which is the necessity of doing aggregation of commonality. With a little abuse of notation, for the jth sub-population, we denote the true unknown smooth functions as g pjq 0k pZ k q, k " 1, . . . , K. We apply the likelihood ratio principle to the following homogeneity test for the kth smooth function, We can construct a likelihood ratio test statistic as Before providing the limiting distribution for the above test, extra notations are required. We say that a statistic T n is nearly χ 2 bn , denoted as T n a " χ 2 bn , if p2b n q´1 {2 pT nb n q weakly converges to N p0, 1q for some sequence b n Ñ 8.
N " opsq is a weak requirement, implying the number of sub-populations cannot be too small to borrow sufficient information across all of them. This phenomenon is also observed in [12], in which it is called the "blessing of aggregation". According to Remarks 1 and 2, J N " OpN q q, where 1{p4pq ă q ă 1{4, implying that J 1{2 N " opsq means γ ă 1´q{2 with the minimum value of 7{8 of the right hand side. Remark 8: The above theorem considers testing one smooth function. If we want to test the summation of all smooth functions, the result can be proved similarly, which is summarized in the following theorem by removing the extra condition J

Simulation Studies
We conduct simulation studies to examine the impact of the balance between subpopulation sizes n and the number of sub-population s on the performance of the proposed estimators, g andβ pjq . We consider the following additive partially linear model with two nonparametric components (K " 2) as the data generating model: where ε is generated from normal distribution N p0, 1q, Z 1 , Z 2 and W are generated independently from uniform distribution U p´1, 1q, X " 1 2 pW`Z 1 q, and C 0 is taken as 100p1´e´3 .25 q{3.25´400p1´e´6 .5 q{6.5`300p1´e´9 .75 q{9.75 to make sure that Etg 1 pZ 1 qu " Etg 2 pZ 2 qu " 0. We can show that r X " W {2, D " Ep r X 2 q " 1{12, and A " EpX 2 q " 1{6. In order to generate heterogenous data, we let β pjq 0 " j, for the jth sub-population, j " 1, . . . , s, with d " 1.
First, we evaluate the performance of the aggregated estimator, g, as an estimator for g. We compute the root mean-square-error (RMSE) of g, under different choices of J N and s, and different settings of N . The results are summarized in Figure 1. The condition that J 2 N ! n, which is needed in all the theorems, implies that the larger number of knots we take and the shorter range of s we should consider. In Figure 1, for each selection of the number of knots, we see that the performance of g is good and stable during a wide range of s. We also see that the RMSE of g deteriorates quickly when logpsq{logpN q is approaching 1´2q, q « log N pJ N q. For example, using 5 knots, N " 2 11 , q " log N p5q « 0.21 and then 1´2q « 0.42; therefore, from   Figure 1, we see that corresponding RMSE increases a lot when the ratio approaches 0.5. In summary, from 1, we see there is a clear boundary of logpsq{logpN q: with this boundary, the performance of g is very good, while beyond this boundary, the performance is very bad. These findings confirm the theoretical results presented in Theorem 2.2.
Second, we evaluate the performance of the proposed estimators, p β pjq andβ pjq , for estimating β pjq 0 . We consider 95% confidence intervals based on p β pjq andβ pjq respectively as follows: For simplicity, we summarize results for the first sub-population in Figures 2-4, where both the coverage probabilities and the interval lengths are displayed, with the results of p β p1q in solid line with circle and those ofβ p1q in dashed line with triangle. From Figure 2 where N " 2 11 and Figure 3 where N " 2 14 , we see that within a proper range of s, CI 1 and CI 2 have similar coverage probabilities. We also see that on average, the interval length of CI 2 is shorter than that CI 1 . This finding confirm that the asymptotic variance derived in Theorem 2.3 is smaller than that in Theorem 2.1. However, the coverage probability of CI 2 is valid for a shorter range of logpsq{logpN q, in contrast with that of CI 1 . This is finding is consistent with that there are more conditions in Theorem 2.3 than in Theorem 2.1.
To visualize the performance of CI 2 more clearly, in Figure 4 we display the cover- age probability of CI 2 in more detail for different settings of s and N , given different numbers of knots. From Figure 4, we can see that, given the number of knots, a larger N implies a wider valid range for s to achieve a good coverage; given N , a larger number of knots implies a smaller transition point for s. Third, we evaluate the heterogeneity tests using the following Wald test statistics constructed based on Theorem 3.1: where C α{2 is the upper α{2 quantile of a standard normal distribution, and p D and " ∆, where ∆ " 0.5, 1 and 1.5, respectively. We see that the power increases as N increase and ∆ increases.
We also see the power of Ψ 2 is larger than that of Ψ 1 across different settings. These findings confirm the asymptotic results stated in Theorem 3.1. Fourth, we evaluate the more general heterogeneity tests involving all sub-populations using the test statistics constructed based on Theorem 3.2 and Remark 6. The null and alternative hypotheses are formulated as follows: where ∆ " 0, 0.4, 0.6 and 1. Note that here Q " 1. In the simulations, we set β pjq 0 " 1 for the null hypothesis, and use 500 bootstrapping samples to obtain the upper α " 0.05 quantile for W 1 S . The results are summarized in Figure 6, where the y-axis shows the probability P pT 1 S ą c 1 Sq, under different settings of s and N . From Panel ∆ " 0 of Figure 6, we see that the proposed test statistic T 1 S has appropriate type-I error around the nominal level α " 0.05 within a wide range of s, but they have inflated type-I error after s passes a transition point. Panels and 1, respectively. We see that the power increases as N increase and ∆ increases. These findings confirm the asymptotic results stated in Theorem 3.2. Finally, we evaluate the homogeneity tests for one non-linear function using the test statistic constructed based on Theorem 3.3. The null and alternative hypotheses are formulated as follows: where ∆ " 0, 0.5, 1.0 and 1.5. In the simulations, we set g pjq 01 pZ 1 q " 5 sint2πpZ 11 qu, j ě 2 as used previously for the null hypothesis. The results are summarized in Figure 7, where the y-axis shows the probability P p´1 3σ 2 n¨LRT s nk ą χ 2 u N ,1´α q under different settings of s and N and χ 2 u N ,1´α is the p1´αq-th quantile of χ 2 u N . From Panel ∆ " 0 of Figure 7, we see that the proposed test statistic´1 3σ 2 n¨LRT s nk has appropriate type-I error around the nominal level α " 0.05 if s is not too large nor too small (logpsq{logpN q ą q{2), but they have inflated type-I error after s passes a transition point. As we use a fixed number of knots, the transition point shifts to the right as N increases so that J N ! n 1{2 . Panels (b)-(d) compare the testing powers under three different alternative hypotheses with ∆ " 0.5, 1.0 and 1.5, respectively. We see that the power increases as N increase and ∆ increases. Similar with observations in [12], the powers for ∆ ą 0 return back to 1 when s is very large. This can still be explained by highly deviated limiting distribution of the test statistic from both null and alternative hypotheses. Therefore, the homogeneity test does not perform well for very large s.

Real data application
We apply the proposed divide-and-conquer strategy for APLMs to the Medicare Provider Utilization and Payment Data (the Physician and Other Supplier Public Use File), with information on services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals. This dataset was prepared by the Centers for Medicare & Medicaid Services (CMS), as part of the Obama Administration ef-forts to make our healthcare system more transparent, affordable, and accountable. We downloaded the dataset "Medicare Physician and Other Supplier Data CY 2014" from www.CMS.gov with more than nine million records for health care providers from the U.S. or U.S. possessions. We focus on the subset consisting of 50 U.S. states and the District of Columbia (DC), which account for the majority part of the whole dataset.
Our goal is to model the outcome variable "average Medicare standardized amt" (average amount that Medicare paid after beneficiary deductible and coinsurance amounts have been deducted for the line item service and after standardization of the Medicare payment has been applied) on other covariates, including gender or entity of provider, provider type, Medicare participation status, place of service, HCPCS drug indicator, number of distinct Medicare beneficiaries ("bene unique cnt"), number of services provided ("bene day srvc cnt"), and number of distinct Medicare beneficiary/per day services ("line srvc cnt"). Detailed explanations of these variables can be found in the official website www.CMS.gov. All covariates except the last three are categorical variables, and particularly the variable for provider type has 91 categories. Because those three quantitative variables are all count data, we take the log 10 -transformation and rescale each of them to the range r´1, 1s by using the formula pZ´min Zq{pmax Zḿ in Zqˆ2´1. Also, we apply the log 10 -transformation to the outcome variable, which is skewed to the right. By excluding those records with value 0 for quantitative variables and choosing records with overlapping ranges for last three variables across states, the working dataset has 9,263,068 records, and the corresponding file size is greater than 2GB. It is hard to apply any complicated model fitting with iterative algorithms on a single PC with limited memory. Therefore, we turn to the developed divide-and-conquer strategy. It is natural to split the data by location, such as states or counties. According to our theoretical results, the number of sub-populations cannot be too large. The number of counties is more than 3,000 in U.S., while ? 9, 263, 068 « 3044. Thus, we split the whole dataset by states and DC, resulting in 51 sub-populations. The number of records for each sub-population varies from 14,809 (Alaska) to 719,970 (California), and the median number is 128,069. It is reasonable to hypothesize that those categorical covariates are heterogeneous because their effects on the average amount that Medicare paid may vary across states. On the other hand, the outcome variable is the standardized payment by removing geographic differences in payment rates for individual services, and all three quantitative covariates are numbers of services and beneficiaries. Then it is reasonable to assume the effects of quantitative covariates are homogeneous.
We choose B-splines with degree of 3 to approximate the non-parametric functions of those three quantitative covariates. Assumption (A4) requires that the number of internal knots should be much smaller than N 1 4 « 55. Additionally, we expect these curves are smooth. Thus, we set the number of internal knots as 5. Noting that the sizes of sub-populations are different, rather than a simple average to obtain the aggregated curves, a weighted average is employed by using weights n j { ř s j"1 n j , where n j is the size of the jth sub-sample.
Since the effects of those categorical covariates are allowed to be heterogeneous, we use box-plots to summarize the variabilities of their estimates across 51 sub-populations. From Figure 8, which displays the extent of heterogeneity, we can see that only the effect of male versus female has small degree of heterogeneity around 0, and all the other . Although the range of this effect seems is small, with such a large sample size, we can easily detect a small heterogeneity across sub-populations. In summary, it implies that the effects of categorical covariates on the average amount that Medicare paid vary a lot across states. Figure 9 presents the non-parametric estimates of the effects of those three quantitative covariates. The largest value of each quantitative covariate is different across states, so we only plot aggregated curves on the common range from -1 to 0. From panels (a)-(c) of the figure, for each covariate, we can see estimated curves from 51 sub-samples (dashed lines in black color) are almost parallel to each other within a narrow band, while the aggregated curve (solid line in red color) is right in the middle of those sub-sample specific curves. In addition, we are interested in the validity of the aggregation for commonality, noting that the sub-popualtion curves at boundaries vary more than they vary around the middle parts. Then we test the homogeneity of the nonlinear compoents across 51 sub-populations on the range from´0.8 to´0.1 via the testing procedure proposed in Section 3.2. The test statistic´1 3σ 2 n¨LRT s nk " 480.24 and u N " 500. Therefore, the resulting p " 0.73 means failing to reject the null hypothesis, and aggregation for commonality is a valid strategy. (j)

Summary
In this paper, we develop a framework for additive partially linear models for massive heterogeneous data, using the divide-and-conquer strategy. As summarized in [19], the divide-and-conquer strategy is one of the three commonly used strategies for analyzing massive data, with the other two being the sub-sampling strategy and the sequential updating strategy. However, the sub-sampling and sequential updating strategies are only suitable for analyzing homogeneous massive data. We can combine the divideand-conquer and sub-sampling strategies to analyze heterogeneous data, by dividing the data into homogeneous subgroups and then conducting sub-sampling within each subgroup. We can also combine the divide-and-conquer and sequential updating strategies to analyze heterogeneous data, by dividing the data into homogeneous subgroups and then conducting sequential updating within each subgroup. The framework developed in this paper extends the partially linear framework proposed in [21]. Their partially linear framework considers only one common feature, using the smoothing-splines technique to fit the non-parametric function based on the general reproducing kernel Hilbert space (RKHS) theory [17]. Although the smoothingsplines technique and the RKHS theory have been well developed in the framework of generalized additive models [7], we find it very hard to extend them to our goal of analyzing massive data with multiple common features. Instead, we adopt polynomial splines for modeling the non-parametric effects of multiple common features simultaneously across all sub-populations while exploring heterogeneity of each subpopulation. The proposed methods can be implemented easily and perform well in both simulation studies and the real data application.

A.1 Technical lemmas for Section 2.2
Define the centered version of B-spline basis as bm ,k pz k q " b m,k pz k q´E rb m,k s Erb 1,k s b 1,k pz k q, k " 1, . . . , K, m "´ `1, . . . , J N , and the standardized version of B-spline basis as B m,k pz k q " bm ,k pz k q }bm ,k } 2k , m "´ `1, . . . , J N , k " 1, . . . , K.
Then the minimization problem (2.2) is equivalent to the following minimization problem: where Bpzq " tB m,k pz k q, m "´ `1, . . . , J N , k " 1, . . . , Ku T . Here, to abuse the notation, we still use p γ pjq . Then p g pjq pzq " p γ T Bpzq is a spline estimator of g 0 for the jth sub-population, and the centered spline estimators of a component function is In practice, basis tb m,k , m "´ `1, . . . , J N , k " 1, . . . , Ku is used for computational implementation, while tB m,k u is convenient for asymptotic analysis. [4] showed that for any function f P H and N ě 1, there exists a function r f P S N such that } r f´f } 8 ď Ch p N . Thus, for g 0 satisfying Assumption (A1), there exists a r g pjq pzq " B T pzqr γ j P G pjq n s.t. }r g pjq´g 0 } 8 " Oph p N q and r g pjq (z) is the best least-squares projection of g 0 pzq into the space G pjq n , implying xr g pjq pzq´g 0 pzq, Bpzqy jn " 0, j " 1, . . . , s.
Proof. Let r δ pjq " ? np r β pjq´β pjq 0 q. Then r δ pjq minimizes r l pjq n pδq " Let A pjq n " 1 n ř iPGj X b2 i . By taking derivatives with respect to δ, we obtain With similar arguments with those of Lemma A.1 in [11] and the fact }r g pjq´g 0 } 8 " Oph p N q, the lemma follows.
Oˆl ogn pnh N q 1{2˙" op1q, a.s., by Lemma A.8 in [18]. Here constant C is taken to be large enough to ensure that the above result holds for all j " 1,¨¨¨, s. The condition J N ! n plognq 2 implies O`logn{pnh N q 1{2˘" op1q. Therefore, the lemma is proved.
Proof. The proof is similar with that of Lemma A.4 in [11] by applying Lemmas A.2 and A.3 and noting that which is implied by Lemma A.8 in [18] under condition J N ! n{plognq 2 .
Proof. The proof is similar with that of Lemma A.5 in [11] by making following revisions. We only show the second equality, and the first one can be proved similarly.

A.2 Technical lemmas for Section 2.3
Let r g " 1 s ř s j"1 r g pjq . In order to ensure that r g P G N , we re-center the individual estimator r g pjq pzq via r g pjq pzq " To abuse the notation, we still denote the centered estimator as r g pjq pzq. Lemma A.2 shows }pV pjq n q´1} 2 ď C a.s., j " 1, . . . , s, if n " J 2 N . This property plays a key role in all following proofs. Define u Proof. If follows , . Observing that , .
where the last inequality is due to the fact e T m pV pjq n q´1e m ď }pV Proof. Note that where the last equality follows from (A.1). Therefore, it follows imsart-ejs ver. 2014/10/16 file: APLM_EJS_revision.tex date: January 1, 2019 Proof. It follows , .
The proof of Lemma A.1 shows where v pjq m " ř iPGj u pjq im pA pjq n q´1X i . Thus, we have , .
Observing that Based on the Central Limit Theorem and Slutsky's Theorem, it follows , .
Therefore, Note that the ondition that J 2 N ! n implies that J N ! n plognq 2 . Also the condition that J N " n 1{p2pq implies that h p N " OpJ´p N q ! n´1 {2 . Therefore, the stated result about p β pjq can be showed by Lemmas A.1-A.5, following the same argument of proving Theorem 1 in [11]. Proof of Theorem 2.2. We first quantify }p g pjq´g 0 } 2 . Noting }r g´g 0 } 2 ď }r g´g 0 } 8 " Oph p N q and we have Then we consider , .
-´B pZiq Xi¯. Therefore, combining Lemmas A.6-A.8, we have and further we have, Next we quantify }g´g 0 } N . Using Lemma A.8 in [18], we have Therefore, noting g, r g P G N , we have Proof of Corollary 2.1. For homogenous massive data, p β pjq , j " 1, . . . , s, are i.i.d. random vectors. [10] showed that if Ep p β pjq´β 0 q " opN´1 {2 q, then β is as efficient as p β, which is defined as via the following minimization using all N observations, [11] showed that ?
Therefore it suffices to show that Ep p β pjq´β 0 q " opN´1 {2 q. Following the proof of Theorem 2.1, we havë and it follows Under Assumption A3 and the fact that EpφpZq Ă Xq " 0 for any measurable function φ, we can show that where c and C are some positive constants. Moreover, we have Therefore, if n " N 1{2 , by Cauchy-Schwarz inequality we have Ep p β pjq´β 0 q " opN´1 {2 q. 2 Proof of Theorem 2.3. The estimating equation is Considering the first term on the right hand side of the above equation, we have Consider the second term on the right hand side. Let w 2 pZ, gq " gpZq´A pjq n¯´1 X. We have By Lemma A.2 of [8], the logarithm of the ε-bracketing number of the class of functions A 2 pδq " tw 2 p¨, gq´sp¨, g 0 q : g P G N , }g´g 0 } 2 ď δu is ctpJ N´ qlogpδ{εql ogpδ´1qu. Thus, the corresponding entropy integral J r s pδ, A 2 pδq, }¨} 2 q ď cδtpJ N´ q 1{2`l og 1{2 pδ´1qu. According to Lemma 7 of [15] and Theorem 2.2, }g´g 0 } 8 ď cJ imsart-ejs ver. 2014/10/16 file: APLM_EJS_revision.tex date: January 1, 2019 where the last inequality is due to the condition that J N " N 1{p2pq . The condition that J 2 N ! n implies that OpJ N n´1 {2 q " op1q and n " N 1{p2pq to make sure that the above expectation has an order opn´1 {2 q. Furthermore, Thus, where the last equality is due to the condition that J 2 N ! s. Therefore, the theorem is proved.
Therefore, the second result is also proved.
where R i " n´1 {2 ř iPGjˆp A pjq n˙´1 X i pg 0 pZ i q´gpZ i qq. Also, we have shown there Noting that s " J 2 N logppdq, it follows max jPS }R i } 8 " o P plog´1 2 ppdqq.
Then, the rest of the proof follows a similar line with the counterpart of Theorem 3.9 in [21] by discussing the asymptotic differences between approximation terms. 2 Proof of Theorem 3.3. The estimates for g k and g´k depends on B-spline basis expansions. Denote γ k as coefficients for basis B k pZ k q of the kth function g k , and γ´k as coefficients for basis B´kpZ´kq of g´k. With a little abuse of notations, we write L pjq nk pβ, γ k , γ´kq where C is some constant. Also, ErB k pZ ik q T Σ´1 k B k pZ ik qs " J N . Then, it follows Let W pjq i1i2 " 2ε i1 ε i2 B k pZ i1k q T Σ´1 k B k pZ i2k q, and then define The next step is to show the asymptotic normality of n´1W pN q, which follows a similar line with the proof of Theorem 4.1 in [12] using Proposition 3.2 in [5]. First, it is easy to see EpW pjq i1i2 |Z i1k q " 0, @i 1 ă i 2 . Second, bounds for several moments of W pN q are derived. Define σpN q 2 " VarpW pN qq and following quantities: ErpW pjq i1i2 q 4 s, Third, σpN q 4 " " ps´1q`2 n 2˘4 σ 4 J N`2 ps´2q`n 2˘4 σ 4 J N ‰ 2 " p12σ 4 ps´1qn 2 J N q 2 because ErpW pjq i1i2 q 2 s " 4σ 4 J N . Since U I , U II and U III have a smaller order than σpN q 4 under the condition J 2 N ! n, Proposition 3.2 in [5] shows that 1 a 12σ 4 ps´1qn 2 J N W pN q