Variance function additive partial linear models

: To model heteroscedasticity in a broad class of additive partial linear models, we allow the variance function to be an additive partial linear model as well and the parameters in the variance function to be diﬀerent from those in the mean function. We develop a two-step estimation procedure, where in the ﬁrst step initial estimates of the parameters in both the mean and variance functions are obtained and then in the second step the estimates are updated using the weights based on the initial estimates. We use polynomial splines to approximate the additive nonparametric components in both the mean and variation functions and derive their convergence rates. The resulting weighted estimators of the linear coeﬃcients in both the mean and variance functions are shown to be asymptotically normal and more eﬃcient than the initial un-weighted estimators. Simulation experiments are conducted to examine the numerical performance of the proposed procedure, which is also applied to analyze the dataset from a nutritional epidemiology study.


Introduction
Additive partial linear models (APLMs) are a generalization of multiple linear regression models, and at the same time they are a special case of generalized additive nonparametric regression models (Hastie and Tibshirani, 1990). As discussed in Liu et al. (2011), APLMs allow an easier interpretation of the effect of each variable. Also, they are preferable to completely nonparametric additive models, since they combine both parametric and nonparametric components when it is believed that the response variable depends on some variables in a linear way but is nonlinearly related to the remaining independent variables.
Estimation and inference for APLMs have been well studied in literature (Opsomer and Ruppert, 1997;Stone, 1985;Opsomer and Ruppert, 1999;Liang et al., 2008;Li, 2000;Liu et al., 2011). However, most existing work focuses on statistical inference for the mean function while variance function estimation has received much less attention. Although a wealth of work has been done to take heteroscedasticity into account for enhancing the efficiency of estimating the parameters in the mean function, estimating variance function is also of independent interest. For example, an appropriate estimator of the variance is needed when one derives confidence intervals/bands for the mean function (Ruppert et al., 2003;Cai and Wang, 2008). Other examples in which the variable function estimation plays an important role include a study of kinetic rate parameters (Box and Hill, 1974), quality control (Box and Meyer, 1986), and a study of social inequality (Western and Bloome, 2009). More recently, Thomas et al. (2012) demonstrated that individual variability in longitudinal measurements for an individual can be predictive of a health outcome, and Teschendorff and Widschwendter (2012) argued that differential variability can be as important as differential means for predicting disease phenotypes in cancer genomes.
In response to these demonstrations of the importance of variance function estimation, many flexible and efficient methods for variance function estimation have been proposed; Carroll (2003) and Carroll and Ruppert (1988) are nice surveys. Representative work on modeling heteroscedasticity in linear or nonlinear models includes Carroll and Härdle (1989), Carroll and Ruppert (1982), Carroll (1982), Hall and Carroll (1989) and Bickel (1978). Motivated by Davidian and Carroll (1987), Lian et al. (2015) studied the variance function partially linear single index models (VF-PLSIMs), in which the variance function is a function of the sum of linear and single index functions and the parameters in the variance function are allowed to be different from those in the mean function. They developed efficient and practical estimators for the parameters in the mean and variance functions, and weighted the objective function to obtain more efficient estimators for the parameters in the mean function.
In this paper, we consider variance function additive partial linear models (VF-APLMs), a broad class of heteroscedastic regression models where the mean function is an additive partial linear model and the variance function depends upon a generalized additive partial linear model as well. Unlike the classic generalized additive partial linear model , here we do not insist that the variance function depends only upon the mean function. Suppose that {(X 1 , Z 1 , Y 1 ), . . . , (X n , Z n , Y n )} is an i.i.d. random sample of size n from the following VF-APLM: where X = (1, X * T ) T = (1, X 1 , . . . , X d ) T and Z = (Z 1 , . . . , Z K ) T are the linear and nonparametric components, g 1 , . . . , g K are unknown smooth functions in the mean function, h 1 , . . . , h K are unknown smooth functions in the variance function, α = (α 0 , α 1 , . . . , α d ) T and β = (β 0 , β 1 , . . . , β d ) T are vectors of unknown parameters, and is independent of X and Z with E( ) = 0 and E(| |) = 1. Here φ is a known function, and generally either φ(v) = v or φ(v) = exp(v). However, using φ(v) = v will not guarantee that φ(v) will be positive in practice. Thus in all our numerical examples we will use φ(v) = exp (v). To ensure identifiability of the nonparametric functions, we assume that E{g k (Z k )} = 0 and E{h k (Z k )} = 0 for k = 1, . . . , K. We also assume that E(X * ) = 0, which can be achieved in practice by centering, that is X * i − n j=1 X * j /n. The challenge of investigating model (1), in both theoretical derivation and numerical implementation, is that there could be more than one nonparametric component in both the mean and variance functions. If there is only one nonparametric component in both the mean and variance function, it may use kernel method to estimate the nonparametric components as in VF-PLSIMs investigated by Lian et al. (2015). However, the kernel method cannot be applied directly to estimate the variance parameter in VF-APLMs.
For APLMs, Opsomer and Ruppert (1997) and Stone (1985) proposed a backfitting algorithm and Opsomer and Ruppert (1999) studied the asymptotic properties of the kernel-based backfitting estimators for the parameters in the mean function. Liang et al. (2008) suggested that a kernel-based estimation procedure is available for APLMs without an undersmoothing requirement. When there are multiple nonparametric terms, the kernel-based procedures are computationally inexpedient. Challenged by these demands, Liu et al. (2011) proposed to approximate the nonparametric components with splines. The resulting estimators for the linear components are easily calculated and, of most importance, still asymptotically normal.
In this paper, we use the polynomial-spline procedure (Xue, 2009;Xue and Yang, 2006a,b) for approximating the multiple nonparametric components in both the mean and variance functions. However, we face additional challenges in establishing asymptotic properties for the estimators of parameters in the variance function. It is also worthwhile to point that the development of theory with spline approximation in VF-APLMs is more difficult than for that in the VF-PLSIMs (Lian et al., 2015).
We organize the remaining as follows. In Section 2, we describe in detail the initial and updated estimation procedures for VF-APLMs. In Section 3, we present the main theoretical results and their implications. We examine numerical performance of the proposed method through simulation studies in Section 4 and by the analysis of a real dataset in Section 5. Some discussion is presented in Section 6 and all the technical assumptions and proofs of the theoretical results are placed in the Appendix.

Spline approximation
In model (1), let g 0 (z) = g 01 (z 1 ) + · · · + g 0K (z K ) and α 0 be the true additive function and parameter for the mean, and let h 0 (z) = h 01 (z 1 ) + · · · + h 0K (z K ) and β 0 be the true additive function and parameter for the variance. For simplicity, we assume that the covariate Z k is distributed on a compact interval [a k , b k ], k = 1, . . . , K, and without loss of generality, we take all intervals [a k , b k ] = [0, 1], k = 1, . . . , K. Under some smoothness assumptions, the g 0k 's and h 0k 's can be well-approximated by spline functions. Although in practice we could consider different sets of spline functions for g 0 and h 0 respectively, for notational simplicity, here we consider a same set of spline functions for both g 0 and h 0 .
Let S n be the space of polynomial splines on [0, 1] of degree ≥ 1. We introduce a knot sequence with J n interior knots, where J n increases with sample size n in some order. Equally spaced knots are used here for simplicity. However, other regular knot sequences can also be used, with similar asymptotic results. Then S n consists of functions ξ satisfying (i) ξ is a polynomial of degree on each of the subintervals We consider additive spline estimate g of g 0 in the mean and additive spline estimate h of h 0 in the variance based on the independent random sample (X i , Z i , Y i ), i = 1, . . . , n. Let A n be the collection of functions ξ with the additive form ξ (z) = ξ 1 (z 1 ) + · · · + ξ K (z K ), where each component function ξ k ∈ S n and n i=1 ξ k (Z ik ) = 0.

Initial estimator of the mean
The problem of estimating g 0 and α 0 in the mean has been already well established if the potential heteroscedasticity is ignored; for example, Liu et al. (2011). We would like to find a function g ∈ A n and a value of α that minimize the following sum of squared residuals function For the k-th covariate z k , let {b j,k (z k ) : j = − , . . . , J n } be the B-spline basis functions of degree . For any g ∈ A n , one can write where  (2) is equivalent to finding α and η to minimize We denote the minimizer as α and Then the spline estimator of g 0 is g(z) = η T b (z), and the centered spline estimator of the component g k is for k = 1, . . . , K. The above estimation approach can be easily implemented with existing linear models in any statistics software. Davidian and Carroll (1987) developed some general methodology and theory for variance function estimation in the parametric case. They distinguished between methods based on squared residuals and those based on absolute residuals, the former being more efficient if the regressions errors i 's are normally distributed, but called this potential efficiency gain "tenuous" because it is less robust to outliers. Here we consider absolute residuals. Define unobserved absolute residuals

Initial estimator of the variance
A very quick way to estimate h 0 and β 0 is to regress R on φ{h 0 (Z) + X T β 0 }. We would like to find a function h ∈ A n and a value of β that minimize the following sum of squared residuals function For any h ∈ A n , one can write where the spline coefficient vector γ = {γ j,k , j = − , . . . , J n , k = 1, . . . , K} T .
Then the spline estimator of h 0 is h(z) = γ T b (z), and the centered spline estimator of the component h k is for k = 1, . . . , K. The above estimation approach can also be easily implemented with existing linear models in any statistics software.

More efficient estimators
After the initial estimates of h 0 and β 0 in the variance function are obtained, we can estimate g 0 and α 0 in the mean function more efficiently via generalized least squares. For this aim, let Then g 0 and α 0 can be estimated more efficiently by the minimizers, g wls and α wls , of the following sum of weighted squared residuals function Equivalently, if η wls and α wls are the minimizers of then g wls (z) = η T wls b (z), whose components can be centered as in (5).
Consequently, the absolute residuals R i can be updated as Then h 0 and β 0 can be estimated more efficiently by the minimizers, h wls and β wls , of the following sum of weighted squared residuals function Equivalently, if γ wls and β wls are the minimizers of then h wls (z) = γ T wls b (z), whose components can be centered as in (9).

Theoretical Results
Let r be an integer and ν ∈ (0, 1], with p = r + ν > 1.5. Let H r,ν be the collection of functions ξ on [0, 1] whose rth derivative, ξ (r) , exists and satisfies the Lipschitz condition of order ν: where and below c and C are generic positive constants. In order to derive theoretical results, we make the following assumptions.
Let Γ 2 (z) = E{X/Φ 2 |Z = z}/E{1/Φ 2 |Z = z} and let Γ add 2 (z) = K k=1 Γ add 2k (z k ) be the projection of Γ 2 onto the Hilbert space of theoretically centered additive functions with inner product ζ 1 , be the projection of Γ 3 onto the Hilbert space of theoretically centered additive functions with inner product ζ 1 , , for m = 0, 1, 2, 3. We also make the following assumption on the above centered additive projections.
(A6) The additive components in Γ add m satisfy that Γ add mk ∈ H r,ν for k = 1, . . . , K and m = 0, 1, 2, 3. where Consequently, and Consequently, Consequently, and Consequently, Remark 1. The convergence rate O p {(J n /n) 1/2 + J −p n } enjoyed by the estimators of nonparametric components in both the mean and variance functions is natural. Similar assumption on J n was made and similar convergence rate was obtained in Wang et al. (2014). If J n n 1/(2p+1) , then we obtain an optimal convergence rate n −2p/(2p+1) . Remark 2. Following the routine proposed in Newey (1994) and theory developed in Bickel et al. (1993), we can show that, when is normally distributed, α wls is the most efficient estimator in the sense of semiparametric efficiency. Remark 3. Consider the estimators for α 0 . The weighted estimator α wls is more efficient than the initial estimator α. To see this, in this and next remarks, for simplicity, we ignore the factor n. The asymptotic variance of α is ). Remark 4. Consider the estimators for β 0 . The weighted estimator β wls is more efficient than the initial estimator β. To see this, for simplicity, we only consider the special case where is symmetric. In this special case, E(D) = 0 and therefore the second term in each of Σ β and Σ β,wls becomes zero. Not-

Simulation Experiments
To assess the finite sample performance of the proposed methods in Section 2, we simulate data from model (1) where G is the cumulative distribution function of the standard normal distribution and consequently (4x) and generate responses from , the above model is one example of model (1). In each case, 500 datasets are generated and fitted. We use cubic splines (ρ = 3) to approximate the nonparametric functions. The number of basic functions is set to be 5 for n = 200, 400 and 6 for n = 800. This choice applies for both mean functions and variance functions, and for both initial estimators and more efficient estimators. Although data-adaptive choice for the number of internal knots can be developed, as in Wang and Yang (2007), Fan et al. (2011) and Lian et al. (2013), it is found that fixed choice of the number of internal knots is much more convenient and indeed adopted in most studies using B-splines for function estimation, and for regression splines a small number of basis functions is typically used in numerical studies. For both the mean and variance functions, we consider three estimators: the initial estimators (4) and (8), the weighted estimators (12) and (14), and the infeasible estimators where Φ i is replaced by Φ i and R i,wls is replaced by R i in (12) and (14).
First, we examine the estimation errors, and similarly for other nonparametric functions. The estimation errors averaged over 500 datasets for each of the nine parameter settings are reported in Table 1, with the standard deviations of the errors shown in brackets. We see that both mean and variance estimation improve with larger sample size. While errors in mean estimation increase with noise, errors in variance estimation remain almost the same with different noise levels. Most importantly, we see that the updated, weighted estimators significantly improve on the initial estimators in all situations. In Figures  1-3, we show the estimated nonparametric functions on 20 generated data sets Table 1 Estimation errors (average errors with standard deviations inside brackets on 500 simulated datasets) for the simulated data sets   when n = 400, for the three noise levels respectively. The weighted estimators are obviously better than the initial estimators, and are visually very similar to the infeasible estimators. For the case n = 400, σ = 0.2, we also show the squared bias and variance for the three estimators in Figure 4. We see that the improvement of the weighted estimator mostly come from the reduction of variance. Furthermore, there is relatively large bias and variance close to the boundary. Second, we consider estimation of standard errors for the parameters α 0 and β 0 . It is easy to obtain standard error estimates based on the asymptotic normality results, using the sandwich formula. On each generated dataset, we can get an estimate of standard errors, and the average of these over 500 datasets are reported in Table 2, on rows indicated by "s.e. (est)". The sample standard errors of the estimated parameter values on 500 datasets are reported on rows indicated by "s.e. (emp)". It is seen that the estimated standard errors are reasonably close to the empirical standard errors especially when the sample size is large.
Third, we consider the coverage of the pointwise confidence intervals for the nonparametric functions. Our construction of the pointwise confidence intervals is again based on the sandwich formula. More specifically, since the nonparametric functions are approximated by linear combinations of known basis functions, we regard the model as parametric with parameters α, η, β, γ and we find the estimated covariance matrix of these parameters as in parametric models by the standard sandwich formula. Note that this ignores the bias term caused by the series expansion of nonparametric functions. However, it is a difficult problem to construct bona fide confidence intervals in additive models. Thus we just use this naive approach and mainly regard the intervals as "exploratory" in nature.
For illustration we only used n = 200, 400, 800 with σ = 0.2. Figures 5-7 show the empirical coverage of the 95% pointwise confidence intervals for the three sample sizes, respectively. When n = 200 we see the coverage is not satisfactory, especially close to the boundary for some cases, although the coverage is always larger than 80% . For larger sample sizes, the coverage improved somewhat. Better construction of the pointwise confidence intervals or simultaneous confidence bands for the nonparametric part is an interesting problem to be investigated in the future.
Finally, we modify the above example in two ways. In the first modification, we change the mean functions to g 01 (x) = − sin(15x), g 02 (x) = sin(15x) while other aspects of the model remain the same. This is used to explore the case

. The first row is for the initial estimators and the second row is for the weighted estimators.
when the mean functions cannot be estimated well, whether the estimates of the variance functions will be seriously affected, and whether weighting will improve the estimation. In the second modification, we change the normal errors

. The first row is for the initial estimators and the second row is for the weighted estimators.
in the original example to Student's t errors with degrees of freedom 3 and scale parameter σ. We only report the results for σ = 0.2 with n = 200, 400, 800 for these two modified examples. The estimation errors are reported in Table 3 and 20 estimated nonparametric functions are illustrated in Figure 8 and Figure 9, respectively (for n = 400 only). From Figure 8, we see that when the mean function cannot be estimated well, this causes obvious bias in the variance function estimate. Table 3 shows that the second stage weighted estimate does not improve the estimation (actually for this particular case the second stage estimator is even worse). This indicates that using totally wrong weights could be worse than using equal weights, due to the extra variability of the estimated weights. Even using the correct weights does not help the seriously biased estimates for mean functions. In conclusion, reasonably accurate estimate of the mean function is a requirement for the proposed method to work well. The bad performance of this example is mainly due to that with highly oscillating true functions, the spline approximation using only 5 or 6 basis functions results in a large bias. To illustrate this point, we increase the number of basis functions to 9 and some estimated curves are shown in Figure 10. With a larger number of basis functions, the mean function estimation is now much more reasonable. However, a larger number of basis functions means the estimates for variance functions become more variable. In such examples, it is desirable that the number of basis functions is chosen adaptively according to (unknown) smoothness of each function, although this is outside of the scope of the current paper. For the second modification using heavy-tailed errors, we still see that the second stage estimator improves upon the initial estimator. With heavy-tailed errors, the estimates are obviously worse than those with normal errors. Table 3 Estimation errors (average errors with standard deviations inside brackets on 500 simulated datasets) for the simulated data sets. The first block is for the case g 01 (x) = − sin(15x) and g 02 (x) = sin(15x), and the second block reports results when the error follows a student's t distribution

Nutrition Data
We apply the proposed method to the dataset from a nutritional epidemiology study (Nierenberg et al., 1989), which attempted to investigate the relationships between the plasma beta-carotene levels and personal characteristics, including AGE, SEX, BMI, and other factors: CALORIES (number of calories consumed per day), FAT (grams of fat consumed per day), FIBER (grams of fiber consumed per day), ALCOHOL (number of alcoholic drinks consumed per week), CHOL (cholesterol consumed mg per day), BETADIET (dietary betacarotene consumed mcg per day), SMOKE2 (smoking status [1 = former smoker, 0 = never smoked]), and SMOKE3 (smoking status [1 = current smoker, 0 = never smoked]). We remove the predictor FAT since it is highly correlated with CALO-RIES. There is one extremely high leverage point in alcohol consumption that is deleted prior to analysis and thus the sample size of the dataset is n = 314. Similar to simulations, we used cubic splines with 5 basis functions. By first fitting an APLM that puts all continuous predictors in the nonparametric part and discrete predictors in the linear part, we find that AGE and CHOL seem to have nonlinear effect in the mean function, while ALCOHOL and FIBER seem to have nonlinear effect in the variance function. Thus we fit the model with AGE and CHOL in the nonparametric part in the mean and at the same time with ALCOHOL and FIBER in the nonparametric part in the variance. The shapes of the estimated nonparametric functions for both the initial estimators and the weighted estimators are shown in Figure 11. The nonparametric functions g 1 (AGE) and g 2 (CHOL) in the mean function are similar to that obtained in Liu et al. (2011). For the nonparametric functions h 1 (ALCOHOL) and h 2 (CHOL) in the variance function, it shows there is nonlinear contribution of ALCOHOL to the variance while the effect of CHOL seems to be nonsignificant. The estimated coefficients and their standard errors are listed in Table 4. The variables BMI, FIBER, SEX and SMOKE3 have significant effects in the mean function at the 0.05 level, while none has significant effect in the variance function.

Discussion
The additive partial linear models have been well studied in the literature when the heteroscedasticity is ignored. In this paper we investigate a broad class of models, variance function additive partial linear models. The flexibility of such models comes from that the variance is not limited to be a known function of the mean. The models are useful for the settings where estimating the variance function is of its own interest. The models are also useful for the settings where estimating the mean function is of main interest, because taking into account the heteroscedasticity would improve the efficiency of estimating the mean function. Also, in cases of even moderate heteroscedasticity, prediction intervals will not have correct coverage probabilities unless the variance is modeled properly. The polynomial-splines method we adopt for approximating the nonparametric components in both the mean and variance functions has at least three principal advantages. First, it avoids iterative algorithms and therefore it is computationally convenient. Second, the resulting estimators of the linear components in both the mean and variance functions are still asymptotically normal. Third, it is very easy to conduct variable selection on the linear components in both the mean and variance functions, which we discuss in detail in next. On the other hand, the spline estimators tend to be less accurate at the boundary, compared to some other estimation methods such as local linear regression. As we show in our simulation, reasonably good mean function estimates is necessary for the weighted estimator to work well. Thus if boundary problem is a serious concern, one may consider local linear regression, at the cost of increased computational burden.
If the number of predictors is large and curse of high-dimensionality is a concern, we should consider variable selection. Fortunately, there is a wealthy of literature on the topic of variable selection in the past two decades and many existing variable selection procedures can be easily extended to our setting. For example, we can add some sparsity penalty terms such as LASSO (Tibshirani, 1996) and SCAD (Fan and Li, 2001) to the objective functions (4) and (8), respectively.
In addition, the asymptotic properties of weighted estimators for the mean function have been studied in the literature by many authors and it is well known that in general weighted estimators are more efficient than unweighted estimators. However, the asymptotic properties of weighted estimators for the variance function draw much less attention. In this paper, we investigate the asymptotic properties of weighted estimators for the variance function for VF-APLMs and show that weighted estimators are more efficient than unweighted estimators.
Finally, we emphasize that a great deal of effort be put on making decision on which predictors should be the linear components of both the mean and variance functions. Scatterplots of the response variable versus those predictors or initial fitting using a fully additive model could help us make such decision for the mean function, as we demonstrate in Section 5. For the notational simplicity, in the main context we assume that the same subset of predictors are put in the linear components of both the mean and variance function. The model (1) can be easily extended to allow different subsets of predictors to be put in the linear components of the mean and variance functions, as we demonstrate in Section 5.

Appendix
Let · be the Euclidean norm. For matrix A, denote its L 2 norm as A 2 = sup u =0 Au / u . Let ξ ∞ = sup z |ξ (z)| be the supremum norm of a function ξ on [0, 1].
Following Stone (1985) and Huang (2003), for any measurable functions ζ 1 , ζ 2 on [0, 1] K , we take the empirical inner product and the corresponding norm to be where {Z i } is a sample from density f . If ζ 1 and ζ 2 are L 2 -integrable, take the inner product with the corresponding induced norm ζ 2 2 = [0,1] K ζ 2 (z) f (z)dz. The empirical and theoretical norm of a univariate function ξ on [0, 1] are to be where f k is the density of Z k for k = 1, . . . , K. Define the centered version spline basis with the standardized version given by, for any k = 1, . . . , K, Notice that finding the (η, α) that minimizes (4) is equivalent to finding the (η, α) that minimizes where B (z) = {B j,k (z k ) , j = − + 1, . . . , J n , k = 1, . . . , K} T . Then the spline estimator of g 0 is g(z) = η T B (z) and the centered spline estimator of a component function is Similarly, finding the (γ, β) that minimizes (8) is equivalent to finding the (γ, β) minimizing Then the spline estimator of h 0 is h(z) = γ T B (z) and the centered components are In practice, basis {b j,k ,j = − , . . . , J n , k = 1, . . . , K} T is used for data-analytic implementation, and basis (A.1) is convenient for asymptotic analysis.

A.1. Proof of Theorem 1
The proof of Theorem 1 is similar to that of Theorem 1 in Liu et al. (2011), except that in proving their theorem, Liu et al. (2011) assumed that the intercept α 0 is zero and Γ 0 = Γ add 0 in their Assumption (C5). Here we partition the predictor vector into X = (1, X * T ) T to relax their zero-intercept assumption. And, we define X \0 as X − Γ add 0 (Z) to relax their assumption that Γ 0 = Γ add 0 ; in Liu et al. (2011), they defined X \0 as X − Γ 0 (Z).

A.2. Proof of Theorem 2
According to the result of de Boor (2001, page 149), for any function ξ ∈ H r,ν and n ≥ 1, there exists a function ξ ∈ S n such that ξ − ξ ∞ ≤ CJ −p n . For h 0 satisfying (A1), we can find γ = { γ j,k , j = − + 1, . . . , J n , k = 1, . . . , K} T and an additive spline function h = γ T B (z) ∈ A n , such that In the following, let Proof of Lemma 1. Let δ = √ n( β − β 0 ). According to (A.3), δ minimizes By expansion, we have The first term on the right-hand-side of the above can be further expressed as Consider the first summation in (A.4). Using an identity in Knight (1998, p. 758), we have By Lemma A.5 in Liu et al. (2011), term (A.6) equals o p (1). By Lemma 7 of Stone (1986) and Theorem 1, g − g 0 ∞ ≤ CJ 1/2 n g − g 0 2 = o p (n −1/6 ). Hence, max 1≤i≤n S i = o p (n −1/6 ). Let a n = n 1/6 . Following the proof of Theorem 1 in Knight (1998), we can show that a n √ n Noting that Φ (1) i X T i δ is bounded, we can see that term (A.8) equals O p (a −1 n ), which equals o p (1). In term (A.7), 1 . Combining these results of (A.6)-(A.8), we have We can easily see that the second summation in (A.4) equals −n −1/2 e i Φ (1) i X T i δ. By Taylor expression, we can also easily see that the third summation in (A.4) equals which, again by Lemma A.5 in Liu et al. (2011), is equal to o p (1). In addition, we can show that Together, we have By Assumption (A5), Q β > 0. Then for any τ > 0, there exists a large constant This implies with probability at least 1 − τ that there exists a local minimum in the ball {β 0 + n −1/2 δ : δ ≤ C}. Hence, there exists a local minimizer such that β −β 0 = O p (n −1/2 ). By the expression of √ n( α−α 0 ) in (16), the second part of the lemma can be derived easily.
This implies there exists a local minimizer such that θ − θ = O p (a n ). where the last equality is from h − h 0 2 = O p (J −p n ). By Lemma 1 of Stone (1985), h k − h 0k 2k = O p {(J n /n) 1/2 + J −p n }, 1 ≤ k ≤ K. By Lemma A.8 in Wang and Yang (2007), we have