Backfitting and Smooth Backfitting in Varying Coefficient Quantile Regression

In this paper, we study ordinary backfitting and smooth backfitting as methods of fitting varying coefficient quantile models. We do this in a unified framework that accommodates various types of varying coefficient models. Our framework also covers the additive quantile model as a special case. Under a set of weak conditions, we derive the asymptotic distributions of the backfitting estimators. We also briefly report on the results of a simulation study.


INTRODUCTION
There have been some recent developments for varying coefficient models that were originated by Hastie and Tibshirani (1993). Several types of varying coefficient models have been introduced as extensions of the classical linear model in a way that allows the regression coefficients to be nonparametric functions of some variable(s). A more flexible form is to take different smoothing or effect-modifying variables for different coefficient functions (i.e., to let different regression coefficients be functions of different variables). This type is a structural regression model, which is known to be an effective way of avoiding the curse of dimensionality when there are many covariates. For this type, most works have been focused on the mean regression context. Some recent works on this type include Yang et al. (2006), Roca-Pardinas and Sperlich (2010) and Lee et al. (2012a). Park et al. (2013) have provided a comprehensive review of several different types of varying coefficient models along with various statistical problems and methods associated with those models. Lee et al. (2012b) have proposed and studied a unifying approach to all types of varying coefficient models with powerful techniques for fitting the fully flexible model.
Varying coefficient models have many economics applications. In a wage model in labour economics, for example, it might be desirable to allow marginal returns to education to vary with the level of schooling and with work experience. For other applications, see Hong and S21 Lee (2003), Li et al. (2002) and Cai et al. (2006), for example. In these applications, however, quantile regression is a more natural or more informative approach than mean regression. Quantile regression gives a picture of the entire conditional distributions, and can be used to build conditional prediction intervals. Also, it provides a useful tool for verifying the presence of conditional heteroscedasticity; see, e.g., Furno (2004). There have been only a few works for varying coefficient quantile regression. Among these, Kim (2007) and Wang et al. (2009) considered sieve estimation, and Honda (2004) worked on kernel estimation. In these works, the single smoothing variable case was considered (i.e., the case where all coefficients are functions of the same smoothing variable). In kernel estimation of models of this type, the standard technique applies and gives directly the proper estimators of coefficient functions. Thus, there is no need for backfitting.
In this paper, we discuss quantile regression for the fully flexible varying coefficient model, which was introduced and studied by Lee et al. (2012b) in the mean regression context. We consider the applications of the ordinary backfitting (BF) approach and the smooth backfitting approach (SBF). The two procedures were proposed by Buja et al. (1989) and Mammen et al. (1999), respectively, in the additive mean regression setting. In that setting, the theory for BF was developed by Opsomer and Ruppert (1997), and the SBF approach was further studied in related problems; see, e.g., Mammen and Park (2005), Mammen and Park (2006) and Yu et al. (2008). The marginal integration technique of Linton and Nielsen (1995) as another way of fitting additive models is not considered here, because it is widely accepted that it suffers from the curse of dimensionality. We provide backfitting algorithms for BF and SBF and discuss their statistical properties in the quantile regression setting. In particular, we show that both BF methods produce rate-optimal estimators and derive their asymptotic distributions.
For covariates X 1 , . . . , X D , we write X = (X 1 , . . . , X D ) and we assume that the quantile function m(x) satisfies where the index sets I j ⊂ {1, 2, . . . , D} are known and may not be disjoint. Without loss of generality, we assume that each I j does not include j . The reason for this is that we allow X l ≡ 1 for some 1 ≤ l ≤ d, and for such X l we may put X j f jj (X j ) into X l k∈I l f lk (X k ) in the form of X l g(X j ), where g(X j ) = X j f jj (X j ). Here, d ≤ D. The level α of the quantile is fixed throughout the paper: The covariates that enter into one of the coefficient functions f jk (i.e., X k for k ∈ C ≡ ∪ d j =1 I j ) are of continuous type. For simplicity, we assume that X k with k ∈ C are supported on the interval [0, 1]. We allow some of the covariates X j , for 1 ≤ j ≤ d, to be discrete random variables. In particular, the model may include a variable X j ≡ 1 for one j ∈ {1, 2, . . . , d}. We also allow that C and {1, 2, . . . , d} may have common indices.
The form of the model (1.1) unifies various types of varying coefficient models. First, it reduces to the single smoothing variable case studied by Kim (2007), Wang et al. (2009) andHonda (2004), if one takes D = d + 1 and I j = {d + 1} for all 1 ≤ j ≤ d. It also covers the varying coefficient regression models of Xue and Yang (2006) and Lee et al. (2012a), where {X 1 , . . . , X d } and the group of smoothing variables {X k : k ∈ C} do not have any variable in common (i.e., C 0 is empty). Furthermore, the framework includes even the additive quantile model studied by Horowitz and Lee (2005) and Lee et al. (2010) as a special case with the choice d = 1 and X 1 ≡ 1.
The functions f jk in the representation (1.1) are not identifiable. To make all f jk identifiable, we put the following constraints for non-negative weight functions w k : (1.2) The first constraint of (1.2) is typical in additive models and it is for not allowing the two different representations f j 1 (x 1 ) + f j 2 (x 2 ) = g j 1 (x 1 ) + g j 2 (x 2 ), for example, where g j 1 = f j 1 + c and g j 2 = f j 2 − c for some constant c. The second constraint is also required. It is to avoid the two different representations x 1 f 12 (x 2 ) + x 2 f 21 (x 1 ) = x 1 g 12 (x 2 ) + x 2 g 21 (x 1 ), for example, where g 12 (x 2 ) = f 12 (x 2 ) + x 2 and g 21 (x 1 ) = f 21 (x 1 ) − x 1 . With these constraints, all f jk are identifiable; see Lee et al. (2012b) for a proof. We can also rewrite the model (1.1) as where f jk satisfy (1.2).
In the description of our methods and in our theory, we also make use of a different representation of the model (1.3). In this representation of the model, we collect those coefficients that are functions of the same continuous covariate and put them together as an additive component. Suppose that, among X 1 , . . . , X d in the model (1.3), there are r (0 ≤ r ≤ d) variables whose indices do not enter into C. Without loss of generality, we denote these by X 1 , . . . , X r . Let p = D − r ≥ 2 be the number of indices in C. Thus, C = {r + 1, . . . , r + p} and C 0 = {r + 1, . . . , d}. The case p = 1 is not considered in this paper because, in the latter case, no backfitting is needed. Definẽ (1.4) The vectorX k is the collection of all X j , for 1 ≤ j ≤ d, that have interactions with X r+k in the form of X j f j,r+k (X r+k ). Thus,X k does not include X r+k as its element because the index set I j does not contain j . Let d k denote the number of the index sets I j that contain r + k. Thus, X k is of d k -dimension. Likewise, for a given vector x, we denote the above rearrangements of (1.5) In the next section, we describe the BF and SBF algorithms to fit the model (1.5) and discuss their theoretical properties. Section 3 is devoted to the proof of the main theorem in Section 2. The numerical properties of the backfitting estimators are presented in Section 4.

BACKFITTING METHODS
We introduce two kernel estimators, one based on the ordinary BF and the other based on the SBF approach. Model (1.1) can be rewritten as (1.3) with constraints (1.2). Both the ordinary BF and the SBF estimator require starting valuesf BF,[0] andf SBF,[0] (1 ≤ ≤ p), respectively. Furthermore, we need initial estimatorsâ l, [0] j andâ l, [0] jk , for l ∈ {BF, SBF}, of a j (1 ≤ j ≤ d) and a jk (j < k; j, k ∈ C 0 ). We assume that the initial estimatorsf BF,[0] andf SBF,[0] fulfil the constraints (1.2), but we define their updatesf BF,[k] andf SBF,[k] (k ≥ 1), respectively, which might not fulfil (1.2). We do not update the estimators of a j and a jk ; see (2.1) and (2.2). After stopping the iteration of a BF algorithm after k * steps, say, it is easy to adjust the last updated estimators,f BF,[k * ] orf SBF,[k * ] , by subtracting constants or linear functions so that (1.2) is fulfilled for the normalized estimators. We refer to (4.7) in Lee et al. (2012b) for the normalizing procedure. We denote these estimators byf BF andf SBF , respectively. We discuss the choice of k * below. The final estimatorsâ l j andâ l jk of a j and a jk are defined by For theoretical purposes, we make some assumptions on the starting functions,f BF,[0] and f SBF,[0] ; see Assumptions 3.5 and 3.6 in Section 3. Some choices that fulfil these conditions are discussed in Section 2 of Lee et al. (2010) in the setting of the additive model. Similar choices can be made for our varying coefficient model. In their Section 2, Lee et al. (2010) also discussed the possibility of relaxing the conditions; see also a remark given immediately after Theorem 3.1 in Section 3. For given starting functionsf BF,[0] andf SBF,[0] , the initial estimatorsâ l, [0] j andâ l, [0] jk for l ∈ {BF, SBF} can be obtained by minimizing where τ α denotes the so-called check function defined by τ α (u) = u{α − I (u < 0)}. In our simulation study in Section 4, we have takenf BF,[0] =f SBF,[0] = 0 for all 1 ≤ ≤ p and we have found that it works very well.
The ordinary BF algorithm runs in iterative cycles. Let K g be a kernel function with bandwidth g; see Assumption 3.2. Then, in the j th step of the kth cycle, the estimatorf BF, [k] j is updated as follows: To simplify the mathematical arguments, the minimization in (2.1) is undertaken over a compact set j ⊂ R d j . It is assumed that all values of the function f j lie in the interior of j . As in the case of mean regression, the ordinary BF estimator is not defined as a solution of a global minimization problem. The SBF estimator is also computed by an iterative scheme. The estimate of f j is updated by the following minimization: Here, the integration runs over the support of (X i r+1 , . . . , where the integration is over the support of (X i r+1 , . . . , X i r+p ). As mentioned above, we assume for theoretical purposes that the iterative algorithm stops after a finite number k * of cycles; see Theorem 3.1 and the discussion that follows for a theoretical choice of k * . A stopping rule in practice would be a value k * for which the difference between the two updates at the (k * − 1)th cycle and the k * th cycle is sufficiently small. One measure for the difference is In this case, we should use the normalized versions of the updates that satisfy the constraints (1.2). Comparing the two BF equations (2.1) and (2.2), we see that the SBF estimators require multiple integration while the BF estimators do not. The dimension of the multiple integration for SBF is p − 1 so that the computational costs increase as p becomes high. This is a drawback of the SBF method in quantile regression. We note that this is not the case in the mean regression setting where we only need a single-dimensional integration regardless of the dimension p. It is possible to speed up the computing time for SBF by applying a well-devised Monte Carlo method for the numerical integration. For a discussion on this issue in the setting of quasi-likelihood additive regression, we refer to Section 5 of Yu et al. (2008). The SBF method, however, gives S25 more stable estimators, as we have found in our simulation study in Section 4. The method is found to produce estimators with better mean integrated squared error performance than the BF method in all settings of our simulation study.

ASYMPTOTIC THEORY
We develop a complete asymptotic theory for these estimators. We do this for the case where i.i.d. observations (X i , Y i ) are made on the random vector (X, Y ). To keep the presentation simple, we assume that a j ≡ 0 and a jk ≡ 0, and we also takeâ i, The theory we present here remains valid in general cases because a j and a jk can be estimated at a faster rate than f jk .
To discuss the theoretical properties of the BF estimators, we make the following assumptions.
ASSUMPTION 3.1. It holds that the product measure D j =1 P X j has a density with respect to the distribution P X of X that is bounded away from zero and infinity on the support of P X . Here, P X j is the marginal distribution of X j . The marginal distributions are absolutely continuous with respect to Lebesgue measure, or they are discrete measures with a finite support. Furthermore, the weight functions w j for j ∈ C in the constraints (1.2) are chosen so that w j /p X j is bounded away from zero and infinity on the support of P X j . Here, p X j is the density of X j . The smallest eigenvalues of the matrices E[X jX j |X r+j = z j ] for 1 ≤ j ≤ p are bounded away from zero for z j in the support of p X r+j . The densities p X j for r + 1 ≤ j ≤ r + p are bounded away from zero on [0, 1]. The weight functions w j are continuously differentiable, fulfil w j (0) = w j (1) = 0, w j (x j ) ≥ 0 for x j ∈ [0, 1] and w j (x j ) dx j > 0 for j ∈ C.
Assumption 3.1 is a modification of Assumption A0 in Lee et al. (2012b). It guarantees that for each function m there exists only one tuple (f 1 , . . . , f p ) with (1.5) subject to the constraint (1.2).
We now make assumptions on the kernels and the order of the bandwidths. We use boundary corrected kernels. Assumption 3.2, in particular, concerns the shape of the kernel at the boundary. Note that we need different kinds of boundary correction for the ordinary BF and for the SBF estimator.
ASSUMPTION 3.2. There exist constants c K , C D > 0, C K , C S , C S > 0 such that: (a) for all u ∈ [0, 1], the kernel K g (u, ·) is positive, bounded by C K g −1 , has bounded support in [u − C S g, u + C S g] and is Lipschitz continuous with Lipschitz constant bounded by C K g −2 ; (b) the kernel K g (u, ·) has a second derivative with respect to u that is bounded by C D g −3 , and fulfils ) for a function K with K(v) dv = 1 and vK(v) dv = 0; (d) in the case of the ordinary BF, the kernel satisfies K g (u, v) dv = 1 and K g (u, v)(u − v) dv = 0, while in the case of the SBF K g (u, v) du = 1.
ASSUMPTION 3.4. The conditional density p ε|X (0|x) of ε ≡ Y − m(X) given X = x is bounded away from zero and infinity on the support of P X . Furthermore, it satisfies the following uniform Lipschitz condition, for x in the support of P X and for e in a neighbourhood of 0, with a constant C 1 > 0 that does not depend on x. The partial derivative ∂p X (x)/∂x c of the joint density function p X exists and is continuous in x c ≡ (x r+1 , . . . , x r+p ) for all (x 1 , . . . , x r ). The components of f j are twice continuously differentiable.
ASSUMPTION 3.5. For j = 1, . . . , p, it holds for i = BF and i = SBF that ASSUMPTION 3.6. There exist random functions g 1 , . . . , g p with derivatives that fulfil the Lipschitz condition, . . . , p and u, v ∈ [0, 1]. Furthermore, these functions satisfy For our final assumption, we need some additional notation. Put W jk (z j , z k ) = E p ε|X (0|X)X jX k X r+j = z j , X r+k = z k p j,k (z j , z k ), Here, p j denotes the density of X r+j and p j,k denotes the density of (X r+j , X r+k ). Note that Assumptions 3.1 and 3.4 imply that the smallest eigenvalues of the matrices W jj (z j ), for 1 ≤ j ≤ p, are bounded away from zero for z j ∈ [0, 1].
ASSUMPTION 3.7. The elements of the matrices W jj (z j ) and W jk (z j , z k ) have second-order derivatives with respect to z j that are bounded over z j ∈ [0, 1] and z k ∈ [0, 1], 1 ≤ j = k ≤ p.

S27
Furthermore, the elements of W jk (z j , z k ) and their first-order derivatives with respect to z j are continuous in z k .
We now introduce notation for the asymptotic biases and variances of the BF estimators. Define p (1) j,X (x) = ∂p X (x)/∂x r+j , p (1) j,ε (0|x) = ∂p ε|X (0|x)/∂x r+j and denote the first and second derivatives of f j by f j and f j , respectively. Put Here, jk is a vector that has the same dimension asX j and has elements jk, such that jk, = 1 if, in our rearrangement (1.4), the th element ofx j equals x r+k , and jk, = 0 otherwise. Thus, ifx j does not contain x r+k as its element, then jk is a zero vector. For the ordinary BF estimator, let β * ,BF (z) = ( β * ,BF j (z j ) : 1 ≤ j ≤ p) be the solution of the following system of integral equations, and put β BF j (z j ) to be the normalized versions of β * ,BF j (z j ) so that they satisfy the constraints (1.2). For the SBF estimator, define β * ,SBF (z) = (β * ,SBF j (z j ) : 1 ≤ j ≤ p) be the solution of and put β SBF j (z j ) to be the normalized versions of so that they satisfy the constraints (1.2).
THEOREM 3.1. Assume that Assumptions 3.1-3.7 hold with ξ ≥ 0, δ 1 , δ 2 > 0 small enough. Let p X c denote the density of X c ≡ (X r+1 , . . . , X r+p ). Then, we obtain forf i j =f i,[k * ] j (i = BF and i = SBF) with k * = C iter,i log n for an appropriate choice of C iter,i that for all z in the interior of the support of p X c , n 2/5 (f i j (z j ) − f j (z j )) are jointly asymptotically normal with mean (β i 1 (z 1 ) , . . . , β i p (z p ) ) and variance diag( j (z j )).
We conjecture that the ordinary BF and the SBF also work for less accurate starting valueŝ f i, [0] j and also for the limits of the iterative BF algorithms (k * = ∞), and that the same asymptotic limit applies as in Theorem 3.1. For two reasons, we need in our proof thatf i, [0] j converges with a certain rate to the true function, and that k * is finite. First, in the proof, we approximate the non-smooth and non-linear criteria given at (2.1) and (2.2) by smooth linear approximations. At that point, we need the initial functions to be not far away from the true functions. Second, for the BF method in additive mean regression, we can show that the BF algorithm converges with a geometric speed. The main argument in the background for this claim is to prove that the updating operator is a contraction. Updating algorithms for the minimizers of the smooth linear approximations mentioned above also have a geometric speed of convergence, and the limits of the algorithms have the asymptotic distributions given in Theorem 3.1. However, the smooth linear approximations of the original non-smooth and non-linear criteria produce errors in the resulting estimators and these approximation errors increase as the iteration goes on. For this reason, we need to stop after a finite number of iterations in order to control the errors so that they remain negligible in comparison with the first-order estimation error, which is n −2/5 . As can be seen from the results in Theorem 3.1, we allow the number of iterations to increase with the sample size and we choose an appropriate speed that gives the same asymptotic distributions as the limits of the approximating smooth linear BF algorithms.

NUMERICAL RESULTS
In this section, we briefly report the simulation results for BF and SBF. In the simulation, we have considered the following model: There are two scenarios for the choice of covariates X j : (1) (X i 1 , X i 2 ) were i.i.d. from N(0, I) where I was the 2 × 2 identity matrix, and (X i 3 , X i 4 ) independent of (X i 1 , X i 2 ) were i.i.d. from N(1/2, I) truncated on [0, 1] 2 where 1 = (1, 1); (2) (X i 1 , X i 2 ) were i.i.d. from N (0, J 1 ) where J 1 was the 2 × 2 matrix (1; 0.7; 1) so that the correlation between X i 1 and X i 2 was 0.7, and (X i 3 , X i 4 ) independent of (X i 1 , X i 2 ) were i.i.d. from N (1/2, J 2 ) truncated on [0, 1] 2 where J 2 = (1; 0.9; 1). The error terms U i were i.i.d. N(0, 1) independent of (X i 1 , . . . , X i 4 ). In terms of the representation (1.1), the centred coefficient functions, denoted by f jk (·; α), in the conditional α-quantile function m are thus given by where the constants c jk are chosen so that f jk (u; α) du = 0, and −1 (α) is the α-quantile of the standard normal distribution. The sample sizes were n = 400 and n = 1000.
We have used the R function rq( ) in the library quantreg to optimize the objective functions at (2.1) and (2.2). For SBF, we discretized the integrals on a fine grid in [0, 1] 2 . We used the Epanechinikov kernel given by K(u) = (3/4)(1 − u 2 )I [−1,1] (u). For the bandwidths, we took a number of combinations (h 1 , h 2 ) with h 1 ∈ [0.10, 0.40] and h 2 ∈ [0.05, 0.20]. We chose the initial estimates in the iterations (2.1) and (2.2) to be zero. We found that the algorithms converged with this initial choice in all cases.
Tables 1 and 2 provide the Monte Carlo estimates of the mean integrated squared errors (MISE), the integrated squared biases (ISB) and the integrated variance (IV) of the BF and SBF estimators of the individual coefficient functions. Table 1 is for the case where the covariates X j are uncorrelated, while Table 2 is for the correlated case. In the tables, we only report the results for the bandwidth choice (h 1 , h 2 ) = (0.3, 0.15) in the case n = 400 and (h 1 , h 2 ) = (0.2, 0.1) in the case n = 1000, which give an average performance. The results were based on 500 pseudosamples.
A close investigation of the tables reveals that the SBF estimator has slightly better MISE performance than the BF estimator in all cases. Note that the true coefficient functions f 23 and f 24 remain the same, as the level of quantile (α) changes. Because of this, the values of the ISB of the estimatorsf BF 23 ,f BF 24 ,f SBF 23 ,f SBF 24 do not change much across different values of α. We also find that both the BF and SBF methods show fairly good performance in the case of correlated covariates. Comparing the values of MISE, ISB and IV in Table 2 with those in Table 1, we observe that the values of MISE, ISB and IV in the correlated case are marginally larger than the corresponding values in the uncorrelated case. S34 Y. K. Lee, E. Mammen and B. U. Park To prove (A.5), we note that the solution of the minimization at (A.1) has the following explicit form: i jX i f * ,BF,[k] X i r+ p ε|X (0|X i ) K h j u, X i r+j .
We rewrite this aŝ where k ,j = k + I (j < ) and