Eﬃcient model selection in semivarying coeﬃcient models

: Varying coeﬃcient models are useful extensions of classical lin- ear models. In practice, some of the coeﬃcients may be just constant, while other coeﬃcients are varying. Several methods have been developed to uti- lize the information that some coeﬃcient functions are constant to improve estimation eﬃciency. However, in order for such methods to really work, the information about which coeﬃcient functions are constant should be given in advance. In this paper, we propose a computationally eﬃcient method to discriminate in a consistent way the constant coeﬃcient functions from the varying ones. Additionally, we compare the performance of our proposal with that of previous methods developed for the same purpose in terms of model selection accuracy and computing time.


Introduction
Varying coefficient models are an important generalization of classical linear models. They form an alternative class of models to ameliorate the so-called 'curse of dimensionality' in multidimensional nonparametric modeling from a theoretical point of view, but they also allow us to explore the dynamic features hidden in the data. Let Y be the response variable and (U, X 1 , . . . , X p ) be its associated covariates. Then the varying coefficient model is defined by the following linear model: with E(ǫ|U, X 1 , . . . , X p ) = 0 and Var(ǫ|U, X 1 , . . . , X p ) = σ 2 (U ). In practice, occasionally, some of the coefficient functions in model (1.1) are constant, while other coefficients are varying with respect to the index variable U . In this case, we can rewrite the model as a l X l + ǫ, (1.2) which is the so-called semi-varying coefficient model. From an estimation point of view, this model cannot be treated as a special case of a varying coefficient model, because treating constant coefficients as varying will result in loss of estimation efficiency. This motivated several authors to develop methods to incorporate the information of constancy into the estimation procedure. To obtain a root-n convergence rate for the constant coefficients as in parametric models, Zhang et al. [17] and Cheng et al. [3] considered a two-step estimation procedure that estimates the constant coefficients by taking the average of the initial estimators over a grid of points. Xia et al. [15] proposed a semi-local least squares method that estimates the varying coefficients locally and the constant coefficients globally. Additionally, Fan and Huang [4] showed that it is also possible to have a parametric convergence rate for the constant coefficients with the profiled least squares method. However, in order for such methods to really work, the information about which coefficient functions are constant should be given in advance. Since such information is rarely available in practice, it is of interest to develop an efficient and fast method to discriminate constant coefficients from varying ones. Identifying constant coefficients can be done through hypothesis testing as in Fan et al. [5], Fan and Huang [4] and Wang et al. [13]. Nevertheless, the theoretical properties of such identification based on hypothesis testing can be somewhat hard to analyze. Another way to identify the constant coefficients is to consider the problem in a variable selection framework. This type of approach has been considered often for the identification of nonzero coefficients in varying coefficient models as in Wang et al. [14], Noh and Park [11] and Antoniadis et al. [1] but not as often for the identification of nonconstant coefficients. Xia et al. [15] proposed a cross-validation (CV) procedure based on local linear estimators to discriminate constant coefficients from varying ones. However, the implementation of their method is computationally very demanding. Indeed, to calculate the CV score for each candidate model, the bandwidth should be chosen via leave-one-out cross-validation using the estimates based on the semi-local least squares method involving large-scale matrix computation. Further, the computation becomes even more challenging when the number of covariates increases. To tackle this problem, Hu and Xia [7] developed a shrinkage method based on local polynomial smoothing and proposed a Bayesian Information Criterion to select the shrinkage parameter. Although the method is able to identify the constant coefficients without doing an exhaustive search over the candidate models, its discrimination ability is not as good as for the method of Xia et al. [15], because it is based on local constant estimators. Additionally, the adaptation of the method of Hu and Xia [7] to local linear estimation is not at all a trivial task. The reason is that when the coefficient function a l (·) is constant we should estimate a l (·) as constant but its derivative a ′ l (·) as zero. Due to this, we should consider a totally different penalty from what Hu and Xia [7] considered and hence much theoretical and computational work should be done for the extension. However, the extension is necessary, since local constant estimators tend to suffer from boundary problems, which can substantially degrade the discrimination ability of the procedure. Based on this observation, we are motivated to develop a new method, whose discrimination ability is as good as for the procedure of Xia et al. [15], which is based on local linear estimators, and which is as fast as the one of Hu and Xia [7], which is known to be computationally efficient.
The rest of this paper is organized as follows. In Section 2, we will introduce our new model selection method. Consistency of the model selection method is established in Section 3. In Section 4, we compare the performance of our proposal with that of previous methods developed for the same purpose in terms of model selection accuracy and computing time. All technical details are deferred to the Appendix.

A new Bayesian Information Criterion to detect constancy
In this section, we propose an information criterion that can consistently identify constant coefficients. For the estimation of model (1.2), suppose that we have a random sample of size n, {(Y i , U i , X 1i , . . . , X pi ), i = 1, . . . , n}. As a first step for model selection, we obtain the estimatesâ l (U i ), i = 1, . . . , n, l = 1, . . . , p, based on local linear regression. Using Taylor's expansion, we have for U in a neighborhood of u. This leads to the following local least squares estimation problem: with respect to b = (b 1 , . . . , b p ) ⊤ and c = (c 1 , . . . , c p ) ⊤ , for a given kernel function K and a bandwidth h. Here, K(·) is a symmetric density function and K h (·) = h −1 K(·/h). Letã l (u) beb l for l = 1, . . . , p and letã(u) = (ã 1 (u), . . . ,ã p (u)) ⊤ . Now, we are ready to define a Bayesian Information Criterion (BIC) to identify which coefficients are varying. Without loss of generality, we can assume that the whole set of covariates can be separated into two exclusive subsets Based on these estimates, the BIC for detecting constancy is defined as follows: and |I| is the cardinality of the set I. Since the set I c is the complement of I v , we suppress I c in the remaining of the paper. The set of indices of varying coefficient functions is estimated by the minimizer of BIC(I v ), which we denote byÎ v . Similar criteria were considered in Hu and Xia [7] and Cheng et al. [3]. Although Cheng et al. [3] considered a similar criterion in a likelihood estimation framework, they did not focus on the consistency of the procedure to identify constant coefficients. Hu and Xia [7] used another similar BIC to select the amount of shrinkage for the same identification problem. In order to detect underfit more sensitively, we use the traditional residual sum of squares, whereas Hu and Xia [7] used the kernel-weighted residual sum of squares, which is common in kernel smoothing methods and whose theoretical properties are easy to show. In Section 4, we will show by means of simulations that this difference in the definition of RSS(I v ) is really crucial to enhance the ability of preventing underfit. From a computational point of view, our BIC method involves an exhaustive search over p j=0 n j candidate models to identify constant coefficients consistently, which is quite demanding when p is very large. For many situations where computation efficiency is more desirable, one can think of heuristic algorithms to avoid an exhaustive search such as the forward selection and backward selection procedures. In our context, forward selection means increasing the number of coefficients to be estimated as varying, one at a time starting from the classical linear model as long as the proposed BIC decreases. The detail of the forward selection procedure is given as follows: Forward selection In the k-th step (k ≥ 1), we are given I (k−1) . Then, for each j ∈ F \I (k−1) , we consider a candidate model M Step (2). Otherwise, stop the iteration and setÎ v = I (k−1) .
In a similar manner, one can define the backward selection procedure, which increases the number of coefficients to be estimated as constant according to the proposed BIC. These two heuristic algorithms take less computation time than an exhaustive search, but lead to a comparable performance. However, in our simulations the gain in computation time is not very remarkable even though we do not report the results to save space. The reason is that since the computation task to calculate BIC(I v ) for each candidate model only involves simple averaging, it is hard to see a considerable gain in computation time until the number of covariates becomes very large.

Consistency of the model selection rule
To study consistency of the proposed method for model selection, the following standard regularity conditions are needed [4,7]: The support of the random variable U is [0, 1]. The density function of U , f (u), is Lipschitz continuous and bounded away from 0 on the support.
The conditional density of U given X is continuous and uniformly bounded with respect to u and x up to its second derivative with respect to u. (A5) The function E(ǫ 4 |U = u, X = x) and the second order derivative of f (u) are bounded with respect to u and x. (A6) The second order derivatives of the coefficients a l (u), l = 1, . . . , p, are continuous.
We are now ready to show the consistency of the proposed BIC procedure.
When the simple averaging approach in (2.3) for the estimation of constant coefficients is used as in Zhang et al. [17], Lee [8] and Cai and Xiao [2], undersmoothing is inevitable for a root-n convergence rate ofâ l (l ∈ I c ). However, obtaining a root-n convergence rate forâ l (l ∈ I c ) is not necessary for the consistency of our model selection method (as will be shown in the proof of Theorem 1). Since our theoretical result is established under the ordinary optimal bandwidth given in (A8), usual bandwidth selection methods based on the mean squared error can be used for our method. Further, the simple averaging approach enables us to choose the bandwidth only once, whereas the method of Xia et al. [15] requires the selection of a bandwidth for each candidate model.
Note that one could consider another type of BIC for our method, which is directly motivated by Hu and Xia [7]: Under assumptions similar to (A1)-(A8), it is possible to show that our method with BIC * (I v ) is also asymptotically consistent for the identification of constant coefficients. However, our simulation study suggests that BIC * (I v ) is significantly inferior to BIC(I v ) in terms of resistance against underfit.

Numerical studies
In this section we illustrate the competitiveness of our method in terms of model selection accuracy and computation time through the comparison with two previously proposed methods for the same purpose. Additionally, we illustrate the necessity to use BIC(I v ) instead of BIC * (I v ) by means of simulations. Throughout this section, the kernel function K(u) = exp(−u 2 /2)/ √ 2π is used and the optimal bandwidth for our method is chosen by the leave-one-out crossvalidation procedure. One could also select the bandwidth using the generalized cross-validation procedure proposed in Li and Palta [9], which is computationally less intensive.

Comparison with previous methods
We compare the finite sample performance of our method with two CV-based methods of Xia et al. [15] and with the shrinkage method of Hu and Xia [7].
Concerning the CV methods, Xia et al. [15] proposed two modifications of their original proposal. The first one is the simplified CV, which is designed to avoid an exhaustive search of the proposed CV criterion. The second one is an improvement of the first one in terms of overfit resistance. We included both improved versions for comparison, but not the original one because it is very time-consuming.
To compare the computation time of the different methods, we first implemented all methods in R software (R Core Team [12]). In particular, as for the shrinkage method, we reimplemented it in R referring to its original implementation in MATLAB [10] that Prof. Yingcun Xia kindly provided. However, we found that the implementation in MATLAB is somewhat faster than that in R, which is not the case for the other methods. Due to this, we report for the shrinkage method the computation time from its implementation in MATLAB.
We consider the following two sets of models: where U i ∼ Unif[0, 1], X 1i ≡ 1, (X 2i , . . . , X 7i ) ⊤ is simulated from a multivariate normal with mean zero and cov(X j1i , X j2i ) = 0.5 |j1−j2| for any 2 ≤ j 1 , j 2 ≤ 7, and ǫ i follows a standard normal distribution. Models A and B were already considered in Xia et al. [15] and Hu and Xia [7], respectively. To evaluate the performance of the three model selection methods, we discriminate three different situations in Table 1 -3. In column "C", we present the percentage of trials in which all the varying and constant coefficients are correctly identified. When the estimated model misses at least one varying coefficient, we count it as an underfitted model and report its percentage in column "U". Finally, we show in column "O" the percentage of overfitted cases in which all the varying coefficients are correctly identified but additionally some of the constant coefficients are identified as varying. We measure the computing time in seconds using the following computing environment with CPU: Intel(R) Core(TM) i7 2.80 GHz and RAM: 4GB. Table 1 Comparison of the results of the two CV methods in Xia et al. [15], our proposal and the shrinkage method of Hu and Xia [7].  Table 2 Comparison of the results of the two CV methods in Xia et al. [15], our proposal and the shrinkage method of Hu and Xia [7].   Table 1 and 2 summarize the model selection results. The consistency rates for all methods come closer to 1 as the sample size grows. Only the simplified CV method seems to suffer a bit from slow convergence, as was already explained in Xia et al. [15]. In terms of model selection performance, both the modified CV method and ours work equally better than the other methods, but the former seems to be slightly better than the latter in Model A, which corresponds to the case where the non-constant coefficients do not strongly deviate from a constant. However, if we consider the computation time, our method appears to be incomparably better than all other methods including the shrinkage method, which is supposed to be computationally efficient. Finally, although the results are not presented here, we learned the same lesson for the case where different kinds of error distributions are used as long as the errors satisfy the assumptions in Section 3.

Comparison between BIC(I v )
and BIC * (I v ) As will be shown in the proof of Theorem 1, the sum of residual squares in the BIC criterion prevents underfit, whereas the penalty resists against overfit. Our intuition in the proposal of BIC(I v ) was that the sum of residual squares not involving additional smoothing in its definition would prevent underfitting more effectively than the one involving smoothing. To confirm this intuition, we compare BIC(I v ) with BIC * (I v ) defined in (3.2) for Model A. Table 3 shows that BIC * (I v ) seriously underfits when c = 1, which shows that avoiding additional smoothing is helpful for sharpening the sensitivity of BIC against underfit. However, BIC * (I v ) seems to become consistent in constancy identification as the sample size increases considering the case where c = 2, as it is already expected from Section 3.

Conclusion and future research
We developed a new model selection method for semivarying coefficient models, which is shown to be competitive in identification performance and significantly faster than previous methods studied in the literature. Since we should study the asymptotic behavior of the residual sum of squares involving the estimator a l (·) to show the consistency of our BIC method, we worked with a simplified version of the error assumption which is made for the estimation of the coefficient functions a l (·) as in Fan and Zhang [6] and Fan and Huang [4]. It would be an interesting research topic to show that the proposed BIC method works under the general error assumption. Although we only deal with the case of independent and identically distributed data, we expect that it would also work in a time series setting as the method of Xia et al. [15] does. Also, it is possible to extend our approach to quantile regression and we have some numerical evidences supporting its validity. However, lack of a closed form of the estimator makes the theoretical investigation more challenging than for mean regression. We leave these as future works.
Proof. The result can be easily shown using similar arguments as in the Appendix of Fan and Huang [4].
Using similar calculations to those for A 1 and A 2 , it can be easily shown that a 2 − a 2 =â 3 − a 3 = O p (h 2 + n −1/2 ). (A.6) One can show that both A 3 and A 4 are o p (log(nh)/nh) using (A.6) and Ustatistic computations similar to those done for A 1 and A 2 . This shows that A = o p (log(nh)/nh). Let us now consider the term B in (A.5). Consider the decomposition Regarding the first two terms, it is easy to show that they have the desired order using similar arguments as for the term A and the third term in (A.7).