Local linear smoothing for sparse high dimensional varying coeﬃcient models

: Varying coeﬃcient models are useful generalizations of para- metric linear models. They allow for parameters that depend on a covariate or that develop in time. They have a wide range of applications in time series analysis and regression. In time series analysis they have turned out to be a powerful approach to infer on behavioral and structural changes over time. In this paper, we are concerned with high dimensional varying coeﬃcient models including the time varying coeﬃcient model. Most studies in high di- mensional nonparametric models treat penalization of series estimators. On the other side, kernel smoothing is a well established, well understood and successful approach in nonparametric estimation, in particular in the time varying coeﬃcient model. But not much has been done for kernel smooth- ing in high-dimensional models. In this paper we will close this gap and we develop a penalized kernel smoothing approach for sparse high-dimensional models. The proposed estimators make use of a novel penalization scheme working with kernel smoothing. We establish a general and systematic the- oretical analysis in high dimensions. This complements recent alternative approaches that are based on basis approximations and that allow more di- rect arguments to carry over insights from high-dimensional linear models. Furthermore, we develop theory not only for regression with independent observations but also for local stationary time series in high-dimensional sparse varying coeﬃcient models. The development of theory for local sta- tionary processes in a high-dimensional setting creates technical challenges. We also address issues of numerical implementation and of data adaptive se- lection of tuning parameters for penalization.The ﬁnite sample performance of the proposed methods is studied by simulations and it is illustrated by an empirical analysis of NASDAQ composite index data.


Introduction
Varying coefficient models arise in a wide range of applications. They are an important generalization of parametric linear regression models. They relax the assumption that the parameters are constant and allow regression coefficients to be smooth functions of other predictors, called index variables. On the one side, the models are very flexible and give an accurate fit of complex data and on the other side they still maintain a simple structure. This allows an intuitive interpretation and an accurate estimation. For an overview on varying coefficient models, we refer to Fan and Zhang [13] and Park et al. [26].
In this paper we will propose an approach based on kernel smoothing for sparse high-dimensional varying coefficient models. Kernel smoothing has yet been considered mostly only for finite dimensional models. This is the case for varying coefficient models and for other nonparametric settings. Typically, work on sparse nonparametric high-dimensional models has made use of orthogonal series estimators. These estimators are more closely linked to linear models and for this reason they more easily allow to carry over theory from high-dimensional linear models. Our paper argues that also for high-dimensional nonparametric models kernel smoothing is an attractive alternative to orthogonal series estimation. This will be shown for varying coefficient models. Our implementation of kernel smoothing in high-dimensional settings requires the introduction of novel penalization schemes. We will show that kernel smoothing inherits from finitedimensional models its intuitive interpretation and clear asymptotic theory for the distribution of the estimator.
In our theory we will consider both, regression models and time series models. In time series a central example of a varying coefficient model is the time varying coefficient model where the index variable is rescaled time. This class of models has been developed independently from varying coefficient models and it has turned out to be a very powerful tool in the empirical analysis for structural changes over time in time series data [see 27, 28, 9, 10, 2, 3, 44, for example]. An important example in this class is the time varying autoregressive model. In this model the data are non stationary because the autoregressive structure changes over time. This complicates the asymptotic analysis. A common strategy to handle this nonstationarity is to model the time series as locally stationary processes, see Dahlhaus [8,9]. Roughly speaking, a locally stationary process behaves approximately as a stationary process over a short period of time. This naturally suggests the use of local smoothing methods like kernel smoothing [see 32, for example]. Estimation and statistical inference based on kernel smoothing has been established and their statistical properties have been well understood in the time varying coefficient model [2,3,44]. However, all this work is restricted to finite dimensional settings. As noted in Fan et al. [14], high dimensionality is encountered in many time series data applications, e.g. in economics and finance. Besides exogenous variables, often lagged variables of different lag orders and interaction terms have to be included into the model for accurate fits. These applications serve as an important motivation for our paper.
Sparse modeling provides an effective framework to analyze high dimensional data. It allows for identifiability of the model and it facilitates consistent statistical estimation even in high dimensional situations. Many penalized methods such as Least Absolute Shrinkage and Selection Operator [LASSO,29] and Smoothly Clipped Absolute Deviation [SCAD,12] have been proposed for variable selection and estimation in sparse linear regression. The methods have proven to possess high computational efficiency as well as desirable statistical properties even under high dimensional settings. This has motivated to extend the ideas to varying coefficient models for i.i.d. and longitudinal data. Varying coefficient models using orthogonal series estimation have been considered in Wei et al. [38], Lian [24], Xue and Qu [40] and Klopp and Pensky [21]. Their asymptotics allowed for an increasing number of coefficients and the studies include variable selection based on groupwise penalized methods such as the group LASSO [41]. Moreover, Klopp and Pensky [21] developed a non-asymptotic minimax theory for a model where the coefficient functions possibly have different degrees of smoothness and where they are spatially inhomogeneous. All these papers do not treat kernel smoothing nor time series models. Furthermore, the theoretical studies heavily rely on the assumption of independent observations and partially they need that the covariables X i and the predictor Z i are independent, see (2.1) for the definition of X i , Z i . This could be considered as a restrictive assumption. We will drop this condition on the way to cover time series models. For an initial screening procedure to handle ultra-high dimensional variables see also Cheng et al. [5], Fan et al. [15] and Cheng et al. [6]. But, not much is known on penalized kernel smoothing methods. For varying coefficient models, the only work we are aware of are Wang and Xia [35], Hu and Xia [19], Wang and Kulasekera [33] and Kong et al. [22]. However, their asymptotic analysis is restricted to the case of fixed-dimension and only the case of independent observations is treated.
Kernel smoothing is a very popular estimation technique for a lot of nonparametric models and it is especially recommended to use for the time varying coefficient models. In this paper we will develop kernel smoothing techniques that are working theoretically and computationally for varying coefficient models with a diverging number of variables. Our first contribution to accomplish this task is to propose a penalized local linear kernel estimation method in varying coefficient models and to provide its sound asymptotic theory under high dimensionality. We will adapt the group LASSO and SCAD methods to the local linear method and we will systematically study variable selection and estimation properties of these methods. Our theory will include oracle inequalities of the group LASSO kernel method and we will show that the group SCAD kernel method consistently identifies the true structure of a partially linear varying coefficient model. Our methodological and theoretical developments require technical treatments that are quite different from asymptotics for groupwise penalized methods using series estimators. For example, in the sieve approach, one approximates a nonparametric model by a parametric model with increasing dimension. Thus the estimation problem of the nonparametric model is methodologically very similar to the estimation of a parametric model with increasing dimension. Such a simplifying technical approach does not apply to kernel smoothing. Furthermore, we also treat local stationary varying coefficient models including the above-mentioned time varying autoregressive model. The study of this class of models requires new mathematical tools. Locally in time the time series has to be approximated by a stationary process. This approximation facilitates to carry over techniques from the study of stationary processes. We are not aware that such a theoretical study has been done in another high dimensional nonparametric set up. Our theory includes models with errors that have serial correlations with lagged errors and observations and with covariates. In particular, we allow for conditional heteroskedastic errors. We also do not assume that the errors are sub-Gaussian. The latter point may be important in financial applications.
Our second contribution is to develop a new computation method for the implementation of our proposals. Implementing our estimator involves a quite complicated optimization problem to which a typical group LASSO algorithm cannot be applied. By reformulating the problem as a second order cone programming problem, we are able to provide a simple and computationally efficient algorithm for the implementation. The details can be found in Section 4.1. The third contribution is to develop a criterion for determining the amount of penalization in the penalized estimation. This is a crucial step in the identification of the true partially linear structure. Although penalization methods for consistent identification of semiparametric models have been proposed, see Cheng et al. [5] and Zhang et al. [43], to our knowledge no work has been done on the choice of the tuning parameters for such estimators. We propose a tuning parameter selector based on the Bayesian information criterion (BIC) and we provide its theoretical justification. For this task, we verify that our penalized estimators of the relevant parametric and nonparametric components achieve the respective optimal rates of convergence at the same time. The result is new and, compared to the usual oracle properties in the literature (see Theorem 1 and Remark 3 in Zhang et al. [43] and Theorem 3.3 in Cheng et al. [5]) it is much stronger. It will be used as our theoretical foundation that the proposed BIC identifies the true partially linear structure with probability tending to one. Finally, even in the fixed dimensional case, our methods extend other kernel smoothing-based penalization methods.
The rest of this paper is organized as follows. The next section introduces the model and our statistical procedures based on kernel smoothing: LASSOestimators that uses L 1 -penalties for non-zero and non-linear component functions, SCAD-estimators with a BIC-choice of their penalty constants and BICchoices of the set of non-zero coefficient function and of the set of non-linear coefficient functions. Section 3 contains our theoretical results. Section 4 discusses the numerical implementation of our methods and shows some simulation results. An illustrative data example is given in Section 5. All proofs are deferred to Appendix.

Model and methodology
We suppose that the data 1] and i are random errors. With rescaled time Z i = i/n we get the so-called 'time varying regression model'. This model includes the time varying autoregressive model as a special case. In our paper, we consider both, independent data and time series versions of model (2.1): in the i.i.d. scenario, we assume that (X i , Z i , i ) in (2.1) are independent and identically distributed (i.i.d.) copies of (X, Z, ) with E( |X, Z) = 0 and E( 2 |X, Z) ≤ σ 2 < ∞ for some σ 2 > 0; in the time series scenario, we suppose that We allow p to tend to infinity as n → ∞. Our main assumption is the sparsity of the model (2.1), that is, m 0 j ≡ 0 for many j's, as specified in more detail below.
Let (m 0 j ) (s) be the s-th derivative of the true coefficient function m 0 j for 1 ≤ j ≤ p. Given any z ∈ [0, 1], we define m 0 (z) = (m 0 1 (z), . . . , m 0 p (z)) and , for Z ≈ z, the local linear estimator of m 0 (z) and (m 0 ) (1) (z), z ∈ [0, 1] is defined by minimizing the following local kernel weighted least squares criterion: where K h (u) = K(u/h)/h, K is a kernel function and h > 0 is a bandwidth. Equivalently, the estimated coefficient functionsm(·) andm (1) (·) are the minimizer of with respect to m = (m 1 , . . . , m p ) and m (1) p ) . From now on, we omit the arguments of functions when no confusion arises.
Given a function g defined on [0, 1], let ||g|| = [ 1 0 g 2 (z)dz] 1/2 and ||g|| c = [ 1 0 (g(z) − g(z)dz) 2 dz] 1/2 be the respective L 2 norms of g and its centered version. They measure how much the function g differs from zero or from a constant function, respectively. In this paper, we consider estimation of m 0 j and (m 0 j ) (1) , 1 ≤ j ≤ p for sparse high dimensional varying coefficient models where sparsity is defined on a functional level (in L 2 sense). Adapting the idea of the group LASSO to our context, we propose to minimize the following penalized criterion: (m,m (1) ) = argmin m,m (1) L(m, m (1) ) + λ 1 P (m, m (1) ), (2.2) where P (m, m (1) ) = p j=1 p ) . Here, λ 1 > 0 is a regularization parameter. The penalty m j 2 + h 2 m (1) j 2 jointly controls both, for sparsity of the coefficient function (m 0 j ) and for sparsity of its derivative (m 0 j ) (1) . It contains the rescaled factor h 2 for technical reasons. Our proposal (2.2) is different from the penalized local linear method in Kong et al. [22]. In that paper a penalized criterion for a fixed value of z is considered, see (5) in Kong et al. [22]. This simplifies the asymptotic treatment of the estimator but the chosen set of non-zero coefficient functions depends on the value of z so that it is not applicable for our purpose of estimation under sparsity on the functional level.
It is well known that penalized estimators which employ the LASSO or the group LASSO may fail to achieve consistency in model selection, see also Section 3.2. For this reason, we consider a penalized estimator that corrects for this. Further, the method should be able to discriminate between varying coefficient functions m 0 j (z) over z ∈ [0, 1] with m 0 j = 0 and m 0 j c = 0, nonzero constant functions m 0 j with m 0 j = 0 and m 0 j c = 0 and zero functions m 0 j ≡ 0 with m 0 j = 0 and m 0 j c = 0. In the first case, we say the coefficients m 0 j are varying, in the second case, that they are non-varying. We now propose a procedure of estimating the coefficient functions that performs this discrimination. For this purpose we adapt the idea of a group SCAD penalty to our setting. Our version of the SCAD estimator (m,m (1) ) is defined as the minimizer of the following criterion: and p λ (·) is the derivative of the SCAD penalty function with regularization parameter λ > 0, which is given by for some γ > 2 and x > 0. In our simulations and in our data example, γ = 3.7 is chosen according to the suggestion of Fan and Li [12]. Instead of the SCAD itself, we use a linear approximation of the SCAD penalty (around suitable initial estimates, e.g. the minimizer of (2.2)) in order to overcome difficulties due to non-convexity of the SCAD penalty [46]. The Bayesian information criterion (BIC) has been used for consistent model selection in linear models. In recent years, it has been proposed as a method of selecting regularization parameters for penalized methods. Work on linear models with high-dimensional settings include Wang and Leng [34], Wang et al. [36,37], Lee et al. [23]. For our semiparametric setting, we propose the following version of BIC for criterion (2.3): first, we only consider choices λ 2 = λ * 2 . This is done to get a stable choice of the regularization parameter. The value of λ 2 = λ * 2 is chosen which minimizes Here, the estimatorsm λ2 = (m λ2,1 , . . . ,m λ2,p ) andm λ2,p ) are defined as the minimizer of (2.3) with λ 2 = λ * 2 . Furthermore, C n > 0 is a sequence of positive constants whose choice will be discussed below. The terms df λ2,V and df λ2,I are the estimated numbers of varying and non-varying coefficients, respectively. That is, df λ2,V = |V λ2 | and df λ2,I = |Â λ2 \V λ2 | with estimated index setsÂ λ2 = {j = 1, . . . , p : m λ2,j = 0} andV λ2 = {j ∈Â λ2 : m λ2,j c = 0} of nonzero and varying coefficient functions, respectively. When calculating the BIC in (2.5), the effective sample size nh is used for the number of the nonparametric components instead of the original sample size n. For fixed p the BIC with C n = 1 guarantees consistent model selection, but it may fail to work when p increases, see also Chen and Chen [4] and Lee et al. [23]. Using the ideas developed in Wang et al. [37] and Lee et al. [23] for high dimensional linear models, we consider a diverging constant C n → ∞ as n → ∞ for high dimensional cases. We will see that a proper choice of C n leads to consistency of the proposed BIC even in high dimension. See the assumption (A11) and the discussions in Section 3.4.
Although we propose the BIC in (2.5) primarily for selecting λ 2 in our penalization (2.3), the idea also applies to a direct selection problem of the index sets V and I of varying and non-varying coefficient functions. This is done by minimizing the following BIC: = 0 for j ∈ I. A similar type of estimator has been discussed in Xia et al. [39] in an i.i.d. setting with fixed dimension. As far as we know, a statistical procedure for simultaneous identification has never been established even in fixed dimensional cases. Only statistical methods for discriminating between zero functions and varying functions, and methods for identifying non-varying coefficients among nonzero functions have been developed [see 39, 35, 19, 44, for example]. Because calculation of BIC(V, I) over all sets (V, I) is too complex we propose to let BIC(V, I) only run over sets (V, I) chosen bym λ2 ,m (1) λ2 with λ 2 in an appropriate set of values. We have checked the performance of this estimator in our simulation study. We will show consistency of the proposed BICs in (2.5) and (2.6) in S ection 3.4.

Oracle inequality
be the true active set with cardinality a 0 ≡ |A 0 |. For any p-tuples of (square integrable) functions m = (m 1 , . . . , m p ) and For our theoretical analysis we will make use of the following assumptions.
The kernel K is a symmetric probability density function with support [−1, 1] and it is Lipschitz continuous. (A3) There exists a constant C 1 > 0, not depending on n, such that There exists a constant φ n > 0 such that with probability tending to one Assumption (A1) can be relaxed. We only need that max i,j |X When p is fixed, it has typically been assumed that there exists a constant φ > 0 such that for m(·) = (m 1 (·), . . . , m p (·)) and m(·) (1) = (m 1 (·) (1) , . . . , m p (·) (1) ) (with large probability). However, for very large p it may be too restrictive to assume (3.2) for all (m, m (1) ). For (A4), we adapt the concept of 'compatibility condition' that has been developed for high dimensional models [see 1, for example]. For a general comparison of different conditions on design matrices in high dimensional linear models see van de Geer and Bühlmann [30]. There it has also been pointed out that their version of the 'compatibility condition' allows for a fairly general class of design matrices. Since assumption (A4) depends on the data, we will discuss a population version of (A4) later.
The following lemma and theorem state oracle results for the estimator (m,m (1) and where T 3 is the event that (3.1) holds for m = (m 1 , . . . , m p ) and m (1)

Theorem 3.1. Suppose that the assumptions (A1)-(A4) hold and that
Below we will state assumptions under which P (T 2 ) → 1, see Theorem 3.2. Here and below, we write a n ≈ b n for two sequences a n and b n if the ratio a n /b n is bounded away from zero and infinity. Suppose nh → ∞ and h → 0 as n → ∞.
Moreover, suppose that X (j) i are bounded by a log-factor, that φ n is bounded away from zero and that the cardinality of the true active set A 0 is of order (log p) γ for some γ > 0. Then, up to a log term, the above rates coincide with that of oracle estimators that utilize knowledge of the set A 0 , and that achieve the optimal nonparametric convergence rate when h ≈ n −1/5 . Thus, our results can be interpreted as oracle results for the estimators of the coefficient functions and their derivatives. Regarding model selection, we will show in Section 3.2 that the LASSO estimator (m,m (1) ) (with any choice of λ 1 ) cannot achieve consistency in general. Thus, we chose λ 1 which minimizes an estimate of prediction error in the simulated and real data examples.
We present theoretical results for varying coefficient models with i.i.d. data and for time varying regression models. We introduce some generic notations where the definitions differ in these two settings. We define Σ(z) = E[XX |Z = z] under the i.i.d. setting and Σ(i/n) = E[X i X i ] under the time series setting. Furthermore, f denotes the density of Z under the i.i.d. setting and f (z) ≡ 1 under the time series setting. We now state sufficient conditions for (A4). Note that the quantity (3.1) in the assumption (A4) depends on the data We now state an assumption that is related to (A4), but with random quantities replaced by nonrandom terms: for any m = (m 1 , . . . , m p ) and m (1) (1) ).
In our notation, 0 < C < ∞ denotes a generic constant, not depending on n. This means that the variable name C is used for different constants, even in the same equation. We now state additional assumptions. We will show below that under (A1)-(A2) and under these conditions, (A4') implies (A4).
(A5) There exists a constant 0 < C < ∞, not depending on n, such that sup 1≤j,k≤p satisfy α(k) ≤ Ck −α for some C > 0 and α > 1, and Since the constants φ n and φ n in the assumptions are not unique, we suppose that φ n and φ n are chosen as the largest positive constants satisfying (A4) and (A4'), respectively. In (A6), the first condition n −1 h −2 log n → 0 for the i.i.d. settings guarantees uniform bounds on N (z) of the order nh, with probability tending to one, where N (z) is the number of Z i 's that fall into the interval [z − h, z + h], see the discussion in the first paragraph of the Appendix for details. Note that since Z i = i/n under the time series setting, N (z) ≤ 2nh + 1 for z ∈ [0, 1]. Under the time series setting, the mixing condition of (A6) is not strong, compare also recent work on local stationary processes [see 17, 32, for example]. Time dependency of the covariates X in the time varying coefficient model restricts the growth rate of p. The first condition in (3.4) implies that the order of p does not exceed (nh) α/4 and thus, the larger (smaller) α is, the more (less) covariates are allowed in the model. If the α-mixing coefficients α(k) decrease exponentially and p grows at any polynomial rate of n, i.e. p = O(n κ ) for κ > 0 then there is no such restriction on p as long as (A7) holds. Then, note that because of α(k) ≤ Ck −α for all α > 0 the first condition in (3.4) is automatically satisfied. The assumption (A7) implies that the number a 0 of true nonzero coefficient functions cannot grow too fast. In the i.i.d. setting, it allows for ultra-high dimensionality of the variables, i.e., p = o(exp(nh)) if φ n , a 0 , d n are bounded.
The following theorem states an asymptotic equivalence between φ n and φ n . holds with a sequence φ n that fulfills Cφ n ≤ φ n ≤ C −1 φ n for some C > 0.
Theorem 3.2 also gives a uniform rate of convergence for the estimatorsm j andm

Consistency and inconsistency of group LASSO estimators
In this section, we study if the proposed estimator (m,m (1) ) in (2.2) achieves consistency in model selection. Here, consistency means that the selected set A = {j = 1, . . . , p : m j = 0} by the estimator is equal to the true active set A 0 with probability tending to 1 as n → ∞. Discrimination between varying and non-varying functions in model (2.1) would require an additional model choice procedure. In this section we will state a condition that is necessary for consistency, see Proposition 3.1 and Theorem 3.3. At the end of this section, we will use these results to show inconsistency of our group LASSO (m,m (1) ) in an example.
For simplification we make the following additional condition: (C) the cardinality a 0 = |A 0 | of the true active set is fixed and the smallest eigenvalues of Before presenting our theoretical results, we introduce some notation.
We also skip the argument and write Γ i andŜ.
The following proposition and theorem give a necessary condition for consistency of the group LASSO.
and similarly as in the proof of Lemma 3.1, it can be shown that P ( This also implies (3.6) for fixed p.
We now give an example where our group LASSO is inconsistent. Suppose that the matrix Σ(z) is constant over z, i.e., Σ(z) ≡ Σ.
and Σ jk = Σ kj = 0 for j ≥ a 0 + 2 and k ≤ a 0 + 1. In this model there is one irrelevant predictor X (a 0 +1) correlated with relevant predictors. It holds that /a 0 and if the nonvanishing functions m j , j ∈ A 0 are all nonnegative (or nonpositive), then, the model selection via the method (m,m (1) ) defined at (2.2) is not consistent because condition (3.6) does not hold.

Oracle properties
In this section, we present oracle properties of the estimator (m,m (1) ), defined as minimizer of (2.3).
as the index set of true coefficient functions that are varying over z ∈ [0, 1]. We comparem A 0 and m (1) A 0 with the oracle estimatorsm ora = (m ora j : j ∈ A 0 ) and (m ora ) (1) = ((m ora j ) (1) : j ∈ A 0 ) that are defined as minimizers of with respect to m j , (m j ) (1) for j ∈ A 0 under the constraint that m j are constant functions for j ∈ A 0 \ V 0 and that (m j ) (1) ≡ 0 for j ∈ A 0 \ V 0 . The oracle estimator is an infeasible estimator that makes use of the unknown true index sets A 0 and V 0 . For the asymptotic analysis in this section we make the following additional assumption.
(A8) There exists a positive constant δ > 0 such that inf j∈A 0 m 0 j > δ and inf j∈V 0 m 0 j c > δ.
Theorem 3.4 states that our proposed procedure consistently identifies the true index sets of varying and non-varying coefficients in the model. Thus, the resulting estimatorsm j andm (1) j of the nonzero coefficient functions have the same asymptotic properties as the oracle estimatorsm ora j and (m ora j ) (1) for j ∈ A 0 . Using standard arguments of kernel smoothing it can be shown that the estimatorsm j , j ∈ A 0 \ V 0 of the (nonzero) constant coefficients achieve the parametric √ n-rate of convergence under certain regularity conditions, see (A.26) in Appendix A.7. The required assumptions allow h ≈ n −1/5 for the case that a 0 = |A 0 | is fixed. This implies that in this case the same bandwidth h ≈ n −1/5 can be used to achieve an optimal rate of convergence for both, the parametric and the nonparametric components, at the same time. In contrast to other methods in semiparametrics, undersmoothing is not required for √ n consistency of the parametric estimators.

Consistent identification of BIC
In this section, we study consistency of the BIC methods proposed in (2.5)  A modification where the minimization does not run over the full space M will be discussed below and studied in the simulation section 4.3. The now developed theory for the estimator with (V, I) running over the full space M will be applied to this modification below.
For stating our results on the BIC methods we need the following additional assumptions.
(A4") There exists a constant φ > 0, not depending on n, such that with probability tending to one, (1) for all m = (m 1 , . . . , m p ) and m (1) = (m For simplicity, we make assumptions (A4") and (A9) that put stronger conditions on φ , d n , a 0 , δ than the assumptions in Subsections 3.1 and 3.3. Our theory can be generalized to cases where the constants φ , d n , a 0 and δ depend on the sample size n, i.e, φ , δ tend to zero as n → ∞ or d n , a 0 diverge with n. However, then more restrictive conditions on C n are needed that depend on the unknown quantities φ , d n , a 0 and δ. This restricts the practical use of such a result. The (A4 ) is a modification of the so-called 'sparse Riesz condition' of Zhang and Huang [42] in high dimensional linear regression. For an application of this assumption see also Wang et al. [37] and Lee et al. [23]. We make the assumption (A12) in order to show that our procedures correctly classify the estimated constant coefficients into zeros and non-varying ones. It is also needed for getting asymptotic properties for the oracle estimatorm ora j , (m ora j ) (1) for j ∈ A 0 . The asymptotics of the oracle estimator is well understood and the derivation of sufficient high-level conditions follows standard lines, see also Appendix A.7 for a related discussion.
The following theorem states that the BIC method defined in (2.6) consistently estimates the index sets V 0 and I 0 . Theorem 3.5 can be translated to a consistency result of the BIC in (2.5), which is a penalization parameter selector for the problem (2.3). Defineλ 2 = arg min λ2 BIC(λ 2 ) where the 'argmin' runs over all λ 2 > 0 such the cardinality |Â λ2 | ofÂ λ2 is smaller than or equal to s n . HereÂ λ2 =V λ2 ∪Î λ2 is the subset selected by the penalized estimatorm λ2,j andm λ2 ) consistently selects the true V 0 and I 0 . Furthermore, we get that minimizing BIC(V, I) over (V λ2 ,Î λ2 ) leads to a consistent estimator of (V 0 , I 0 ). We denote the minimizing sets by ( V , I). We call the estimator (m V , I ,m = 0 for j ∈ I. Thus we have three types of estimators: the LASSO-estimator, the SCAD-estimator with penalty constant chosen by BIC and the just introduced BIC-estimator. We will compare the three estimators in our simulation study in the next section.

Numerical implementation
Our proposed criteria (2.2) and (2.3) include integrals over the interval [0, 1]. In the numerical implementation of the method we propose to approximate the integrals by discretization schemes. In our computations we take J discretization points of the interval [0, 1] with J = 100 and compute the Riemann sum of the integral for numerical integration. Then our problems turn into a 2Jp dimensional optimization problem. The discretized problem of minimizing (2.2) can be formulated as a typical problem of the group LASSO and easily solved by any numerical algorithm for the group LASSO. In contrast, the resulting problem of (2.3) is quite complicated because there is an hierarchical structure between the different penalties (1) j a numerical algorithm for minimizing (the discretized version of) the criterion (2.3). The optimization problem (2.2) can be done either by using any available software for solving a group LASSO problem or by applying our algorithm for (2.3) with v 1 = · · · = v p = λ and w 1 = · · · = w p = 0. Recall the definitions of Γ i (z) andŜ(z) in Section 3.2, and define L(z) = With J discretization points z 1 , . . . , z J of the interval [0, 1], our problem (2.3) can be rewritten as follows: where x = (x(z 1 ) , . . . , x(z J ) ) , q = (q 1 , . . . , q J ) , s = (s 1 , . . . , s p ) and t = (t 1 , . . . , t p ) . Also A k and B k denote the 2Jp × 2Jp matrices A k = diag(e k , e k , . . . , e k ) and . . . , 1) is a p-dimensional vector, e k is the kth standard basis vector for a Euclidean space with dimension p and 0 p×2p is the p × 2p zero matrix with all its entries being zero. Because of this reformulation as a second order cone programming (SOCP) problem, the problem (2.3) can be minimized by any of the many available numerical solvers of SOCP problems. In our simulations and real data analysis, we used the package 'cvx' in MATLAB, see CVX Research, Inc. [7] and Grant and Boyd [18] for details. Note that the dimension of x is 2Jp. When p is large, Jp is very large so that the optimization can lead to a grave difficulty of data handling for available softwares. This difficulty generally occurs if one would consider penalized methods based on kernel smoothing in high dimensional models. To circumvent the problem, we used an iterative algorithm to minimize (4.1) in a coefficient(covariate) wise manner for our simulations and in our data example. The idea of coordinatewise optimization is widely used in high dimensional models for similar reasons [see 16, for example]. Although we observed in our simulations that our iterative algorithm converges in a few iterations (3 ∼ 10, and on average about 4.9 iterations), computation is not fast enough. The reason is that one has to solve a SOCP problem numerically at each covariate-wise step. It deserves further study to develop more efficient and fast computational algorithms.
In our numerical work we used the Epanechnikov kernel K(u) = 3/4 · (1 − u 2 )I(|u| ≤ 1) with bandwidth h = 0.15. To select the regularization parameter λ 1 in (2.2), we used a 5-fold cross validation estimate of the prediction error. For this, we partitioned randomly the original sample into 5 groups of subsamples, X 1 ,. . . ,X 5 . Then, for each j, the sample with the jth partition removed, X −j , is used for estimation whereas the jth partition, X j , is used for validation. For the method (2.2), we selected the regularization parameter λ 1 that minimizes the cross validation criterion 5 j=1 i∈Xj

Model identification and estimation of penalized methods
We simulated the varying coefficient model in both specifications: in the i.i.d. and in the time series settings, introduced in Section 2. We generated data from the following models: • Model I (i.i.d. setting): where X where { i } and {X i } are independently generated by the AR(1) models: Here W (j) i are independently generated from N (0, 1) and η i are i.i.d. from N (0, 0.25 2 ).  For an assessment of the model selection, we computed the proportion (CM) how the true semiparametric model was correctly selected out of 100 Monte Carlo replications, that is, the proportion of cases where the procedure correctly identified both the true index sets (V 0 , I 0 ). We also report the number of correct and incorrect identifications of the varying and non-varying coefficient functions: (N V →V ) denotes the average number of correctly identified varying components, (N I→V ) the number of non-varying components classified as varying and (N Z→V ) the number of zeros incorrectly identified as varying. Furthermore, (N I→I ) is the number of correctly identified non-varying components, (N V →I ) the number of varying components classified as non-varying, and (N Z→I ) the number of zeros incorrectly identified as non-varying. As measures of estimation accuracy we report the average of the integrated squared error (ISE), p j=1 (m j (z) − m 0 j (z)) 2 dz, and the median value of the relative integrated squared error with respect to the oracle estimator (rISE). As above the oracle estimator is defined as the minimizer of (3.7) subject to the knowledge of the true index sets (V 0 , I 0 ). Tables 1 and 2 summarize the simulation results of the LASSO-estimator, see (2.2) and the SCAD-estimator, see (2.3), with penalty constant chosen by BIC. The values of ISE and rISE show that both methods seem to work in the simulation scenarios. This is expected from Theorems 3.1, 3.4 and 3.5. From the tables, we also see, that the LASSO-estimator is not capable to discriminate non-varying components from varying coefficients: the resulting values of CM, N I→I , N V →I and N Z→I are always zero. Furthermore, the table shows that this method tends to include more unnecessary varying components. In contrast, the SCAD-method correctly discriminates both varying and non-varying coefficients from zeros so that it gives a quite accurate estimation, also compared to the oracle estimator. That LASSO performs relatively worse compared to SCAD this might also be caused by the fact that LASSO generally has bias terms for all coefficient functions because penalization applies to all coefficients by the same amount. In Figure 1, the first components,m 1 , of the LASSO estimates in Model I with p = 200 are displayed for the three samples that show good/median/poor performances in estimation. More precisely, the estimates are shown for the random samples corresponding to the 0.25, 0.5, 0.75 quantiles of the ISE results, respectively. From the figure, it can be seen that all the estimates have a bias but that they follow the shape of the true coefficient function m 0 1 quite well. The simulation results confirm the theoretical results in Section 3.

Consistency of BIC in semiparametric model identification
We carried out additional simulations to see how well the criterion BIC(V, I) in (2.6) performs in model selection. For s n in the definition of M, we set s n = 20 in the simulations. As discussed in Section 3.4 it is computationally infeasible to calculate all values of BIC(V, I) within M when p is large and λ2 ) with the same value of λ 2 . Thus BIC(V λ2 ,Î λ2 ) is not equal to BIC(λ 2 ). Table 3 shows the results of model selection by using the BIC(V, I)-criterion over the described subset of M and it also gives the values of the integrated squared error for the BIC-estimator. The table shows that in our simulations model choice by BIC(V, I) works pretty well and leads to a very accurate estimator. However, the differences between the SCAD-estimator and the BIC-estimator are small. They both show a very excellent performance, in particular compared to the LASSO-estimator in our simulations.

A data example
In this section, we apply our methods to daily observations of NASDAQ composite index data from January 1, 1998 to December 31, 2011 (n = 3523). The data include daily returns R i , i.e., the differences between closing logarithmic prices from today and yesterday, and the high-low ranges Y i , i.e., the differences between the highest and lowest logarithmic prices of a day. The latter has been proposed as a measure of daily volatility in finance. Figure 2  The data plots show changes over time in the time series dynamics. In particular, one sees pattern in the conditional variance as heteroskedasticity and volatility clustering. This motivates the use of time varying coefficient models.
We have fitted such a model with the daily volatility Y i as response variable. For (potential) covariates, we took the high-low ranges as well as the value, the squared value, the sign, the negative part and the squared negative part of the daily returns. All these values have been taken from the last 4 weeks = 20 work- ing days. The latter terms are included in the model to check for asymmetric pattern. Thus, we have 120 covariates (except intercept) and the model is given by In the analysis the variables are standardized to have zero mean and unit variance, although all results are presented on the original scale of the data.
We applied the penalization methods (2.2) and (2.3) to the dataset. The Epanechnikov kernel was used with a bandwidth that spans approximately one year and a half. As in the simulations we chose the regularization parameter λ 1 for (2.2) by cross validation and the choice of λ 2 for (2.3) is based on ordinary and high dimensional BIC. Method (2.2) identified 55 nonzero coefficient functions whereas method (2.3) with both versions of BIC selected 12 nonzeros, among them, 5 varying and 7 non-varying components, see Table 4 and Table 5. Table 6 contains the estimates of the coefficients in the data example that were classified as non-varying by the (2.3). Figure 3 shows the plots of the estimated (nonconstant) coefficient functions. The model fit makes sense. First, this holds for the signs of the selected coefficients. This also concerns the selected covariates depending on daily returns: . This choice implies an asymmetric effect of returns on volatility, which is well documented in the literature. Furthermore, in Figure 3 one sees that during the financial crisis periods the daily volatility tends to react more strongly to the volatilities and the (negative) returns of last days. However, the curves differ in their shape. A past return (volatility) seems more (less) influential in increasing volatility in the first financial crisis period (i) than in the second period (ii). This may be explained by the difference in the pattern of Y i during the two financial crisis periods: while rather sporadic peaks and drop-offs were observed during the whole period (i), a number of peaks tend to be concentrated within a relatively narrow time span (late 2008) during period Table 4 The selected covariates in the data example by (2.2).

Conclusion
This paper closes a gap in recent interests in sparse high dimensional nonparametric regression. Most papers in this area were only concerned with sieve and orhtogonal series estimation. In this paper we have developed a penalized estimation method based on kernel smoothing. This has been done for a central model of sparse high dimensional nonparametric regression. We considered high dimensional varying coefficient models for two settings: for i.i.d. observations and for time varying coefficient models. We showed that our methods can be easily numerically implemented. We proposed several adaptations of group LASSO and SCAD to the local linear kernel method and we carefully investigated their theoretical properties in model structure identification and estimation. We showed that the group LASSO has an estimation error with nearly the same accuracy as if the zero coefficient functions would be known but that typically, it is inconsistent in model selection. Furthermore, the group SCAD estimators have the same asymptotic properties as when one would know the true structure of a partially linear varying coefficient model. We also argue that the penalized estimators of purely parametric components achieve parametric rates of convergence. This is a stronger advantage than a oracle property as typically shown in the high dimensionality literature. Further we proposed an extension of BIC to select the shrinkage parameter for structure identification. We theoretically justified the proposed BIC-methods by showing their consistency in (semiparametric) model choice.

Appendix
as n → ∞. Note that using the Hoeffding's inequality we get that Here, we only present the proofs of Theorems 3.1-3.5 and of Proposition 3.1 in the time series settings. The proofs for the i.i.d. setting can be shown following the lines of the proofs in the time series settings together with the fact (A.1), so that we omit these proofs.

A.1. Proof of Lemma 3.1
For a given p-dimensional vector a = (a 1 , . . . , a p ) ∈ R p , we let |a| ∞ = sup 1≤j≤p |a j | be the supremum norm of a. The methods leading to Theorem 2.3. of Dümbgen et al. [11] can be used to derive the following lemma for martingales.
Then, there exists a constant C > 0 such that This implies that T c Because of (A.2) this implies Lemma 3.1.

A.3. Proof of Theorem 3.2
Let r n = d 2 n (log n + log p) 1 The following lemma is taken from Liebscher [25]. It states an exponential inequality for sums of α-mixing random variables. We will use the result in the proof of Theorem 3.2. Proof. PutΨ jk (z) =Ψ jk,0 (z). From Lemma 2.2 in Liebscher [25], note that for some constant 0 < C 1 < ∞. Applying Lemma A.2 with m = r −1 n d 2 n , we get that for 1 ≤ j, k ≤ p and z ∈ [0, 1] for M > 1 and some 0 < C 2 < ∞. Let δ n = r n h and B ≡ {z : |z − z | ≤ δ n }, 1 ≤ ≤ N be minimal number of balls with radius δ n that cover [0, 1]. By Lipschitz continuity of K, observe that for all z ∈ B , for some bounded and nonnegative functionK with compact support. Then, for r n < 1 and sufficiently large M > 0, whereΨ for sufficiently large M > 0, which completes the proof when s = 0. The fact (A.5) with s = 1, 2, can be proved along the lines of the proof with s = 0.

A.6. Proof of Theorem 3.5
Let (V, I) ∈ M be given. For simplicity, we denote the corresponding estimator (m V,I ,m