"Variable selection of varying coefficient models in quantile regression"

Varying coefficient (VC) models are commonly used to study dynamic patterns in many scientific areas. In particular, VC models in quantile regression are known to provide a more complete description of the response distribution than in mean regression. In this paper, we develop a variable selection method for VC models in quantile regression using a shrinkage idea. The proposed method is based on the basis expansion of each varying coefficient and the regularization penalty on the Euclidean norm of the corresponding coefficient vector. We show that our estimator is obtained as an optimal solution to the second order cone programming (SOCP) problem and that the proposed procedure has consistency in vari- able selection under suitable conditions. Further, we show that the estimated relevant coefficients converge to the true functions at the univariate optimal rate. Finally, the method is illustrated with numerical simulations including the analysis of forced expiratory volume (FEV) d... Abstract: Varying coeﬃcient (VC) models are commonly used to study dynamic patterns in many scientiﬁc areas. In particular, VC models in quantile regression are known to provide a more complete description of the response distribution than in mean regression. In this paper, we develop a variable selection method for VC models in quantile regression using a shrinkage idea. The proposed method is based on the basis expansion of each varying coeﬃcient and the regularization penalty on the Euclidean norm of the corresponding coeﬃcient vector. We show that our estimator is obtained as an optimal solution to the second order cone programming (SOCP) problem and that the proposed procedure has consistency in variable selection under suitable conditions. Further, we show that the esti- mated relevant coeﬃcients converge to the true functions at the univariate optimal rate. Finally, the method is illustrated with numerical simulations including the analysis of forced expiratory volume (FEV) data.


Introduction
Varying coefficient (VC) models have been widely used as a useful generalization of the linear regression model to depict dynamic behaviors in scientific research. Because of their flexibility and interpretability, much work has been done on their parameter estimation and hypothesis testing, but mostly in the mean regression setting until some authors such as Honda [6], Cai and Xu [2] and Kim [9] paid attention to the quantile regression setting in the early 2000s.
Quantiles themselves can be defined without moment conditions. Inspecting several conditional quantiles can provide us with a more complete description of the data than inspecting only the conditional mean. These advantages have propelled many researchers to consider the quantile regression framework not only in parametric models but also in semi-parametric or nonparametric models. Regarding VC models, Honda [6] and Cai and Xu [2] considered a VC model for the conditional quantiles using local polynomials. Kim [9] proposed a polynomialspline-based methodology for the same model. For longitudinal data analysis, Wang et al. [22] developed the partially linear VC model in quantile regression. In spite of remarkable progress in estimation and hypotheses testing, it is not well understood how to conduct variable selection efficiently for the VC model in the quantile regression framework.
Variable selection is important for any regression problem in that ignoring important predictors brings out seriously biased results, whereas including spurious predictors leads to substantial loss in estimation efficiency. Due to their computational efficiency, various shrinkage methods such as the nonnegative garrotte, the Least Absolute Shrinkage and Selection Operator (LASSO) and the Smoothly Clipped Absolute Deviation (SCAD) have been used in parametric models and recognized as promising methods to allow us to do estimation and variable selection simultaneously. Furthermore, the past decade has observed their extensions to semi-parametric and nonparametric models using basis approximation techniques. Regarding variable selection in VC models, Wang et al. [23] and Wang and Xia [21] proposed penalization methods for selecting nonzero coefficients using the SCAD penalty [4] and the adaptive LASSO penalty [26], respectively. Noh and Park [14] improved the performance of the estimator in Wang et al. [23] by extending the result of Zou and Li [27] to VC models. However, most existing research about the variable selection for VC models is concentrated on mean regression. Since there are few such works in the quantile regression context, we are stimulated to develop a shrinkage method for the VC model in quantile regression.
While Kai et al. [8] recently developed a variable selection method for the parametric part of a partially linear VC model in quantile regression using SCAD, they assumed that there is no sparsity of the variables in the nonparametric part. Different from their work, we propose a variable selection method for estimating coefficient functions using a nonparametric approach. We approximate each varying coefficient function with a B-spline basis [3] and consider a penalized check loss function based on the Euclidean norm of the corresponding coefficient vector. Our variable selection method uses a penalization of the norm of each coefficient vector. Therefore, our work shares the same motivation as in Wang et al. [23] and Xue [24]. However, because our interest lies in the estimation of the conditional quantile of the response, we need to use the check loss instead of the squared loss. Therefore, we take a different approach both theoretically and computationally. In particular, the non-differentiability of the check loss function requires us to develop a different computational algorithm. Furthermore, to show its asymptotic properties, we have to adopt a different approach from the one used in mean regression to handle the issues caused by the non-differentiability of the check loss. We will discuss these issues in the proof of our main results, which are given in detail in the Appendix.
The rest of the article is organized as follows. Section 2 introduces the estimator that we propose. We present a computational algorithm for obtaining the estimator and selection methods for the tuning parameters in Section 3. Its asymptotic properties are fully described in Section 4. Finally, we report numerical simulation results as well as an application of our methodology to the forced expiratory volume (FEV) data in Section 5. All the technical proofs are provided in the Appendix.

Varying coefficient model
is the response of interest and U i ∈ [0, 1] is the univariate index variable. We consider the following VC model for the conditional quantile of Y i given (X i , U i ): where α τ (u) = (α 0,τ (u), α 1,τ (u), . . . , α p,τ (u)) ⊤ is a coefficient vector and the errors e i,τ are independent random variables with the τ th quantile 0 and e i,τ is independent of (U i , X i ). We assume that only s covariates among the X (k) i 's are relevant in model (2.1). It is unknown which s covariates are relevant, nor what the value of s is. Without loss of generality, we let α k (·), k = 1, . . . , s be the nonzero coefficient functions, and α k (·), k = s + 1, . . . , p, be identically zero. Additionally we suppress the subscript "τ " whenever no confusion is caused hereafter.

Regularized estimation using one-step group SCAD
We assume that each coefficient function α k (u), k = 0, . . . , p, can be approximated by a set of basis functions, that is, where {B kl (·), l = 1, . . . , ∞} for all k = 0, . . . , p span a function space F k which is assumed to contain α k (u), and q k is the number of basis functions that are needed to approximate α k (u). Following the approximation (2.2), model (2.1) can be rewritten as 3) The parameters γ kl in the basis expansion can be estimated by minimizing where ρ(t) = 2(τ − I(t < 0))t is the check loss function at a given quantile level 0 < τ < 1. We denote the minimizer of (2.4) asγ = (γ ⊤ 0 ,γ ⊤ 1 , . . . ,γ ⊤ p ) ⊤ , wherẽ γ k = (γ k1 , . . . ,γ kq k ) ⊤ . The statistical properties of the estimator of α k (·) based onγ are fully addressed in Kim [9]. Now suppose that some variables are irrelevant in (2.1) so that the corresponding coefficients are zero functions. Since, using the approximation (2.2), each function α k (u) in (2.1) is characterized by a set of parameters γ k = (γ k1 , . . . , γ kq k ) ⊤ , we should not select nonzero individual components γ kl , but choose the whole nonzero vector γ k . For that purpose, we use the regularized estimation by adding penalties not to an individual element γ kl but to the Euclidean norm γ k 2 of a coefficient vector γ k . To gain some insights into the proposed method, consider the approximation of α k (·) as g k (·) = q k l=1 γ kl B kl (·). We note that its squared L 2 -norm can be rewritten as g k Since the relevance of a coefficient function α k (·) is equivalent to α k L2 > 0, we add to (2.4) thresholding penalties based on γ k w where γ k w ≡ (γ ⊤ k H k γ k ) 1/2 is the weighted Euclidean norm of a vector γ k ∈ R q k . Examples of penalties include the LASSO or the SCAD penalty. Such penalties are called group LASSO [25] penalty and group SCAD [23] penalty, respectively.
In this paper, we use a one-step group SCAD penalty, which is a local linear approximation of the group SCAD. It is known that the one-step group SCAD outperforms the group SCAD both in theoretical and computational aspects. For details, we refer to Noh and Park [14]. Let p λ (·) be the SCAD penalty function. The function p λ is defined on R + by its derivative as for some constant a > 2 and I(·) is the indicator function. In this paper, we define the one-step group SCAD regularized estimator of where ν k = p ′ λ ( γ k w ). Note that for a given basis {B k1 (·), . . . , B kq k (·)}, if there exist constants α ≥ 0 and 0 < N 1 , N 2 < ∞, not depending on q k , such that the Euclidean norm of γ k , γ k 2 , can be easily used in the penalty of (2.5) instead of the weighted Euclidean norm, γ k w . This is because the condition (2.6) enables the direct translation between the Euclidean norm of the estimated coefficient vectorγ k and the L 2 -norm of the estimated functionα k . B-splines (α = 1) and Riesz bases (α = 0) are examples of such bases. For these reasons, we will use γ k 2 in the penalty throughout this paper.

Implementation of the proposed estimator
Since quantile regression typically requires a non-differentiable and asymmetric check loss function, the computation of the estimator defined in (2.5) is quite demanding when penalizing the Euclidean norm of γ k . In particular, the iterative algorithm using local quadratic approximation of p k=1 ν k γ k 2 is not so useful for our estimator as for the one of Wang et al. [23]. On account of this fact, for variable selection of model (2.1), Tang et al. [20] considered the penalty based on the ℓ 1 -norm of γ k instead. However, their proposal needs to be justified theoretically in that such penalization cannot generally guarantee groupwise sparsity in the estimator γ, which is necessary for variable selection in nonparametric models. In our work, while still using the Euclidean norm of γ k , we show that the minimization of l(γ) in (2.5) is equivalent to a second-order cone programming problem. Since it is a well-known convex optimization problem, the estimatorγ can be calculated using computationally efficient methods from the optimization literature.

Computational algorithms
Let π(·) = (B 1 (·), . . . , B q k (·)) ⊤ be a set of basis functions for the estimation of . . , n. Using these notations, the optimization problem (2.5) is reformulated as: The reformulation (3.1) shows that the problem to minimize (2.5) is expressed as one of the second order cone programming (SOCP) problems in which a linear objective function is minimized over the intersection of an affine set and second-order (quadratic) cones. For more details, we refer to Lobo et al. [13] and Alizadeh and Goldfarb [1]. It is clear that (3.1) always has a feasible solution because the original problem (2.5) is an unconstrained optimization problem. Therefore, an optimal solution to (3.1) can be obtained by using the convex optimization algorithms such as primal-dual interior point methods. In our simulations, we use CVX [5] to solve the SOCP problem (3.1).

Selection of tuning parameters
Variable selection in nonparametric models needs to determine two regularization parameters. One is a smoothing parameter for controlling the smoothness of the estimated coefficient functions. The other is a shrinkage parameter for managing the complexity that is the number of covariates in the model. In our model, they are the number of basis functions (q k ) to approximate each coefficient α k (·) and the penalty parameter (λ).
For an initial estimatorγ, we need to choose the number q k of basis for each coefficient function. To choose the q k , we use the Schwarz-type Information Criterion (SIC) [17] as follows: For the proposed estimator, we need to choose q k 's again for the minimization of (2.5). Further, we also have to determine the penalty parameter λ. For the simplicity of implementation, we use the same q k 's for an initial estimator. This is justified by the fact that the optimal order of q k for the proposed estimator is the same as that of the initial estimator. For the selection of λ, we use SIC as follows: where edf is the number of zero residuals. The number of zero residuals is widely used as a measure of the effective dimension of the fitted models in quantile regression. Li and Zhu [12] provided its justification in L 1 -norm penalized quantile regression and Koenker et al. [10] heuristically argued that in the case of univariate quantile smoothing splines the number of zero residuals is a plausible measure for the effective dimension. Additionally, a recent work by Lee and Noh [11] showed that the proposed SIC in (3.3) with such edf gives consistency in model selection for model (2.1).

Asymptotic properties
As a desirable property of nonparametric variable selection, Storlie et al. [19] defined a selection procedure to be nonparametric oracle (np-oracle) if it identifies all relevant variables and estimates the nonzero coefficient functions at the optimal nonparametric rate simultaneously. Wang et al. [23] showed that the estimator for (2.1) in mean regression is np-oracle. In this section, we extend their results to the case of quantile regression. We focus our asymptotic analysis on the case of B-spline basis functions. Note that if we use normalized B-splines of the order d + 1 with b k uniform internal knots to approximate α k (·), the number of basis functions q k is equal to b k + d + 1. The extension to general basis expansions satisfying (2.6) or to the case of the penalty based on γ k w can be obtained using similar technical arguments given in this paper.
To investigate asymptotic properties of our estimator, we consider the case where q k tends to infinity as n goes to infinity. Hence, we should use the notation q k,n because q k depends on n. However, since the assumption that q k,n = q n for all k does not incur any loss of generality in the asymptotic analysis as long as lim sup n→∞ (max 0≤k≤p q k,n / min 0≤k≤p q k,n ) < ∞ is guaranteed, we suppress the dependence of q k,n on k for simplicity. Since we focus on B-splines in our asymptotic analysis, we use the notation b n instead of b k,n in this section and in the Appendix. Similarly, we use the notation λ n since λ needs to vary as n increases. Finally, we use a n ≈ b n to indicate that there exist constants 0 < A < B < ∞ such that A ≤ a n /b n ≤ B for sufficiently large n. To derive asymptotic properties of the proposed estimator, we make the following assumptions.
(A1) α k (·) ∈ H r , k = 0, . . . , p for a constant r > 3/2, where H r is the collection of all functions on [0, 1] for which the mth order derivative satisfies the Hölder condition of the order γ with r ≡ m + γ and 0 < γ ≤ 1. (A2) The conditional distribution of U , given X = x, has a bounded density This set of assumptions is the same as those used by Kim [9] to develop the convergence of the unpenalized estimator for model (2.1). Now, we describe the main result of our paper.
of the function f .
The detailed proof of Theorem 1 is given in the Appendix. We briefly provide the main idea of the proof. First, to handle the issue of non-differentiability of the check function, we uniformly approximate by a quadratic function of γ the difference between l 0 (γ) and l 0 (γ 0 ) (which is non-differentiable) when γ and γ 0 differ by a term of order n −1/2 b n . Here, γ 0 is a coefficient vector that makes the corresponding coefficient function vector α 0 (u) best approximate the true one α(u) within the function space under consideration. For the precise definition of γ 0 , we refer to Lemma 2 in Appendix. Using this approximation, we then show asymptotic properties of our estimator with the idea of the proof in Wang et al. [23].
Part (a) of Theorem 1 shows that the proposed estimator consistently identifies the relevant covariates with probability tending to 1. Part (b) provides the convergence rate of the nonzero coefficient functions. If α k (·) has a bounded second derivative, and if piecewise linear splines with b n ∼ n 1/5 are used, then Theorem 1 (b) implies that α k − α k L2 = O p (n −2/5 ), which is the same as the univariate optimal rate for nonparametric regression with i.i.d. data [18]. As a result, Theorem 1 shows that our estimator is np-oracle.

Numerical studies
In this section, we use B-splines as the basis functions in our implementation of the proposed method. Following the recommendation of Ruppert [16], in our simulation we use a common value of q k , which we will call q as before. Thus the number of basis functions q is equal to b+d+1, where b is the number of interior knots for approximating a coefficient function and d is the degree of the spline. We use equally spaced knots for all simulations in this paper. As for the degree of the spline, Kim [9] recommended the use of lower degree splines because higher degree splines would induce unnecessary interactions between the spline basis and the collinearity among variables in varying coefficient models. In our simulations, we use d = 3 corresponding to cubic splines.

Simulation examples
When the underlying error distribution has a fat tail, is not normal or has infinite variance, the median regression (MR) estimator that minimizes the sum of absolute errors is often considered as a robust alternative to the least squares regression (LSR) estimator in order to understand the conditional central tendency in a dataset. In this section, we compare the performance of the median regression estimator for the model (2.1) with that of the least squares regression estimator while the tails of the error becomes gradually fat using the contaminated normal distribution. The LSR estimator can be obtained by substituting the check loss with the squared loss in (2.4) and (2.5). Since the LSR estimator in the model (2.1) is regarded as a special case of the estimator in Wang et al. [23] where the number of repeated measurements is one, it is easy to see that the estimator has the same asymptotic properties as the MR estimator. For the LSR estimator, we use the Bayesian Information Criterion (BIC) to select the tuning parameters as follows: whereγ LS λ is the estimator in mean regression and edf is the number of nonzero elements inγ LS λ . Theoretical background of this type of BIC criterion can be found partly in Huang and Yang [7] and the same type of BIC was also considered in Tang et al. [20] to select the tuning parameters.
In our simulations, we consider the following two models: i ) ⊤ is generated from a truncated multivariate normal distribution with mean vector consisting of all zeros and covariance matrix given by cov(X (k1) i , X (k2) i ) = 0.5 |k1−k2| for any 1 ≤ k 1 , k 2 ≤ 6. The support of each marginal distribution is restricted to [−5, 5] and e is a symmetric random error independent of the covariates. Because the median and mean of the error coincide when the error is symmetric, both the LSR and the MR estimator aim for the same quantity and hence are directly comparable. The index variable U is randomly generated from Uniform(0,1). It is clear that only the covariate X To investigate the effect of relatively heavy tail error distributions, we consider the contaminated normal distribution CN (ρ; σ 1 , σ 2 ) that is the mixture of N (0, σ 2 1 ) and N (0, σ 2 2 ) with weights 1 − ρ and ρ, respectively. To be more specific for our simulations, we set σ 1 = 1 for model (I) and σ 1 = 1.5 for model (II). For both models, we set σ 2 = 5 and take the values of ρ equal to 0, 0.1 and 0.2. As ρ increases, the error distribution gradually changes from a normal distribution to a relatively heavy tail distribution. For each model, we generate two random samples of n = 200 and n = 400 and then repeat the simulations 500 times.
To compare the MR and LSR estimates for α(u), we first calculate the absolute deviation error (ADE) defined as where the u r 's are the equally spaced grid points on the support of U with n grid = 201. We define the relative absolute deviation error (RADE) by for an estimatorα of the coefficient function vector, whereα MR (·) is the MR estimator. Table 1 shows the sample mean and standard deviation of the RADE with respect to the LSR and unpenalized MR (uMR) estimators over 500 simulations. From the RADE with respect to the LSR estimator, we see that the LSR estimator slightly outperforms the MR estimator in terms of the estimation accuracy when the error is normal (ρ = 0). However, when the error distribution becomes relatively heavier which is the case when ρ = 0.1 or 0.2, the performance of the LSR estimator deteriorates more rapidly than that of the MR estimator. This shows that the quantile regression estimator in the VC model is more useful than the least squares estimator when the error deviates from a normal distribution. The RADE with respect to the uMR estimator implies that the estimation efficiency has substantially improved by the variable selection. Additionally, when comparing models (I) and (II) in terms of the RADE with respect to the uMR, we observe that the sparser model has more benefits from the variable selection.   The results of variable selection for each estimator is given in Table 2. As a performance measure of the variable selection, we report in the column "Correct" the average number of irrelevant coefficient functions correctly estimated to be zero functions and in the column "Incorrect" the average number of relevant coefficient functions incorrectly estimated to be zero functions. In another aspect, we summarize the variable selection results by discriminating three different situations. In the column "Correct fit", we present the percentage of trials for which the estimated model coincides with the true model. When the estimated model misses at least one relevant covariate (even if some irrelevant covariates are selected), we count it as an underfitted model and report its percentage in the column "Underfit". Finally, we show in the column "Overfit" the percentage of overfitted cases in which the estimated model includes all relevant covariates and some irrelevant covariates. Overall, similar conclusions are obtained in terms of variable selection as in the result of estimation accuracy. Additionally, Table 2 suggests that both the MR and the LSR estimator have a tendency to overfit more than to underfit. A possible explanation for this phenomenon is that both estimators determine the amount of penalization for variable selection using the estimator of the L 2 norm of each coefficient. When the sample size is relatively small, the instability of estimation could bring out a situation where the L 2 norm estimates of some irrelevant coefficients are larger than those of other irrelevant ones, which results in overfit. However, even in such a situation, the L 2 norm estimates of relevant coefficients are not likely to be smaller than those of irrelevant ones so underfit is less frequent than overfit. Similar results can be found in Wang and Xia [21], who also uses the L 2 norm estimate of each coefficient for variable selection.
Throughout the simulation, we observed that the performance of our estimator is fairly dependent on that of the initial estimator because our variable selection method uses the initial estimator to determine the amount of shrinkage for each coefficient function. However, this dependency can be eliminated in a certain degree by using the final estimate as a new initial estimator if necessary. An example of this kind of iterations is found in Tang et al. [20].

Application to forced expiratory volume data
The forced expiratory volume (FEV) is the amount of air which is forcibly exhaled from the lungs in the first second. The FEV data from Rosner [15] contain measurements of FEV in liters, age (U ) in years, height (H) in inches, sex (S = 0 for girls/S = 1 for boys), and smoking status (SM = 1 for a regular smoker/SM = 0 otherwise) which were collected from 654 children aged 3-19. To assess the effect of smoking status on FEV and the lower conditional quantiles of FEV as a measure for poor pulmonary functioning, Kim [9] considered the VC model (2.1) in quantile regression as follows: In this subsection, we illustrate our proposed method by showing how it selects relevant covariates in model (5.1) with respect to different quantile levels.
We consider the conditional first vigintile (q 0.2 ) and median (q 0.5 ) for the analysis. In order to detect smoking effects accurately, we only consider children over the age 10 (345 subjects) because there are no smokers among the children below the age 10. We use piecewise linear (d = 1) and quadratic (d = 2) splines with equispaced knots. Since the number of covariates in model (5.1) is relatively small, choosing different number of knots for each coefficient function is computationally allowed. Hence, we determine the degree of splines and the number of equispaced knots for each coefficient function simultaneously using the SIC values in (3.2).
Through the proposed procedure, all the covariates except the smoking status are selected as relevant for both quantiles. FEV is known to increase in accordance with the body growth, which is measured by the height to a certain extent. Typically boys tend to have a larger FEV on average than girls as they grow. Based on this information, our selection results of covariates coincide with the previous results commonly known in the medical science. Figure  1 shows the coefficient function estimates for the selected covariates. Since only a small number of data are available beyond the age of 15, we restrict our attention to the estimates below the age of 15 in Figure 1. At both quantiles, the estimated baseline function β 0 (u) increases linearly, which implies that FEV increases with age. This makes sense because all the children in the data were in the middle of growth. Regarding the effect of sex on FEV, the difference between boys and girls becomes bigger in both quantiles if we ignore the age range beyond 15.
One of the interesting results in our analysis is that the relevance of smoking status on FEV is found to be different according to the quantile level. The smoking status is not selected for the median level (q 0.5 ), while it is selected for the relatively low quantile (q 0.2 ). In Figure 2, we observe that the coefficient function of smoking status in the first vigintile is negative during all the period except at age 10 and for the age range 17-18. The coefficients in those ranges are not reliable because the smoking period of a 10-year old boy is supposed to be very short and few observations exist in the age range [17][18]. Consequently, our result shows that the negative effect of smoking on FEV is not clear at the median level. However, smoking smoking should at least be considered as a relevant factor which has some negative effects on the group who has weak lungs, especially at the period of growth. This kind of heterogeneity would not have been revealed by variable selection procedures for mean regression. However, to investigate the relationship between the smoking status and FEV more rigorously, we need to know not only the smoking status of each subject but also the years they have been smoking for. This relationship should be checked again with such data.

Conclusion and future work
In this paper, we proposed an efficient variable selection method for varying coefficient models in quantile regression when all coefficient functions are assumed to be varying. Theoretically, we showed that our method has a nonparametric oracle property, which is desirable as a variable selection procedure. From a practical point of view, we illustrated through a simulation study that when the error is asymmetric or has a fat tail, our estimator is more robust than the one in mean regression in terms of the consistency of the variable selection method and the efficiency of the estimation procedure.
For more efficient estimation through variable selection, it is also important to separate varying and constant coefficients among the nonzero coefficients. For that purpose, it is possible to use the hypothesis testing given in Section 4 of Wang et al. [22] after identifying the nonzero coefficients based on our method.
Finally we addressed variable selection issues of varying coefficient models under the assumption that there are no repeated measurements for each subject. To extend our work to a longitudinal data setting seems a promising and useful project for practitioners. We leave it as a future work.
and hence it follows from (6.5) that P inf γ−γ 0 2=C1n −1/2 bn (l(γ) − l(γ 0 )) > 0 → 1, as n → ∞. By the convexity of l(γ) and the fact that l(γ) − l(γ) ≤ 0, there exists some C ξ , for any ξ > 0, such that as n → ∞, Proof of Theorem 1 (a). Different from the mean regression, the gradient func- is not useful to establish asymptotic properties of the estimatorγ. The reason is because the check function is not differentiable at zero and some of the estimated residuals are exactly zero in quantile regression. Because of these two reasons, the gradient function of l 0 (γ) evaluated atγ is not directly applicable since it involves the first derivative of the check function evaluated at each residual Y i − Π ⊤ iγ . As an alternative, we may consider the subgradient function of l 0 (γ). However, the subgradient function of l 0 (γ) is so complicated that it is very difficult to derive easy-to-check optimality conditions. Even though Tang et al. [20] seemed to derive the concise optimality conditions as in the case of mean regression when penalizing by the Euclidean norm of γ k , their optimality conditions are not correct. So in our work we derive the consistency in variable selection using a certain lower bound of the difference of two check loss functions. Since the selection consistency regarding the relevant coefficients is clear from Theorem 1 (b), which we will show, we focus on the case of the irrelevant coefficients.
Let γ * be the vector obtained fromγ withγ k0 being replaced by 0. It will be shown that there exists a η > 0 such that l(γ) − l(γ * ) > 0 with probability at least η for infinitely many n, which contradicts with the fact that l(γ)−l(γ * ) ≤ 0.
Proof of Theorem 1 (b). It is easy to see that, for all k = 0, 1, . . . , s, By Lemma 3 and 5, the desired result is obtained.