Screening-based Bregman divergence estimation with NP-dimensionality

: Feature screening via the marginal screening ([5]; [7]) has gained special attention for high dimensional regression problems. However, their results are conﬁned to the generalized linear model (GLM) with the exponential family of distributions. This inspires us to explore the suitability of applying screening procedures to more general models, for example without assuming either the explicit form of distributions or parametric forms between response and covariates. In this paper, we extend the marginal screening procedure, by means of Bregman divergence (BD) as the loss function, to include not only the GLM but also the quasi-likelihood model. A sure screening property for the resulting screening procedure is established under this very general framework, assuming only certain moment conditions and tail properties, where the dimensionality p n is allowed to grow with the sample size n as fast as log( p n ) = O ( n a ) for some a ∈ (0 , 1). Simulation and real data studies illustrate that a two-step procedure, which combines the feature screening in the ﬁrst step and a penalized-BD estimation in the second step, is practically applicable to identifying the set of relevant variables and achieving good estimation of model parameters, with the computational cost much less than those without using the screening step.


Introduction
In the recent literature, there has been a tremendous amount of work on the high dimensional regression estimation and classification.These types of studies arise frequently from many different areas of scientific research, such as fMRI brain images, microarrays, genomics, financial data, and internet traffic data.With the development of new technologies, we are now able to collect data sets which are much larger and more complex than we could have imagined a few years ago.In certain applications, we can even see that the dimensionality p = p n can grow much faster than the sample size n.Particularly, if p n can grow at log(p n ) = O(n a ) for some a > 0, we call p n the non-polynomial dimensionality or "NP-dimensionality".
For problems with NP-dimensionality, the classical regression model with p n parameters is not identifiable.On the other hand, in many applications only a small number of variables among all p n covariates would really have actual impact on the response variable.Thus, a sparse structure is usually assumed in such cases.As a result, those techniques that can generate sparse solutions are preferred and extensively studied.Regularization is one of the most commonly used techniques aiming at obtaining well behaved solutions to overparameterized estimation problems.Numerous variable selection methods, based on regularization/penalization, have been developed, including the bridge regression ( [8]), the Lasso ( [17]), the SCAD ( [6]), the MCP ( [21]), and the Dantzig selector ( [3]), among many others.[5] proposed another approach, which is a screening procedure to select relevant variables based on their marginal correlations.The "sure independence screening" property was established under certain conditions in their work.[7] extended the sure independence screening procedure to the generalized linear model (GLM) ( [15]).Their result works well for the GLM, but is somewhat restrictive, since their arguments largely depend on the nice properties associated with the exponential family and the canonical link function.[24] proposed a model-free feature screening approach (SIRS) using a special marginal utility measure based on the conditional distribution of the response given covariates.They showed that their approach can rank the relevant variables above irrelevant ones asymptotically in multi-index models.[14] studied another model-free feature screening method by considering the distance correlation and demonstrated the sure screening property for their method.[13] developed the robust rank correlation based screening procedure for the linear model and discussed the possibility of extending their method to the generalized linear model.In [4], an independence feature screening procedure based on the marginal empirical likelihood is studied for the generalized linear models under the exponential family distribution assumption.
Those works inspire us to explore the suitability of applying the screening procedures to more general models, for example without either the explicit form of distributions or any parametric forms between response and covariates.In this paper, we extend the marginal screening procedure, by means of the notation of Bregman divergence (BD) as the loss function ( [22]), to a wider class of screening procedure, including not only ranking by marginal maximum likelihood estimate in the GLM, which has been studied by [7], but also ranking by the quasilikelihood ( [16]), which has been less developed, and a lot more.An interesting example is given in Section 5.1.1 for overdispersed Poisson responses, to which the conventional GLM is not applicable.Hence, compared with the methods in [13] and [4], our proposed method is applicable to a more general setting than the generalized linear model.Furthermore, there are a few BD loss functions widely used in machine learning systems, for example hinge loss for the support vector machine ( [19]) and exponential loss for boosting ( [12]), but not generally fulfilling GLM model assumptions.
The proposed method utilizes the marginal regression minimum-BD estimator of each covariate and ranks their importance according to the absolute values of the marginal estimates.The sure screening property can be established based on a non-asymptotic probability bound for the occurrences of selection inconsistency.This means that all the truly relevant variables will be selected with overwhelming probabilities and the results are applicable under NP dimensionality which allows the dimension p n to grow as fast as log(p n ) = O(n a ) for some a ∈ (0, 1).Our results do not require that the covariate either follows an elliptically contoured distribution as in [7] or satisfies linearity condition as in [24].
While the above screening procedure is able to identify relevant variables, it does not directly provide good estimates of the non-zero parameters in a given parametric model.It is thus natural to devise a two-step procedure, which combines the "feature screening" in the first step with the "parameter estimation" in the second step, to obtain the final model.To our knowledge, the theoretical properties of such two-step procedure, where the second step uses the penalized-BD estimation of parameters ( [23]), have not been studied in the existing works on screening procedures.We carry out numerical assessment and comparison of the proposed screening method and the resulting estimation performance of several popular methods for the final model, via both simulation studies and real data analysis.From the simulation studies, the performance of our proposed screening method is better than the other model-free alternative methods like those in [24] and [14] in ultra-high dimensional settings.Our screening method based marginal regression minimum-BD estimator performs well, especially in the most stringent simulation setting in which both response and covariates are binary and very sparse.The results show that the two-step procedure is practically applicable.Another considerable advantage enjoyed by this two-step procedure is that by filtering out most of the irrelevant variables in the first step, we can greatly reduce the computational expense for parameter estimation in the second step which is usually more costly.Thus, the computation time of the two-step procedure in the simulation is just a fraction of those without using the screening step.
It is relevant to note that our main contribution in this paper is using Bregman divergence as a powerful tool to unify many commonly used loss functions and simultaneously study their asymptotic behavior under a very general framework in which the distribution of the response, conditional on the covariates, is allowed to be incompletely or not fully specified, and only certain moment conditions and tail properties are assumed.The main result of sure screening property does not require a particular parametric model, and it reveals that different choices of BD will only affect some constants in the probability bound.
The rest of the paper is organized as follows.Section 2 introduces the setup of the general regression model and briefly reviews the Bregman divergence (BD).Section 3 develops a screening procedure, based on componentwise regression minimum-BD estimation, and justifies its sure screening property.Section 4 proposes a two-step procedure combined with the penalized-BD estimation and demonstrates the oracle property of parameter estimation.Simulation results are presented in Section 5, and real data applications are given in Section 6.Details of technical assumptions and proofs are relegated to the Appendix.

A general framework
Assume that the observed data {(X i , Y i ) : i = 1, . . ., n} are random samples from the population distribution of a p-dimensional covariate vector X and a scalar response Y , where X = (X 1 , . . ., X p ) T and X i = (X i1 , . . ., X ip ) T .The number of variables p is allowed to grow with the sample size n, thus we denote it as p n when needed.In this paper, we are interested in predicting the response Y by its conditional expectation given X, While it is possible that p n can grow much faster than n, we assume that the true underlying model is sparse, which means that m(X) functionally only depends on a small fraction of the covariates, denoted by Without loss of generality, we can re-arrange the covariates such that M n = {1, . . ., s n }.Write X = (X (I)T , X (II)T ) T , where X (I) collects all truly relevant covariates and X (II) is just noise.Under this framework, our goal is to investigate the performance of a wide class of feature screening methods, by means of Bregman divergence that will be introduced in the next section.[2] introduced a device for constructing a bivariate function which can be used as a general loss function.For a given concave function q, define the Bregman divergence as

Bregman divergence
Conversely, for a given Q-loss, [22] provided necessary and sufficient conditions for Q being a BD, and further derived an explicit formula for solving the generating q-function.They also showed that the quadratic function, the Kullback-Leibler divergence (or the deviance loss) for the exponential family of probability functions, the (negative) quasi-likelihood function, and many margin-based loss functions, such as the misclassification loss, the hinge loss for the support vector machine ( [19]), and the exponential loss used in AdaBoost ( [12]), are all special cases of BD.
As an illustration, when we relax the distributional assumption on the response Y by only assuming var(Y | X = x) = σ 2 V {m(x)} for a known continuous function V (•) > 0, the quasi-likelihood function Q, given by the partial differential equation for a nuisance parameter σ 2 > 0, is usually used as an alternative of complete log-likelihood function.[22] verified that the (negative) quasi-likelihood function belongs to the BD and derived the generating q-function, given by where a is a finite constant such that the integral is well-defined.

Screening via componentwise regression minimum-BD estimation
Our proposed screening procedure is based on the componentwise regression minimum-BD estimators, defined as where the loss function Q is a BD as defined in (2.2) with a generating q-function.Furthermore, we restrict m(x) in (3.1) to be of the form, where α j and β j are two parameters to be estimated, and F is a known link function for appropriate data type.Usually, an identity link F (μ) = μ corresponds to the linear regression model for continuous responses; a logit link F (μ) = log( μ 1−μ ) is utilized in the logistic regression for binary responses; a log link F (μ) = log(μ) is used in the Poisson regression of count responses.
The functional form in (3.2) is a linear approximation to the problem in (3.1) which appears somewhat restrictive, however our later theoretical results show that such class of functions is actually rich enough to detect the marginal importance of covariates for the screening purpose.
Thus, the minimization problem in (3.1) is equivalent to estimating ( α CR j , β CR j ), for j = 1, . . ., p n , which are defined as We select the variables by choosing those whose componentwise coefficient estimators | β CR j | exceed a predefined threshold value γ n > 0, i.e. variables X j with indices j belonging to the set will be selected; the remaining variables will be screened out.
The minimization problem (3.3)only involves a univariate covariate and an intercept.Thus fast and robust computation would be feasible even in NPdimensional problems.When an appropriate γ n is chosen, we can significantly reduce the dimension of the original parameter space to a much smaller one and thus make it more manageable.After the screening step, other variable selection methods, like those (mentioned in Section 1) based on penalization, would be more feasible on survived variables.
In our screening procedure, the magnitude of the componentwise regression coefficient estimator, β CR j , serves as a proxy for the importance of the corresponding feature X j .Two questions arise naturally: (I) how well the set M γn preserves all relevant covariates, given that the estimators β CR j from the componentwise minimization problem (3.3)only approximate the importance of covariates in the original model (2.1); (II) how small the size of M γn can be, given that M γn should still include all truly relevant variables.
We will answer these two questions, by showing that the sure screening property holds under certain conditions, in the following sections.

Population version of componentwise regression minimum-BD estimator
Recall that the estimators in (3.3) are based on empirical minimization.To gain further insights, we define a population analogue of (3.3), denoted by where the expectation is taken with respect to the underlying joint distribution of (X, Y ).Note that the componentwise minimum-BD estimator β CR j will converge in probability to the population version β CR j .To guarantee the validity of the screening procedure, it is necessary that β CR j should at least preserve the significance of truly relevant covariates, i.e. those whom m(X) functionally depends on.Theorem 1 confirms that the significance of β CR j (i.e.β CR j = 0) only depends on the correlation between Y and X j .
Theorem 1 implies that if the response variable Y is correlated with a relevant variable X j , then the componentwise regression coefficient β CR j will be non-zero.In contrast, for those irrelevant variables X j which are uncorrelated with Y , β CR j will be zero.Theorem 2 further indicates that the magnitude of β CR j is also closely related to the magnitude of correlation between X j and Y .Theorem 2. Assume Conditions A1-A5 in the Appendix.For any positive sequences A n and B n , The conditions used in Theorem 2 are typically regarded as mild and are often assumed in the literature ( [22]; [7]).Assumptions A1 and A2 are related to the tail behavior of the population distribution.Assumptions A3 and A4 are about the convexity and smoothness of BD.The requirement of covariance between Y and X j 's for j ∈ M n is to ensure that the minimal signal strength of relevant variables should not be too weak and still identifiable.
If those conditions hold and A n B n , naturally we could utilize the gap between two groups of {|β CR j |} pn j=1 to identify the relevant variables, where a n b n denotes that there exists a constant c > 0 such that a n ≥ cb n for all n ≥ 1.

Sure screening property of componentwise BD regression
We start by giving the uniform convergence of componentwise regression minimum-BD estimator (3.3).To facilitate the derivation, we assume E(X j ) = 0 and E(X 2 j ) = 1, for j = 1, . . ., p n in the following results.Theorem 3. Assume Conditions A1-A5 in the Appendix.Then for any positive sequence where m 0 and m 1 are the constants given in Condition A2 of the Appendix.
Theorem 3 is an application of the exponential bound for the Quasi-MLE in [7] (Theorem 1).It guarantees that the empirical estimator β CR Corollary 1 addressed the first question raised at the beginning of Section 3. It is easy to see that if we assume A n = c 0 n −α with a constant 0 < α < 1/2 and log(p n ) = o(n 1−2α ), then the probability bounds in Corollary 1 are approaching one with the order 1 − O{p n exp(−c 3 n 1−2α )} for a positive constant c 3 , which is the same rate obtained in [5] and [7].This implies that the correct model will be selected with probability tending to one even under NP-dimensionality, where p n is permitted to be as large as log(p n ) = o(n 1−2α ).

Remark 1.
In Corollary 1, the conclusion from part (i) and the conclusion from part (ii) hold separately.In some cases, the assumption about the covariance between the response and covariates in part (ii) of Corollary 1 needs to be relaxed, but the sure screening property given in part (i) of Corollary 1 will still hold.It means that even if we can not eliminate all irrelevant covariates due to certain correlation between covariates, it is still guaranteed that we will not miss any truly relevant variables.

Comparison with sure independence screening in GLM
While our motivation is from [7]'s work on the generalized linear model (GLM), our results largely enhance the capability of marginal screening methods by extending it to a broader class of models with any BD as a loss function.In fact, proving the sure screening property under such general framework is by no means straightforward.The main challenge is that certain relationships in GLM are not applicable under the arbitrary choice of BD and link function in our setting.For example, the following equality holds under GLM combined with its canonical link F , where β 0;0 and β 0 = (β 1;0 , . . ., β p;0 ) T are unknown true parameters (see parameterization in (4.1)).Actually, (3.5) is the same as equation ( 14) in [7] which is a significant part of the proofs for Theorems 2, 3 and 5 therein.However, (3.5) no longer holds in the BD estimation when F could be an arbitrary link.
To overcome such technical challenge, we need to introduce a different way to express the componentwise regression minimum BD estimate and also impose some assumptions on the uniform bound of covariates.For details, please see the Appendix.Although the proposed screening procedure is based on certain linear form, the sure screening property of the proposed screening method actually does not require any particular parametric form of relationship between the response Y and the covariates X.Instead, the sure screening property is mainly built on the assumption about minimal signal strength of relevant variables measured by marginal covariance.Our results also reveal that different choices of BD will only affect constants c 1 and c 2 in the probability bounds in Corollary 1.

Two-step procedure with penalized-BD estimation
The results in Section 3 show that the screening procedure based on componentwise regression minimum-BD estimation works well in selecting the truly relevant variables.However, it may not be a good way to build a predictive model and provide estimates.In the absence of screening, [23] investigated the penalized-BD estimation and its oracle property in a large-dimensional model with the following form, where β 0;0 and β 0 = (β 1;0 , . . ., β p;0 ) T are unknown true parameters, and F is a known link function.Particularly, their penalized-BD estimator using weighted L 1 penalties minimizes the criterion function, where λ n is the tuning parameter and {w j } pn j=1 are given weights for parameters {β j } pn j=1 .In this section, we adopt the same setting.Now we could answer the second question given at the beginning of Section 3 by Theorem 4 which is to control the number of the selected variables in the set M γn .

Theorem 4. Assume the conditions in Theorem 3 and Condition
, where c 1 is the constant given in Theorem 2. It holds that where Σ = var(X) and λ max (Σ) denotes the maximum eigenvalue of Σ.
When A n = O(n −α ) and λ max (Σ) = O(n τ ) with 2α + τ < 1, Theorem 4 indicates that the number of selected covariates will not exceed the order n 2α+τ = o{n/ log(n)}.Therefore, we propose a two-step procedure from screening features to estimating coefficients of selected variables as follows.Since the cutoff value γ n involves some unknown constant, in practice we propose another easier and more straightforward scheme that choose γ n to be the p n th largest values of | β CR j |, where p n = n/ log(n) and • denotes the floor function.The choice of p n is large enough so that all truly relevant covariates will be selected, and also suitable for further estimation method in the second stage.
Step 1: Obtain the componentwise regression minimum-BD estimators β CR The idea of two-step procedures is widely used in the literature, for example the multi-stage method in [20].After the first step, we can greatly reduce the dimensionality and at the same time, by the sure screening property, still preserve all truly relevant variables with high probability.
Proposition 1 also indicates that our two-step procedure enjoys the oracle property under certain conditions.

Proposition 1. Assume the conditions in Theorem 4 and
√ n and select weights in (4.2) by where β PCR j is based on the penalized componentwise regression BD estimation, ) T .Then we have the following results for the two-step estimator.
(i) There exists a local minimizer then for any fixed integer k and any k × (s n + 1) matrix A n such that where ], ], W n = diag(0, w 1 , . . ., w sn ), and sign( β (I) 0 ) = (sign(β 0;0 ), sign(β 1;0 ), . . ., sign(β sn;0 )) T .Note that the proposed componentwise BD regression weight selection method in [23] excludes an intercept term.In contrast, our current weight selection method (4.3) includes the intercept term.Nevertheless, the assumptions for the oracle property are still satisfied and thus our procedure would also enjoy the oracle property.

Simulation study
In this section, we assess the performance of both the screening step and the estimation step in the two-step procedure.Two different settings of (n, p n ) are used in our simulation, (250, 250) and (350, 15000), which represent the high dimensionality and ultra-high dimensionality of data, respectively.

Performance of feature screening
To evaluate the performance of screening methods, we will measure the accuracy of the importance ranking of the covariates by the minimum model size (MMS) needed to include all truly relevant covariates.We also provide a coverage measure as the percentage of runs that all truly non-zero coefficients are picked up when setting p n = n/ log(n) .The following methods are compared: "SIS-BD" : our proposed screening method, "DC-SIS" : method in [14], "EL-SIS" : method in [4], "RRCS" : method in [13], "SIRS" : method in [24]. (5.1) All the results are averaged over 400 simulation runs.with J d a d×d matrix in which all entries are ones and I d a d×d identity matrix.Thus (X i1 , . . ., X ipn ) are marginally Uniform(−0.5, 0.5) random variables and correlated if ρ = 0.The type of BD we used here is

Overdispersed Poisson responses
which is generated by the q-function in (2.3) when Table 1 presents the mean, standard deviation along with a five number summary of the MMS as well as the coverage percentage for the screening methods in different settings.For the case of (n, p n ) = (250, 250) and ρ = 0.2, all the procedures work well in this nonlinear model and rank the truly relevant covariates at the very top of the list, as the resulting MMS's are very close to the true model size.Also for (n, p n ) = (250, 250), when the dependence parameter ρ increases from 0.2 to 0.5, the correlation between the covariates becomes larger and the irrelevant covariates can be easily confounded with the relevant covariates.In this case, the MMS becomes a little bigger and the coverage percentage becomes slightly smaller, while all methods still perform comparably well.For (n, p n ) = (350, 15000), our proposed method performs better than or as well as the other methods.Particularly, when ρ = 0.5, SIS-BD has a significantly smaller mean MMS and a larger coverage percentage than the other methods.It's also seen that, in the ultra-high dimensional case, as ρ increases, the MMS increases and the coverage percentage decreases as expected.By comparing the results for (n, p n ) = (350, 15000) with those for (n, p n ) = (250, 250), the higher dimensionality makes the feature selection problem harder, but the values of the MMS do not increase much, which supports our theoretical results in Section 3.
Results from two types of BD will be presented in this section, the Bernoulli deviance (DEV) loss, Table 2 reports the mean and the five number summary of the MMS from our simulation.Our proposed methods SIS-BD(DEV) and SIS-BD(EXP), combined with two choices of loss functions, DEV and EXP respectively, give similar results in all cases as expected and both perform better than or as well as the other methods.When r = 0.2, the signal from the data is more scarce and it is more difficult to identify the truly relevant variables.Under this stringent situation, the mean values of the MMS for our methods remain at a low level compared with the total number of covariates, and are smaller than those of the other methods.For r = 0.5, the performance of the screening methods is improved compared with that for r = 0.2.As the dimensionality grows, the MMS's for our methods increase, but at a much smaller rate so that the identification of relevant covariates and further parameter estimation would still be possible.However, because of the ultra-high dimensionality and the sparse signals from the covariates, there is a dramatic decrease in the coverage percentage as p n increases from 250 to 15000 for r = 0.2.Compared with the corresponding results in Table 1 for the overdispersed Poisson model, the values of the MMS in Table 2 are much larger, because, in the logistic regression model, the response and covariates are all binary which provide very limited information from either part.It's worth noting that the RRCS method fails to identify the truly non-zero coefficients.This doesn't contradict the conclusions in [13], where the sure screening properties of RRCS are demonstrated for linear models but not for generalized linear models due to technical difficulties as discussed in their Section 4.2.

Performance of parameter estimation
Here we will compare the performance of the two-step procedure with those variable selection methods using penalization which are directly applied to all variables.Namely, they minimize a criterion function similar to (4.2), except that the choice of the penalty can be the following: Let p n = n/ log(n) in the first step.For brevity, the two-step procedures are referred to as S-SCAD, S-MCP, S-L 1 , and S-WL1PCR, respectively.The tuning constants λ n and κ n are selected via a grid search separately to minimize the BIC.All the results are averaged over 100 simulation runs.

Overdispersed Poisson responses
The setting is similar to that in Section 5.1.1,except that the dependence parameter between covariates is fixed at ρ = 0.2.To compare the accuracy of the Q{y , m(x )}/L.We also provide the model selection performance via C-Z which is the total number of coefficients which are correctly estimated to be zero when the true coefficients are zero, and C-NZ which is the total number of coefficients which are correctly estimated to be non-zero when the true coefficients are non-zero.Finally, we record the average running time of each method under different settings.The high dimensional problem usually imposes a big challenge in computations as well as model selection and estimation, so considerably faster speed can be viewed as an advantage.Table 3 summarizes the simulation results.
By comparing the average losses of the estimates for β 0 and the test errors, we can see that all methods have satisfactory performance, while most two-step procedures are slightly better than their counterparts which directly apply the estimation step.Besides, every method is able to select all of the truly non-zero parameters.While the gain of the accuracy from the screening step does not seem to be very dramatic, we notice that the speed of the two-step procedures is much faster, where the screening step can reduce the computation time by a factor of 3 to 20.This indicates that the screening step can indeed filter out most irrelevant covariates without sacrificing the accuracy, so that we can make better use of the computational resources.When p n grows from 250 to 15000, for each method, the average loss of the estimates for β 0 and the test error increase slightly, indicating that the two-step procedures work conceivably well for ultra-high dimensional case.Compared with all the other methods, the screening-based estimation method using the MCP penalty corresponds to the smallest values of β − β 0 2 and the test error, and enjoys the shortest computational time.But, the difference between the performance of different two-step procedures is indeed not significant.

Bernoulli binary responses
The setting is similar to that in Section 5.1.2,except that we only present the results with r = 0.5.For this type of classification problem, we calculate the misclassification rate (MR) for an independent test set instead of the test error.Other metrics are similar to those in Section 5.2.1.The results are summarized in Table 4.
From Table 4, for most of the methods, the two-step procedures perform better than or as well as the counterparts without applying the screening step in each setting.Among all the two-step procedures, the one using the weighted-L 1 penalized estimation with weights selected by the PCR corresponds to the smallest averaged loss of the estimates for β 0 and the lowest misclassification rate for each setting.As p n increases, for each method and loss function, there is a reasonable increase in the loss of the estimate for β 0 and a slight increase in the misclassification rate.Therefore, the two-step procedures have satisfactory performance under the ultra-high dimensional situation.Again, in our comparison, the screening step can make the computation much faster than those methods that directly work with all possible covariates.Compared with Table 3, Table 4 has much larger values of β − β 0 2 and smaller values of C-NZ, which is due to the specific setting for the logistic regression model where the covariates and response are binary.

Real data application
In this section, we apply the methods considered in Section 5.2 to real data to illustrate the practical usefulness of the screening procedures.The tuning constants λ n and κ n in the second step is selected by the Akaike's information criterion (AIC).

Colon data
The classification of colon cancer is discussed in [1] and the data set can be downloaded from http://genomics-pubs.princeton.edu/oncology/.It consists of p = 2000 genes and n = 62 samples, in which 22 samples are from normal colon tissues and 40 samples are from tumor tissues.In our analysis, the data set is randomly split into two parts, with 45 samples as training samples and the rest 17 as test samples.We repeat the random split 100 times and calculate the average number of misclassified cases in both sets.The results are summarized in Table 5.
From Table 5, we see that the two-step procedures are capable of identifying those relevant variables and obtaining a good estimation and prediction while considerably fewer computational resources are needed.Among all the two-step procedures, it turns out that the one using the L 1 penalty corresponds to the smallest number of misclassified cases for both the training and test sets using either the deviance or exponential loss.Although method WL1PCR without using the screening step performs slightly better than S-L 1 , its computational time is much longer.Moreover, the choice of loss functions in the penalized-BD estimators has a relatively negligible on the classification performance.

Appendix: Proofs of Main Results
Notation.For notational brevity, let X j = (1, X j ) T and b j = (α j , β j ) T denote the two-dimensional covariate and parameter of the componentwise regression minimum-BD estimation in (3.3), respectively.Denote b CR j = ( α CR j , β CR j ) T and b CR j = (α CR j , β CR j ) T .Throughout this section, • 1 is the L 1 -norm, • 2 is the Euclidean L 2 -norm, and • ∞ is used to denote the L ∞ -norm.
Condition.We have the following assumptions in which M , B, B are sufficiently large constants.Those are not the weakest possible, but serve to facilitate the technical derivations.
A1.For all j, X j are uniformly bounded, i.e.X ∞ ≤ M .Σ = var(X) exists finitely and is nonsingular.A2. var(Y | X) > 0, E(Y 2 ) < ∞ and the tail probability of Y satisfies that there exist some positive constants m 0 and m 1 such that for sufficiently large t, P(|Y exist finitely up to any order required.Suppose q 2 (y; θ) > 0 for all θ ∈ R and all y in the range of Y .A4. F (•) is a bijection and F is continuous.Without loss of generality, assume D. Assume that the eigenvalues of Ω n and H n are uniformly bounded away from 0; It follows from Condition A3 that Q{y, F −1 (θ)} is strictly convex in θ.Therefore, the minimizer of (3.4) is the solution of the score equations of (3.4) which are given by which imply cov(Y, X j ) = 0. On the other side, if cov(Y, X j ) = 0, it is easy to verify that (F (E(Y )), 0) satisfies the score equations, Proof of Theorem 2. We first show part (i).The first two partial derivatives of CR j (α j , β j ) = E[Q{Y, F −1 (α j + X j β j )}] with respect to α j are given by By Condition A3, it follows that CR j (α j , β j ) is convex in α j .Then, for any given β j = b, the minimizer of CR j (α j , b) will be the solution to the following equation, which imply that j (b) is convex in b and β CR j is unique and satisfies j (β CR j ) = 0.By the mean-value theorem, where b * is between 0 and b.It can be shown that Since E{q 1 (Y ; α j (0))} = 0, we observe that where C 0 = q (E(Y ))/F (E(Y )).By Conditions A1 and A3, |X j | ≤ M and q 2 (y; θ) > 0, we observe that for any b and j = 1, . . ., p n , Then for any j = 1, . . ., p n and any −B < b < B, . By (6.1), for any j = 1, . . ., s n , Again by (6.1), for any j = s n + 1, . . ., p n ,

.2)
Proof of Theorem 3. To prove Theorem 3, the following lemma (which is Theorem 1 of [7]) will be needed.
exists finitely and is positive definite at β = β 0 .Moreover, We now prove Theorem 3. The main idea is to apply Lemma 1 by letting d = 2, X = X j , β = b j and (X T β, Y ) = Q{Y, F −1 (X T j b j )}.
So we need to show that the conditions (F1)-(F3) hold under our assumptions.
For condition (F1), where μ j = F −1 (X T j b j ).By assumptions A1 and A2, I j (b j ) is bounded and positive definite at b j = b CR j .For condition (F3), it suffices to show that E{q 2 (Y ; x T j b)X j X T j } ≥ V I 2 for some V > 0, where I 2 is the 2 × 2 identity matrix.Since E(X j X T j ) = I 2 and q 2 (Y ; x where f 1 (t) = q(F −1 (t)), f 2 (t) = q (F −1 (t)) and f 3 (t) = F −1 (t)q (F −1 (t)).Let

jCorollary 1 .
will be close enough to the population version β CR j with large probability.With Theorem 3, we obtain Corollary 1 below which demonstrates the sure screening property of componentwise BD regression.Assume conditions in Theorem 3. Set γ n = c 1 A n /2, where c 1 is the constant given in Theorem 2.
j in (3.3) and select sufficiently many covariates, corresponding to p n largest values of | β CR j |.Denote by M the set of indices of selected variables, where | M| = p n .Step 2: Set the coefficients of (p n − p n ) variables not in M equal to zero.Use (4.2) to estimate the other parameters of those p n features selected in Step 1.

Table 1 (Simulation results: overdispersed Poisson count responses)
Mean, standard deviation (std), and five number summary of the minimum model size, out of 400 replications with Q 1 : first quartile; Q 2 : median; Q 3 : third quartile; CP: percentage of runs that all truly non-zero coefficients are covered by M. Methods SIS-BD, DC-SIS, EL-SIS, RRCS and SIRS are described in (5.1).

Table 2 (Simulation results: Bernoulli binary responses)
Mean, standard deviation, and five number summary of the minimum model size, out of 400 replications.Methods SIS-BD, DC-SIS, EL-SIS, RRCS and SIRS are described in (5.1).

Table 3 (Simulation results: overdispersed Poisson count responses)
Covariates are marginally Uniform(−0.5, 0.5) with dependence parameter ρ = 0.2 in (5.2). Results are averaged over 100 replications.Here TE is the test error obtained from an independent test set; time is the average running time in seconds.Timing was carried out on an Intel 3.60GHz processor.
estimated parameters by different methods, the average of β − β 0 2 across those 100 training sets is calculated.The test error (TE) is obtained from an independently generated test set {(x , y )} L=10000 =1 by L =1

Table 4 (Simulation results: Bernoulli binary responses)
Covariates are independentBernoulli random variables with r = 0.5 in(5.3).MR is the misclassification rate on an independent test set.Results are averaged over 100 replications.Timing was carried out on an Intel 3.60 GHz processor.

Table 5 (Real data: Colon)
Average number of misclassified cases among 45 training samples and 17 test samples, average number of selected variables among all 2000 covariates and the average computation time in seconds, over 100 replications.Timing was carried out on an Intel 3.60 GHz processor.

Lemma 1 .
Consider data {(X i , Y i ) : i = 1, . . ., n} which are n i.i.d.samples of (X, Y ) ∈ X × Y for some space X ∈ R d and Y ∈ R. A regression model for X and Y is assumed with loss function (X T β, Y ).Let