Feature Screening for High-Dimensional Variable Selection in Generalized Linear Models

The two-stage feature screening method for linear models applies dimension reduction at first stage to screen out nuisance features and dramatically reduce the dimension to a moderate size; at the second stage, penalized methods such as LASSO and SCAD could be applied for feature selection. A majority of subsequent works on the sure independent screening methods have focused mainly on the linear model. This motivates us to extend the independence screening method to generalized linear models, and particularly with binary response by using the point-biserial correlation. We develop a two-stage feature screening method called point-biserial sure independence screening (PB-SIS) for high-dimensional generalized linear models, aiming for high selection accuracy and low computational cost. We demonstrate that PB-SIS is a feature screening method with high efficiency. The PB-SIS method possesses the sure independence property under certain regularity conditions. A set of simulation studies are conducted and confirm the sure independence property and the accuracy and efficiency of PB-SIS. Finally we apply PB-SIS to one real data example to show its effectiveness.


Introduction
As the data with a huge number of features becomes popular in real life, many feature screening approaches have been developed to reduce the size of features [1]. introduced a model-free category-adaptive feature screening approach to detect category-specific important covariates for high-dimensional heterogeneous data [2]. proposed cumulative divergence (CD) metric and developed a model-free CD-based forward screening procedure. In [3], a distributed screening framework was utilized, which applies a correlation measure as a function of several component parameters and each of those components can be distributively estimated. With the components estimates, a final correlation estimate can be adopted for screening features [4]. proposed a model-free and data-adaptive feature screening method which is based on the projection correlation between two random vectors for ultra-high dimensional data. This approach is applicable for heavy tail and multivariate responses.
A large number of variable selection approaches based on regularization have been developed to tackle the high-dimensionality issue. One of the most popular and renowned regularization method, the Least Absolute Shrinkage and Selection Operator (LASSO) method, was proposed by Tibshirani [5]. The LASSO uses the l 1 penalty and minimizes the squared error. The major advantage of LASSO method is that it performs the variable selection and parameter estimation simultaneously. Unlike the ridge regression, the LASSO is able to shrink the coefficient estimate towards zero. Despite the popularity of the LASSO, many alternative choices of penalty functions are also available. Fan and Li [6] proposed the smoothly clipped absolute deviations (SCAD) penalty, which is a nonconvex penalty. Another example is the Dantzig selector (DS) method proposed by [7], which minimizes the maximum component of the gradient of the squared error function [6]. reviewed and summarized a family of well-established work on variable selection problems by using a penalized likelihood approach in the finite parameter settings and established the oracle properties for non-concave penalized likelihood estimators.
It was argued that the regularization methods cited above may not perform as expected due to the simultaneous challenges of computational expediency, statistical accuracy, and algorithm stability [8]. Thus, a large number of two-stage approaches have been proposed to improve the performance of the regularization methods and reduce the computational cost. In the first stage of these two-stage methods, the dimension of the data was reduced. One can choose from different dimension reduction methods to reduce the number of variables from very large to moderate. Then in the second stage, classic variable selection algorithms can be applied without the curse of high-dimensionality to identify the important features selected from the first stage. The choice of variable selection algorithms ranges from regularization to model selection criteria. Ideally, all of the important features are selected and only a few nuisance variables are kept in the first stage. Therefore, the first stage is usually referred to as the feature screening stage and the second stage as the post-screening stage.
The two-stage approach can be applied to linear models. Fan and Lv [9] proposed the sure independence screening (SIS) method to select important variables based on marginal Pearson correlation coefficient between each predictors and response variable in the first stage. By applying SIS in the first stage, we can select the features that have the strongest correlation with the response variable and reduce high-dimensionality to a relative moderate size. Following the first stage, appropriate regularization methods such as LASSO, SCAD, and Dantzig can be applied in the second stage to further select the important features. Those methods are referred as SIS-LASSO, SIS-SCAD, and SIS-DS.
To broaden the application of two-stage feature screening and variable selection, generalized linear models are involved, and they are popularized via McCullagh and Nelder [10]. In such models, a link function (often nonlinear) connects the mean of a response variable and linear combinations of predictors. A generalized linear model serves as a flexible and more general framework that can be used to build many types of regression models. The response variable is assumed to follow an exponential family distribution and does not have to be a normal distribution. With the release of normality assumption, generalized linear models can therefore be applied to a wide spectrum of data for modeling analysis. As an extension of the linear regression, generalized linear models are substantially utilized in a variety of fields, such as biomedical and educational research, social sciences, agriculture, environmental health, financial analysis, etc.
Sure independent screening method was demonstrated to be capable of efficiently selecting important predictors with low computational cost in linear models. Therefore, it is a natural extension to apply the feature screening method to generalized linear models. Fan and Song [11] extended the feature screening procedure for generalized linear models by ranking the marginal maximum marginal likelihood estimator (MMLE). This method ranks marginal regression coefficient of generalized linear model to screen the important features. It is able to dramatically reduce the dimension of the data and make the computation more feasible after the screening. Actually, the MMLE ranking is the same as the marginal correlation ranking in the linear model setting. Further, it does not depend on normality assumption and can be applied to other models. A variety of marginal screening procedures have been proposed by applied different types of correlations and for different types of models.
Even though some feature screening procedures such as MMLE and Kolmogorov filter [12] have been proposed for generalized linear models, those methods have their own limitations. MMLE approaches can select important predictors efficiently, but the computational cost for this method is relative high since it requires fitting the marginal model for each predictor. The Kolmogorov filter method is computationally fast, but the selection accuracy is relatively low compared with certain methods. Inspired by those two-stage feature screening approaches, we propose a two-stage feature screening approach for high-dimensional variable selection in generalized linear model with binary response variable. The point-biserial correlation [13] is a well-known correlation that can be used to measure the strength and the direction between one continuous variable and one binary variable. In the first stage, we can apply point-biserial correlation as a marginal index to check the correlation between each predictor and the response to reduce the dimension of the data to a moderate size. Then, we apply a regularization method to further select important predictors and build the final sparse model.
The primary objective of this paper is to develop a two-stage feature screening method called point-biserial sure independence screening (PB-SIS) for high-dimensional generalized linear models, aiming for high selection accuracy and low computational cost. The latter property is quite important in the era of big data, where the size of data sets becomes larger and never stops growing with the advancement of modern science and technology. We demonstrate that PB-SIS is a feature screening method with high efficiency. Section 2 introduces generalized linear models. Section 3 presents the PB-SIS method and the two-stage point-biserial correlation screening procedure. Section 4 conducts a set of simulation studies to compare the performance of the proposed method with MMLE [11] and Kolmogorov filter method [12]. The predictors are set to have different strengths of pair-wise correlation and the response variable is generated by using different link functions. These simulations confirm the sure independence property and the accuracy and efficiency of PB-SIS. We demonstrate the effectiveness of PB-SIS with the application to one real data example in Section 5. Section 6 concludes and discusses.

Generalized Linear Models (GLMs)
Even though the sure independence screening (SIS) method proposed by Fan and Lv [9] provides a very useful and powerful tool for high-dimensional data analysis, it focuses on the linear models setting and its properties dependent on the joint normality assumptions. Fan and Song [11] also proposed a more general version of sure independence screening method for generalized linear models (GLMs), which ranks the maximum marginal likelihood estimator (MMLE) or maximum marginal likelihood itself. Assume that the response Y is from an exponential family with the canonical form: where let X = (X 1 , X 2 , . . . , X p ) be the p-dimensional explanatory variables shown as the n × p design matrix. Denote X ij as the ith observation of the jth variable, then we have X i = (X i1 , X i2 , . . . , X ip ) T . The b(·), c(·) are some unknown functions, and natural parameter θ. Then we have the following generalized linear model: where g(·) is the link function, β 0 is an unknown scalar. Let β = (β 1 , β 2 , . . . , β p ) T be a p-dimensional unknown vector. Let {x i , Y i }, i = 1, 2, . . . , n, be an independent and identically distributed sample from a population {x, Y}. For the MMLE method,β M j for the jth predictor X j is defined aŝ is the log likelihood function. Ref. [11] considered to rank magnitude of the marginal regression coefficientsβ M j1 to select important features and defined the selected submodel as where γ n is a pre-specified threshold. The dimension of p will dramatically decrease to a moderate size when we choose a large value of γ n .
To establish the theoretical properties of MMLE, Fan and Song [11] defined the population version of the marginal likelihood maximize as where E denotes the expectation under the true model. Based on this population aspect, it was shown that the marginal regression parameter β M j1 = 0 if and only if cov(Y, X j ) = 0, for j = 1, 2, . . . , p. Thus, β M j1 = 0 when the important features are correlated with the response variable. Define the true model as M = {1 ≤ j ≤ p : β j = 0} with the size s = |M|. Under some conditions, if |cov(Y, X j )| ≥ c 1 n −κ for j ∈ M and some c 1 > 0, then we have min for some c 2 , κ > 0. Thus, the marginal signals β M j1 's are stronger than the stochastic noise provided that X j 's are marginally correlated with Y.
Fan and Song [11] also showed that under proper regularity conditions, this procedure has sure screening property and size control property if γ n follows an ideal rate. Under certain conditions, we have where γ n = cn 1−2k for some 0 < k < 1/2 and c > 0. The M * = {1 ≤ j ≤ p : β * j = 0} is the true index set of model.

Feature Screening Methodology for Generalized Linear Models via Point-Biserial Correlation
We propose a two stage feature screening method for GLMs variable selection by using point-biserial correlation. In the first stage, we use point-biserial correlation as a marginal utility to rank predictors and select the submodel by using some predefined threshold. This step can reduce the number of features from a very large scale to a moderate size in a computationally fast manner. Then in the second stage, we apply a regularization method, such as LASSO, SCAD or MCP, to further shrink the number of parameters and find the final sparse model from the screened set we got from the first stage. This proposed method is referred as the two-stage PB-SIS.
We remark that [9] demonstrated that the two-stage methods which combine independence screening and penalized method outperform an one-step penalized method. The effectiveness of the two-stage method is guaranteed by the sure screening property. The sure screening properties mean all important predictors are selected in the reduced model almost surely, e.g., the sure screening property for PB-SIS guarantees that PB-SIS is able to retain all of the variables from the true model in the screened submodel with probability going to one as the sample size goes to infinity, and the convergence rate is exponential. It can be shown that PB-SIS possesses the sure independence property under certain regularity conditions and that the PB-SIS method can select all of the important variables in the model with probability one.

Point-Biserial Correlation and Its Asymptotic Distribution
Let Y be a binary variable with two classes y 0 = 0, y 1 = 1, and again X = (X 1 , X 2 , . . . , X p ) T be a n × p covariate matrix. Given n independent identically distribution random sample X i = (X i1 , . . . , X ip ) T . Let X ij , i = 1, 2, . . . , n, j = 1, . . . , p, be the ith sample of the jth covariate. To investigate the point-biserial correlation between Y and X j , j = 1, 2, . . . , p, we consider the correlation between each X j and Y.
For each j, consider (X i , Y i ), i = 1, 2, . . . , n, a sequence of independent random vectors. Assume Y i have the Bernoulli distribution: where 0 < p < 1 and p 1 + p 1 = 1. Assume X i have the mixture normal distribution which can be written as either the distribution function F or the density function f : (2) The random variable Z is asymptotically normal with a mean of µ and a variance of σ 2 . Consider X normally distributed in Z 0 and Z 1 separately with different mean µ 1 , µ 0 and same variance σ 2 1 , σ 2 0 , where we have , and σ 2 1 = σ 2 0 = σ. Thus, the point-biserial correlation can be defined as Since Y i has the Bernoulli distribution with probability in Equation (1), the mean and variance of random variable Y are E(Y) = 1(p 1 ) + 0(p 0 ) = p 1 and Since X follows the mixture normal with CDF in Equation (2), the expected value and variance of X are E(X) = p 1 µ 1 + p 0 µ 0 and Denote the standardized difference of means µ 1 and µ 0 , Thus, the variance of random variable X can be written as Then, we can derive the expected value of product of X and Y. Since the product of XY is zero when X = 0 or Y = 0, the expected value of XY only takes the value when Y = 1. Therefore, we have E(XY) = pµ 1 .
Now we can find the population correlation coefficient X and Y which has the form ρ(X, Y) = ∆ p 1 p 0 1+p 1 p 0 ∆ 2 and it has a natural estimator, r pb . Remark 1 states the asymptotic distribution of point-biserial correlation which can be easily extended from [13]. Remark 1. Let random variable Y have a Bernoulli distribution and random variable X have mixture normal distribution with CDF in form (2), then the point-biserial correlation, r pb , between X and Y has the asymptotic distribution

Two-Stage Point-Biserial Correlation Screening Procedure
We consider using the point-biserial correlation to measure the correlation between X j , j = 1, 2, . . . , p, and Y. We define the following index , as a marginal utility measure for screening. Intuitively, we can see that if X j and Y are independent or close to independent, then ω j = 0 or ω j is very close to 0. On the other hand, if X j and Y have strong correlation, ω j is close to −1 or 1. Thus, we can rank the marginal ω j value to select important features that have higher correlation with the response variable. A natural estimator for ω j can be defined aŝ Based onω j , we propose a two-stage screening procedure for high-dimensional GLMs with binary response variable. In the first stage, we compute sample point-biserial correlationω j , j = 1, 2, . . . , p for each predictor. Then we can sort the magnitudes of all the components of! = (ω 1 ,ω 2 , . . . ,ω p ) T in a decreasing order and select a submodel as where the submodel size d is smaller than the sample size n. Thus, we can reduce the high dimension p to the moderate size d. As Ref. [9] suggested, the submodel size d could be set as n/ log(n) , where the a refers as the floor function of a. The submodel (3) has the equivalent from where the d or γ is a predefined threshold value. This proposed procedure is referred to as point-biserial correlation sure independence screening (PB-SIS).
Although the PB-SIS method can reduce the high dimensionality p to a moderate size d, we can apply a penalized method in the second stage to further select important variables to find the final sparse model. In the second stage, a penalty regression procedure, such as the least absolute shrinkage and selector operator (LASSO), can be applied to further select important variables and estimate the coefficients in model. LASSO is a shrinkage method which places a constraint on the absolute values of the parameter in a model. It is the most popular approach for selecting significant variable and estimating coefficients simultaneously. The LASSO estimates is defined aŝ Ref. [14] proposed fast regularization path for GLMs via coordinate descent. This method can handle LASSO penalty for estimation problems efficiently. For solving Equation (4) in generalized linear models setting, Ref. [14] considered a coordinate descent steps. Suppose we have estimatesβ 0 andβ l for l = j, and we would like to partially optimize with respect to β j . Let R(β 0 , β) be the objective function in Equation (4). The gradient at β j =β j could be computed ifβ j = 0. Thus, ifβ j > 0, then we have Ref. [15] showed that after a simple calculation, the coordinate-wise update has the form:β is the fitted value excluding the contribution from x ij , and i is the partial residual for fitting β j . The S(z, γ) is the soft-threshold operator with the value: The details of this derivation are showed in [16].
Since we focus on feature screening for GLMs with binary response question, the logistic regression model is commonly used. We would like to investigate the model optimization and estimation for penalized logistic regression as follow. As we discussed before, the logistic regression model can be represented by the class-conditional probabilities through a linear function of the predictors as where P(G = 1|x) = 1 − P(G = 0|x). This can imply the logistic regression formula: Let p(x i ) = P(G = 1|x i ) be the probability in Equation (5) for observation i at a particular value for the parameters (β 0 , β), then Ref. [14] maximized the penalized log-likelihood: Denote y i = I(g i = 1), then the penalized log-likelihood in Equation (6) can be represented as which is a concave function of the parameter. For the unpenalized log-likelihood problem, we could apply Newton's method to work on maximizing iteratively reweighted least squares. We could form a quadratic approximation (Taylor expansion) for the log-likelihood to estimate (β 0 ,β) as where and andp(x i ) is evaluated at current parameter. The last term in Equation (8) is constant, and z i is the working response and ω i is weights in Equations (9). The Newton update could be obtained by minimizing in Equation (8). Ref. [14] proposed the coordinate descent approach to optimize the penalized log-likelihood in (7), which is similar as the Newton's method. As they suggested, we can create an outer loop which computes the quadratic approximation Q about the current parameters (β 0 ,β) for each value of λ. Then use coordinate descent to solve the penalized weighted least-squares problem as To implement this algorithm, we need to use a sequence of loops at the same time. We can use the outer loop to decrement λ, use the middle loop to update the quadratic approximation Q using the current parameter (β 0 ,β), and apply the inner loop to run the coordinate descent algorithm on the penalized weighted least squares problems in objective function (10). We then iterate those nested loops until convergence. Besides LASSO penalty, the smoothly clipped absolute deviation (SCAD) penalty [6] and the minimax concave penalty (MCP) [17] also can be applied in the second stage to further select important predictors and estimate the coefficients. The SCAD and MCP are concave penalties that satisfy the oracle properties. It means that those two penalized methods can correctly select important variables and estimate coefficients with high probabilities if certain regularity conditions are met. For the SCAD penalty, Ref. [6] proposed a local quadratic approximation (LQA) algorithm to find the optimal solutions. However, once a coefficient is set to zero at any iteration, it will keep staying at zero and the corresponding variable is removed from the final model for LQA algorithm. Ref. [18] proposed the majorization-minimization (MM) approach to optimize a perturbed version of LQA by bounding the denominator away from zero. Besides, Ref. [19] proposed a local linear approximation (LLA) algorithm to approximate the concave penalized solution by repeatedly using the algorithms for the LASSO penalty. However, most of those optimization methods are for linear models. Ref. [20] proposed a majorization minimization by coordinate descent (MMCD) to find the optimal solutions of a concave penalized in GLMs, with emphasis on the logistic regression. They implemented this algorithm for a penalized logistic regression model using the SCAD and MCP penalties.
Since this algorithm can not run λ all the way to zero if p is much greater than n since the saturated logistic regression fit is undefined, it is necessary to apply the first stage of our proposed method first to reduce the number of parameters to a moderate size. Then we use a penalized method, such as LASSO, SCAD and MCP, at the second stage to obtain the final model. This algorithm is easily to implement by using R package SIS. By applying the SIS, one can use cross-validation (CV), AIC [21], BIC [22] or EBIC [23] to choose tuning parameter λ.
The summary of two-stage PB-SIS method is provided in Algorithm 1.

Algorithm 1 Two-stage PB -SIS Algorithm.
1: Compute the point-biserial correlation between x j and y as ω j and rank the magnitude of the absolute value of marginal correlationω j . 2: Choose the predefined threshold value d and take the selected submodel to be M d = {j : 1 ≤ j ≤ p : |ω j | is among the first d largest of all}, where d is some predefined threshold. 3: Start with all variables in the submodel M d , then apply a penalized method, such as LASSO, SCAD or MCD, to further select important variables and estimate coefficients (β 0 ,β).

Simulations
We will conduct Monte Carlo simulations to evaluate the performance for the proposed PB-SIS method with some existing feature screening methods for generalized linear models (GLMs), like sure screening by ranking the magnitude likelihood estimator (MMLE) [11], and screening for binary classification based on the Kolmogorov-Smirnov statistic (Kolmogorov Filter) [12]. We will also check the performance of two-stage PB-SIS method with different penalized methods by using different tuning parameter selection criteria.

Simulation Settings
In each example, the data (X where the conditional distribution of the response Y given X = x is a binomial distribution with probability of success π i . We generate x = (X 1 , X 2 , . . . , X p ) T from multivariate normal distribution with mean 0 and covariance matrix Σ = (σ ij ) p×p = ρ |i−j| . We set up 5 different ρ values from small to large to generate X with different correlation strength among the p predictors. There are independence (ρ = 0), low correlation (ρ = 0.2), moderate correlation (ρ = 0.4), high correlation (ρ = 0.6) and very high correlation (ρ = 0.8). We vary the size of the non-sparse set of coefficients as s = 2, 3, 4 with vary signals and set up the number of parameter with p = 200 and p = 600. Besides, we apply one link function, logit, to generate the binomial proportion π i , then generate the binary response variable Y. For each link function, we consider 6 different models which are presented in Table 1 with different covariates. The true coefficients for these 6 models are 3,3,3), and β = (2, −3, 3, −3) and the same constant term β 0 = 1. Note that these parameters are randomly selected and some easily recognizable numbers are chosen for brevity. The patterns and trends of the simulation results do not depend on the parameter values. Thus, the proposed PB-SIS method is compared with MMLE and Kolmogorov filter method under all 2 × 6 = 18 simulation settings. All simulation results are based on 1000 replicates.

Model
Variables Model Variables For each simulation, we use the proportion of submodels M d with size d that contain all the true predictors among 1000 replications, P 1 , and computing time to evaluate the performance for each setting. For the threshold value d, we follows [9] and choose d to be d 1 = n/ log n , d 2 = 2 n/ log n and d 3 = 3 n/ log n throughout our simulations to empirically examine the effect of the cutoff, where the n/ log n means the floor function of n/ log(n). Since in our simulation setting, we take n = 100, we have d 1 = 21, d 2 = 43, and d 3 = 65. We also evaluate each method by summarizing the median minimum model size (MMMS) of each selected models and its robust estimate of the standard deviation (RSD). RSD is the interquantile range (IQR) divided by 1.34, which is given by [11].
For the principle to define the value of d, Ref. [9] set d = n/ log(n) as one way of choices for d, and this way is conservative yet effective. Their preference is to select sufficiently many features in the first stage, and when d is not very small, the selection results are not very sensitive to the choice of d. It is obvious that larger d means larger probability of including the true model M * in the submodel M d . Provide that d = n/ log(n) is large enough, we can use it as the threshold. Doing so can detect all significant predictors in the selected subset and the P 1 value is large. Therefore, the principle for choosing d is to obtain a relatively large value of d to ensure the selection of the first stage can include all important predictors in the submodel M d . The simulation results in the next subsection will show that taking d 1 = n/ log n , d 2 = 2 n/ log n and d 3 = 3 n/ log n as thresholds results in the P 1 values being close to 1, verifying that these thresholds perform effectively in the proposed feature screening method.

Presentation of Simulation Results for Logit Models
We present a series of simulation results where the response variable is generated from GLMs for binary data by using logit link. For the link function, we will summarize simulation results for 6 different models in Table 1. The proportion P 1 and computing time are tabulated in first 6 tables and the MMMS and the associated RSD are summarized in Tables 7-12 for each link.
The simulation results for model 1 to model 6 where data is generated from logit link are tabulated in Tables 2-7. From Table 2, we can see that the all proportions P 1 are close to 1, which illustrates the sure screening property. MMLE screening procedure usually has highest proportion P 1 than the other two methods, but it takes much longer computing time than PB-SIS method and Kolmogorov-filter method in all settings. Even through the proportion P 1 for PB-SIS is slightly lower than MMLE when ρ = 0 and ρ = 0.2, the difference is very small. The biggest difference for proportion P 1 is only 1.3% between PB-SIS and MMLE when ρ = 0 and p = 600. When ρ is greater than 0.4, the PB-SIS and MMLE have the exact same proportion P 1 . But when we consider about computational cost, the PB-SIS method can be implemented much fast than the MMLE method. The average computing time for the PB-SIS and MMLE methods in logit model 1 are 41.85 seconds and 579.18 seconds when p = 200, and 282.05 seconds and 1289.69 seconds when p = 600. The computing time for MMLE is almost 6.74 times and 2.23 times longer than the PB-SIS method when p = 200 and p = 600. The Kolmogorov filter method has lowest proportion P 1 and moderate computing time in each setting. Since we assign all coefficients are positive in logit model 1, the proportions P 1 do not dependent on the independence assumption. Even for the highly correlated predictors, all three feature screening methods still can successfully select all the true predictors. For example, the proportions P 1 are all equals to 100% when ρ = 0.6 and ρ = 0.8. Besides, the proportion P 1 decreases as the dimensionality increases. As the number of features increases from p = 200 to p = 600, the proportions P 1 decrease in most settings.
The proportion P 1 and computing time for logit model 2 are reported in Table 3. In logit model 2, the two true covariates are assigned different signs. All P 1 of PB-SIS and MMLE are still very close. It means those two screening procedures perform equally well in most of settings. However, when we compare the computing time for the different methods, we can observe that PB-SIS takes much shorter computing time than MMLE in all settings. If we compare covariance structures with different ρ's, those predictors are independent to each other (ρ = 0) and predictors have low correlation (ρ = 0.2) settings typically perform better than those with high (ρ = 0.6) or very high ρ = 0.8 correlation settings for all three screening procedure. This is due to the probabilities of selecting some unimportant variables are inflated by the adjacent important ones when the predictors are highly correlated. Then some unimportant predictors may be selected since those predictors have strong correlation with the true predictors and it weakens the probabilities of selecting all true predictors.       Table 4 depicts the proportion P 1 and computing time for logit model 3. Similar conclusions can be drawn from Table 4 as from Table 2. All proportions P 1 of all three screening approaches are close to one. It means those three approaches are able to select all important predictors in this setting. As the submodel size d increases, the proportions P 1 for all three approaches increase as well. Thus increasing the submodel size d is helpful for increasing the proportion P 1 . The computing time does not change too much as the submodel size d increases. If we would like to get higher proportion P 1 , we can choose a larger threshold d. However, the larger threshold d means the model will become more complex. There is a trade off between the model complexity and the selection accuracy. Our suggestion is to choose the smaller submodel model size d = n/ log(n) , since the small growth of the proportion P 1 is not worth the increasing of twice or three times of model complexity. Table 5 reports the proportion P 1 and computing time for logit model 4. In logit model 4, the three true covariates are assigned different signs. The PB-SIS and MMLE perform equally well and PB-SIS approach is more efficient when ρ = 0, ρ = 0.2, ρ = 0.4 or ρ = 0.6. However, when predictors are highly correlated (ρ = 0.8), all three feature screening fail to detect important predictors. This is because when predictors are highly correlated (ρ = 0.8), each predictor's contribution to the response variable is cancelled out, especially for the predictors have opposite sign.
The proportion P 1 and computing time for logit model 5 and logit model 6 are summarized in Tables 6 and 7. For logit model 5, we observe a qualitative pattern similar to logit model 1 and logit model 3. The PB-SIS and MMLE approaches perform equally well, and the PB-SIS approach yields a comparable computing time. The Kolmogorov filter approach performs a little bit worse than the PB-SIS in both selection accuracy and computing time. We also observe that the proportion P 1 increases as the correlation ρ increases. From Table 7, the simulation results show the PB-SIS and MMLE perform equally well in selection accuracy, while the PB-SIS approach has lower computational cost than MMLE when predictors are independent or have lower correlation. Similar to logit model 1 and logit model 3 simulation results, when predictors are highly correlated, all three feature screening approaches tend to fail select important predictors. Table 8 summarizes the MMMS which contains all true predictors for logit model 1 and its RSD. Those two values could be used to measure the effectiveness of a screening method. The MMMS value can avoid the issues of choosing different threshold d. From Table 8, we can observe that the PB-SIS and MMLE methods perform equally well and Kolmogorov filter approach performs a little bit worse than the PB-SIS and MMLE approaches in all settings. The Kolmogorov filter has a little bit larger RSD due to some outliers, which makes the minimum model size spread out in some cases. For the high correlation and very high correlation settings, the RSD values for PB-SIS and MMLE are larger, which means the minimum model size has higher variability when covariates are highly correlated to each other.  Table 9 depicts the MMMS and RSD for logit model 2. We can observe the similar results as logit model 1. The PB-SIS and MMLE still perform well in selecting all important variables when predictors are independent or have low correlation. However, all three feature screening procedures fail to detect important predictors when predictors are highly correlated (ρ = 0.8), especially for Kolmogorov filter method. For example, when the correlation is high, the MMMS of Kolmogorov filter are 16 and 33 for p = 200 and 600, and the RSD values even achieve 30.60 and 79.85 when p = 200 and p = 600. This means the minimum size models containing all important predictors are very spread out over the 1000 replications and may exist some outliers. This is mainly because each predictor's contribution to the response variable is cancelled out when they are of the different signs and highly correlated to each other. The Kolmogorov filter method still performs a little bit worse than the PB-SIS and MMLE methods. In general, these three screening approaches do not make a big difference when the number of true predictors is small and of the same signs. 6 (0.75) 6 (0.75) 6 (0.75) 6 (0.75) 6 (0.75) 6 (0.75) Table 11 presents the simulation results for logit model 4 in terms of MMMS and the associated RSD. The simulation results illustrate that the PB-SIS and MMLE have more effective and consistent performance than Kolmogorov filter method when ρ = 0, 0.2 or 0.4. In addition, we also notice that for the different dimension and correlation levels, the MMMS and the associated RSD usually increase as the dimension increases or the correlation level increases. When predictors are highly correlated, the PB-SIS, MMLE and Kolmogorov filter methods fail to select important predictors. For example, when ρ = 0.8, the MMMS of PB-SIS, MMLE and Kolmogorov filter procedures are 105, 105 and 144 for p = 200 and 140, 140 and 199 for p = 600, which are much larger than our true model size 3. The simulation results for logit model 5 about the MMMS and the associated RSD are presented in Table 12. The overall pattern of logit model 5 is similar to logit model 1 and 3. The PB-SIS and MMLE methods still outperform Kolmogorov filter method in selection effectiveness. The Kolmogorov filter method has larger MMMS and the associated RSD than PB-SIS and MMLE in almost all settings. The simulation results of MMMS with the associated RSD for logit model 6 are summarized in Table 13. From Table 13, we can observe that as the correlation increases, the MMMS and the associated RSD usually increase as well for all PB-SIS, MMLE and Kolmogorov filter approaches. In addition, we also see that as the dimension increases from 200 to 600, the MMMS also increases for all three feature screening approaches. Among the all approaches, the PB-SIS method usually can achieve smallest MMMS value in most settings. When predictors are highly correlated, all three feature screening methods fail to select important predictors. As we discussed before, this is due to the contribution of predictors with opposite signs may cancel out when predictors are highly correlated.

Simulations in Two-Stage Approach
We investigate the selection performance of two-stage PB-SIS method with different penalties. We consider the LASSO penalty, SCAD penalty and MCP along with four tuning parameter selection criteria: cross-validation(CV), Akaike information criterion (AIC), Bayesian information criterion (BIC) and Extended Bayesian information criteria (EBIC). In this section, only the logit link is applied to generate the binomial proportion π i and the binary response Y. We use the same model settings as Section 4.1 and are presented in Table 1. In the first stage, PB-SIS is conducted to obtain the submodel M d with size d = n/ log(n) . Then in the second stage, three different penalized methods are applied to further select important predictors and recover final sparse model. All the simulation results are based on 1000 replicates.
We evaluate the two-stage PB-SIS performance based on the P 2 , the proportion of final models containing all the true predictors among 1000 iterations and the mean of the final model size. The proportion P 2 and mean model size are summarized for model 1 to model 6 in Tables 14-19 and the mean of the final model size after regularization is reported in the parentheses. We use package SIS in R to implement the penalized methods in the second stage. The tune. f it function in SIS package fits a generalized linear model via penalized maximum likelihood, with available penalties such as LASSO, SCAD and MPC as indicated in the glmnet and ncvreg packages. The number of folds used in cross-validation is 10 and loss function used in selecting the final model is deviance.  The proportion P 2 and mean model size for model 1 and model 2 are tabulated in Tables 14 and 15. For model 1 and model 2, the number of true parameters are both two. In general, we can observe that the PB-SIS+LASSO two-stage approaches with different tuning selection criteria have the higher proportions P 2 , while the PB-SIS+MCP two stage approaches with different tuning parameter selection criteria yield the sparsest models among all three different penalties. Even though the PB-SIS+LASSO two stage approaches usually have highest proportion P 2 , they also give us the largest final models size for all different tuning parameter selection criteria. Furthermore, the PB-SIS+SCAD two-stage approach by using EBIC to select tuning parameter occasionally fails to select important predictors. For example, in Table 14, the proportions P 2 for the PB-SIS+SCAD two-stage approach by using EBIC to select tuning parameter are just 0.742 and 0.599 when p = 200 and p = 600, which are smallest among all two-stage approaches with different penalties. We also notice that as the dimension p increases, the proportion P 2 decreases and the mean model size increases for all three penalties.
Tables 16 and 17 summarize the proportion P 2 and mean model size for model 3 and model 4. Model 3 and model 4 both contain three true parameters. For those two models, we observe similar overall pattern as model 1 and model 2. The final models which are selected by the PB-SIS+MCP two-stage approach with different tuning parameter selection criteria usually have smallest model size among the three penalties. The PB-SIS+SCAD two-stage approaches with different tuning parameter selection criteria return the moderate size final models and the PB-SIS+LASSO two-stage approaches with different tuning parameter selection criteria return the largest size final models. If we consider the proportion P 2 , the PB-SIS+LASSO two-stage approaches with different tuning parameter selection criteria have the largest proportion P 2 . We can conclude that the PB-SIS+LASSO two-stage approach performs better in selection accuracy and the PB-SIS+MCP two-stage approach performs better in finding the sparsest model.
The simulation results for model 5 and model 6 about proportion P 2 and mean model size are presented in Tables 18 and 19. The overall performance of PB-SIS+LASSO, PB-SIS+SCAD and PB-SIS+MCP two-stage approaches for model 5 and model 6 are similar to model 1 to model 4. The PB-SIS+LASSO two-stages approaches with different tuning parameter selection criteria have the highest proportion P 2 along with largest mean model sizes. On the other hand, the PB-SIS+MCP two stage approaches with different tuning parameter selection criteria end up with the smallest model size on average with a slightly smaller proportion P 2 than the PB-SIS+LASSO and PB-SIS+SCAD two-stage approaches. Therefore, there is a trade-off between the selection accuracy and the final model size for those two-stage methods. Our suggestion is that we can choose the two-stage PB-SIS+LASSO method when we care more about selecting all true predictors, while the two-stage PB-SIS+MCP approach is a better choice if we would like to find the sparest final model.
We now remark on the choice of a criterion for selecting tuning parameter λ. In the simulations, as mentioned prior to Algorithm 1, one can use cross-validation (CV), AIC [21], BIC [22] or EBIC [23] to choose tuning parameter λ, each of which serves as a model selection criterion. Depending on the property of each model selection criterion, we can choose one for selecting tuning parameter λ based on different needs. CV is a method for choosing a model with the best out-of-sample predictive accuracy. AIC is an efficient model selection criterion, but not consistent. AIC is a method for choosing a model with the minimum disparity between a candidate model and the true model and is very likely to select an overfitted model including more predictors than the true model. BIC is consistent, which means asymptotically BIC chooses the true model. EBIC is extended BIC and consistent as well and may incur a small loss in the positive rate but tightly control the false discovery rate (see [23]). In many applications, CV or BIC is used for selecting tuning parameter λ.

Application in COPD Gene Expression Data
The simulation studies in Section 4 demonstrate that PB-SIS method can select important variables for generalized linear models with high accuracy rate and low computational cost. We therefore apply the proposed method to a real data example, chronic obstructive pulmonary disease data, which has been utilized in Bahr et al. [24].
Chronic obstructive pulmonary disease (COPD) was classified by the Centers for Disease Control and Prevention in 2014 as the 3rd leading cause of death in the United States (US). COPD weakens lung function and reduces lung capacity. In COPD, there are inflammation of the bronchial tubes (chronic bronchitis) and destruction of the air sacs (emphysema) within the lungs, and the chronic bronchitis and emphysema usually concur under COPD. In addition, the Global Initiative for Chronic Obstructive Lung Disease (GOLD) calls COPD as a common and preventable disease, which is caused by exposure to harmful particles and gases that affect the airways and alveolar of the lungs. The symptoms of COPD include shortness of breath due to lowered concentrations of oxygen in the blood and a chronic cough accompanied by mucus production. COPD progresses with time and the damage caused to the lungs is irreversible.
The main cause of COPD is exposure to tobacco smoke and air pollutants. Problems associated with COPD include under-diagnosis of the disease and an increase in the number of smokers worldwide. Based on previous research, tobacco exposure through smoking cigarettes, second-hand exposure to smoke, continuous exposure to burning fuels, chemicals, polluted air and dust all can cause COPD. Besides tobacco smoke and air pollution, previous study also found that a genetic deficiency, alpha-1 antitrypsin deficiency (AATD), is also associated with COPD. AATD can protect lungs and lungs will become vulnerable due to COPD without AATD. There were over 250 million reported COPD cases in year 2016 and 3.17 million individuals died from this COPD in the year 2015 all over the world. The prevalence of COPD is expected to rise due to increasing smoking rates and aging people in many countries.
Prior to the analysis of the COPD data, we remark on the usage conditions for the difference between the proposed method and some other known methods such as minimum redundancy and maximum relevance (MRMR, e.g., Ding and Peng [25], Radovic et al. [26]) and mutual information feature screening (MIFS, e.g., Hoque et al. [27]). When the response variable of a real data set is binary, the proposed PB-SIS employs point-biserial correlations to conduct feature screening in the first stage and regularization method in the second stage, which ensures that the two-stage variable selection method is consistent. The MRMR method utilizes various measures/criteria (e.g., mutual information difference criteria, mutual information quotient criteria) to maximize relevance and to minimize redundancy and then choose a subset of genes. The MIFS method depends on mutual information and a computational algorithm to obtain a subset of genes. Since the response variable in the COPD data set has two possibles (disease or not disease), it conforms the condition that we use point-biserial correlations for the first-stage feature screening and for GLMs-logit modeling in the second-stage variable selection, so the proposed PB-SIS method is applied to the COPD data for the two-stage feature screening and variable selection. On the contrary to the proposed method, the MRMR and MIFS approaches do not have such restriction on data types or data distributions.
Some previous studies have been conducted for identifying biomarkers for earlier diagnosis of COPD in blood. Ref. [24] compared gene expression profiles of smokers with COPD and smokers without COPD. They applied multiple linear regression to identify candidate genes and pathways.
The goal of our study is to identify disease variability in the gene expression profiles of COPD subjects compared to controls, by re-analyzing pre-existing, publicly available micro-array expression datasets. The data merge resulted in 1262 samples (574 controls and 688 COPD subjects) and 16,237 genes. Our 1262 samples consists of 792 males and 470 females, including 661 former smokers, 418 current smokers and 183 non-smokers.
To check the performance of different variable selection methods, we randomly split the dataset into two parts, the training set and the test set, to evaluate the prediction performance of different methods. The training set contains 80% of the observations and the test set contains 20% of the observations. Thus, the training data sample size is 1010 and the test set sample size is 252. We compare the two-stage PB-SIS approach with the two-stage MMLE and the two-stage Kolmogorov filter approach. For the second stage, we apply three different penalized methods including LASSO [5], SCAD [6] and MCP [17]. For the tuning parameter selection options of each penalized method, we report the results using cross-validation (CV), AIC [21], BIC [22] and EBIC [23] .
The final model size and classification accuracy rates are summarized in Table 20. The numbers in the parentheses are the final model size. When we use CV, AIC, BIC, EBIC as the tuning parameter selection criteria, the PB-SIS+LASSO, PB-SIS+SCAD and PB-SIS+MCP methods select a model with higher classification accuracy than the MMLE and Kolmogorov filter with different penalties with the exception of PB-SIS+LASSO with AIC as tuning parameter selection criterion and PB-SIS+MCP with EBIC as tuning parameter selection criterion. When we use more stringent tuning parameter such as EBIC, we can find that the PB-SIS method with different penalties perform significantly better than MMLE with different penalties. For example, when EBIC is used to select tuning parameter, PB-SIS+MCP selects 4 predictors and has a classification accuracy rate 0.817 and the MMLE+MCP method selects 2 predictors and has a classification accuracy rate 0.765. It is clearly demonstrated that by using the two-stage PB-SIS approach, we can select a model with a reasonably good prediction performance and appropriate model size.
In Table 20, we bold the best results in each column with a relatively high classification accuracy and a medium model size, indicating that the proposed PB-SIS method and using CV or BIC as the criterion to select tuning parameter can perform best in feature screening and variable selection. Even though using AIC can have a better classification accuracy than BIC, the results have larger model sizes, which is not favorable because AIC tends to select overfitted models.  (3) In the paper of [24], they listed 16 top candidates as the most significant genes in their final selection ( Table 2 of the paper). Based on the proposed PB-SIS method, in stage 1, we select 176 genes which have the highest absolute point-biserial correlations with the response variable. However, our selection result does not align with the results in [24]. We judge that this could happen in the gene expression analysis. Both analyses are just exploratory research of the COPD data set, and the real mechanism of COPD is still unknown, so there is no benchmark to compare which selection result is more accurate in reality. Further, no ground truth is available to show which gene does have an association with COPD. So, it is very possible that different approaches can have different results based on different measures. Theoretically, the proposed two-stage PB-SIS method is consistent, which means as the sample size goes to infinity, the procedure selects the true model with probability 1. The simulation results demonstrate that the two-stage PB-SIS method has higher accuracy compared to the MMLE and Kolmogorov Filter approaches in variable selection, and we can select the best model with reasonably good accuracy and appropriate model size in the real data example as in the test set of the simulations. Even though the final gene selection results are not very consistent with the previous study, the proposed method is an effective way for high-dimensional generalized linear model feature screening with high selection accuracy and low computation cost.

Conclusions and Discussion
We propose a two-stage feature screening method PB-SIS for variable selection of generalized linear models. The point-biserial correlation is utilized as a marginal utility measure to rank and filter the important features that have higher correlation with the response variable in the first stage. After the first stage, the model size can be dramatically reduced from a high-dimensionality p to a moderate size d. The subsequent step is to further select the important variables and build the final model through a regularization method, such as LASSO, SCAD or MCP. This two-stage approach is confirmed to be very efficient with high selection accuracy and low computational cost.
The PB-SIS method can retain all of the important variables in the selected submodel M d with probability going to one as the sample size goes to infinity. To investigate the performance of the proposed feature screening method, we conduct Monte Carlo simulations. The simulations evaluate the PB-SIS ability for generalized linear models in variable selection by generating data from two different link functions: logit and probit. The simulation results using logit link are presented in this paper. The simulation results using probit link have similar trends, but not presented here. We compare proportion of submodels M d with size d that contain all the true predictors among 1000 replications, P 1 , and computing time for our proposed method with the MMLE and Kolmogorov filter methods in three different choices of submodel size d. We also compare the MMMS and the associated RSD for those three different feature screening approaches. The simulation results demonstrate that the the proposed method and MMLE perform equally well in almost all settings, but MMLE takes much longer computing time than the proposed method.
The simulation results also show that the proposed method PB-SIS outperforms the Kolmogorov filter method in both selection accuracy and computational cost. We notice that when true predictors have different signs and are highly correlated, all three feature screening approaches fail to select important predictors. Therefore, we need always checking the independence assumption before we apply feature screening approaches. Besides, we also compare the performance of two-stage PB-SIS method with different penalized methods by using different tuning parameter selection criteria. The simulation results show that PB-SIS+LASSO method usually has the highest selection accuracy and the PB-SIS+MCP method can obtain the sparest model.
We also apply the two-stage PB-SIS method to COPD gene expression data. The real data example shows that the PB-SIS method is effective to identify important predictors in the data from the real world.
We comment that the proposed PB-SIS method has limitations. In the application of the proposed method, it is assumed that the response variable is binary data or has a binomial distribution. To achieve a competitive result in variable selection, the proposed PB-SIS method can be applied when the data meets this assumption, and well-performed results are expected. However, if the response variable in a real data set is not binary data, the variable selection result via the proposed PB-SIS method is not an option. In addition, if predictors are not continuous, the result of variable selection using the second stage of the proposed method may be deficient.
Future research are still needed on feature screening for high-dimensional and ultrahighdimensional variable selection problems. Even though the PB-SIS method is able to efficiently select important predictors for high-dimensional generalized linear models, it encounters a similar issue as in SIS [9]. Since the PB-SIS method is based on marginal point-biserial correlationω j , it tends to miss the important predictors that are marginally uncorrelated but jointly correlated with the response variable. To deal with this issue, Ref. [9] also proposed iterative sure independence screening (ISIS) to use more joint information of the predictors rather than just the marginal information in dimensional variable selection. Therefore, it will be an interesting topic to extend the marginal PB-SIS procedure to an iterative feature screening procedure by iteratively carrying out the marginal screening procedure.
In the numerical studies, we generate predictors from multivariate normal distribution and apply a specific model (generalized linear models) to generate response variable. For future research, we could consider examining the performance of PB-SIS for predictors with heavy tails or outliers. In addition, the proposed method also can be applied in other classical classification methods such as the linear discriminant analysis, quadratic discriminant analysis, robust discriminant analysis or even model-free. Some pioneer work can be found in the related references, including model-free screening procedure for ultrahigh dimensional analysis based on conditional distribution function by [28] and model free feature screening with dependent variables in ultrahigh dimensional binary classification by [29].

Data Availability Statement:
The data utilized in this study is available and studied in Bahr et al. [24] doi: 10.1165/rcmb.2012-0230OC.

Conflicts of Interest:
The authors declare no conflict of interest.