Keywords
R, variable selection, statistical inference
This article is included in the RPackage gateway.
R, variable selection, statistical inference
As the sheer volume of data grows at an astronomical rate, variable selection plays an increasingly crucial role in research. This is particularly true in the high dimensional setting where and classical statistical methods exploiting the full feature space no longer work. An ideal variable selection procedure would recover the underlying true support with high probability, yield parameter estimation with low bias, and achieve good prediction performance. While it is hard for a statistical procedure to strike a balance between inference and prediction tasks,1,2 the ProSGPV algorithm is remarkably able in this sense.3,4
One natural approach to variable selection is best subset selection (BSS). BSS chooses out of total variables for each that maximize a chosen loss function. It can be thought of as an -penalized regression.5 showed that the BSS problem is nonconvex and NP-hard. With recent advancements,6–8 solving the BSS routine with thousands of features is no longer infeasible. Particularly, an efficient R package called BeSS8 is scalable to identify the best sub-model in seconds or a few minutes when p is around 10,000. -penalized likelihood procedures are also used for variable selection. Lasso, for example, produces models with a strong predictive ability.9 However, lasso is not always variable selection consistent.1,10 To address the inconsistency issue, adaptive lasso was proposed, which introduces weights in the penalty.11 Despite oracle variable selection properties of the adaptive lasso, it is often hard in practice to find a tuple of tuning parameters that achieve the properties. Both lasso and adaptive lasso can be implemented in the glmnet package.12,13 Smoothly clipped absolute deviation (SCAD)14 and minimax concave penalty with penalized linear unbiased selection (MC+)15 were proposed to bridge the gap between the and penalties. The two algorithms are largely distinguished by their piecewise linear thresholding functions. Sure independence screening (SIS),16,17 implemented in the SIS package,18 ranks the maximum marginal likelihood estimates and can greatly reduce the dimensionality of feature space by keeping top variables in the ranking, even when . Iterative SIS (ISIS) can improve its performance in finite sample sizes. Note that all aforementioned algorithms shrink point estimates to derive a sparse solution and there is room for improvement of inference and prediction properties in finite sample sizes.
Many R packages have been developed to address certain data types. For example, clogitL119 performs variable selection with lasso and elastic net penalties in conditional logistic regression; pogit20 performs Bayesian variable selection with spike and slab priors in Poisson and Logistic regressions; penPHcure21 performs variable selection in Cox proportional hazards cure model with time-varying covariates. The ideal R package would have superior or comparable performance as the current algorithms, and work with each of continuous, binary, count, and time-to-event outcomes.
Recently, we3,4 developed penalized regression with second generation p-values (ProSGPV) for variable selection in both low-dimensional () and high-dimensional () settings. Unlike traditional algorithms, ProSGPV incorporates estimation uncertainty via confidence intervals in the variable selection process. This addition often leads to better support recovery, parameter estimation, and prediction performance. This paper describes an R package named ProSGPV, which implements the ProSGPV algorithm.29 Here, we extend the algorithm to work with data from logistic regression, Poisson regression, and Cox proportional hazards regression. We also provide visualization tools for variable selection process, which was not discussed in the original paper.3,4 Simulation studies below compare the inference and prediction performance of ProSGPV against that of glmnet, BeSS, and ISIS, in scenarios not discussed in.3,4 A real-world example compares the sparsity of solutions and prediction accuracy of all algorithms.
Second-generation p-values (SGPVs) were proposed to address some of the well-known flaws in traditional p-values.22,23 The basic idea is to replace the point null hypothesis with a pre-specified interval null. The SGPV is denoted by , where represents the half-width of the interval null. The interval null represents the set of effect sizes that are scientifically indistinguishable from the point null hypothesis, due to limited precision or practicality.
SGPVs are essentially the fraction of data-supported hypotheses that are also null hypotheses. See Figure 1 for an illustration of how SGPVs work. Their formal definition is as follows: let be the parameter of interest, and let be an interval estimate of whose length is given by . Here, can be a confidence interval (we will use 95% CIs in this paper) or likelihood support interval or a credible interval. The coverage probability of the interval estimate will largely drive the frequency properties of the SGPV upon which it is based.
Denote the length of the interval null by . Then the SGPV, , is defined as
Notice also that the correction term resolves any problems that arise when the interval estimate is too wide to be useful or reliable, in which case the data are effectively deemed inconclusive. It is in this way that SGPVs emphasize effects that are clinically meaningful over effects that are small and near the null hypothesis. Empirical studies have shown how SGPVs can be used to identify feature importance in high dimensional settings.3,4,22,23 The null bound in the SGPVs is typically the smallest effect that would be clinically relevant or the effect magnitude that can be distinguished from noise, on average.3,4 proposed using a generic null interval for regression coefficients that shrinks to zero and is based on the observed level of noise in the data. This extends the null bound in22,23 that remains constant. The interval is easily obtained in the variable selection step and promotes good statistical properties in the selection algorithm.
The ProSGPV algorithm is a two-stage algorithm. In the first stage, the algorithm identifies a candidate set of variables using a broad regularization scheme. In the second stage, the algorithm applies SGPV regularization to the model based on the candidate set identified in the first stage. The successive regularization approach is easy and fast to implement, and quite accurate for screening out false features. The steps of the ProSGPV algorithm are shown in Algorithm 1.
Algorithm 1. The ProSGPV algorithm.
1: procedure ProSGPV(X, Y)
2: Stage one: Find a candidate set
3: Fit a lasso and evaluate it at λgic
4: Fit OLS/GLM/Cox models on the lasso active set
5: Stage two: SGPV screening
6: Extract the confidence intervals of all variables from the previous step
7: Calculate the mean coefficient standard error
8: Calculate the SGPV for each variable where and
9: Keep variables with SGPV of zero
10: Refit the OLS/GLM/Cox with selected variables
11: end procedure
By default, data are scaled and centered in linear regression but are not transformed as such in GLM and Cox regression. No notable difference is observed when data are standardized in GLM and Cox regression. In the first stage, lasso is used to reduce the feature space to a candidate set that is very likely to contain true signals. This pre-screening is crucial to high dimensional data () and improves the support recovery and parameter estimation in low dimensional data.3 The lasso is evaluated at , but the algorithm is robust with respect to the choice of .3 In the second stage, the null bound is set to be the mean standard error of all coefficient estimates. However, the algorithm is insensitive to any reasonable scale change in the null bound.3,4 When data are highly correlated, a generalized variance inflation factor (GVIF)24 adjusted null bound can be used to improve the inference and prediction performance.4
In essence, ProSGPV is a hard thresholding algorithm that shrinks small effect to zero and reserve the large effect to obtain an unbiased estimate when the true support is successfully recovered. Notation-wise, the solution to the ProSGPV algorithm is
When in Algorithm 1 is replaced with zero in the first stage, ProSGPV reduces to a one-stage algorithm. That amounts to calculating SGPVs for each variable in the full model and select variables whose effects are above the threshold. However, the support recovery and parameter estimation performance of the one-stage algorithm is slightly worse than that of the two-stage algorithm.3 Moreover, the one-stage algorithm is not applicable when , i.e. when the full OLS/GLM/Cox model is not identifiable.
The ProSGPV package is publicly available from the Comprehensive R Archive Network (CRAN), a development version is available on GitHub, and is archived with Zenodo.29 To install from CRAN, please run
install.packages("ProSGPV")
To install a development version, please run
library (devtools) devtools::install_github("zuoyi93/ProSGPV")
The main function pro.sgpv implements the default two-stage and optional one-stage ProSGPV algorithm. User-friendly print, coef, summary, predict, and plot functions are equipped with pro.sgpv for both one- and two-stage algorithms. Jeffreys prior penalized logistic regression25 is used when the outcome is binary to stabilize coefficient estimates in the case of complete/quasi-complete separation. In the next section, we demonstrate how pro.sgpv works with simulated continuous outcome data.
The ProSGPV package works across different platforms (Windows, mac OS, and Linux). The R version number should be greater than 3.5.0. Once installed, the workflow is described as below. We first present an example by applying the ProSGPV algorithm to a simulated dataset by use of gen.sim.data function. With sample size = 100, number of variables = 20, number of true signals = 4, smallest effect size = 1, largest effect size = 5, autoregressive correlation = 0.2 and variance = 1, signal-to-noise ratio (SNR) defined in26 = 2, we generate outcomes following Gaussian distribution. gen.sim.data outputs , , indices of true signals, and a vector of true coefficients.
> library (ProSGPV) > set.seed(1) > sim.data <- gen.sim.data(n = 100, p = 20, s = 4, family = "gaussian", beta.min = 1, beta.max = 5, rho = 0.2, nu = 2) > x <- sim.data [[1]] > y <- sim.data [[2]] > (true.index <- sim.data [[3]]) [1] 2 4 5 7 > true.beta <- sim.data [[4]]
By default, the two-stage algorithm is used in ProSGPV. pro.sgpv function takes inputs of explanatory variables x, outcome y, outcome type family (default is “gaussian”), stage indicator (default is 2), and a GVIF indicator (default is FALSE). A print method is available to show labels of the variables selected by ProSGPV.
> sgpv.out.2 <- pro.sgpv(x,y) > sgpv.out.2 Selected variables are V2 V4 V5 V7
The variable selection process can be visualized by using the plot function. Figure 2 shows the fully relaxed lasso path along a range of s. The shaded area is the null zone and any effect whose 95% confidence interval overlaps with the null region will be considered irrelevant or insignificant. ProSGPV is evaluated at . The lpv argument can be used to choose to display one line per variable (the confidence bound that is closer to the null region) or three lines per variable (an point estimate and confidence bounds, default). The lambda.max argument control the limit of the x-axis in the plot.
summary function outputs the regression summary of the selected model. When the outcome is continuous, an OLS is used.
> summary (sgpv.out.2) Call: lm (formula = Response ~ ., data = data.d) Residuals:
Coefficients:
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1 Residual standard error: 4.834 on 95 degrees of freedom Multiple R-squared: 0.6367, Adjusted R-squared: 0.6214 F-statistic: 41.63 on 4 and 95 DF, p-value: < 2.2e-16
coef function can be used to extract the coefficient estimates of length . When signals are sparse, some of estimates are zero. A comparison shows that the estimates are close to the true effect sizes in the simulation.
> beta.hat <- coef (sgpv.out.2) > rbind (beta.hat, true.beta)
[,1] | [,2] | [,3] | [,4] | [,5] | [,6] | [,7] | [,8] | [,9] | [,10] | |
beta.hat | 0 | -4.199224 | 0 | 1.591675 | 2.660038 | 0 | -3.314971 | 0 | 0 | 0 |
true.beta | 0 | -5.000000 | 0 | 1.000000 | 2.333333 | 0 | -3.666667 | 0 | 0 | 0 |
predict function can be used to predict outcomes using the selected model. In-sample prediction can be made by calling predict (sgpv.out.2) and out-of-sample prediction can be made by feeding new data set into the newdata argument.
In addition to the two-stage algorithm, one-stage algorithm can also be used to select variables when n > p. The computation time is shorter for the one-stage algorithm at the expense of slightly reduced support recovery rate in the limit, as shown in.3 Figure 3 shows the variable selection result of the one-stage algorithm on the same data. The one-stage algorithm missed V4 and only selected three variables. The lower confidence bound of the estimate for V4 barely excludes the null region, and was dropped from the final model because of that. The one-stage algorithm illustrates how estimation uncertainty via confidence intervals (horizontal segments) can be incorporated in variable selection.
> sgpv.out.1 <- pro.sgpv(x,y,stage=1) > sgpv.out.1 Selected variables are V2 V5 V7 > plot (sgpv.out.1)
Examples of binary, count, and time-to-event data can be found in the package vignettes.
Simulation design is adapted from3,4 and we will present below scenarios not discussed in those two papers. The setup can be found in the Table 1 below. The scale and shape parameters are used to generate time-to-event from a Weibull distribution. ProSGPV is compared against lasso, BeSS, and ISIS. Results are aggregated over 1000 simulations. Evaluation metrics include support recovery rate, parameter estimation mean absolute error (MAE), prediction root mean square error (RMSE) and area under the curve in a separate test set. Support recovery is defined as capturing the exact true support, not containing. An estimate of MAE is , where is the th true coefficient. Simulation results are summarized in Figure 4.
Linear regression | Logistic regression | Poisson regression | Cox regression | |
---|---|---|---|---|
100 | 32:320 | 40 | 80:800 | |
100:1000 | 16 | 40:400 | 40 | |
10 | 6 | 4 | 20 | |
βl | 1 | 0.4 | 0.2 | 0.3 |
βu | 2 | 1.2 | 0.5 | 1 |
ρ | 0.3 | 0.6 | 0.3 | 0.3 |
σ | 2 | 2 | 2 | 2 |
ν | 2 | |||
Intercept | 0 | 2 | 0 | |
Scale | 2 | |||
Shape | 1 | |||
Rate of censoring | 0.2 |
From Figure 4, we observe similar results as in.3,4 ProSGPV often has the highest capture rates of the exact true model, has lowest parameter estimation bias, has one of the lowest prediction error, and is the fastest to compute. Note that GVIF-adjusted null bound is used in the Logistic regression because of the high correlation in the design matrix.
In this section, we compare the performance of ProSGPV with lasso, BeSS, and ISIS algorithms using a real-world financial data.28 The close price of Dow Jones Industrial Average (DJIA) was documented from Jan 1, 2010 to November 15, 2017 and eight groups of primitive, technical indicators, big U.S. companies, commodities, exchange rate of currencies, future contracts and worlds stock indices, and other sources of information27 were collected to predict the DJIA close price. In the analyzed data with complete records, there are 1114 observations and 82 predictors. We randomly sampled 614 observations as a fixed test set to evaluate the prediction performance of models built on the training set. We allowed the training set size n to vary from 40 to 300. At each n, we recorded the distribution of the training model size for each algorithm as well as the distribution of the prediction RMSE over 1000 repetitions. Results are summarized in Figure 5. ProSGPV and lasso had sparser models than BeSS and ISIS did, while ProSGPV had much better prediction performance in the test set. BeSS and ISIS had better prediction performance at the cost of including much more variables in the final model. The tradeoff between the sparsity of solutions and prediction accuracy is well illustrated in this example. Variables frequently selected by ProSGPV include 5-, 10-, and 15-day rate of change, and 10-day exponential moving average of (DJIA). Technical indicators seem more predictive than other world indices, commodity, exchange rate, futures, etc.
We introduced an R package to implement the ProSGPV algorithm, a variable selection algorithm that incorporates estimation uncertainty via confidence intervals. This novel addition often leads to better support recovery, parameter estimation, and prediction performance. The package is user-friendly, has nice visualization tools for the variable selection process, facilitates subsequent inference and prediction, and very fast computationally. The efficacy of our package is demonstrated on both simulation and real-world data sets.
The real-world data are from a Kaggle data challenge, which is available at https://www.kaggle.com/ehoseinz/cnnpred-stock-market-prediction. Detailed description of data elements can be found in.27
Zenodo: zuoyi93/r-code-prosgpv-r: v1.0.0. https://doi.org/10.5281/zenodo.5655772.28
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Analysis code available from: https://github.com/zuoyi93/r-code-prosgpv-r/tree/v1.0.0#readme
Archived analysis code as at time of publication: https://doi.org/10.5281/zenodo.5655772.28
License: Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Software available from: https://cran.r-project.org/web/packages/ProSGPV/index.html
Source code available from: https://github.com/zuoyi93/ProSGPV
Archived source code at time of publication: https://doi.org/10.5281/zenodo.5655795.29
License: 3-Clause BSD License.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Statistics
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: biostatistics, clinical epidemiology, prognosis research
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 18 Jan 22 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)