ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

ProSGPV: an R package for variable selection with second-generation p-values

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 18 Jan 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the RPackage gateway.

Abstract

We introduce the ProSGPV R package, which implements a variable selection algorithm based on second-generation p-values (SGPV) instead of traditional p-values. Most variable selection algorithms shrink point estimates to arrive at a sparse solution. In contrast, the ProSGPV algorithm accounts for the estimation uncertainty – via confidence intervals – in the selection process. This additional information leads to better inference and prediction performance in finite sample sizes. ProSGPV maintains good performance even in the high dimensional case where $p>n$, or when explanatory variables are highly correlated. Moreover, ProSGPV is a unifying algorithm that works with continuous, binary, count, and time-to-event outcomes. No cross-validation or iterative processes are needed and thus ProSGPV is very fast to compute. Visualization tools are available in this package for assessing the variable selection process. Here we present simulation studies and a real-world example to demonstrate ProSGPV’s inference and prediction performance in relation to the current standards in variable selection procedures.

Keywords

R, variable selection, statistical inference

Introduction

As the sheer volume of data grows at an astronomical rate, variable selection plays an increasingly crucial role in research. This is particularly true in the high dimensional setting where p>n and classical statistical methods exploiting the full feature space no longer work. An ideal variable selection procedure would recover the underlying true support with high probability, yield parameter estimation with low bias, and achieve good prediction performance. While it is hard for a statistical procedure to strike a balance between inference and prediction tasks,1,2 the ProSGPV algorithm is remarkably able in this sense.3,4

One natural approach to variable selection is best subset selection (BSS). BSS chooses k out of p total variables for each k12p that maximize a chosen loss function. It can be thought of as an 0-penalized regression.5 showed that the BSS problem is nonconvex and NP-hard. With recent advancements,68 solving the BSS routine with thousands of features is no longer infeasible. Particularly, an efficient R package called BeSS8 is scalable to identify the best sub-model in seconds or a few minutes when p is around 10,000. 1-penalized likelihood procedures are also used for variable selection. Lasso, for example, produces models with a strong predictive ability.9 However, lasso is not always variable selection consistent.1,10 To address the inconsistency issue, adaptive lasso was proposed, which introduces weights in the 1 penalty.11 Despite oracle variable selection properties of the adaptive lasso, it is often hard in practice to find a tuple of tuning parameters that achieve the properties. Both lasso and adaptive lasso can be implemented in the glmnet package.12,13 Smoothly clipped absolute deviation (SCAD)14 and minimax concave penalty with penalized linear unbiased selection (MC+)15 were proposed to bridge the gap between the 0 and 1 penalties. The two algorithms are largely distinguished by their piecewise linear thresholding functions. Sure independence screening (SIS),16,17 implemented in the SIS package,18 ranks the maximum marginal likelihood estimates and can greatly reduce the dimensionality of feature space by keeping top variables in the ranking, even when pn. Iterative SIS (ISIS) can improve its performance in finite sample sizes. Note that all aforementioned algorithms shrink point estimates to derive a sparse solution and there is room for improvement of inference and prediction properties in finite sample sizes.

Many R packages have been developed to address certain data types. For example, clogitL119 performs variable selection with lasso and elastic net penalties in conditional logistic regression; pogit20 performs Bayesian variable selection with spike and slab priors in Poisson and Logistic regressions; penPHcure21 performs variable selection in Cox proportional hazards cure model with time-varying covariates. The ideal R package would have superior or comparable performance as the current algorithms, and work with each of continuous, binary, count, and time-to-event outcomes.

Recently, we3,4 developed penalized regression with second generation p-values (ProSGPV) for variable selection in both low-dimensional (n>p) and high-dimensional (p>n) settings. Unlike traditional algorithms, ProSGPV incorporates estimation uncertainty via confidence intervals in the variable selection process. This addition often leads to better support recovery, parameter estimation, and prediction performance. This paper describes an R package named ProSGPV, which implements the ProSGPV algorithm.29 Here, we extend the algorithm to work with data from logistic regression, Poisson regression, and Cox proportional hazards regression. We also provide visualization tools for variable selection process, which was not discussed in the original paper.3,4 Simulation studies below compare the inference and prediction performance of ProSGPV against that of glmnet, BeSS, and ISIS, in scenarios not discussed in.3,4 A real-world example compares the sparsity of solutions and prediction accuracy of all algorithms.

Methods

What is a second-generation p-value?

Second-generation p-values (SGPVs) were proposed to address some of the well-known flaws in traditional p-values.22,23 The basic idea is to replace the point null hypothesis with a pre-specified interval null. The SGPV is denoted by pδ, where δ represents the half-width of the interval null. The interval null represents the set of effect sizes that are scientifically indistinguishable from the point null hypothesis, due to limited precision or practicality.

SGPVs are essentially the fraction of data-supported hypotheses that are also null hypotheses. See Figure 1 for an illustration of how SGPVs work. Their formal definition is as follows: let θ be the parameter of interest, and let I=θlθu be an interval estimate of θ whose length is given by |I|=θuθl. Here, I can be a confidence interval (we will use 95% CIs in this paper) or likelihood support interval or a credible interval. The coverage probability of the interval estimate I will largely drive the frequency properties of the SGPV upon which it is based.

c7196c7f-fa95-41cb-9532-5fdbaf0af75a_figure1.gif

Figure 1. An illustration of how SGPVs work.

Denote the length of the interval null by |H0|. Then the SGPV, pδ, is defined as

(1)
pδ=|IH0||I|×max|I|2|H0|1
where IH0 is the intersection of two intervals.22,23 Notice how the SGPV indicates when the data are compatible with null hypotheses (pδ=1), or with alternative hypotheses (pδ=0), or when the data are inconclusive (0<pδ<1).

Notice also that the correction term maxI/2H01 resolves any problems that arise when the interval estimate I is too wide to be useful or reliable, in which case the data are effectively deemed inconclusive. It is in this way that SGPVs emphasize effects that are clinically meaningful over effects that are small and near the null hypothesis. Empirical studies have shown how SGPVs can be used to identify feature importance in high dimensional settings.3,4,22,23 The null bound in the SGPVs is typically the smallest effect that would be clinically relevant or the effect magnitude that can be distinguished from noise, on average.3,4 proposed using a generic null interval for regression coefficients that shrinks to zero and is based on the observed level of noise in the data. This extends the null bound in22,23 that remains constant. The interval is easily obtained in the variable selection step and promotes good statistical properties in the selection algorithm.

The ProSGPV algorithm

The ProSGPV algorithm is a two-stage algorithm. In the first stage, the algorithm identifies a candidate set of variables using a broad regularization scheme. In the second stage, the algorithm applies SGPV regularization to the model based on the candidate set identified in the first stage. The successive regularization approach is easy and fast to implement, and quite accurate for screening out false features. The steps of the ProSGPV algorithm are shown in Algorithm 1.

Algorithm 1. The ProSGPV algorithm.

1: procedure ProSGPV(X, Y)

2:  Stage one: Find a candidate set

3:   Fit a lasso and evaluate it at λgic

4:   Fit OLS/GLM/Cox models on the lasso active set

5:  Stage two: SGPV screening

6:   Extract the confidence intervals of all variables from the previous step

7:   Calculate the mean coefficient standard error SE¯

8:   Calculate the SGPV for each variable where Ij=β̂j±1.96×SEj and H0=SE¯SE¯

9:   Keep variables with SGPV of zero

10:   Refit the OLS/GLM/Cox with selected variables

11: end procedure

By default, data are scaled and centered in linear regression but are not transformed as such in GLM and Cox regression. No notable difference is observed when data are standardized in GLM and Cox regression. In the first stage, lasso is used to reduce the feature space to a candidate set that is very likely to contain true signals. This pre-screening is crucial to high dimensional data (n<p) and improves the support recovery and parameter estimation in low dimensional data.3 The lasso is evaluated at λgic, but the algorithm is robust with respect to the choice of λ.3 In the second stage, the null bound is set to be the mean standard error of all coefficient estimates. However, the algorithm is insensitive to any reasonable scale change in the null bound.3,4 When data are highly correlated, a generalized variance inflation factor (GVIF)24 adjusted null bound can be used to improve the inference and prediction performance.4

In essence, ProSGPV is a hard thresholding algorithm that shrinks small effect to zero and reserve the large effect to obtain an unbiased estimate when the true support is successfully recovered. Notation-wise, the solution to the ProSGPV algorithm β̂prosgpv is

(2)
β̂prosgpv=β̂|Smp,whereS=kC:β̂km>λk,C=j12p:β̂jlasso>0
where β̂|Sm is a vector of length p with non-zero elements being the OLS/GLM/Cox coefficient estimates from the model with variables only in the set S. S is the final selection set and C is the candidate set from the first-stage screening. Note that the cutoff λk=1.96SEk+SE¯.

When λgic in Algorithm 1 is replaced with zero in the first stage, ProSGPV reduces to a one-stage algorithm. That amounts to calculating SGPVs for each variable in the full model and select variables whose effects are above the threshold. However, the support recovery and parameter estimation performance of the one-stage algorithm is slightly worse than that of the two-stage algorithm.3 Moreover, the one-stage algorithm is not applicable when p>n, i.e. when the full OLS/GLM/Cox model is not identifiable.

Implementation

The ProSGPV package is publicly available from the Comprehensive R Archive Network (CRAN), a development version is available on GitHub, and is archived with Zenodo.29 To install from CRAN, please run

install.packages("ProSGPV")

To install a development version, please run

library (devtools)
devtools::install_github("zuoyi93/ProSGPV")

The main function pro.sgpv implements the default two-stage and optional one-stage ProSGPV algorithm. User-friendly print, coef, summary, predict, and plot functions are equipped with pro.sgpv for both one- and two-stage algorithms. Jeffreys prior penalized logistic regression25 is used when the outcome is binary to stabilize coefficient estimates in the case of complete/quasi-complete separation. In the next section, we demonstrate how pro.sgpv works with simulated continuous outcome data.

Operation

The ProSGPV package works across different platforms (Windows, mac OS, and Linux). The R version number should be greater than 3.5.0. Once installed, the workflow is described as below. We first present an example by applying the ProSGPV algorithm to a simulated dataset by use of gen.sim.data function. With sample size n = 100, number of variables p = 20, number of true signals s = 4, smallest effect size βmin = 1, largest effect size βmax = 5, autoregressive correlation ρ = 0.2 and variance σ2 = 1, signal-to-noise ratio (SNR) defined in26 ν = 2, we generate outcomes Y following Gaussian distribution. gen.sim.data outputs X, Y, indices of true signals, and a vector of true coefficients.

> library (ProSGPV)
> set.seed(1)
> sim.data <- gen.sim.data(n = 100, p = 20, s = 4, family = "gaussian",
                beta.min = 1, beta.max = 5, rho = 0.2, nu = 2)
> x <- sim.data [[1]]
> y <- sim.data [[2]]
> (true.index <- sim.data [[3]])

[1] 2 4 5 7

> true.beta <- sim.data [[4]]

By default, the two-stage algorithm is used in ProSGPV. pro.sgpv function takes inputs of explanatory variables x, outcome y, outcome type family (default is “gaussian”), stage indicator (default is 2), and a GVIF indicator (default is FALSE). A print method is available to show labels of the variables selected by ProSGPV.

> sgpv.out.2 <- pro.sgpv(x,y)
> sgpv.out.2

Selected variables are V2 V4 V5 V7

The variable selection process can be visualized by using the plot function. Figure 2 shows the fully relaxed lasso path along a range of λs. The shaded area is the null zone and any effect whose 95% confidence interval overlaps with the null region will be considered irrelevant or insignificant. ProSGPV is evaluated at λgic. The lpv argument can be used to choose to display one line per variable (the confidence bound that is closer to the null region) or three lines per variable (an point estimate and confidence bounds, default). The lambda.max argument control the limit of the x-axis in the plot.

c7196c7f-fa95-41cb-9532-5fdbaf0af75a_figure2.gif

Figure 2. Solution path of the two-stage ProSGPV algorithm.

summary function outputs the regression summary of the selected model. When the outcome is continuous, an OLS is used.

> summary (sgpv.out.2)

Call:
lm (formula = Response ~ ., data = data.d)

Residuals:

Min1QMedian3QMax
-9.8066-3.59120.08173.15309.5042

Coefficients:

Estimate Std.Errort valuePr(>|t|)
(Intercept)-0.023590.48792-0.0480.96154
V2-4.199220.46775-8.9782.53e-14 ***
V41.591680.528763.0100.00334 **
V52.660040.474415.6072.01e-07 ***
V7-3.314970.46207-7.1741.58e-10 ***

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 4.834 on 95 degrees of freedom
Multiple R-squared: 0.6367, Adjusted R-squared: 0.6214
F-statistic: 41.63 on 4 and 95 DF, p-value: < 2.2e-16

coef function can be used to extract the coefficient estimates of length p. When signals are sparse, some of estimates are zero. A comparison shows that the estimates are close to the true effect sizes in the simulation.

> beta.hat <- coef (sgpv.out.2)
> rbind (beta.hat, true.beta)

[,1][,2][,3][,4][,5][,6][,7][,8][,9][,10]
beta.hat0-4.19922401.5916752.6600380-3.314971000
true.beta0-5.00000001.0000002.3333330-3.666667000

predict function can be used to predict outcomes using the selected model. In-sample prediction can be made by calling predict (sgpv.out.2) and out-of-sample prediction can be made by feeding new data set into the newdata argument.

In addition to the two-stage algorithm, one-stage algorithm can also be used to select variables when n > p. The computation time is shorter for the one-stage algorithm at the expense of slightly reduced support recovery rate in the limit, as shown in.3 Figure 3 shows the variable selection result of the one-stage algorithm on the same data. The one-stage algorithm missed V4 and only selected three variables. The lower confidence bound of the estimate for V4 barely excludes the null region, and was dropped from the final model because of that. The one-stage algorithm illustrates how estimation uncertainty via confidence intervals (horizontal segments) can be incorporated in variable selection.

> sgpv.out.1 <- pro.sgpv(x,y,stage=1)
> sgpv.out.1
Selected variables are V2 V5 V7

> plot (sgpv.out.1)

c7196c7f-fa95-41cb-9532-5fdbaf0af75a_figure3.gif

Figure 3. Visualization of variable selection results of the one-stage ProSGPV algorithm.

Examples of binary, count, and time-to-event data can be found in the package vignettes.

Simulation

Simulation design is adapted from3,4 and we will present below scenarios not discussed in those two papers. The setup can be found in the Table 1 below. The scale and shape parameters are used to generate time-to-event from a Weibull distribution. ProSGPV is compared against lasso, BeSS, and ISIS. Results are aggregated over 1000 simulations. Evaluation metrics include support recovery rate, parameter estimation mean absolute error (MAE), prediction root mean square error (RMSE) and area under the curve in a separate test set. Support recovery is defined as capturing the exact true support, not containing. An estimate of MAE is 1/pj=1pβ̂jβ0,j, where β0,j is the jth true coefficient. Simulation results are summarized in Figure 4.

Table 1. Summary of parameters in simulation studies.

Linear regressionLogistic regressionPoisson regressionCox regression
n10032:3204080:800
p100:10001640:40040
s106420
βl10.40.20.3
βu21.20.51
ρ0.30.60.30.3
σ2222
ν2
Intercept t020
Scale2
Shape1
Rate of censoring0.2
c7196c7f-fa95-41cb-9532-5fdbaf0af75a_figure4.gif

Figure 4. Simulation results over 1000 iterations.

A) Average support recovery rate, means surrounded by 95% Wald confidence intervals. B) Average mean absolute error, medians surrounded by 1st and 3rd quartiles. C) Prediction accuracy in a separate test set (root mean square error for linear and Poisson regressions, and area under the curve for Logistic regression), medians surrounded by 1st and 3rd quartiles. D) Average running time in seconds, medians surrounded by 1st and 3rd quartiles. Values are censored at 0.8 seconds for aesthetic reasons.

From Figure 4, we observe similar results as in.3,4 ProSGPV often has the highest capture rates of the exact true model, has lowest parameter estimation bias, has one of the lowest prediction error, and is the fastest to compute. Note that GVIF-adjusted null bound is used in the Logistic regression because of the high correlation in the design matrix.

Results

Real-world data example

In this section, we compare the performance of ProSGPV with lasso, BeSS, and ISIS algorithms using a real-world financial data.28 The close price of Dow Jones Industrial Average (DJIA) was documented from Jan 1, 2010 to November 15, 2017 and eight groups of primitive, technical indicators, big U.S. companies, commodities, exchange rate of currencies, future contracts and worlds stock indices, and other sources of information27 were collected to predict the DJIA close price. In the analyzed data with complete records, there are 1114 observations and 82 predictors. We randomly sampled 614 observations as a fixed test set to evaluate the prediction performance of models built on the training set. We allowed the training set size n to vary from 40 to 300. At each n, we recorded the distribution of the training model size for each algorithm as well as the distribution of the prediction RMSE over 1000 repetitions. Results are summarized in Figure 5. ProSGPV and lasso had sparser models than BeSS and ISIS did, while ProSGPV had much better prediction performance in the test set. BeSS and ISIS had better prediction performance at the cost of including much more variables in the final model. The tradeoff between the sparsity of solutions and prediction accuracy is well illustrated in this example. Variables frequently selected by ProSGPV include 5-, 10-, and 15-day rate of change, and 10-day exponential moving average of (DJIA). Technical indicators seem more predictive than other world indices, commodity, exchange rate, futures, etc.

c7196c7f-fa95-41cb-9532-5fdbaf0af75a_figure5.gif

Figure 5. Sparsity of solutions (A) and prediction performance (B) of all algorithms in the real-world example.

Medians as well as 1st and 3rd quartiles from 1000 repetitions are compared.

Conclusions

We introduced an R package to implement the ProSGPV algorithm, a variable selection algorithm that incorporates estimation uncertainty via confidence intervals. This novel addition often leads to better support recovery, parameter estimation, and prediction performance. The package is user-friendly, has nice visualization tools for the variable selection process, facilitates subsequent inference and prediction, and very fast computationally. The efficacy of our package is demonstrated on both simulation and real-world data sets.

Data availability

Source data

The real-world data are from a Kaggle data challenge, which is available at https://www.kaggle.com/ehoseinz/cnnpred-stock-market-prediction. Detailed description of data elements can be found in.27

Underlying data

Zenodo: zuoyi93/r-code-prosgpv-r: v1.0.0. https://doi.org/10.5281/zenodo.5655772.28

  • Processed_DJI.csv (real-world data)

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Extended data

Analysis code available from: https://github.com/zuoyi93/r-code-prosgpv-r/tree/v1.0.0#readme

Archived analysis code as at time of publication: https://doi.org/10.5281/zenodo.5655772.28

License: Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Software availability

Software available from: https://cran.r-project.org/web/packages/ProSGPV/index.html

Source code available from: https://github.com/zuoyi93/ProSGPV

Archived source code at time of publication: https://doi.org/10.5281/zenodo.5655795.29

License: 3-Clause BSD License.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 18 Jan 2022
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Zuo Y, Stewart TG and Blume JD. ProSGPV: an R package for variable selection with second-generation p-values [version 1; peer review: 1 approved, 1 approved with reservations] F1000Research 2022, 11:58 (https://doi.org/10.12688/f1000research.74401.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 18 Jan 2022
Views
3
Cite
Reviewer Report 08 Feb 2024
Enrico Ripamonti, Department of Economics and Management, University of Milan-Bicocca, Milan, Italy 
Approved
VIEWS 3
I found the authors' report clear, concise, and effective. I do not have further comments or suggestions. Maybe, as a proposal for the future, a vignette could be created and posted in the CRAN. This could be useful for practitioners ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ripamonti E. Reviewer Report For: ProSGPV: an R package for variable selection with second-generation p-values [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2022, 11:58 (https://doi.org/10.5256/f1000research.78152.r159039)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
23
Cite
Reviewer Report 26 Jan 2022
Georg Heinze, Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria 
Approved with Reservations
VIEWS 23
The paper describes an R-package with which a recently proposed method of variable selection, ProSGPV, can be applied to data sets.
I am not fully convinced of the method. It would need a neutral comparison study (one not conducted ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Heinze G. Reviewer Report For: ProSGPV: an R package for variable selection with second-generation p-values [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2022, 11:58 (https://doi.org/10.5256/f1000research.78152.r120267)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 01 Mar 2022
    Yi Zuo, Biostatistics, Vanderbilt University, Nashville, USA
    01 Mar 2022
    Author Response
    Dear Dr. Heinze,

    Thank you for taking the time to review our paper. We appreciate your thoughtful comments and we think these concerns can be readily addressed. Please see ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 01 Mar 2022
    Yi Zuo, Biostatistics, Vanderbilt University, Nashville, USA
    01 Mar 2022
    Author Response
    Dear Dr. Heinze,

    Thank you for taking the time to review our paper. We appreciate your thoughtful comments and we think these concerns can be readily addressed. Please see ... Continue reading

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 18 Jan 2022
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.