A selective overview of feature screening for ultrahigh-dimensional data

Liu, JingYuan; Zhong, Wei; Li, RunZe

doi:10.1007/s11425-015-5062-9

A selective overview of feature screening for ultrahigh-dimensional data

Reviews
Invited Articles
Published: 22 August 2015

Volume 58, pages 1–22, (2015)
Cite this article

Science China Mathematics Aims and scope Submit manuscript

JingYuan Liu^1,2,3,
Wei Zhong^2,1,3 &
RunZe Li⁴

1186 Accesses
57 Citations
Explore all metrics

Abstract

High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data. Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many highdimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Benjamini Y, Hochberg T. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Stat Soc Ser B, 1995, 57: 289–300
MathSciNet MATH Google Scholar
Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Statist, 2004, 29: 1165–1188
MathSciNet Google Scholar
Bickel P J, Levina E. Some theory for Fishers linear discriminant function, “Naive Bayes,” and some alternatives when there are many more variables than observations. Bernoulli, 2004, 10: 989–1010
Article MathSciNet MATH Google Scholar
Breiman L. Better subset regression using the nonnegative garrote. Technometrics, 1995, 37: 373–384
Article MathSciNet MATH Google Scholar
Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Ann Statist, 2007, 35: 2313–2404
Article MathSciNet MATH Google Scholar
Carroll R J, Fan J, Gijbels I, et al. Generalized partially linear single-index models. J Amer Statist Assoc, 1997, 92: 477–489
Article MathSciNet MATH Google Scholar
Chen J, Chen Z. Extended Bayesian information criterion for model selection with large model spaces. Biometrika, 2008, 85: 759–771
Article Google Scholar
Chen L S, Paul D, Prentice R L, et al. A regularized Hotelling’s T ² test for pathway analysis in proteomic studies. J Amer Statist Assoc, 2011, 106: 1345–1360
Article MathSciNet MATH Google Scholar
Cui H, Li R, Zhong W. Model-free feature screening for ultrahigh-dimensional discriminant analysis. J Amer Statist Assoc, 2015, 110: 630–641
Article MathSciNet Google Scholar
Donoho D L. High-dimensional data: The curse and blessings of dimensionality. Los Angeles: Amer Math Soc Conference Math Challenges of 21st Century, 2000
Google Scholar
Donoho D L. Neighborly polytopes and sparse solution of underdetermined linear equations. Technical Report, Department of Statistics. Stanford: Stanford University, 2004
Google Scholar
Donoho D L. For most large undetermined systems of linear equations the minimal l ₁-norm solution is the sparsest solution. Commun Pure Appl Math, 2006, 59: 797–829
Article MathSciNet MATH Google Scholar
Dudoit S, Shaffer J P, Boldrick J C. Multiple hypothesis testing in microarray experiments. Stat Sci, 2003, 18: 71–103
Article MathSciNet MATH Google Scholar
Efron B. Correlation and large-scale simultaneous significance testing. J Amer Statist Assoc, 2007, 102: 93–103
Article MathSciNet MATH Google Scholar
Fan J, Fan Y. High-dimensional classification using features annealed independence rules. Ann Statist, 2008, 36: 2605–2637
Article MathSciNet MATH Google Scholar
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high dimensional additive models. J Amer Statist Assoc, 2011, 106: 544–557
Article MathSciNet MATH Google Scholar
Fan J, Feng Y, Wu Y. Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collection, 2010, 6: 70–86
MathSciNet Google Scholar
Fan J, Han F, Liu H. Challenges of Big Data analysis. Nat Sci Rev, 2014, 1: 293–314
Article Google Scholar
Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. J Amer Statist Assoc, 2001, 96: 1348–1360
Article MathSciNet MATH Google Scholar
Fan J, Li R. Statistical challenges with high dimensionality: Feature selection in knowledge discoverry. In: Sanz-Sole M, Soria J, Varona J L, et al., eds. Proceedings on the International Congress of Mathematicians, vol. III. Freiburg: European Math Soc, 2006, 595–622
MathSciNet Google Scholar
Fan J, Lv J. Sure independence screening for ultrahigh-dimensional feature space (with discussion). J Roy Stat Soc Ser B, 2008, 70: 849–911
Article MathSciNet Google Scholar
Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Stat Sinica, 2010, 20: 101–148
MathSciNet MATH Google Scholar
Fan J, Ma Y, Dai W. Nonparametric independence screening in sparse ultra-high dimensional varying coefficient models. J Amer Statist Assoc, 2014, 109: 1270–1284
Article MathSciNet Google Scholar
Fan J, Ren Y. Statistical analysis of DNA microarray data. Clin Cancer Res, 2006, 12: 4469–4473
Article Google Scholar
Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res, 2009, 10: 1829–1853
MathSciNet Google Scholar
Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist, 2010, 38: 3567–3604
Article MathSciNet MATH Google Scholar
Fang K, Kotz S, Ng K. Symmetric Multivariate and Related Distributions. London: Chapman & Hall, 1990
Book MATH Google Scholar
Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graph Stat, 2009, 18: 533–550
Article MathSciNet Google Scholar
Hao N, Zhang H H. Interaction Screening for Ultrahigh-Dimensional Data. J Amer Statist Assoc, 2014, 109: 1285–1301
Article MathSciNet Google Scholar
Härdle W, Hall P, Ichimura H. Optimal smoothing in single-index models. Ann Statist, 1993, 21: 157–178
Article MathSciNet MATH Google Scholar
Härdle W, Liang H, Gao J T. Partially Linear Models. Germany: Springer Phisica-Verlag, 2000
Book MATH Google Scholar
He X, Wang L, Hong H G. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Statist, 2013, 41: 342–369
Article MathSciNet MATH Google Scholar
Huang D, Li R, Wang H. Feature screening for ultrahigh-dimensional categorical data with applications. J Bus Econ Stat, 2014, 32: 237–244
Article MathSciNet Google Scholar
Huber P J. Robust estimation of a location parameter. Ann Math Stat, 1964, 35: 73–101
Article MATH Google Scholar
Li G, Peng H, Zhang J, et al. Robust rank correlation based screening. Ann Statist, 2012, 40: 1846–1877
Article MathSciNet MATH Google Scholar
Li J, Zhong W, Li R, et al. A fast algorithm for detecting gene-gene interactions in genome-wide association studies. Ann Appl Stat, 2014, 8: 2292–2318
Article MathSciNet MATH Google Scholar
Li R, Zhong W, Zhu L P. Feature screening via distance correlation learning. J Amer Statist Assoc, 2012, 107: 1129–1139
Article MathSciNet MATH Google Scholar
Lin L, Sun J, Zhu L X. Nonparametric feature screening. Comput Stat Data Anal, 2013, 67: 162–174
Article MathSciNet Google Scholar
Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Amer Statist Assoc, 2014, 109: 266–274
Article MathSciNet Google Scholar
Luo X, Stefanski L A, Boos D D. Tuning variable selection procedure by adding noise. Technometrics, 2006, 48: 165–175
Article MathSciNet Google Scholar
Mai Q, Zou H. The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika, 2013, 100: 229–234
Article MathSciNet MATH Google Scholar
Mai Q, Zou H, Yuan M. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 2012, 99: 29–42
Article MathSciNet MATH Google Scholar
Meier I, Geer V, Bühlmann P. High-dimensional additive modeling. J Roy Stat Soc Ser B, 2009, 71: 1009–1030
Article Google Scholar
Pan R, Wang H, Li R. On the ultrahigh-dimensional linear discriminant analysis problem with a diverging number of classes. J Amer Statist Assoc, 2015, in press
Google Scholar
Song R, Yi F, Zou H. On varying-coefficient independence screening for high-dimensional varying-coefficient models. Stat Sinica, 2014, 24: 1735–1752
MathSciNet Google Scholar
Storey J D, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci USA, 2003, 100: 9440–9445
Article MathSciNet MATH Google Scholar
Székely G, Jizzo M L, Bakirov N K. Measuring and testing dependence by correlation of distances. Ann Statist, 2007, 35: 2769–2794
Article MathSciNet MATH Google Scholar
Tibshirani R. Regression shrinkage and selection via lasso. J Roy Stat Soc Ser B, 1996, 58: 267–288
MathSciNet MATH Google Scholar
Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA, 2002, 99: 6567–6572
Article Google Scholar
Vapnik V. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995
Book MATH Google Scholar
Wang H. Forward regression for ultra-high dimensional variable screening. J Amer Statist Assoc, 2009, 104: 1512–1524
Article MathSciNet MATH Google Scholar
Xu C, Chen J. The sparse MLE for ultrahigh-dimensional feature screening. J Amer Statist Assoc, 2014, 109: 1257–1265
Article MathSciNet Google Scholar
Xu P, Zhu L X, Li Y. Ultrahigh dimensional time course feature selection. Biometrics, 2014, 70: 356–365
Article MathSciNet MATH Google Scholar
Yuan M, Lin Y. On the nonnegative garrote estimator. J Roy Stat Soc Ser B, 2007, 69: 143–161
Article MathSciNet MATH Google Scholar
Zhao D S, Li Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivariate Anal, 2012, 105: 397–411
Article MathSciNet MATH Google Scholar
Zhong W, Zhu L P. An iterative approach to distance correlation based sure independence screening. J Stat Comput Sim, 2014, doi:10.108000949655.2014.928820
Google Scholar
Zhu L P, Li L, Li R, et al. Model-free feature screening for ultrahigh-dimensional data. J Amer Statist Assoc, 2011, 106: 1464–1475
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, School of Economics, Xiamen University, Xiamen, 361005, China
JingYuan Liu & Wei Zhong
Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, 361005, China
JingYuan Liu & Wei Zhong
Fujian Key Laboratory of Statistical Science, Xiamen University, Xiamen, 361005, China
JingYuan Liu & Wei Zhong
Department of Statistics and The Methodology Center, Pennsylvania State University, University Park, PA, 16802-2111, USA
RunZe Li

Authors

JingYuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhong
View author publications
You can also search for this author in PubMed Google Scholar
RunZe Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to RunZe Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, J., Zhong, W. & Li, R. A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58, 1–22 (2015). https://doi.org/10.1007/s11425-015-5062-9

Download citation

Received: 24 March 2015
Accepted: 23 July 2015
Published: 22 August 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s11425-015-5062-9

Keywords

MSC(2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A selective overview of feature screening for ultrahigh-dimensional data

Abstract

Access this article

Similar content being viewed by others

Variable Selection and Feature Screening

A semi-parametric approach to feature selection in high-dimensional linear regression models

Feature selection for high-dimensional data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

MSC(2010)

Navigation

A selective overview of feature screening for ultrahigh-dimensional data

Abstract

Access this article

Similar content being viewed by others

Variable Selection and Feature Screening

A semi-parametric approach to feature selection in high-dimensional linear regression models

Feature selection for high-dimensional data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

MSC(2010)

Search

Navigation