Abstract
High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data. Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many highdimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.
Similar content being viewed by others
References
Benjamini Y, Hochberg T. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Stat Soc Ser B, 1995, 57: 289–300
Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Statist, 2004, 29: 1165–1188
Bickel P J, Levina E. Some theory for Fishers linear discriminant function, “Naive Bayes,” and some alternatives when there are many more variables than observations. Bernoulli, 2004, 10: 989–1010
Breiman L. Better subset regression using the nonnegative garrote. Technometrics, 1995, 37: 373–384
Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Ann Statist, 2007, 35: 2313–2404
Carroll R J, Fan J, Gijbels I, et al. Generalized partially linear single-index models. J Amer Statist Assoc, 1997, 92: 477–489
Chen J, Chen Z. Extended Bayesian information criterion for model selection with large model spaces. Biometrika, 2008, 85: 759–771
Chen L S, Paul D, Prentice R L, et al. A regularized Hotelling’s T 2 test for pathway analysis in proteomic studies. J Amer Statist Assoc, 2011, 106: 1345–1360
Cui H, Li R, Zhong W. Model-free feature screening for ultrahigh-dimensional discriminant analysis. J Amer Statist Assoc, 2015, 110: 630–641
Donoho D L. High-dimensional data: The curse and blessings of dimensionality. Los Angeles: Amer Math Soc Conference Math Challenges of 21st Century, 2000
Donoho D L. Neighborly polytopes and sparse solution of underdetermined linear equations. Technical Report, Department of Statistics. Stanford: Stanford University, 2004
Donoho D L. For most large undetermined systems of linear equations the minimal l 1-norm solution is the sparsest solution. Commun Pure Appl Math, 2006, 59: 797–829
Dudoit S, Shaffer J P, Boldrick J C. Multiple hypothesis testing in microarray experiments. Stat Sci, 2003, 18: 71–103
Efron B. Correlation and large-scale simultaneous significance testing. J Amer Statist Assoc, 2007, 102: 93–103
Fan J, Fan Y. High-dimensional classification using features annealed independence rules. Ann Statist, 2008, 36: 2605–2637
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high dimensional additive models. J Amer Statist Assoc, 2011, 106: 544–557
Fan J, Feng Y, Wu Y. Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collection, 2010, 6: 70–86
Fan J, Han F, Liu H. Challenges of Big Data analysis. Nat Sci Rev, 2014, 1: 293–314
Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. J Amer Statist Assoc, 2001, 96: 1348–1360
Fan J, Li R. Statistical challenges with high dimensionality: Feature selection in knowledge discoverry. In: Sanz-Sole M, Soria J, Varona J L, et al., eds. Proceedings on the International Congress of Mathematicians, vol. III. Freiburg: European Math Soc, 2006, 595–622
Fan J, Lv J. Sure independence screening for ultrahigh-dimensional feature space (with discussion). J Roy Stat Soc Ser B, 2008, 70: 849–911
Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Stat Sinica, 2010, 20: 101–148
Fan J, Ma Y, Dai W. Nonparametric independence screening in sparse ultra-high dimensional varying coefficient models. J Amer Statist Assoc, 2014, 109: 1270–1284
Fan J, Ren Y. Statistical analysis of DNA microarray data. Clin Cancer Res, 2006, 12: 4469–4473
Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res, 2009, 10: 1829–1853
Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist, 2010, 38: 3567–3604
Fang K, Kotz S, Ng K. Symmetric Multivariate and Related Distributions. London: Chapman & Hall, 1990
Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graph Stat, 2009, 18: 533–550
Hao N, Zhang H H. Interaction Screening for Ultrahigh-Dimensional Data. J Amer Statist Assoc, 2014, 109: 1285–1301
Härdle W, Hall P, Ichimura H. Optimal smoothing in single-index models. Ann Statist, 1993, 21: 157–178
Härdle W, Liang H, Gao J T. Partially Linear Models. Germany: Springer Phisica-Verlag, 2000
He X, Wang L, Hong H G. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Statist, 2013, 41: 342–369
Huang D, Li R, Wang H. Feature screening for ultrahigh-dimensional categorical data with applications. J Bus Econ Stat, 2014, 32: 237–244
Huber P J. Robust estimation of a location parameter. Ann Math Stat, 1964, 35: 73–101
Li G, Peng H, Zhang J, et al. Robust rank correlation based screening. Ann Statist, 2012, 40: 1846–1877
Li J, Zhong W, Li R, et al. A fast algorithm for detecting gene-gene interactions in genome-wide association studies. Ann Appl Stat, 2014, 8: 2292–2318
Li R, Zhong W, Zhu L P. Feature screening via distance correlation learning. J Amer Statist Assoc, 2012, 107: 1129–1139
Lin L, Sun J, Zhu L X. Nonparametric feature screening. Comput Stat Data Anal, 2013, 67: 162–174
Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Amer Statist Assoc, 2014, 109: 266–274
Luo X, Stefanski L A, Boos D D. Tuning variable selection procedure by adding noise. Technometrics, 2006, 48: 165–175
Mai Q, Zou H. The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika, 2013, 100: 229–234
Mai Q, Zou H, Yuan M. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 2012, 99: 29–42
Meier I, Geer V, Bühlmann P. High-dimensional additive modeling. J Roy Stat Soc Ser B, 2009, 71: 1009–1030
Pan R, Wang H, Li R. On the ultrahigh-dimensional linear discriminant analysis problem with a diverging number of classes. J Amer Statist Assoc, 2015, in press
Song R, Yi F, Zou H. On varying-coefficient independence screening for high-dimensional varying-coefficient models. Stat Sinica, 2014, 24: 1735–1752
Storey J D, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci USA, 2003, 100: 9440–9445
Székely G, Jizzo M L, Bakirov N K. Measuring and testing dependence by correlation of distances. Ann Statist, 2007, 35: 2769–2794
Tibshirani R. Regression shrinkage and selection via lasso. J Roy Stat Soc Ser B, 1996, 58: 267–288
Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA, 2002, 99: 6567–6572
Vapnik V. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995
Wang H. Forward regression for ultra-high dimensional variable screening. J Amer Statist Assoc, 2009, 104: 1512–1524
Xu C, Chen J. The sparse MLE for ultrahigh-dimensional feature screening. J Amer Statist Assoc, 2014, 109: 1257–1265
Xu P, Zhu L X, Li Y. Ultrahigh dimensional time course feature selection. Biometrics, 2014, 70: 356–365
Yuan M, Lin Y. On the nonnegative garrote estimator. J Roy Stat Soc Ser B, 2007, 69: 143–161
Zhao D S, Li Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivariate Anal, 2012, 105: 397–411
Zhong W, Zhu L P. An iterative approach to distance correlation based sure independence screening. J Stat Comput Sim, 2014, doi:10.108000949655.2014.928820
Zhu L P, Li L, Li R, et al. Model-free feature screening for ultrahigh-dimensional data. J Amer Statist Assoc, 2011, 106: 1464–1475
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, J., Zhong, W. & Li, R. A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58, 1–22 (2015). https://doi.org/10.1007/s11425-015-5062-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11425-015-5062-9
Keywords
- correlation learning
- distance correlation
- sure independence screening
- sure joint screening
- sure screening property
- ultrahigh-dimensional data