Skip to main content
Log in

A selective overview of feature screening for ultrahigh-dimensional data

  • Reviews
  • Invited Articles
  • Published:
Science China Mathematics Aims and scope Submit manuscript

Abstract

High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of highdimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data. Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many highdimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Benjamini Y, Hochberg T. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Stat Soc Ser B, 1995, 57: 289–300

    MathSciNet  MATH  Google Scholar 

  2. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Statist, 2004, 29: 1165–1188

    MathSciNet  Google Scholar 

  3. Bickel P J, Levina E. Some theory for Fishers linear discriminant function, “Naive Bayes,” and some alternatives when there are many more variables than observations. Bernoulli, 2004, 10: 989–1010

    Article  MathSciNet  MATH  Google Scholar 

  4. Breiman L. Better subset regression using the nonnegative garrote. Technometrics, 1995, 37: 373–384

    Article  MathSciNet  MATH  Google Scholar 

  5. Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Ann Statist, 2007, 35: 2313–2404

    Article  MathSciNet  MATH  Google Scholar 

  6. Carroll R J, Fan J, Gijbels I, et al. Generalized partially linear single-index models. J Amer Statist Assoc, 1997, 92: 477–489

    Article  MathSciNet  MATH  Google Scholar 

  7. Chen J, Chen Z. Extended Bayesian information criterion for model selection with large model spaces. Biometrika, 2008, 85: 759–771

    Article  Google Scholar 

  8. Chen L S, Paul D, Prentice R L, et al. A regularized Hotelling’s T 2 test for pathway analysis in proteomic studies. J Amer Statist Assoc, 2011, 106: 1345–1360

    Article  MathSciNet  MATH  Google Scholar 

  9. Cui H, Li R, Zhong W. Model-free feature screening for ultrahigh-dimensional discriminant analysis. J Amer Statist Assoc, 2015, 110: 630–641

    Article  MathSciNet  Google Scholar 

  10. Donoho D L. High-dimensional data: The curse and blessings of dimensionality. Los Angeles: Amer Math Soc Conference Math Challenges of 21st Century, 2000

    Google Scholar 

  11. Donoho D L. Neighborly polytopes and sparse solution of underdetermined linear equations. Technical Report, Department of Statistics. Stanford: Stanford University, 2004

    Google Scholar 

  12. Donoho D L. For most large undetermined systems of linear equations the minimal l 1-norm solution is the sparsest solution. Commun Pure Appl Math, 2006, 59: 797–829

    Article  MathSciNet  MATH  Google Scholar 

  13. Dudoit S, Shaffer J P, Boldrick J C. Multiple hypothesis testing in microarray experiments. Stat Sci, 2003, 18: 71–103

    Article  MathSciNet  MATH  Google Scholar 

  14. Efron B. Correlation and large-scale simultaneous significance testing. J Amer Statist Assoc, 2007, 102: 93–103

    Article  MathSciNet  MATH  Google Scholar 

  15. Fan J, Fan Y. High-dimensional classification using features annealed independence rules. Ann Statist, 2008, 36: 2605–2637

    Article  MathSciNet  MATH  Google Scholar 

  16. Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high dimensional additive models. J Amer Statist Assoc, 2011, 106: 544–557

    Article  MathSciNet  MATH  Google Scholar 

  17. Fan J, Feng Y, Wu Y. Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collection, 2010, 6: 70–86

    MathSciNet  Google Scholar 

  18. Fan J, Han F, Liu H. Challenges of Big Data analysis. Nat Sci Rev, 2014, 1: 293–314

    Article  Google Scholar 

  19. Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. J Amer Statist Assoc, 2001, 96: 1348–1360

    Article  MathSciNet  MATH  Google Scholar 

  20. Fan J, Li R. Statistical challenges with high dimensionality: Feature selection in knowledge discoverry. In: Sanz-Sole M, Soria J, Varona J L, et al., eds. Proceedings on the International Congress of Mathematicians, vol. III. Freiburg: European Math Soc, 2006, 595–622

    MathSciNet  Google Scholar 

  21. Fan J, Lv J. Sure independence screening for ultrahigh-dimensional feature space (with discussion). J Roy Stat Soc Ser B, 2008, 70: 849–911

    Article  MathSciNet  Google Scholar 

  22. Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Stat Sinica, 2010, 20: 101–148

    MathSciNet  MATH  Google Scholar 

  23. Fan J, Ma Y, Dai W. Nonparametric independence screening in sparse ultra-high dimensional varying coefficient models. J Amer Statist Assoc, 2014, 109: 1270–1284

    Article  MathSciNet  Google Scholar 

  24. Fan J, Ren Y. Statistical analysis of DNA microarray data. Clin Cancer Res, 2006, 12: 4469–4473

    Article  Google Scholar 

  25. Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res, 2009, 10: 1829–1853

    MathSciNet  Google Scholar 

  26. Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist, 2010, 38: 3567–3604

    Article  MathSciNet  MATH  Google Scholar 

  27. Fang K, Kotz S, Ng K. Symmetric Multivariate and Related Distributions. London: Chapman & Hall, 1990

    Book  MATH  Google Scholar 

  28. Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. J Comput Graph Stat, 2009, 18: 533–550

    Article  MathSciNet  Google Scholar 

  29. Hao N, Zhang H H. Interaction Screening for Ultrahigh-Dimensional Data. J Amer Statist Assoc, 2014, 109: 1285–1301

    Article  MathSciNet  Google Scholar 

  30. Härdle W, Hall P, Ichimura H. Optimal smoothing in single-index models. Ann Statist, 1993, 21: 157–178

    Article  MathSciNet  MATH  Google Scholar 

  31. Härdle W, Liang H, Gao J T. Partially Linear Models. Germany: Springer Phisica-Verlag, 2000

    Book  MATH  Google Scholar 

  32. He X, Wang L, Hong H G. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Statist, 2013, 41: 342–369

    Article  MathSciNet  MATH  Google Scholar 

  33. Huang D, Li R, Wang H. Feature screening for ultrahigh-dimensional categorical data with applications. J Bus Econ Stat, 2014, 32: 237–244

    Article  MathSciNet  Google Scholar 

  34. Huber P J. Robust estimation of a location parameter. Ann Math Stat, 1964, 35: 73–101

    Article  MATH  Google Scholar 

  35. Li G, Peng H, Zhang J, et al. Robust rank correlation based screening. Ann Statist, 2012, 40: 1846–1877

    Article  MathSciNet  MATH  Google Scholar 

  36. Li J, Zhong W, Li R, et al. A fast algorithm for detecting gene-gene interactions in genome-wide association studies. Ann Appl Stat, 2014, 8: 2292–2318

    Article  MathSciNet  MATH  Google Scholar 

  37. Li R, Zhong W, Zhu L P. Feature screening via distance correlation learning. J Amer Statist Assoc, 2012, 107: 1129–1139

    Article  MathSciNet  MATH  Google Scholar 

  38. Lin L, Sun J, Zhu L X. Nonparametric feature screening. Comput Stat Data Anal, 2013, 67: 162–174

    Article  MathSciNet  Google Scholar 

  39. Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Amer Statist Assoc, 2014, 109: 266–274

    Article  MathSciNet  Google Scholar 

  40. Luo X, Stefanski L A, Boos D D. Tuning variable selection procedure by adding noise. Technometrics, 2006, 48: 165–175

    Article  MathSciNet  Google Scholar 

  41. Mai Q, Zou H. The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika, 2013, 100: 229–234

    Article  MathSciNet  MATH  Google Scholar 

  42. Mai Q, Zou H, Yuan M. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 2012, 99: 29–42

    Article  MathSciNet  MATH  Google Scholar 

  43. Meier I, Geer V, Bühlmann P. High-dimensional additive modeling. J Roy Stat Soc Ser B, 2009, 71: 1009–1030

    Article  Google Scholar 

  44. Pan R, Wang H, Li R. On the ultrahigh-dimensional linear discriminant analysis problem with a diverging number of classes. J Amer Statist Assoc, 2015, in press

    Google Scholar 

  45. Song R, Yi F, Zou H. On varying-coefficient independence screening for high-dimensional varying-coefficient models. Stat Sinica, 2014, 24: 1735–1752

    MathSciNet  Google Scholar 

  46. Storey J D, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci USA, 2003, 100: 9440–9445

    Article  MathSciNet  MATH  Google Scholar 

  47. Székely G, Jizzo M L, Bakirov N K. Measuring and testing dependence by correlation of distances. Ann Statist, 2007, 35: 2769–2794

    Article  MathSciNet  MATH  Google Scholar 

  48. Tibshirani R. Regression shrinkage and selection via lasso. J Roy Stat Soc Ser B, 1996, 58: 267–288

    MathSciNet  MATH  Google Scholar 

  49. Tibshirani R, Hastie T, Narasimhan B, et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA, 2002, 99: 6567–6572

    Article  Google Scholar 

  50. Vapnik V. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995

    Book  MATH  Google Scholar 

  51. Wang H. Forward regression for ultra-high dimensional variable screening. J Amer Statist Assoc, 2009, 104: 1512–1524

    Article  MathSciNet  MATH  Google Scholar 

  52. Xu C, Chen J. The sparse MLE for ultrahigh-dimensional feature screening. J Amer Statist Assoc, 2014, 109: 1257–1265

    Article  MathSciNet  Google Scholar 

  53. Xu P, Zhu L X, Li Y. Ultrahigh dimensional time course feature selection. Biometrics, 2014, 70: 356–365

    Article  MathSciNet  MATH  Google Scholar 

  54. Yuan M, Lin Y. On the nonnegative garrote estimator. J Roy Stat Soc Ser B, 2007, 69: 143–161

    Article  MathSciNet  MATH  Google Scholar 

  55. Zhao D S, Li Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivariate Anal, 2012, 105: 397–411

    Article  MathSciNet  MATH  Google Scholar 

  56. Zhong W, Zhu L P. An iterative approach to distance correlation based sure independence screening. J Stat Comput Sim, 2014, doi:10.108000949655.2014.928820

    Google Scholar 

  57. Zhu L P, Li L, Li R, et al. Model-free feature screening for ultrahigh-dimensional data. J Amer Statist Assoc, 2011, 106: 1464–1475

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to RunZe Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Zhong, W. & Li, R. A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58, 1–22 (2015). https://doi.org/10.1007/s11425-015-5062-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11425-015-5062-9

Keywords

MSC(2010)

Navigation