Skip to main content
Log in

Model-free conditional independence feature screening for ultrahigh dimensional data

  • Articles
  • Published:
Science China Mathematics Aims and scope Submit manuscript

Abstract

Feature screening plays an important role in ultrahigh dimensional data analysis. This paper is concerned with conditional feature screening when one is interested in detecting the association between the response and ultrahigh dimensional predictors (e.g., genetic makers) given a low-dimensional exposure variable (such as clinical variables or environmental variables). To this end, we first propose a new index to measure conditional independence, and further develop a conditional screening procedure based on the newly proposed index. We systematically study the theoretical property of the proposed procedure and establish the sure screening and ranking consistency properties under some very mild conditions. The newly proposed screening procedure enjoys some appealing properties. (a) It is model-free in that its implementation does not require a specification on the model structure; (b) it is robust to heavy-tailed distributions or outliers in both directions of response and predictors; and (c) it can deal with both feature screening and the conditional screening in a unified way. We study the finite sample performance of the proposed procedure by Monte Carlo simulations and further illustrate the proposed method through two real data examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Candes E, Tao T. The dantzig selector: Statistical estimation when p is much larger than n. Ann Statist, 2007, 35: 2313–2351

    Article  MathSciNet  MATH  Google Scholar 

  2. Chiang A P, Beck J S, Yen H-J, et al. Homozygosity mapping with snp arrays identifies trim32, an e3 ubiquitin ligase, as a bardet—biedl syndrome gene (bbs11). Proc Nat Acad Sci, 2006, 103: 6287–6292

    Article  Google Scholar 

  3. Cui H, Li R, Zhong W. Model-free feature screening for ultrahigh dimensional discriminant analysis. J Amer Statist Assoc, 2014, 110: 630–641

    Article  MathSciNet  Google Scholar 

  4. Donoho D L. High-dimensional data analysis: The curses and blessings of dimensionality. In: AMS Math Challenges Lecture. Princeton: CiteSeerX, 2000, 1–32

    Google Scholar 

  5. Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Amer Statist Assoc, 2011, 106: 544–557

    Article  MathSciNet  MATH  Google Scholar 

  6. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc, 2001, 96: 1348–1360

    Article  MathSciNet  MATH  Google Scholar 

  7. Fan J, Li R. Statistical challenges with high dimensionality: Feature selection in knowledge discovery. ArXiv:math/0602133, 2006

    MATH  Google Scholar 

  8. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J Roy Statist Soc Ser B, 2008, 70: 849–911

    Article  MathSciNet  Google Scholar 

  9. Fan J, Ma Y, Dai W. Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J Amer Statist Assoc, 2014, 109: 1270–1284

    Article  MathSciNet  Google Scholar 

  10. Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: Beyond the linear model. J Mach Learn Res, 2009, 10: 2013–2038

    MathSciNet  MATH  Google Scholar 

  11. Fan J, Song R. Sure independence screening in generalized linear models with np-dimensionality. Ann Statist, 2010, 38: 3567–3604

    Article  MathSciNet  MATH  Google Scholar 

  12. Hall P, Miller H. Using generalized correlation to effect variable selection in very high-dimensional problems. J Comput Graph Statist, 2009, 18: 533–550

    Article  MathSciNet  Google Scholar 

  13. Hoeffding W. Probability inequalities for sums of bounded random variables. J Amer Statist Assoc, 1963, 58: 13–30

    Article  MathSciNet  MATH  Google Scholar 

  14. Huang J, Horowitz J L, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Statist, 2008, 36: 587–613

    Article  MathSciNet  MATH  Google Scholar 

  15. Huang J, Ma S, Zhang C-H. Adaptive LASSO for sparse high-dimensional regression models. Statist Sinica, 2008, 18: 1603–1618

    MathSciNet  MATH  Google Scholar 

  16. Li G, Peng H, Zhang J, et al. Robust rank correlation based screening. Ann Statist, 2012, 40: 1846–1877

    Article  MathSciNet  MATH  Google Scholar 

  17. Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Amer Statist Assoc, 2012, 107: 1129–1139

    Article  MathSciNet  MATH  Google Scholar 

  18. Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Amer Statist Assoc, 2014, 109: 266–274

    Article  MathSciNet  Google Scholar 

  19. Liu J Y, Zhong W, Li R Z. A selective overview of feature screening for ultrahigh-dimensional data. Sci China Math, 2015, 58: 2033–2054

    MathSciNet  MATH  Google Scholar 

  20. Mai Q, Zou H. The kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika, 2012, 100: 229–234

    Article  MathSciNet  MATH  Google Scholar 

  21. Scheetz T E, Kim K Y A, Swiderski R E, et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Nat Acad Sci, 2006, 103: 14429–14434

    Article  Google Scholar 

  22. Tibshirani R. Regression shrinkage and selection via the LASSO. J Roy Statist Soc Ser B, 1996, 58: 267–288

    MathSciNet  MATH  Google Scholar 

  23. Van Der Vaart A W, Wellner J A. Weak Convergence and Empirical Processes. New York: Springer, 1996

    Book  MATH  Google Scholar 

  24. Xu C, Chen J. The sparse MLE for ultrahigh-dimensional feature screening. J Amer Statist Assoc, 2014, 109: 1257–1269

    Article  MathSciNet  Google Scholar 

  25. Zhu L-P, Li L, Li R, et al. Model-free feature screening for ultrahigh-dimensional data. J Amer Statist Assoc, 2011, 106: 1464–1475

    Article  MathSciNet  MATH  Google Scholar 

  26. Zou H. The adaptive LASSO and its oracle properties. J Amer Statist Assoc, 2006, 101: 1418–1429

    Article  MathSciNet  MATH  Google Scholar 

  27. Zou H, Hastie T. Regularization and variable selection via the elastic net. J Roy Statist Soc Ser B, 2005, 67: 301–320

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by National Science Foundation of USA (Grant No. P50 DA039838), the Program of China Scholarships Council (Grant No. 201506040130), National Natural Science Foundation of China (Grant No. 11401497), the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, the National Key Basic Research Development Program of China (Grant No. 2010CB950703), the Fundamental Research Funds for the Central Universities, National Institute on Drug Abuse, National Institutes of Health (Grants Nos. P50 DA036107 and P50 DA039838) and National Science Foundation of USA (Grant No. DMS 1512422). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIDA, NIH, NSF, NKBRDPC, FRFCU, CSC or NNSFC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to JingYuan Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, L., Liu, J., Li, Y. et al. Model-free conditional independence feature screening for ultrahigh dimensional data. Sci. China Math. 60, 551–568 (2017). https://doi.org/10.1007/s11425-016-0186-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11425-016-0186-8

Keywords

MSC(2010)

Navigation