Skip to main content

Improving Group Lasso for High-Dimensional Categorical Data

  • Conference paper
  • First Online:
Computational Science – ICCS 2023 (ICCS 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14074))

Included in the following conference series:

Abstract

Sparse modeling or model selection with categorical data is challenging even for a moderate number of variables, because roughly one parameter is needed to encode one category or level. The Group Lasso is a well known efficient algorithm for selection of continuous or categorical variables, but all estimates related to a selected factor usually differ. Therefore, a fitted model may not be sparse, which makes the model interpretation difficult. To obtain a sparse solution of the Group Lasso, we propose the following two-step procedure: first, we reduce data dimensionality using the Group Lasso; then, to choose the final model, we use an information criterion on a small family of models prepared by clustering levels of individual factors. In the consequence, our procedure reduces dimensionality of the Group Lasso and strongly improves interpretability of the final model. What is important, this reduction results only in the small increase of the prediction error. In the paper we investigate selection correctness of the algorithm in a sparse high-dimensional scenario. We also test our method on synthetic as well as the real data sets and show that it outperforms the state of the art algorithms with respect to the prediction accuracy, model dimension and execution time. Our procedure is contained in the R package DMRnet and available in the CRAN repository.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/SzymonNowakowski/ICCS-2023.

References

  1. Ahle, T.D.: Sharp and simple bounds for the raw moments of the binomial and poisson distributions. Statist. Probab. Lett. 182, 109–306 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  2. Andrews, D.F., Herzberg, A.M.: Data. A Collection of Problems from Many Fields for the Student and Research Worker. Springer, New York (1985)

    Google Scholar 

  3. Bondell, H.D., Reich, B.J.: Simultaneous factor selection and collapsing levels in anova. Biometrics 65, 169–177 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  4. Breheny, P., Huang, J.: Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 25, 173–187 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  5. Foster, D., George, E.: The risk inflation criterion for multiple regression. Ann. Statist. 22, 1947–1975 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  6. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)

    Article  Google Scholar 

  7. García-Donato, G., Paulo, R.: Variable selection in the presence of factors: a model selection perspective. J. Amer. Statist. Assoc. 117, 1847–1857 (2022)

    Article  MathSciNet  Google Scholar 

  8. Gertheiss, J., Tutz, G.: Sparse modeling of categorial explanatory variables. Ann. Appl. Stat. 4, 2150–2180 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  9. Harley, C., Reynolds, R.: Analysis of e. coli promoter sequences. Nucleic Acids Res. 15, 2343–2361 (1987)

    Google Scholar 

  10. Kaggle: Prudential life insurance assessment (2015). www.kaggle.com/c/prudential-life-insurance-assessment/data

  11. Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Proceedings of KDD, pp. 202–207 (1996)

    Google Scholar 

  12. Maindonald, J., Braun, W.: Data Analysis and Graphics Using R. Cambridge University Press (2010)

    Google Scholar 

  13. Maj-Kańska, A., Pokarowski, P., Prochenka, A.: Delete or merge regressors for linear model selection. Electron. J. Stat. 9, 1749–1778 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  14. Oelker, M.R., Gertheiss, J., Tutz, G.: Regularization and model selection with categorical predictors and effect modifiers in generalized linear models. Stat. Modelling 14, 157–177 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  15. Pauger, D., Wagner, H.: Bayesian effect fusion for categorical predictors. Bayesian Anal. 14, 341–369 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  16. Prochenka-Soltys, A., Pokarowski, P., Nowakowski, S.: DMRnet: Delete or Merge Regressors Algorithms for Linear and Logistic Model Selection and High-Dimensional Data (2022). https://cran.r-project.org/web/packages/DMRnet

  17. Rabinowicz, A., Rosset, S.: Cross-validation for correlated data. J. Amer. Statist. Assoc. 117, 718–731 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  18. Rabinowicz, A., Rosset, S.: Trees-based models for correlated data. J. Mach. Learn. Res. 23, 1–31 (2022)

    MathSciNet  MATH  Google Scholar 

  19. Simchoni, G., Rosset, S.: Using random effects to account for high-cardinality categorical features and repeated measures in deep neural networks. In: Proceedings of NIPS, pp. 25111–25122 (2021)

    Google Scholar 

  20. Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22, 231–245 (2013)

    Article  MathSciNet  Google Scholar 

  21. Stokell, B.G., Shah, R.D., Tibshirani, R.J.: Modelling high-dimensional categorical data using nonconvex fusion penalties. J. Roy. Statist. Soc. Ser. B 83, 579–611 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  22. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  23. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B 67, 91–108 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  24. Towell, G., Shavlik, J., Noordewier, M.: Refinement of approximate domain theories by knowledge-based artificial neural networks. In: Proceedings of AAAI (1990)

    Google Scholar 

  25. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B 68, 49–67 (2006)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wojciech Rejchel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nowakowski, S., Pokarowski, P., Rejchel, W., Sołtys, A. (2023). Improving Group Lasso for High-Dimensional Categorical Data. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-36021-3_47

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-36020-6

  • Online ISBN: 978-3-031-36021-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics