Improving Group Lasso for High-Dimensional Categorical Data

Nowakowski, Szymon; Pokarowski, Piotr; Rejchel, Wojciech; Sołtys, Agnieszka

doi:10.1007/978-3-031-36021-3_47

Szymon Nowakowski^13,14,
Piotr Pokarowski¹³,
Wojciech Rejchel¹⁵ &
…
Agnieszka Sołtys¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14074))

Included in the following conference series:

International Conference on Computational Science

697 Accesses
1 Citations

Abstract

Sparse modeling or model selection with categorical data is challenging even for a moderate number of variables, because roughly one parameter is needed to encode one category or level. The Group Lasso is a well known efficient algorithm for selection of continuous or categorical variables, but all estimates related to a selected factor usually differ. Therefore, a fitted model may not be sparse, which makes the model interpretation difficult. To obtain a sparse solution of the Group Lasso, we propose the following two-step procedure: first, we reduce data dimensionality using the Group Lasso; then, to choose the final model, we use an information criterion on a small family of models prepared by clustering levels of individual factors. In the consequence, our procedure reduces dimensionality of the Group Lasso and strongly improves interpretability of the final model. What is important, this reduction results only in the small increase of the prediction error. In the paper we investigate selection correctness of the algorithm in a sparse high-dimensional scenario. We also test our method on synthetic as well as the real data sets and show that it outperforms the state of the art algorithms with respect to the prediction accuracy, model dimension and execution time. Our procedure is contained in the R package DMRnet and available in the CRAN repository.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/SzymonNowakowski/ICCS-2023.

References

Ahle, T.D.: Sharp and simple bounds for the raw moments of the binomial and poisson distributions. Statist. Probab. Lett. 182, 109–306 (2022)
Article MathSciNet MATH Google Scholar
Andrews, D.F., Herzberg, A.M.: Data. A Collection of Problems from Many Fields for the Student and Research Worker. Springer, New York (1985)
Google Scholar
Bondell, H.D., Reich, B.J.: Simultaneous factor selection and collapsing levels in anova. Biometrics 65, 169–177 (2009)
Article MathSciNet MATH Google Scholar
Breheny, P., Huang, J.: Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 25, 173–187 (2015)
Article MathSciNet MATH Google Scholar
Foster, D., George, E.: The risk inflation criterion for multiple regression. Ann. Statist. 22, 1947–1975 (1994)
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)
Article Google Scholar
García-Donato, G., Paulo, R.: Variable selection in the presence of factors: a model selection perspective. J. Amer. Statist. Assoc. 117, 1847–1857 (2022)
Article MathSciNet Google Scholar
Gertheiss, J., Tutz, G.: Sparse modeling of categorial explanatory variables. Ann. Appl. Stat. 4, 2150–2180 (2010)
Article MathSciNet MATH Google Scholar
Harley, C., Reynolds, R.: Analysis of e. coli promoter sequences. Nucleic Acids Res. 15, 2343–2361 (1987)
Google Scholar
Kaggle: Prudential life insurance assessment (2015). www.kaggle.com/c/prudential-life-insurance-assessment/data
Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Proceedings of KDD, pp. 202–207 (1996)
Google Scholar
Maindonald, J., Braun, W.: Data Analysis and Graphics Using R. Cambridge University Press (2010)
Google Scholar
Maj-Kańska, A., Pokarowski, P., Prochenka, A.: Delete or merge regressors for linear model selection. Electron. J. Stat. 9, 1749–1778 (2015)
Article MathSciNet MATH Google Scholar
Oelker, M.R., Gertheiss, J., Tutz, G.: Regularization and model selection with categorical predictors and effect modifiers in generalized linear models. Stat. Modelling 14, 157–177 (2014)
Article MathSciNet MATH Google Scholar
Pauger, D., Wagner, H.: Bayesian effect fusion for categorical predictors. Bayesian Anal. 14, 341–369 (2019)
Article MathSciNet MATH Google Scholar
Prochenka-Soltys, A., Pokarowski, P., Nowakowski, S.: DMRnet: Delete or Merge Regressors Algorithms for Linear and Logistic Model Selection and High-Dimensional Data (2022). https://cran.r-project.org/web/packages/DMRnet
Rabinowicz, A., Rosset, S.: Cross-validation for correlated data. J. Amer. Statist. Assoc. 117, 718–731 (2022)
Article MathSciNet MATH Google Scholar
Rabinowicz, A., Rosset, S.: Trees-based models for correlated data. J. Mach. Learn. Res. 23, 1–31 (2022)
MathSciNet MATH Google Scholar
Simchoni, G., Rosset, S.: Using random effects to account for high-cardinality categorical features and repeated measures in deep neural networks. In: Proceedings of NIPS, pp. 25111–25122 (2021)
Google Scholar
Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22, 231–245 (2013)
Article MathSciNet Google Scholar
Stokell, B.G., Shah, R.D., Tibshirani, R.J.: Modelling high-dimensional categorical data using nonconvex fusion penalties. J. Roy. Statist. Soc. Ser. B 83, 579–611 (2021)
Article MathSciNet MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B 67, 91–108 (2005)
Article MathSciNet MATH Google Scholar
Towell, G., Shavlik, J., Noordewier, M.: Refinement of approximate domain theories by knowledge-based artificial neural networks. In: Proceedings of AAAI (1990)
Google Scholar
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B 68, 49–67 (2006)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Applied Mathematics and Mechanics, University of Warsaw, Warsaw, Poland
Szymon Nowakowski, Piotr Pokarowski & Agnieszka Sołtys
Faculty of Physics, University of Warsaw, Warsaw, Poland
Szymon Nowakowski
Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
Wojciech Rejchel

Authors

Szymon Nowakowski
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Pokarowski
View author publications
You can also search for this author in PubMed Google Scholar
Wojciech Rejchel
View author publications
You can also search for this author in PubMed Google Scholar
Agnieszka Sołtys
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wojciech Rejchel .

Editor information

Editors and Affiliations

Czech Technical University in Prague, Prague, Czech Republic
Jiří Mikyška
University of Amsterdam, Amsterdam, The Netherlands
Clélia de Mulatier
AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M.A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nowakowski, S., Pokarowski, P., Rejchel, W., Sołtys, A. (2023). Improving Group Lasso for High-Dimensional Categorical Data. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_47

Download citation

DOI: https://doi.org/10.1007/978-3-031-36021-3_47
Published: 26 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36020-6
Online ISBN: 978-3-031-36021-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Group Lasso for High-Dimensional Categorical Data