Abstract
Sparse modeling or model selection with categorical data is challenging even for a moderate number of variables, because roughly one parameter is needed to encode one category or level. The Group Lasso is a well known efficient algorithm for selection of continuous or categorical variables, but all estimates related to a selected factor usually differ. Therefore, a fitted model may not be sparse, which makes the model interpretation difficult. To obtain a sparse solution of the Group Lasso, we propose the following two-step procedure: first, we reduce data dimensionality using the Group Lasso; then, to choose the final model, we use an information criterion on a small family of models prepared by clustering levels of individual factors. In the consequence, our procedure reduces dimensionality of the Group Lasso and strongly improves interpretability of the final model. What is important, this reduction results only in the small increase of the prediction error. In the paper we investigate selection correctness of the algorithm in a sparse high-dimensional scenario. We also test our method on synthetic as well as the real data sets and show that it outperforms the state of the art algorithms with respect to the prediction accuracy, model dimension and execution time. Our procedure is contained in the R package DMRnet and available in the CRAN repository.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahle, T.D.: Sharp and simple bounds for the raw moments of the binomial and poisson distributions. Statist. Probab. Lett. 182, 109–306 (2022)
Andrews, D.F., Herzberg, A.M.: Data. A Collection of Problems from Many Fields for the Student and Research Worker. Springer, New York (1985)
Bondell, H.D., Reich, B.J.: Simultaneous factor selection and collapsing levels in anova. Biometrics 65, 169–177 (2009)
Breheny, P., Huang, J.: Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 25, 173–187 (2015)
Foster, D., George, E.: The risk inflation criterion for multiple regression. Ann. Statist. 22, 1947–1975 (1994)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)
García-Donato, G., Paulo, R.: Variable selection in the presence of factors: a model selection perspective. J. Amer. Statist. Assoc. 117, 1847–1857 (2022)
Gertheiss, J., Tutz, G.: Sparse modeling of categorial explanatory variables. Ann. Appl. Stat. 4, 2150–2180 (2010)
Harley, C., Reynolds, R.: Analysis of e. coli promoter sequences. Nucleic Acids Res. 15, 2343–2361 (1987)
Kaggle: Prudential life insurance assessment (2015). www.kaggle.com/c/prudential-life-insurance-assessment/data
Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In: Proceedings of KDD, pp. 202–207 (1996)
Maindonald, J., Braun, W.: Data Analysis and Graphics Using R. Cambridge University Press (2010)
Maj-Kańska, A., Pokarowski, P., Prochenka, A.: Delete or merge regressors for linear model selection. Electron. J. Stat. 9, 1749–1778 (2015)
Oelker, M.R., Gertheiss, J., Tutz, G.: Regularization and model selection with categorical predictors and effect modifiers in generalized linear models. Stat. Modelling 14, 157–177 (2014)
Pauger, D., Wagner, H.: Bayesian effect fusion for categorical predictors. Bayesian Anal. 14, 341–369 (2019)
Prochenka-Soltys, A., Pokarowski, P., Nowakowski, S.: DMRnet: Delete or Merge Regressors Algorithms for Linear and Logistic Model Selection and High-Dimensional Data (2022). https://cran.r-project.org/web/packages/DMRnet
Rabinowicz, A., Rosset, S.: Cross-validation for correlated data. J. Amer. Statist. Assoc. 117, 718–731 (2022)
Rabinowicz, A., Rosset, S.: Trees-based models for correlated data. J. Mach. Learn. Res. 23, 1–31 (2022)
Simchoni, G., Rosset, S.: Using random effects to account for high-cardinality categorical features and repeated measures in deep neural networks. In: Proceedings of NIPS, pp. 25111–25122 (2021)
Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22, 231–245 (2013)
Stokell, B.G., Shah, R.D., Tibshirani, R.J.: Modelling high-dimensional categorical data using nonconvex fusion penalties. J. Roy. Statist. Soc. Ser. B 83, 579–611 (2021)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58, 267–288 (1996)
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B 67, 91–108 (2005)
Towell, G., Shavlik, J., Noordewier, M.: Refinement of approximate domain theories by knowledge-based artificial neural networks. In: Proceedings of AAAI (1990)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B 68, 49–67 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nowakowski, S., Pokarowski, P., Rejchel, W., Sołtys, A. (2023). Improving Group Lasso for High-Dimensional Categorical Data. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_47
Download citation
DOI: https://doi.org/10.1007/978-3-031-36021-3_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36020-6
Online ISBN: 978-3-031-36021-3
eBook Packages: Computer ScienceComputer Science (R0)