Abstract
Model-based component-wise gradient boosting is a popular tool for data-driven variable selection. In order to improve its prediction and selection qualities even further, several modifications of the original algorithm have been developed, that mainly focus on different stopping criteria, leaving the actual variable selection mechanism untouched. We investigate different prediction-based mechanisms for the variable selection step in model-based component-wise gradient boosting. These approaches include Akaikes Information Criterion (AIC) as well as a selection rule relying on the component-wise test error computed via cross-validation. We implemented the AIC and cross-validation routines for Generalized Linear Models and evaluated them regarding their variable selection properties and predictive performance. An extensive simulation study revealed improved selection properties whereas the prediction error could be lowered in a real world application with age-standardized COVID-19 incidence rates.
Funding source: Deutsche Forschungsgemeinschaft
Award Identifier / Grant number: 426493614
Funding source: Volkswagen Foundation
Award Identifier / Grant number: Freigeist Fellowship
Funding source: codeocean capsule
Acknowledgment
We would like to thank Gabriele Doblhammer and Daniel Kreft for aggregating and sharing the data.
-
Research ethics: Not applicable.
-
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Competing interests: The authors state no conflict of interest.
-
Research funding: The study was supported by Deutsche Forschungsgemeinschaft (DOI :http://dx.doi.org/10.13039/501100001659, 426493614); Volkswagen Foundation (http://dx.doi.org/10.13039/501100001663; Freigeist Fellowship) and codeocean capsule.
-
Data availability: The raw data can be obtained on request from the corresponding author.
-
Software availability: The R files of the new algorithms as well as a minimal working example can be found via github https://github.com/SophiePotts/PBVS and codeocean https://codeocean.com/capsule/2221576/tree.
Appendix A: Simulations
A.1 Simulation setup
|
y ∼ Poi(λ) | y ∼ Bin(π) | |
---|---|---|---|
β r for NSR = low | See Figure A.1 | {−0.8, 0.8} | {−0.8, 0.8} |
β r for NSR = medium | See Figure A.1 | {−0.25, 0.4} | {−0.25, 0.4} |
β r for NSR = high | See Figure A.1 | {−0.12, 0.1} | {−0.12, 0.1} |
k | {100, 500} | {100, 500} | {100, 500} |
INF | {5, 20} | {5, 20} | {5, 20} |
cor | {uncor, toep} | {uncor, toep} | {uncor, toep} |
runs | 100 | 50 | 50 |
T | 3000 | 6500 | 10,000 |
ν | 0.1 | 0.01 | 0.8 |
A.2 Simulation results of normally distributed data
Settings with superior performancea of AIC-boost compared to mboostCV | ||
---|---|---|
Number | Percentage | |
Uncorrelated | 9/12 | 75 % |
Toeplitz correlation | 10/12 | 83 % |
NSR = 0.2 | 8/8 | 100 % |
NSR = 0.5 | 7/8 | 88 % |
NSR = 1 | 4/8 | 50 % |
INF = 5 | 8/12 | 67 % |
INF = 20 | 11/12 | 92 % |
k = 100 | 11/12 | 92 % |
k = 500 | 8/12 | 67 % |
-
aSuperior performance is measured by means of the median FPR.
A.3 Simulation results of poisson distributed data
A.4 Simulation results of binomial distributed data
Figure A.14 depicts the evolution of the TPR and FPR by increasing model size of the different algorithms for a specific setting (P = 100, INF = 20, uncorrelated covariates, T = 3000, ν and number of simulation runs as in Table A.1) with varying NSR.Note, that the lines correspond to averages over the number of simulation runs and that the NSR cannot be directly compared between the distributions. The paths of the three compared methods appear very similar with highly overlapping empirical 90 % quantile bands. The algorithms tend to behave very similar in terms of FPR and TPR when a model of a specific size would be chosen. In the Gaussian setting (second row) it becomes clear, that AIC-boost stops at a certain smaller model size and possibly fits the included covariates but does not add further FPs compared to the benchmark. This pattern can be seen in Figure 3 as well. In high NSR settings AIC boost refrains from including more FPs to reach TPR = 1 while mboostCV and CV-boost do so.
Regarding the binomial distribution, one can also observe similar paths in terms of model size of the two methods. AIC-boost again seems to refrain longer from including new variables, since the paths differ in length, i.e. it rather updates non-zero coefficients than including new variables. However, by keeping in mind, that most of the settings use the final iteration as the stopping iteration t*, AIC-boost would possibly reach the same model sizes when fitted for a much larger number of iterations T. The observed differences in the predictive performance for binomial distributed data arise from different estimated coefficient sizes and different stopping iterations.
Appendix B: Application
B.1 Model table of application
Covariate | mboostCV | LASSO | AIC-boost | |||
---|---|---|---|---|---|---|
Intercept | 4.133 | 3.992 | 2.829 | |||
Age-standardized incidence rate per 100.000 person-years until 15.03 | 0.043 | (5) | 0.046 | (6) | 0.109 | (6) |
Unemployment rate of young persons (under 26 years) in 2017 | −0.071 | (4) | −0.055 | (5) | −0.071 | (8) |
%Voter turnout (number of valid votes in the last Bundestag election) of all registered voters | 0.023 | (7) | 0.036 | (7) | 0.033 | (13) |
%Roman-catholics in 2011 | 0.075 | (3) | 0.079 | (4) | 0.086 | (7) |
Persons in long-term care per 10.000 persons in 2017 | −0.024 | (6) | −0.033 | (8) | −0.039 | (12) |
Premature mortality (deaths of persons younger than 65 years) per 1000 persons | −0.097 | (2) | −0.088 | (3) | −0.117 | (5) |
%Households with low income (1500€ per month) in all households in 2016 | −0.015 | (8) | −0.015 | (9) | −0.012 | (18) |
Latitude | −0.122 | (1) | −0.130 | (2) | −0.130 | (4) |
%Older employed persons (55 years+) in all employed persons in 2017 | −0.013 | (10) | ||||
Ever 100+ inbound commuters from Tirschenreuth | 0.141 | (1) | 0.920 | (1) | ||
%Change of number of persons at age 50–65 in 2012–2017 | 0.009 | (19) | ||||
Sex ratio (females to males) at age 20–40 in 2017 | 0.021 | (17) | ||||
Total sex ratio (females to males) in 2017 | 0.068 | (9) | ||||
%Young employed persons in all young persons (under 26 years) in 2017 | 0.054 | (11) | ||||
%Older employed persons in all older persons (55 years+) in 2011–2017 | 0.027 | (15) | ||||
%Change of number of employed persons in 2012–2016 | 0.009 | (19) | ||||
Average travel time to the next large-sized regional center (“Oberzentrum”) | −0.032 | (14) | ||||
%Outbound commuters over a distance of 50 km + in all employed persons in 2017 | −0.006 | (20) | ||||
%Outbound commuters over a distance of 150 km + in all employed persons in 2017 | −0.056 | (10) | ||||
Ever 100+ inbound commuters from Hohenlohekreis | 0.136 | (3) | ||||
Ever 100+ inbound commuters from Aachen | 0.027 | (15) | ||||
Ever 100+ inbound commuters from Rosenheim (city + district) | 0.194 | (2) |
References
1. Bühlmann, P, Hothorn, T. Boosting Algorithms: Regularization, Prediction and Model Fitting. Stat Sci 2007;22:477–505. https://doi.org/10.1214/07-sts242.Search in Google Scholar
2. Freund, Y, Schapire, R. Experiments with a new boosting algorithm. In: Proceedings of the thirteenth international conference on machine learning theory; 1996:148–56 pp.Search in Google Scholar
3. Breiman, L. Arcing the edge. Berkeley: Statistics Department, University of California at Berkeley; 1997:1–14 pp.Search in Google Scholar
4. Friedman, J, Hastie, T, Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 2000;28:337–407. https://doi.org/10.1214/aos/1016218223.Search in Google Scholar
5. Friedman, J. Greedy function approximation: A gradient boosting machine. Ann Stat 2001;29:1189–232. https://doi.org/10.1214/aos/1013203451.Search in Google Scholar
6. Mayr, A, Hofner, B, Schmid, M. The importance of knowing when to stop. Methods Inf Med 2012;51:178–86. https://doi.org/10.3414/me11-02-0030.Search in Google Scholar
7. Thomas, J, Hepp, T, Mayr, A, Bischl, B, Zhao, Y. Probing for sparse and fast variable selection with model-based boosting. Comput Math Methods Med 2017;2017:1–8. https://doi.org/10.1155/2017/1421409.Search in Google Scholar PubMed PubMed Central
8. Meinshausen, N, Bühlmann, P. Stability selection. J R Stat Soc B Stat Methodol 2010;72:417–73. https://doi.org/10.1111/j.1467-9868.2010.00740.x.Search in Google Scholar
9. Hofner, B, Boccuto, L, Göker, M. Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinf 2015;16:1–17. https://doi.org/10.1186/s12859-015-0575-3.Search in Google Scholar PubMed PubMed Central
10. Strömer, A, Staerk, C, Klein, N, Weinhold, L, Titze, S, Mayr, A. Deselection of base-learners for statistical boosting–with an application to distributional regression. Stat Methods Med Res 2022;31:207–24. https://doi.org/10.1177/09622802211051088.Search in Google Scholar PubMed
11. Bühlmann, P, Hothorn, T. Twin boosting: Improved feature selection and prediction. Stat Comput 2010;20:119–38. https://doi.org/10.1007/s11222-009-9148-5.Search in Google Scholar
12. Staerk, C, Mayr, A. Randomized boosting with multivariable base-learners for high-dimensional variable selection and prediction. BMC Bioinf 2021;22:1–28. https://doi.org/10.1186/s12859-021-04340-z.Search in Google Scholar PubMed PubMed Central
13. Bühlmann, P, Yu, B, Singer, Y, Wasserman, L. Sparse boosting. J Mach Learn Res 2006;7:1001–24.Search in Google Scholar
14. Hofner, B, Hothorn, T, Kneib, T, Schmid, M. A framework for unbiased model selection based on boosting. J Comput Graph Stat 2011;20:956–71. https://doi.org/10.1198/jcgs.2011.09220.Search in Google Scholar
15. Tutz, G, Groll, A. Generalized linear mixed models based on boosting. In: Statistical modelling and regression structures: festschrift in honour of ludwig fahrmeir. Heidelberg: Physica; 2010:197–215 pp.10.1007/978-3-7908-2413-1_11Search in Google Scholar
16. Fahrmeir, L, Kneib, T, Lang, S, Marx, B. Generalized linear models. In: Regression models. Berlin/Heidelberg: Springer; 2021:283–342 pp.10.1007/978-3-662-63882-8_5Search in Google Scholar
17. Hastie, T, Mease, D, Wyner, AJ. Comment: Boosting algorithms: Regularization, Prediction and Model Fitting. Stat Sci 2007;22:513–5. https://doi.org/10.1214/07-sts242b.Search in Google Scholar
18. Hothorn, T, Bühlmann, P, Kneib, T, Schmid, M, Hofner, B. Model-based boosting 2.0. J Mach Learn Res 2010;11:2109–13.Search in Google Scholar
19. Hepp, T, Schmid, M, Gefeller, O, Waldmann, E, Mayr, A. Approaches to regularized regression – a comparison between gradient boosting and the LASSO. Methods Inf Med 2016;55:422–30. https://doi.org/10.3414/me16-01-0033.Search in Google Scholar PubMed
20. Doblhammer, G, Reinke, C, Kreft, D. Social disparities in the first wave of COVID-19 incidence rates in Germany: a county-scale explainable machine learning approach. BMJ Open 2022;12:1–11. https://doi.org/10.1136/bmjopen-2021-049852.Search in Google Scholar PubMed PubMed Central
21. Tibshirani, R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol 1996;58:267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.Search in Google Scholar
22. Plümper, T, Neumayer, E. The pandemic predominantly hits poor neighbourhoods? SARS-CoV-2 infections and COVID-19 fatalities in German districts. Eur J Publ Health 2020;30:1176–80. https://doi.org/10.1093/eurpub/ckaa168.Search in Google Scholar PubMed PubMed Central
23. Wachtler, B, Michalski, N, Nowossadeck, E, Diercke, M, Wahrendorf, M, Santos-Hövener, C, et al.. Socioeconomic inequalities in the risk of SARS-CoV-2 infection – first results from an analysis of surveillance data from Germany. J Health Monit 2020;5:18–29. https://doi.org/10.25646/7057.Search in Google Scholar PubMed PubMed Central
24. Rohleder, S, Costa, D, Bozorgmehr, K. Area-level socioeconomic deprivation, non-national residency, and Covid-19 incidence: a longitudinal spatiotemporal analysis in Germany. EClinicalMedicine 2022;49:101485. https://doi.org/10.1016/j.eclinm.2022.101485.Search in Google Scholar PubMed PubMed Central
25. Plümper, T, Laroze, D, Neumayer, E. The limits to equivalent living conditions: regional disparities in premature mortality in Germany. J Public Health 2018;26:309–19. https://doi.org/10.1007/s10389-017-0865-5.Search in Google Scholar PubMed PubMed Central
26. Brandl, M, Selb, R, Seidl-Pillmeier, S, Marosevic, D, Buchholz, U, Rehmet, S. Mass gathering events and undetected transmission of SARS-CoV-2 in vulnerable populations leading to an outbreak with high case fatality ratio in the district of Tirschenreuth, Germany. Epidemiol Infect 2020;148:e252. https://doi.org/10.1017/s0950268820002460.Search in Google Scholar
27. Fuest, C, Immel, L. Ein zunehmend gespaltenes Land? – Regionale Einkommensunterschiede und die Entwicklung des Gefälles zwischen Stadt und Land sowie West- und Ostdeutschland. Ifo Schnelld 2019;72:19–28.Search in Google Scholar
28. Ballering, A, Oertelt-Prigione, S, Olde Hartman, T, Rosmalen, J, Boezen, M, Mierau, JO, et al.. Sex and gender-related differences in COVID-19 diagnoses and SARS-CoV-2 testing practices during the first wave of the pandemic: the Dutch lifelines COVID-19 cohort study. J Wom Health 2021;30:1686–92. https://doi.org/10.1089/jwh.2021.0226.Search in Google Scholar PubMed PubMed Central
29. Bianconi, V, Mannarino, M, Bronzo, P, Marini, E, Pirro, M. Time-related changes in sex distribution of COVID-19 incidence proportion in Italy. Heliyon 2020;6:e05304. https://doi.org/10.1016/j.heliyon.2020.e05304.Search in Google Scholar PubMed PubMed Central
30. Doerre, A, Doblhammer, G. The influence of gender on COVID-19 infections and mortality in Germany: insights from age- and gender-specific modeling of contact rates, infections, and deaths in the early phase of the pandemic. PLoS One 2022;17:e0268119. https://doi.org/10.1371/journal.pone.0268119.Search in Google Scholar PubMed PubMed Central
31. Ancochea, J, Izquierdo, J, Soriano, J. Evidence of gender differences in the diagnosis and management of Coronavirus disease 2019 patients: an analysis of electronic health records using natural language processing and machine learning. J Wom Health 2021;30:393–404. https://doi.org/10.1089/jwh.2020.8721.Search in Google Scholar PubMed
32. Leibert, T, Wolff, M, Haase, A. Shifting spatial patterns in German population trends: local-level hot and cold spots, 1990–2019. Geograph Helv 2022;77:369–87. https://doi.org/10.5194/gh-77-369-2022.Search in Google Scholar
33. Fink, P, Hennicke, M, Tiemann, H. Unequal Germany: socio-economic disparities report 2019. Bonn/Berlin: Friedrich-Ebert-Stiftung; 2019.Search in Google Scholar
34. Robinson, W. Ecological correlations and the behavior of individuals. Am Socio Rev 1950;15:351–7. https://doi.org/10.2307/2087176.Search in Google Scholar
35. Hurvich, C, Tsai, C. Regression and time series model selection in small samples. Biometrika 1989;76:297–307. https://doi.org/10.1093/biomet/76.2.297.Search in Google Scholar
36. Bühlmann, P, Hothorn, T. Rejoinder: Boosting Algorithms: Regularization, Prediction and Model Fitting. Stat Sci 2007;22:516–22. https://doi.org/10.1214/07-sts242rej.Search in Google Scholar
© 2023 Walter de Gruyter GmbH, Berlin/Boston