Prediction-based variable selection for component-wise gradient boosting

Sophie Potts; Elisabeth Bergherr; Constantin Reinke; Colin Griesbach

doi:10.1515/ijb-2023-0052

Published online by De Gruyter November 27, 2023

Prediction-based variable selection for component-wise gradient boosting

Sophie Potts , Elisabeth Bergherr , Constantin Reinke and Colin Griesbach

From the journal The International Journal of Biostatistics

https://doi.org/10.1515/ijb-2023-0052

Showing a limited preview of this publication:

Abstract

Model-based component-wise gradient boosting is a popular tool for data-driven variable selection. In order to improve its prediction and selection qualities even further, several modifications of the original algorithm have been developed, that mainly focus on different stopping criteria, leaving the actual variable selection mechanism untouched. We investigate different prediction-based mechanisms for the variable selection step in model-based component-wise gradient boosting. These approaches include Akaikes Information Criterion (AIC) as well as a selection rule relying on the component-wise test error computed via cross-validation. We implemented the AIC and cross-validation routines for Generalized Linear Models and evaluated them regarding their variable selection properties and predictive performance. An extensive simulation study revealed improved selection properties whereas the prediction error could be lowered in a real world application with age-standardized COVID-19 incidence rates.

Keywords: gradient boosting; variable selection; prediction analysis; high-dimensional data; sparse models

Corresponding author: Sophie Potts, Chair of Spatial Data Science and Statistical Learning, University of Goettingen, Goettingen, Germany, E-mail: sophie.potts@uni-goettingen.de

Funding source: Deutsche Forschungsgemeinschaft

Award Identifier / Grant number: 426493614

Funding source: Volkswagen Foundation

Award Identifier / Grant number: Freigeist Fellowship

Funding source: codeocean capsule

Acknowledgment

We would like to thank Gabriele Doblhammer and Daniel Kreft for aggregating and sharing the data.

Research ethics: Not applicable.
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Competing interests: The authors state no conflict of interest.
Research funding: The study was supported by Deutsche Forschungsgemeinschaft (DOI :http://dx.doi.org/10.13039/501100001659, 426493614); Volkswagen Foundation (http://dx.doi.org/10.13039/501100001663; Freigeist Fellowship) and codeocean capsule.
Data availability: The raw data can be obtained on request from the corresponding author.
Software availability: The R files of the new algorithms as well as a minimal working example can be found via github https://github.com/SophiePotts/PBVS and codeocean https://codeocean.com/capsule/2221576/tree.

Appendix A: Simulations

A.1 Simulation setup

Table A.1:

Simulation setup by type of outcome distribution.

	y ∼ N ( μ , σ 2 )	y ∼ Poi(λ)	y ∼ Bin(π)
β_r for NSR = low	See Figure A.1	{−0.8, 0.8}	{−0.8, 0.8}
β_r for NSR = medium	See Figure A.1	{−0.25, 0.4}	{−0.25, 0.4}
β_r for NSR = high	See Figure A.1	{−0.12, 0.1}	{−0.12, 0.1}
k	{100, 500}	{100, 500}	{100, 500}
INF	{5, 20}	{5, 20}	{5, 20}
cor	{uncor, toep}	{uncor, toep}	{uncor, toep}
runs	100	50	50
T	3000	6500	10,000
ν	0.1	0.01	0.8

Figure A.1:

Scaled β coefficients in the simulation study with normally distributed outcome by noise-to-signal ratio.

A.2 Simulation results of normally distributed data

Table A.2:

Comparison of false positive rates of AIC-boost and mboost_CV by simulation parameter.

	Settings with superior performance^a of AIC-boost compared to mboost_CV
	Number	Percentage
Uncorrelated	9/12	75 %
Toeplitz correlation	10/12	83 %
NSR = 0.2	8/8	100 %
NSR = 0.5	7/8	88 %
NSR = 1	4/8	50 %
INF = 5	8/12	67 %
INF = 20	11/12	92 %
k = 100	11/12	92 %
k = 500	8/12	67 %

^aSuperior performance is measured by means of the median FPR.

Figure A.2:

False positive rate for new variable selection strategies by simulation setting.

Figure A.3:

Mean squared prediction error for new variable selection strategies by simulation setting.

Figure A.4:

True positive rate for new variable selection strategies by simulation setting.

Figure A.5:

Stopping iteration t* for new variable selection strategies by simulation setting.

A.3 Simulation results of poisson distributed data

Figure A.6:

Poisson distribution: False positive rate for new variable selection strategies by simulation setting.

Figure A.7:

Poisson distribution: Mean squared prediction error for new variable selection strategies by simulation setting.

Figure A.8:

Poisson distribution: True positive rate for new variable selection strategies by simulation setting.

Figure A.9:

Poisson distribution: Stopping iteration t* for new variable selection strategies by simulation setting.

A.4 Simulation results of binomial distributed data

Figure A.10:

Binomial distribution: false positive rate for new variable selection strategies by simulation setting.

Figure A.11:

Binomial distribution: mean squared prediction error for new variable selection strategies by simulation setting.

Figure A.12:

Binomial distribution: true positive rate for new variable selection strategies by simulation setting.

Figure A.13:

Binomial distribution: stopping iteration t* for new variable selection strategies by simulation setting.

Figure A.14:

TPR and FPR by model size for a setting with P = 100 and INF = 20. Averages with 90 % quantiles.

Figure A.14 depicts the evolution of the TPR and FPR by increasing model size of the different algorithms for a specific setting (P = 100, INF = 20, uncorrelated covariates, T = 3000, ν and number of simulation runs as in Table A.1) with varying NSR.Note, that the lines correspond to averages over the number of simulation runs and that the NSR cannot be directly compared between the distributions. The paths of the three compared methods appear very similar with highly overlapping empirical 90 % quantile bands. The algorithms tend to behave very similar in terms of FPR and TPR when a model of a specific size would be chosen. In the Gaussian setting (second row) it becomes clear, that AIC-boost stops at a certain smaller model size and possibly fits the included covariates but does not add further FPs compared to the benchmark. This pattern can be seen in Figure 3 as well. In high NSR settings AIC boost refrains from including more FPs to reach TPR = 1 while mboost_CV and CV-boost do so.

Regarding the binomial distribution, one can also observe similar paths in terms of model size of the two methods. AIC-boost again seems to refrain longer from including new variables, since the paths differ in length, i.e. it rather updates non-zero coefficients than including new variables. However, by keeping in mind, that most of the settings use the final iteration as the stopping iteration t*, AIC-boost would possibly reach the same model sizes when fitted for a much larger number of iterations T. The observed differences in the predictive performance for binomial distributed data arise from different estimated coefficient sizes and different stopping iterations.

Appendix B: Application

B.1 Model table of application

Table B.3:

Coefficients of standardized covariates on log(age-standardized incidence rate) of different estimation models. Numbers in brackets correspond to the ordering of the absolute values of the coefficients.

Covariate	mboost_CV		LASSO		AIC-boost
Intercept	4.133		3.992		2.829
Age-standardized incidence rate per 100.000 person-years until 15.03	0.043	(5)	0.046	(6)	0.109	(6)
Unemployment rate of young persons (under 26 years) in 2017	−0.071	(4)	−0.055	(5)	−0.071	(8)
%Voter turnout (number of valid votes in the last Bundestag election) of all registered voters	0.023	(7)	0.036	(7)	0.033	(13)
%Roman-catholics in 2011	0.075	(3)	0.079	(4)	0.086	(7)
Persons in long-term care per 10.000 persons in 2017	−0.024	(6)	−0.033	(8)	−0.039	(12)
Premature mortality (deaths of persons younger than 65 years) per 1000 persons	−0.097	(2)	−0.088	(3)	−0.117	(5)
%Households with low income (1500€ per month) in all households in 2016	−0.015	(8)	−0.015	(9)	−0.012	(18)
Latitude	−0.122	(1)	−0.130	(2)	−0.130	(4)
%Older employed persons (55 years+) in all employed persons in 2017			−0.013	(10)
Ever 100+ inbound commuters from Tirschenreuth			0.141	(1)	0.920	(1)
%Change of number of persons at age 50–65 in 2012–2017					0.009	(19)
Sex ratio (females to males) at age 20–40 in 2017					0.021	(17)
Total sex ratio (females to males) in 2017					0.068	(9)
%Young employed persons in all young persons (under 26 years) in 2017					0.054	(11)
%Older employed persons in all older persons (55 years+) in 2011–2017					0.027	(15)
%Change of number of employed persons in 2012–2016					0.009	(19)
Average travel time to the next large-sized regional center (“Oberzentrum”)					−0.032	(14)
%Outbound commuters over a distance of 50 km + in all employed persons in 2017					−0.006	(20)
%Outbound commuters over a distance of 150 km + in all employed persons in 2017					−0.056	(10)
Ever 100+ inbound commuters from Hohenlohekreis					0.136	(3)
Ever 100+ inbound commuters from Aachen					0.027	(15)
Ever 100+ inbound commuters from Rosenheim (city + district)					0.194	(2)

References

1. Bühlmann, P, Hothorn, T. Boosting Algorithms: Regularization, Prediction and Model Fitting. Stat Sci 2007;22:477–505. https://doi.org/10.1214/07-sts242.Search in Google Scholar

2. Freund, Y, Schapire, R. Experiments with a new boosting algorithm. In: Proceedings of the thirteenth international conference on machine learning theory; 1996:148–56 pp.Search in Google Scholar

3. Breiman, L. Arcing the edge. Berkeley: Statistics Department, University of California at Berkeley; 1997:1–14 pp.Search in Google Scholar

4. Friedman, J, Hastie, T, Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 2000;28:337–407. https://doi.org/10.1214/aos/1016218223.Search in Google Scholar

5. Friedman, J. Greedy function approximation: A gradient boosting machine. Ann Stat 2001;29:1189–232. https://doi.org/10.1214/aos/1013203451.Search in Google Scholar

6. Mayr, A, Hofner, B, Schmid, M. The importance of knowing when to stop. Methods Inf Med 2012;51:178–86. https://doi.org/10.3414/me11-02-0030.Search in Google Scholar

7. Thomas, J, Hepp, T, Mayr, A, Bischl, B, Zhao, Y. Probing for sparse and fast variable selection with model-based boosting. Comput Math Methods Med 2017;2017:1–8. https://doi.org/10.1155/2017/1421409.Search in Google Scholar PubMed PubMed Central

8. Meinshausen, N, Bühlmann, P. Stability selection. J R Stat Soc B Stat Methodol 2010;72:417–73. https://doi.org/10.1111/j.1467-9868.2010.00740.x.Search in Google Scholar

9. Hofner, B, Boccuto, L, Göker, M. Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinf 2015;16:1–17. https://doi.org/10.1186/s12859-015-0575-3.Search in Google Scholar PubMed PubMed Central

10. Strömer, A, Staerk, C, Klein, N, Weinhold, L, Titze, S, Mayr, A. Deselection of base-learners for statistical boosting–with an application to distributional regression. Stat Methods Med Res 2022;31:207–24. https://doi.org/10.1177/09622802211051088.Search in Google Scholar PubMed

11. Bühlmann, P, Hothorn, T. Twin boosting: Improved feature selection and prediction. Stat Comput 2010;20:119–38. https://doi.org/10.1007/s11222-009-9148-5.Search in Google Scholar

12. Staerk, C, Mayr, A. Randomized boosting with multivariable base-learners for high-dimensional variable selection and prediction. BMC Bioinf 2021;22:1–28. https://doi.org/10.1186/s12859-021-04340-z.Search in Google Scholar PubMed PubMed Central

13. Bühlmann, P, Yu, B, Singer, Y, Wasserman, L. Sparse boosting. J Mach Learn Res 2006;7:1001–24.Search in Google Scholar

14. Hofner, B, Hothorn, T, Kneib, T, Schmid, M. A framework for unbiased model selection based on boosting. J Comput Graph Stat 2011;20:956–71. https://doi.org/10.1198/jcgs.2011.09220.Search in Google Scholar

15. Tutz, G, Groll, A. Generalized linear mixed models based on boosting. In: Statistical modelling and regression structures: festschrift in honour of ludwig fahrmeir. Heidelberg: Physica; 2010:197–215 pp.10.1007/978-3-7908-2413-1_11Search in Google Scholar

16. Fahrmeir, L, Kneib, T, Lang, S, Marx, B. Generalized linear models. In: Regression models. Berlin/Heidelberg: Springer; 2021:283–342 pp.10.1007/978-3-662-63882-8_5Search in Google Scholar

17. Hastie, T, Mease, D, Wyner, AJ. Comment: Boosting algorithms: Regularization, Prediction and Model Fitting. Stat Sci 2007;22:513–5. https://doi.org/10.1214/07-sts242b.Search in Google Scholar

18. Hothorn, T, Bühlmann, P, Kneib, T, Schmid, M, Hofner, B. Model-based boosting 2.0. J Mach Learn Res 2010;11:2109–13.Search in Google Scholar

19. Hepp, T, Schmid, M, Gefeller, O, Waldmann, E, Mayr, A. Approaches to regularized regression – a comparison between gradient boosting and the LASSO. Methods Inf Med 2016;55:422–30. https://doi.org/10.3414/me16-01-0033.Search in Google Scholar PubMed

20. Doblhammer, G, Reinke, C, Kreft, D. Social disparities in the first wave of COVID-19 incidence rates in Germany: a county-scale explainable machine learning approach. BMJ Open 2022;12:1–11. https://doi.org/10.1136/bmjopen-2021-049852.Search in Google Scholar PubMed PubMed Central

21. Tibshirani, R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol 1996;58:267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.Search in Google Scholar

22. Plümper, T, Neumayer, E. The pandemic predominantly hits poor neighbourhoods? SARS-CoV-2 infections and COVID-19 fatalities in German districts. Eur J Publ Health 2020;30:1176–80. https://doi.org/10.1093/eurpub/ckaa168.Search in Google Scholar PubMed PubMed Central

23. Wachtler, B, Michalski, N, Nowossadeck, E, Diercke, M, Wahrendorf, M, Santos-Hövener, C, et al.. Socioeconomic inequalities in the risk of SARS-CoV-2 infection – first results from an analysis of surveillance data from Germany. J Health Monit 2020;5:18–29. https://doi.org/10.25646/7057.Search in Google Scholar PubMed PubMed Central

24. Rohleder, S, Costa, D, Bozorgmehr, K. Area-level socioeconomic deprivation, non-national residency, and Covid-19 incidence: a longitudinal spatiotemporal analysis in Germany. EClinicalMedicine 2022;49:101485. https://doi.org/10.1016/j.eclinm.2022.101485.Search in Google Scholar PubMed PubMed Central

25. Plümper, T, Laroze, D, Neumayer, E. The limits to equivalent living conditions: regional disparities in premature mortality in Germany. J Public Health 2018;26:309–19. https://doi.org/10.1007/s10389-017-0865-5.Search in Google Scholar PubMed PubMed Central

26. Brandl, M, Selb, R, Seidl-Pillmeier, S, Marosevic, D, Buchholz, U, Rehmet, S. Mass gathering events and undetected transmission of SARS-CoV-2 in vulnerable populations leading to an outbreak with high case fatality ratio in the district of Tirschenreuth, Germany. Epidemiol Infect 2020;148:e252. https://doi.org/10.1017/s0950268820002460.Search in Google Scholar

27. Fuest, C, Immel, L. Ein zunehmend gespaltenes Land? – Regionale Einkommensunterschiede und die Entwicklung des Gefälles zwischen Stadt und Land sowie West- und Ostdeutschland. Ifo Schnelld 2019;72:19–28.Search in Google Scholar

28. Ballering, A, Oertelt-Prigione, S, Olde Hartman, T, Rosmalen, J, Boezen, M, Mierau, JO, et al.. Sex and gender-related differences in COVID-19 diagnoses and SARS-CoV-2 testing practices during the first wave of the pandemic: the Dutch lifelines COVID-19 cohort study. J Wom Health 2021;30:1686–92. https://doi.org/10.1089/jwh.2021.0226.Search in Google Scholar PubMed PubMed Central

29. Bianconi, V, Mannarino, M, Bronzo, P, Marini, E, Pirro, M. Time-related changes in sex distribution of COVID-19 incidence proportion in Italy. Heliyon 2020;6:e05304. https://doi.org/10.1016/j.heliyon.2020.e05304.Search in Google Scholar PubMed PubMed Central

30. Doerre, A, Doblhammer, G. The influence of gender on COVID-19 infections and mortality in Germany: insights from age- and gender-specific modeling of contact rates, infections, and deaths in the early phase of the pandemic. PLoS One 2022;17:e0268119. https://doi.org/10.1371/journal.pone.0268119.Search in Google Scholar PubMed PubMed Central

31. Ancochea, J, Izquierdo, J, Soriano, J. Evidence of gender differences in the diagnosis and management of Coronavirus disease 2019 patients: an analysis of electronic health records using natural language processing and machine learning. J Wom Health 2021;30:393–404. https://doi.org/10.1089/jwh.2020.8721.Search in Google Scholar PubMed

32. Leibert, T, Wolff, M, Haase, A. Shifting spatial patterns in German population trends: local-level hot and cold spots, 1990–2019. Geograph Helv 2022;77:369–87. https://doi.org/10.5194/gh-77-369-2022.Search in Google Scholar

33. Fink, P, Hennicke, M, Tiemann, H. Unequal Germany: socio-economic disparities report 2019. Bonn/Berlin: Friedrich-Ebert-Stiftung; 2019.Search in Google Scholar

34. Robinson, W. Ecological correlations and the behavior of individuals. Am Socio Rev 1950;15:351–7. https://doi.org/10.2307/2087176.Search in Google Scholar

35. Hurvich, C, Tsai, C. Regression and time series model selection in small samples. Biometrika 1989;76:297–307. https://doi.org/10.1093/biomet/76.2.297.Search in Google Scholar

36. Bühlmann, P, Hothorn, T. Rejoinder: Boosting Algorithms: Regularization, Prediction and Model Fitting. Stat Sci 2007;22:516–22. https://doi.org/10.1214/07-sts242rej.Search in Google Scholar

Received: 2023-04-24

Accepted: 2023-09-18

Published Online: 2023-11-27