Gradient and Newton boosting for classification and regression
Introduction
Boosting (Freund et al., 1996, Friedman, 2001, Friedman et al., 2000) refers to a type of supervised learning algorithms that enjoy high popularity in applied data science and research, among other things, due to their high predictive accuracy (Chen & Guestrin, 2016). This is reflected in statements such as “[i]n general ‘boosted decision trees’ is regarded as the most effective off-the-shelf nonlinear learning method for a wide range of application problems” (Johnson & Zhang, 2013). Boosting iteratively adds so-called base learners to an ensemble of learners. Broadly speaking, there exist three different versions for selecting a base learner in every boosting iteration: functional gradient descent, a functional version of Newton’s method, and a combination of the two. We refer to these three different versions of boosting as gradient boosting, Newton boosting, and hybrid gradient-Newton boosting; see Section 2 for more information.
In expert and intelligent systems, tree-boosting is a technique that is frequently used. Recent applications of boosting in expert systems include bankruptcy prediction and credit scoring (Djeundje et al., 2020, Moscatelli et al., 2020, Wang et al., 2014, Xia et al., 2017), network intrusion detection (Zhou, Mazzuchi, & Sarkani, 2020), epileptic seizure diagnosis (Al-Hadeethi, Abdulla, Diykh, Deo, & Green, 2020), early stage disease symptom detection (Ahamad et al., 2020), cancer prognosis (Lu, Wang, & Yoon, 2019), and face re-identification (Soleymani, Granger, & Fumera, 2018).
In both methodological and applied research, the distinction between gradient and Newton boosting is often not made and/or it is not declared which version of boosting is used (e.g. Ahamad et al., 2020, Djeundje et al., 2020, Moscatelli et al., 2020). It is thus implicitly assumed that the difference is not important. For instance, the two recent popular boosting libraries LightGBM and TF Boosted Trees do not distinguish in their companion articles (Ke et al., 2017, Ponomareva et al., 2017) between gradient and Newton boosting, and it is unclear to the reader which version is used. Similarly, Prokhorenkova, Gusev, Vorobev, Dorogush, and Gulin (2018) briefly mention in their article on CatBoost that the minimization for finding a boosting update can be done using the Newton method or with a gradient step, and then continue to write that “[b]oth methods are kinds of functional gradient descent”. However, Newton’s method is different from gradient descent. Further, Bühlmann and Hothorn (2007) state that for gradient boosting “an additional line search ... seems unnecessary for achieving a good estimator”. For trees as base learners, the additional line search is often done for each leaf separately by using a Newton step (Friedman, 2001). I.e., this corresponds to what we denote as hybrid gradient-Newton boosting which is different from plain gradient boosting also in terms of predictive accuracy. Besides, particular software implementations of boosting such as XGBoost (Chen & Guestrin, 2016) are sometimes presented as if they were separate boosting algorithms (e.g. Ahamad et al., 2020, Djeundje et al., 2020, Xia et al., 2017) when, in fact, they implement a particular version of boosting.
The novel contributions of this article are the following ones. First, we show how gradient, Newton, as well as hybrid gradient-Newton boosting can be derived in a unified framework. Further, we systematically compare gradient, Newton, and hybrid gradient-Newton boosting on a large set of both real-world and simulated classification and regression datasets. In our experiments, using trees as base learners, we find that Newton boosting achieves lower test errors than both gradient boosting and hybrid gradient-Newton boosting, and hybrid gradient-Newton boosting often has higher predictive accuracy than gradient boosting. Interestingly, we find that Newton boosting results in both lower in-sample training losses, which are essentially zero for most classification datasets, and lower out-of-sample test errors for most datasets. We also present evidence that the higher predictive accuracy is not due to a faster convergence speed of Newton boosting. In addition, we introduce a novel tuning parameter for Newton boosting with trees as base learners. We argue that this minimum equivalent sample size per leaf parameter is a natural and interpretable tuning parameter which is important for predictive accuracy. In particular, we present evidence that the unnormalized version of this tuning parameter currently adopted in popular software implementations such as XGBoost is difficult to tune and thus likely results in lower predictive accuracy.
The first boosting algorithms for classification, including the well known AdaBoost algorithm, were introduced by Freund and Schapire, 1995, Freund et al., 1996, Schapire, 1990. Later, several authors (Breiman, 1998, Breiman, 1999, Friedman, 2001, Friedman et al., 2000, Mason et al., 2000) presented the statistical view of boosting as a stagewise optimization approach. See Bühlmann and Hothorn, 2007, Mayr et al., 2014a, Mayr et al., 2014b, Schapire, 2003 and Schapire and Freund (2012) for reviews on boosting algorithms in both the machine learning and statistical literature.
To the best of our knowledge, a systematic comparison concerning the predictive accuracy of gradient, Newton, and hybrid gradient-Newton boosting for various choices of loss functions, including regression and classification losses, has not been done so far. The _TreeBoost algorithm (Friedman, 2001) is compared in Friedman (2001) with -class LogitBoost (Friedman et al., 2000) for classification with five classes in a simulation study for one type of random functions. In our terminology, _TreeBoost is a version of hybrid gradient-Newton boosting, and -class LogitBoost corresponds to Newton boosting for the Bernoulli likelihood. Friedman (2001) finds that the algorithms perform “nearly the same” with “LogitBoost perhaps having a slight advantage”. In addition, it is mentioned that “it is likely that when the shrinkage parameter is carefully tuned for each of the three methods [_TreeBoost, -class LogitBoost, AdaBoost], there would be little performance differential between them”. Our empirical evidence is not in line with this statement. Saberian, Masnadi-Shirazi, and Vasconcelos (2011) also briefly compare variants of boosting with gradient and second-order updates using three different binary classification datasets and Haar wavelets as base learners. However, their boosting approach is different from the one usually adopted in practice and research in the sense that they assume normed based learners, find base learners as maxima of inner products of gradients and base learners, and then have to perform an additional line search to find the step size. Further, tuning parameters such as the learning rate and the number of boosting iterations are not chosen using cross-validation, and only 25 boosting iterations are performed. Nonetheless, they come to the same conclusion as we do, i.e., they find that their version of second-order boosting performs better than gradient boosting. The closest to our empirical analysis are Li (2010) and Zheng and Liu (2012). Li (2010) compares Newton boosting (“logitboost”) with hybrid gradient-Newton boosting (“mart”) for several multi-class classification datasets and also finds that Newton results in lower test errors than hybrid gradient-Newton boosting. Further, Zheng and Liu (2012) compare gradient and Newton boosting when using the probit link function in a logistic regression model and find that Newton boosting results in lower error rates than gradient boosting for several classification applications including face detection, cancer classification, and handwritten digit recognition. However, both Li (2010) and Zheng and Liu (2012) consider only specific classification tasks, tuning parameters are not chosen using validation data in their experiments, and it is not investigated whether the observed differences are statistically significant. Finally, Sun, Zhang, and Zhou (2014) compare Newton and gradient boosting for binary classification using the logistic loss. Their focus is on the convergence rate and their empirical comparison only considers training errors, though.
Section snippets
The statistical view of boosting: three approaches for stagewise optimization
In this section, we present the statistical view of boosting as finding the minimizer of a risk functional in a function space using a stagewise, or greedy, optimization approach. We distinguish between gradient and Newton boosting as wells as a hybrid version of the two and show how these can be presented in a unified framework. Note that these boosting algorithms have been proposed in prior research (Breiman, 1998, Breiman, 1999, Friedman, 2001, Friedman et al., 2000, Mason et al., 2000,
Empirical evaluation and comparison
In the following, we compare the three different boosting algorithms presented in the previous section for different loss functions on various datasets using regression trees as base learners.7 Specifically, we use the CART version of Breiman et al. (1984) with the mean squared error as splitting criterion. Note that we use trees (Breiman et al., 1984) as base learners as these are the most
Does Newton boosting show higher predictive accuracy than gradient boosting due to faster convergence?
In the previous sections, we have empirically shown that Newton boosting often results in higher predictive accuracy than gradient and also hybrid gradient-Newton boosting. A potential explanation for the observed phenomenon is that Newton boosting converges faster than both gradient and hybrid gradient-Newton boosting, and that hybrid boosting also converges faster than gradient boosting. This, in turn, could allow for using a smaller shrinkage parameter , and smaller shrinkage parameters
Conclusions
We compare gradient and Newton boosting as well as a hybrid variant of the two with trees as base learners on a wide range of classification and regression datasets. Our empirical results show that Newton boosting outperforms gradient and often also hybrid gradient-Newton boosting in terms of predictive accuracy. Further, we present empirical evidence that this outperformance is not a consequence of a faster convergence speed of Newton boosting. Interestingly, Newton boosting converges to lower
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research was partially supported by the Swiss Innovation Agency - Innosuisse (25746.1 PFES-ES). We are grateful to Christoph Hirnschall and Torsten Hothorn for valuable suggestions and discussions.
References (64)
- et al.
A machine learning model to identify early stage symptoms of sars-cov-2 infected patients
Expert Systems with Applications
(2020) - et al.
Adaptive boost ls-svm classification approach for time-series signal classification in epileptic seizure diagnosis applications
Expert Systems with Applications
(2020) - et al.
An analysis of boosted ensembles of binary fuzzy decision trees
Expert Systems with Applications
(2020) - et al.
Failure detection in robotic arms using statistical modeling, machine learning and hybrid gradient boosting
Measurement
(2019) - et al.
Data classification with binary response through the boosting algorithm and logistic regression
Expert Systems with Applications
(2017) Stochastic gradient boosting
Computational Statistics & Data Analysis
(2002)- et al.
Hboost: A heterogeneous ensemble classifier based on the boosting method and entropy measurement
Expert Systems with Applications
(2020) - et al.
A dynamic gradient boosting machine using genetic optimizer for practical breast cancer prognosis
Expert Systems with Applications
(2019) - et al.
Corporate default forecasting with machine learning
Expert Systems with Applications
(2020) - et al.
Boosting additive models using component-wise p-splines
Computational Statistics & Data Analysis
(2008)
Grabit: Gradient tree-boosted tobit models for default prediction
Journal of Banking & Finance
Progressive boosting for class imbalance and its application to face re-identification
Expert Systems with Applications
An improved boosting based on feature selection for corporate bankruptcy prediction
Expert Systems with Applications
A boosted decision tree approach using bayesian hyper-parameter optimization for credit scoring
Expert Systems with Applications
Functional gradient ascent for probit regression
Pattern Recognition
M-adaboost-a based ensemble system for network intrusion detection
Expert Systems with Applications
Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate
To understand deep learning we need to understand kernel learning
No unbiased estimator of the variance of k-fold cross-validation
Journal of Machine Learning Research
Arcing classifiers
Annals of Statistics
Prediction games and arcing algorithms
Neural Computation
Classification and regression trees
Boosting algorithms: Regularization, prediction and model fitting
Statistical Science
Boosting with the l 2 loss: regression and classification
Journal of the American Statistical Association
Boosting for high-dimensional linear models
The Annals of Statistics
Xgboost: A scalable tree boosting system
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
Approximate statistical tests for comparing supervised classification learning algorithms
Neural Computation
Enhancing credit scoring with alternative data
Expert Systems with Applications
Least angle regression
The Annals of statistics
Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression
Journal of the American Statistical Association
A desicion-theoretic generalization of on-line learning and an application to boosting
Cited by (32)
A strategy based on statistical modelling and multi-objective optimization to design a dishwasher cleaning cycle
2024, Expert Systems with ApplicationsMachine learning-based mapping of band gaps for metal halide perovskites
2024, Materials LettersInterpretable ensemble machine learning framework to predict wear rate of modified ZA-27 alloy
2023, Tribology InternationalNonlinear MPC based on elastic autoregressive fuzzy neural network with roasting process application
2023, Expert Systems with ApplicationsSemi-supervised networks integrated with autoencoder and pseudo-labels propagation for structural condition assessment
2023, Measurement: Journal of the International Measurement Confederation