Gradient and Newton boosting for classification and regression

doi:10.1016/j.eswa.2020.114080

Expert Systems with Applications

Volume 167, 1 April 2021, 114080

https://doi.org/10.1016/j.eswa.2020.114080 Get rights and content

Highlights

•
Present gradient, Newton, and hybrid gradient-Newton boosting in a unified framework.
•
Show that Newton boosting achieves significantly higher predictive accuracy.
•
Reason for higher predictive accuracy is not faster convergence.
•
Introduce novel interpretable tuning parameter which is important for predictive accuracy.

Abstract

Boosting algorithms are frequently used in applied data science and in research. To date, the distinction between boosting with either gradient descent or second-order Newton updates is often not made in both applied and methodological research, and it is thus implicitly assumed that the difference is irrelevant. The goal of this article is to clarify this situation. In particular, we present gradient and Newton boosting, as well as a hybrid variant of the two, in a unified framework. We compare these boosting algorithms with trees as base learners using various datasets and loss functions. Our experiments show that Newton boosting outperforms gradient and hybrid gradient-Newton boosting in terms of predictive accuracy on the majority of datasets. We also present evidence that the reason for this is not faster convergence of Newton boosting. In addition, we introduce a novel tuning parameter for tree-based Newton boosting which is interpretable and important for predictive accuracy.

Introduction

Boosting (Freund et al., 1996, Friedman, 2001, Friedman et al., 2000) refers to a type of supervised learning algorithms that enjoy high popularity in applied data science and research, among other things, due to their high predictive accuracy (Chen & Guestrin, 2016). This is reflected in statements such as “[i]n general ‘boosted decision trees’ is regarded as the most effective off-the-shelf nonlinear learning method for a wide range of application problems” (Johnson & Zhang, 2013). Boosting iteratively adds so-called base learners to an ensemble of learners. Broadly speaking, there exist three different versions for selecting a base learner in every boosting iteration: functional gradient descent, a functional version of Newton’s method, and a combination of the two. We refer to these three different versions of boosting as gradient boosting, Newton boosting, and hybrid gradient-Newton boosting; see Section 2 for more information.

In expert and intelligent systems, tree-boosting is a technique that is frequently used. Recent applications of boosting in expert systems include bankruptcy prediction and credit scoring (Djeundje et al., 2020, Moscatelli et al., 2020, Wang et al., 2014, Xia et al., 2017), network intrusion detection (Zhou, Mazzuchi, & Sarkani, 2020), epileptic seizure diagnosis (Al-Hadeethi, Abdulla, Diykh, Deo, & Green, 2020), early stage disease symptom detection (Ahamad et al., 2020), cancer prognosis (Lu, Wang, & Yoon, 2019), and face re-identification (Soleymani, Granger, & Fumera, 2018).

In both methodological and applied research, the distinction between gradient and Newton boosting is often not made and/or it is not declared which version of boosting is used (e.g. Ahamad et al., 2020, Djeundje et al., 2020, Moscatelli et al., 2020). It is thus implicitly assumed that the difference is not important. For instance, the two recent popular boosting libraries LightGBM and TF Boosted Trees do not distinguish in their companion articles (Ke et al., 2017, Ponomareva et al., 2017) between gradient and Newton boosting, and it is unclear to the reader which version is used. Similarly, Prokhorenkova, Gusev, Vorobev, Dorogush, and Gulin (2018) briefly mention in their article on CatBoost that the minimization for finding a boosting update can be done using the Newton method or with a gradient step, and then continue to write that “[b]oth methods are kinds of functional gradient descent”. However, Newton’s method is different from gradient descent. Further, Bühlmann and Hothorn (2007) state that for gradient boosting “an additional line search ... seems unnecessary for achieving a good estimator”. For trees as base learners, the additional line search is often done for each leaf separately by using a Newton step (Friedman, 2001). I.e., this corresponds to what we denote as hybrid gradient-Newton boosting which is different from plain gradient boosting also in terms of predictive accuracy. Besides, particular software implementations of boosting such as XGBoost (Chen & Guestrin, 2016) are sometimes presented as if they were separate boosting algorithms (e.g. Ahamad et al., 2020, Djeundje et al., 2020, Xia et al., 2017) when, in fact, they implement a particular version of boosting.

The novel contributions of this article are the following ones. First, we show how gradient, Newton, as well as hybrid gradient-Newton boosting can be derived in a unified framework. Further, we systematically compare gradient, Newton, and hybrid gradient-Newton boosting on a large set of both real-world and simulated classification and regression datasets. In our experiments, using trees as base learners, we find that Newton boosting achieves lower test errors than both gradient boosting and hybrid gradient-Newton boosting, and hybrid gradient-Newton boosting often has higher predictive accuracy than gradient boosting. Interestingly, we find that Newton boosting results in both lower in-sample training losses, which are essentially zero for most classification datasets, and lower out-of-sample test errors for most datasets. We also present evidence that the higher predictive accuracy is not due to a faster convergence speed of Newton boosting. In addition, we introduce a novel tuning parameter for Newton boosting with trees as base learners. We argue that this minimum equivalent sample size per leaf parameter is a natural and interpretable tuning parameter which is important for predictive accuracy. In particular, we present evidence that the unnormalized version of this tuning parameter currently adopted in popular software implementations such as XGBoost is difficult to tune and thus likely results in lower predictive accuracy.

The first boosting algorithms for classification, including the well known AdaBoost algorithm, were introduced by Freund and Schapire, 1995, Freund et al., 1996, Schapire, 1990. Later, several authors (Breiman, 1998, Breiman, 1999, Friedman, 2001, Friedman et al., 2000, Mason et al., 2000) presented the statistical view of boosting as a stagewise optimization approach. See Bühlmann and Hothorn, 2007, Mayr et al., 2014a, Mayr et al., 2014b, Schapire, 2003 and Schapire and Freund (2012) for reviews on boosting algorithms in both the machine learning and statistical literature.

To the best of our knowledge, a systematic comparison concerning the predictive accuracy of gradient, Newton, and hybrid gradient-Newton boosting for various choices of loss functions, including regression and classification losses, has not been done so far. The $L_{K}$ _TreeBoost algorithm (Friedman, 2001) is compared in Friedman (2001) with $K$ -class LogitBoost (Friedman et al., 2000) for classification with five classes in a simulation study for one type of random functions. In our terminology, $L_{K}$ _TreeBoost is a version of hybrid gradient-Newton boosting, and $K$ -class LogitBoost corresponds to Newton boosting for the Bernoulli likelihood. Friedman (2001) finds that the algorithms perform “nearly the same” with “LogitBoost perhaps having a slight advantage”. In addition, it is mentioned that “it is likely that when the shrinkage parameter is carefully tuned for each of the three methods [ $L_{K}$ _TreeBoost, $K$ -class LogitBoost, AdaBoost], there would be little performance differential between them”. Our empirical evidence is not in line with this statement. Saberian, Masnadi-Shirazi, and Vasconcelos (2011) also briefly compare variants of boosting with gradient and second-order updates using three different binary classification datasets and Haar wavelets as base learners. However, their boosting approach is different from the one usually adopted in practice and research in the sense that they assume normed based learners, find base learners as maxima of inner products of gradients and base learners, and then have to perform an additional line search to find the step size. Further, tuning parameters such as the learning rate and the number of boosting iterations are not chosen using cross-validation, and only 25 boosting iterations are performed. Nonetheless, they come to the same conclusion as we do, i.e., they find that their version of second-order boosting performs better than gradient boosting. The closest to our empirical analysis are Li (2010) and Zheng and Liu (2012). Li (2010) compares Newton boosting (“logitboost”) with hybrid gradient-Newton boosting (“mart”) for several multi-class classification datasets and also finds that Newton results in lower test errors than hybrid gradient-Newton boosting. Further, Zheng and Liu (2012) compare gradient and Newton boosting when using the probit link function in a logistic regression model and find that Newton boosting results in lower error rates than gradient boosting for several classification applications including face detection, cancer classification, and handwritten digit recognition. However, both Li (2010) and Zheng and Liu (2012) consider only specific classification tasks, tuning parameters are not chosen using validation data in their experiments, and it is not investigated whether the observed differences are statistically significant. Finally, Sun, Zhang, and Zhou (2014) compare Newton and gradient boosting for binary classification using the logistic loss. Their focus is on the convergence rate and their empirical comparison only considers training errors, though.

Section snippets

The statistical view of boosting: three approaches for stagewise optimization

In this section, we present the statistical view of boosting as finding the minimizer of a risk functional in a function space using a stagewise, or greedy, optimization approach. We distinguish between gradient and Newton boosting as wells as a hybrid version of the two and show how these can be presented in a unified framework. Note that these boosting algorithms have been proposed in prior research (Breiman, 1998, Breiman, 1999, Friedman, 2001, Friedman et al., 2000, Mason et al., 2000,

Empirical evaluation and comparison

In the following, we compare the three different boosting algorithms presented in the previous section for different loss functions on various datasets using regression trees as base learners.⁷ Specifically, we use the CART version of Breiman et al. (1984) with the mean squared error as splitting criterion. Note that we use trees (Breiman et al., 1984) as base learners as these are the most

Does Newton boosting show higher predictive accuracy than gradient boosting due to faster convergence?

In the previous sections, we have empirically shown that Newton boosting often results in higher predictive accuracy than gradient and also hybrid gradient-Newton boosting. A potential explanation for the observed phenomenon is that Newton boosting converges faster than both gradient and hybrid gradient-Newton boosting, and that hybrid boosting also converges faster than gradient boosting. This, in turn, could allow for using a smaller shrinkage parameter $ν$ , and smaller shrinkage parameters

Conclusions

We compare gradient and Newton boosting as well as a hybrid variant of the two with trees as base learners on a wide range of classification and regression datasets. Our empirical results show that Newton boosting outperforms gradient and often also hybrid gradient-Newton boosting in terms of predictive accuracy. Further, we present empirical evidence that this outperformance is not a consequence of a faster convergence speed of Newton boosting. Interestingly, Newton boosting converges to lower

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was partially supported by the Swiss Innovation Agency - Innosuisse (25746.1 PFES-ES). We are grateful to Christoph Hirnschall and Torsten Hothorn for valuable suggestions and discussions.

References (64)

AhamadM.M. et al.
A machine learning model to identify early stage symptoms of sars-cov-2 infected patients
Expert Systems with Applications
(2020)
Al-HadeethiH. et al.
Adaptive boost ls-svm classification approach for time-series signal classification in epileptic seizure diagnosis applications
Expert Systems with Applications
(2020)
BarsacchiM. et al.
An analysis of boosted ensembles of binary fuzzy decision trees
Expert Systems with Applications
(2020)
CostaM.A. et al.
Failure detection in robotic arms using statistical modeling, machine learning and hybrid gradient boosting
Measurement
(2019)
De MenezesF.S. et al.
Data classification with binary response through the boosting algorithm and logistic regression
Expert Systems with Applications
(2017)
FriedmanJ.H.
Stochastic gradient boosting
Computational Statistics & Data Analysis
(2002)
KadkhodaeiH.R. et al.
Hboost: A heterogeneous ensemble classifier based on the boosting method and entropy measurement
Expert Systems with Applications
(2020)
LuH. et al.
A dynamic gradient boosting machine using genetic optimizer for practical breast cancer prognosis
Expert Systems with Applications
(2019)
MoscatelliM. et al.
Corporate default forecasting with machine learning
Expert Systems with Applications
(2020)
SchmidM. et al.
Boosting additive models using component-wise p-splines
Computational Statistics & Data Analysis
(2008)

SigristF. et al.

Grabit: Gradient tree-boosted tobit models for default prediction

Journal of Banking & Finance

(2019)

SoleymaniR. et al.

Progressive boosting for class imbalance and its application to face re-identification

Expert Systems with Applications

(2018)

WangG. et al.

An improved boosting based on feature selection for corporate bankruptcy prediction

Expert Systems with Applications

(2014)

XiaY. et al.

A boosted decision tree approach using bayesian hyper-parameter optimization for credit scoring

Expert Systems with Applications

(2017)

ZhengS. et al.

Functional gradient ascent for probit regression

Pattern Recognition

(2012)

ZhouY. et al.

M-adaboost-a based ensemble system for network intrusion detection

Expert Systems with Applications

(2020)

BelkinM. et al.

Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate

BelkinM. et al.

To understand deep learning we need to understand kernel learning

BengioY. et al.

No unbiased estimator of the variance of k-fold cross-validation

Journal of Machine Learning Research

(2004)

BreimanL.

Arcing classifiers

Annals of Statistics

(1998)

BreimanL.

Prediction games and arcing algorithms

Neural Computation

(1999)

BreimanL. et al.

Classification and regression trees

(1984)

BühlmannP. et al.

Boosting algorithms: Regularization, prediction and model fitting

Statistical Science

(2007)

BühlmannP. et al.

Boosting with the l 2 loss: regression and classification

Journal of the American Statistical Association

(2003)

BühlmannP.

Boosting for high-dimensional linear models

The Annals of Statistics

(2006)

ChenT. et al.

Xgboost: A scalable tree boosting system

DemšarJ.

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

(2006)

DietterichT.G.

Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation

(1998)

DjeundjeV.B. et al.

Enhancing credit scoring with alternative data

Expert Systems with Applications

(2020)

EfronB. et al.

Least angle regression

The Annals of statistics

(2004)

FenskeN. et al.

Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression

Journal of the American Statistical Association

(2011)

FreundY. et al.

A desicion-theoretic generalization of on-line learning and an application to boosting

Cited by (32)

A strategy based on statistical modelling and multi-objective optimization to design a dishwasher cleaning cycle
2024, Expert Systems with Applications
This study proposes a novel approach based on statistical learning and multi-objective optimization to reduce the need for experiments during the design phase of new cleaning cycles for household dishwashers. We first build regression models associated with the feature selection methods to predict the outputs of a dishwasher cleaning cycle by using the existing cleaning cycles’ program flows as input data and the results of the performance laboratory tests of the related cleaning cycles as output data. Then, a multi-objective optimization problem is defined by assigning the regression models and chosen features as objective functions and unknown decision variables, respectively. Obtained optimization problem is then solved by using evolutionary algorithms according to the designer’s preferences (or customers’ needs).
Machine learning-based mapping of band gaps for metal halide perovskites
2024, Materials Letters
Recently, perovskite solar cells (PSCs) have received great attentions as the most promising candidate for the next generation solar cells; where the halide perovskite with ABX₃ stoichiometry are playing as key components. As a key relevant parameter for various applications of perovskites, the band gap can be modified and optimized by tuning their compositions. In order to enhance the screening efficiency of perovskites with appropriate band gap range for particular applications, this work developed a mapping strategy for band gap range classification through data-driven selected features. This work demonstrates that the proposed features, and further the developed band gap maps, are able to offer a useful initial guiding principle for screenings of potential halide perovskites with fitting band gap ranges and provide opportunities for their compositional design.
Machine learning and Shapley Additive Explanation-based interpretable prediction of the electrocatalytic performance of N-doped carbon materials
2024, Fuel
Enhancing the kinetic rate of cathodic oxygen reduction reaction (ORR) by catalysts is the key to improve the performance of microbial fuel cells (MFCs). Metal-free ORR catalysts represented by nitrogen-doped carbon materials have been extensively investigated and have shown excellent catalytic effects for oxygen reduction reaction. However, it is difficult to clarify the coupling effect physicochemical properties of nitrogen-doped carbon materials on their catalytic effect (i.e. electricity production performance of MFCs) by traditional experimental methods. Therefore, in this study, six machine learning models were combined with SHAP to develop prediction models for the power density ratio of MFCs for reflecting the catalytic performance of nitrogen-doped carbon materials by using physicochemical properties, such as elemental composition, functional group structure, and pore structure, as input features. The gradient boosting regression (GBR) model was found to have the highest prediction accuracy on the test set, with R² and RMSE of 0.86 and 0.09, respectively. The SHAP method was used to interpret the output of the GBR model and reveal the mechanism of interaction between different characteristic variables. It was found that pyridine nitrogen is the most important input characteristic of the nitrogen elements, as its corresponding SHAP value reaches above 0.1. Surprisingly, an increase in the content of oxygen significantly attenuates the extent to which changes in nitrogen content affect the system, although its effect on power density prediction is inconsiderable. In addition, the optimal range of important physicochemical properties of nitrogen-doped carbon materials was obtained by the SHAP method. The carbon materials have relatively high catalytic performance under the conditions of C(at%) between 85% and 90%, N(at%) > 2.5%, Pyridine-N(at%) > 30%, V_total between 0.3 and 0.6 cm³g⁻¹, and S_BET > 1000 m²g⁻¹. This study can provide theoretical guidance for subsequent experimental design of carbon-based nitrogen-doped electrocatalysts.
Interpretable ensemble machine learning framework to predict wear rate of modified ZA-27 alloy
2023, Tribology International
This study investigates the impact of adding manganese (Mn) to ZA-27 alloy on microstructure and tribological properties. The Mn content varied from 0.2% to 1%. Volumetric wear rates were measured under different operating conditions. XRD and SEM were employed for phase identification and surface analysis. Ensemble Machine Learning (EML) regression models, including bagging, decision trees, random forest, ada boost, gradient boosting, and extreme gradient boost, were used to predict wear properties. Results indicate that the lowest wear rate occurred at 0.5% Mn content. Different wear mechanisms were observed for varying Mn contents. Among the EML models, extreme gradient boost showed superior performance with R² values of 0.999 and 0.985 in training and testing, respectively.
Nonlinear MPC based on elastic autoregressive fuzzy neural network with roasting process application
2023, Expert Systems with Applications
Because of the increasing complexity and nonlinearity of industrial processes, nonlinear model predictive control (NMPC) has been rapidly developed owing to its fast response and robustness. However, the complicated optimization process of NMPC limits its application. Hence, this paper proposes an NMPC method that is compatible with nonlinear modeling and concise online control. First, an elastic autoregressive fuzzy neural network (EAFNN) is proposed under reasonable assumptions. The EAFNN exhibits strong parameter identification and structure optimization capabilities because of its autoregressive layer and elastic mechanism. Second, the EAFNN is adaptively simplified into a linear model based on the real-time working condition information during online control. Third, based on a simplified model, NMPC provides an explicit solution without complex optimization procedures. Finally, numerical simulations and roasting process experiments are conducted. Experimental results show that the proposed method exhibits superior control performance and computational complexity compared with other methods, thereby verifying its effectiveness and superiority. The source code for EAFNN-MPC is publicly available at: https://github.com/553318570/EAFNN_MPC.git.
Semi-supervised networks integrated with autoencoder and pseudo-labels propagation for structural condition assessment
2023, Measurement: Journal of the International Measurement Confederation
Insufficient labeled data of vibration measurement from unknown structural states cast great obstacle to attempt of extending the superiority of deep learning techniques into their rapid state evaluation. This study proposes novel semi-supervised networks for condition assessment integrated with deep autoencoder and pseudo-labels propagation. The architecture and mechanism of the workflow are elaborated. With sophisticated network design and novel strategies for improving performance, the proposed method succeeds to establish a systematic network embedded with simultaneously self-supervised autoencoder, optimized unsupervised fuzzy clustering and supervised classification algorithms in semi-supervised paradigm. Both numerical simulations and full-scale laboratory shaking table tests of a two-story building structure were conducted to validate its capability of identifying the post-earthquake damage scenarios. Driven by acceleration signals measured by merely single sensor, the prediction accuracy of proposed method achieved to be 0.95 in numerical validation and averagely 0.86 in laboratory case studies, respectively, confirming a powerful effectiveness and robustness.

View all citing articles on Scopus

View full text

Gradient and Newton boosting for classification and regression

Highlights

Abstract

Introduction

Section snippets

The statistical view of boosting: three approaches for stagewise optimization

Empirical evaluation and comparison

Does Newton boosting show higher predictive accuracy than gradient boosting due to faster convergence?

Conclusions

Declaration of Competing Interest

Acknowledgments

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Measurement

Expert Systems with Applications

Computational Statistics & Data Analysis

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Computational Statistics & Data Analysis

Journal of Banking & Finance

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Pattern Recognition

Expert Systems with Applications

Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate

To understand deep learning we need to understand kernel learning

No unbiased estimator of the variance of k-fold cross-validation

Journal of Machine Learning Research

Arcing classifiers

Annals of Statistics

Prediction games and arcing algorithms

Neural Computation

Classification and regression trees

Boosting algorithms: Regularization, prediction and model fitting

Statistical Science

Boosting with the l 2 loss: regression and classification

Journal of the American Statistical Association

Boosting for high-dimensional linear models

The Annals of Statistics

Xgboost: A scalable tree boosting system

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation

Enhancing credit scoring with alternative data

Expert Systems with Applications

Least angle regression

The Annals of statistics

Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression

Journal of the American Statistical Association

A desicion-theoretic generalization of on-line learning and an application to boosting