Unified Robust Boosting

Boosting is a popular algorithm in supervised machine learning with wide applications in regression and classification problems. Boosting can combine a sequence of regression trees to obtain accurate prediction. In the presence of outliers, traditional boosting may show inferior results since the algorithm optimizes a convex loss function. Recent literature has proposed boosting algorithms to optimizing robust nonconvex loss functions. However, there is a lack of weighted estimation to indicate the outlier status of the observations. This article proposes an iteratively reweighted boosting algorithm combining robust loss optimization and weighted estimation, which is conveniently constructed with existing software. The output includes the weights as a valuable diagnostic to the outlier status of the observations. For practitioners interested in the boosting algorithm, the new algorithm can be interpreted as a method to tuning in observation weights, which can lead to a more accurate model. The R package irboost is demonstrated with publicly available data in various robust boosting approaches to generalized linear models, classification and survival data analysis.


Introduction
Boosting is a powerful supervised machine learning algorithm.As an ensemble method, boosting combines many weak learners to generate a strong prediction.As a functional decent method, boosting has a wide applications in regression and classification problems.Friedman (2001); Friedman, Hastie, and Tibshirani (2000) discussed boosting for a variety of convex loss functions.Boosting can be utilized to fit a variety of models with different base learners, including linear least squares, smoothing splines and regression trees (Bühlmann and Hothorn 2007;Wang 2018b).To deal with outliers, robust estimation and boosting can jointly provide more accurate estimation.Wang (2018a,b) proposed robust functional gradient boosting for nonconvex loss functions.These methods applied majorization-minimization (MM) scheme, an extension of the popular expectation-maximization (EM) algorithm in statistics.However, there is a lack of the weights as an indication of outlier status of observations, where small weights are assigned to observations deviated from the underlying model.In the classical robust estimation, the weights are derived from some robust loss functions, such as the Huber loss.
There is some recent progress on how to generate weights from robust loss functions in more complex problems.Wang (2020) innovatively proposed a new framework of robust estimation by reducing the weight of the observation that leads to a large loss.The author initiated a unified class of robust loss functions, the concave convex (CC) family, and introduced the iteratively reweighted convex optimization (IRCO) that minimizes the loss functions in the CC-family.The CC-family includes traditional robust loss functions such as the Huber loss, robust hinge loss for support vector machine, and robust exponential family for generalized linear models.The IRCO can be conveniently implemented with existing methods and software.
In this article, we integrate the IRCO and boosting into the IRBoost for the CC-family.This functional optimization is more general than the parameter-based estimation in Wang (2020).For instance, the IRBoost permits function space derived from the regression trees.Unlike the previous boosting applications including Wang (2018a,b), the major novelty is that the IRBoost framework provides weights to help identify outliers.We illustrate the proposed algorithm through the R irboost package with applications to robust exponential family, including regression, logistic regression and Poisson regression.Another illustration is robust survival regression with accelerated failure time model.The package also implements IR-Boost to Gamma regression, Tweedie regression, hinge classification and multinomial logistic regression.

CC-family function estimation
To unify robust estimation, Wang (2020) proposed the concave convex family with functions Γ = g • s satisfying the following conditions: i. g is a nondecreasing closed concave function whose domain is the range of function s ii.∂(−g(z)) ∀z ∈ range of s is nonempty and bounded iii.s is convex on R.
Here ∂(−g(z)) means subdifferential of function −g at point z, which is equivalent to the derivative {−g (z)} when exists.Examples of concave component are listed in Table 1.Note that the tcave is not differentiable everywhere, but subdifferentiable.The parameter σ controls robustness level a model is allowed, and a smaller value leads to more robust estimation.See Wang (2020) for details.The convex component includes common loss functions in regression and classification such as squared loss s(u) = u 2 and negative log-likelihood function in the exponential family adopted by the generalized linear models.Other examples include negative log-likelihood function for multinomial logistic regression, Tweedie regression and accelerated failure time model for time-to-event data subject to censoring (Barnwal, Cho, and Hocking 2020).The requirement of z ≥ 0 on the domain of g may be relaxed for some concave functions g in Table 1 although many commonly used loss functions have a range of nonnegative values.However, g < 0 does exist, for instance, when g is a negative log-likelihood value for the Gamma distribution.In this case, a nonnegative value is easily obtained by subtracting some data dependent constant, which is described below.
Given a set of observations ( x i , y i ), i = 1, ..., n, where y i ∈ R and x i = (x i1 , ..., x ip ) ∈ R p , denote Ω the linear span of a set H of base learners including regression trees and linear

Concave
where is a CC-family member, = g • s = g(s(u)).With some abuse of notation, s(u) is also used to denote s(y, f ( x)).For instance, s(u) = s(y − f ( x)) in regression, and s(u) = s(yf ( x)) in a margin-based classification with y ∈ [−1, 1].If s(y, f ( x)) < 0, a simple remedy is to subtract some constant C such that s(y, f ( x)) − C ≥ 0. For the exponential family, s(y, f ( x i )) ≥ s(y, y) holds since s(y, y) is the negative log-likelihood value of a saturated model.Hence, the desired concave loss can be obtained with s(y i , f (x i ))−min i=1,...,n s(y i , y i ) ≥ 0. To simplify notations, f is often used to replace f ( x).
The robust function estimation problem (1) can be solved by Algorithm 1, where step 4 involves ϕ, the Fenchel conjugate of −g.Denote where f (k) are generated in the algorithm.We have the convergence results for the IRBoost.
Theorem 1. Suppose that g is a concave component in the CC-family, and g is bounded below.
The loss function values ρ( f (k) ) generated by Algorithm 1 are nonincreasing and converge.
This result is a generalization of Theorem 4 in Wang (2020), where the function is the linear predictor.Here we study more broadly defined function spaces.On the other hand, if H is Algorithm 1 IRBoost 1: Input: training samples {( x 1 , y 1 ), ..., ( x n , y n )}, concave component g with parameter σ, convex component s, starting point f (0) and iteration count K.
a space of linear models, Theorem 1 indeed coincides with the results in Wang (2020).Algorithm 1 is a majorization-minimization algorithm, and the proof follows the same argument of Theorem 4 in Wang (2020), hence only a sketch of the proof is given below.
For a differentiable concave function g, the first-order condition is ∀u (3) Substitute u with s(u), and v with s(v) in (3), we get i ) in ( 4), and sum up for i = 1, ..., n, we get ) the right hand side of (5), the following inequalities hold: Alternatively, the majorization (6) can be constructed from a different surrogate function derived from the Fenchel convex conjugate: ).
Step 4 computes weights in two different ways corresponding to different surrogate functions.
The solutions can be shown to be the same based on the Fenchel-Moreau theorem.

Boosting algorithm for function estimation
An important question is how to compute step 5 in Algorithm 1.For ease of notation, we only present methods to the following unweighted estimation since the weighted estimation does not pose technical difficulties: In a boosting algorithm, the solution is an additive model given by where F M ( x i ) is stagewisely constructed by sequentially adding an update t m ( x i ) to the current estimate F m−1 ( x i ): There are different ways to compute t m ( x) = (t m ( x 1 ), ..., t m ( x n )) : gradient and Newtontype update are the most popular (Sigrist 2020).When the second derivative of loss function exists, the Newton-type update is preferred over gradient update to achieve fast convergence: where the first and second derivatives of the loss function s for observations i are given by: For quadratic loss s( , we obtain h m,i = 1.In this case, the Newton-update becomes the gradient update.

Penalized estimation
To avoid overfitting, we can add the objective function with a regularization term: where Λ penalizes the model complexity.If H is the space of linear regression with a pdimensional predictor, i.e., t m ( where λ ≥ 0, α ≥ 0. Note that Λ(t m ) provides shrinkage estimators and can conduct variable selection.Suppose that H is the space of regression trees.Each regression tree splits the whole predictor space into disjoint hyper-rectangles with sides parallel to the coordinate axes (Wang 2018b).Specifically, denote the hyper-rectangles in the m-th boosting iteration R jm , j = 1, ..., J. Let t m ( x i ) = β jm , x i ∈ R jm , i = 1, ..., n, j = 1, ..., J.With γ ≥ 0, the penalty can be defined as in Chen and Guestrin (2016): A different penalized estimation is to implement a shrinkage parameter 0 < ν ≤ 1 in the update (9):

Implementation and tuning parameter selection
In summary, we use Algorithm 1 coupled with the boosting algorithm to minimize the following objective function: where fi is given by ( 8).There are two layers of iterations: the outer layer is the weights update and the inner layer is the boosting iterations.An early stop of iterations in boosting does not guarantee convergence.On the other hand, the output f (K) may overfit the data.
In practice, we may consider a two stage process: In the first stage, apply Algorithm 1 to obtain optimal weights of observations.In the second stage, we can use a data-driven method such as cross-validation to select optimal boosting iteration M , penalty numbers γ for trees, λ and α.The same strategy can also be applied to the robust parameter σ.However, since this parameter is typically considered a hyperparameter, a more computationally convenient approach in the literature is to conduct estimation for different values of σ and compare the results.One can begin with a large value σ with less robust estimation, and move towards smaller value σ for more robust results.
The source version of the irboost package is freely available from the Comprehensive R Archive Network (http://CRAN.R-project.org).The reader can install the package directly from the R prompt via R> install.packages("irboost") All analyses presented below are contained in a package vignette.The rendered output of the analyses is available by the R-command R> library("irboost") R> vignette("irbst",package = "irboost") To reproduce the analyses, one can invoke the R code R> edit(vignette("irbst",package = "irboost"))

Robust boosting for regression
In this example, we predict median value of owner-occupied homes in suburbs of Boston, with data publicly available from the UCI machine learning data repository.There are 506 observations and 13 predictors.A different robust estimation can be found in Wang (2020).

Robust logistic boosting
A binary classification problem was proposed by Long and Servedio (2010).Response variable y is randomly chosen to be -1 or +1 with equal probability.We randomly generate symbols A, B and C with probability 0.25, 0.25 and 0.5, respectively.The predictor vector x with 21 elements is generated as follows.If A is obtained, x j = y, j = 1, ..., 21.If B is generated, x j = y, i = 1, ..., 11, x j = −y, j = 12, ..., 21.If C is generated, x j = y, where j is randomly chosen from 5 out of 1-11, and 6 out of 12-21.For the remaining j ∈ (1, 21), x j = −y.We generate training data n = 400 and test data n = 200.
We fit a robust logistic boosting model with concave component acave, where the maximum depth of a tree is 5.Other concave components in

Robust multiclass boosting
In a 3-class classification in iris dataset, xgboost generates classification error 0.02.Letting the initial boosting parameters the same, the IRBoost in irboost automatically updates the observation weights and leads to a different decision while maintaining the similar classification accuracy.
Compare model bst and fit7, with small change of weights, a different classification rule is obtained with similar error.

Robust Poisson boosting
A survey collected from 3066 Americans was studied on health care utilization in lieu of doctor office visits (Heritier, Cantoni, Copt, and Victoria-Feser 2009).The data contained 24 risk factors.Robust Poisson regression was conducted in Wang (2020)

Conclusion
In this article we propose IRBoost as a unified robust boosting algorithm, and illustrate its applications in regression, generalized linear models, classification and time-to-event data analysis.The method can be used for outlier detection and can provide more robust predictive models.Based on existing weighted boosting software, we can explore the developed models on variable importance and the underlying trees.The R package irboost is a useful tool in the machine learning applications.

Acknowledgment
This work was supported by the National Institute Of Diabetes And Digestive And Kidney Diseases of the National Institutes of Health under Award Number R21DK130006.
The observation weights are plotted with highlighted four smallest values.The corresponding four observations are considered outliers.We can plot the original median housing price vs Not surprisingly, those 4 observations with the smallest weights have poor predictions.We can view feature importance/influence from the learned model.The figure shows that the top two factors to predict median housing price are average number of rooms per dwelling (RM) and percentage values of lower status of the population (LSTAT).The IRBoost in irboost is a weighted xgboost, where the weights are tuned by robust argument.This can be illustrated below.
Table1can be applied similarly.We can prediction error of test data at each boosting iteration.Furthermore, we simulate data with 10% contamination of response variables, and apply the IRBoost.In the third robust logistic boosting, we reduce parameter value σ (s in the irboost function) for more robust estimation.As a result, some observations would have decreased weights in the model. compute . Here robust Poisson boosting model is fitted with concave component ccave.The observation weights are estimated.The doctor office visits in two years are highlighted for the 8 smallest weights, ranging from 200 to 750.We can view feature importance/influence from the learned model.The It is worth noting that the Cox regression in survival analysis is based on partial likelihood function, which does not follow the IRBoost.Alternatively, one may apply robust survival regression with accelerated failure time model in irboost.The following code provides survival analysis in patients with advanced lung cancer from the North Central Cancer Treatment Group.Performance scores rate how well the patient can perform usual daily activities.R> #udpate data with weights and rerun xgboost with the same niter R> setinfo(dtrain, 'weight', bst_cc$weight_update)[1] TRUE R> #compare bst and bst_xg with different weights R> bst_xg <-xgb.train(params,dtrain, nrounds=50) R> rcorr.cens(predict(bst_xg,dtrain), Surv(lung1$time, lung1$status))