Robust boosting with truncated loss functions

Department of Research Connecticut Children’s Medical Center Hartford, Connecticut 06106 USA e-mail: zhuwang@gmail.com Abstract: Boosting is a powerful machine learning tool with attractive theoretical properties. In recent years, boosting algorithms have been extended to many statistical estimation problems. For data contaminated with outliers, however, development of boosting algorithms is very limited. In this paper, innovative robust boosting algorithms utilizing the majorization-minimization (MM) principle are developed for binary and multi-category classification problems. Based on truncated loss functions, the robust boosting algorithms share a unified framework for linear and nonlinear effects models. The proposed methods can reduce the heavy influence from a small number of outliers which could otherwise distort the results. In addition, adaptive boosting for the truncated loss functions are developed to construct more sparse predictive models. We present convergence guarantees for smooth surrogate loss functions with both iteration-varying and constant step-sizes. We conducted empirical studies using data from simulations, a pediatric database developed for the US Healthcare Cost and Utilization Project, and breast cancer gene expression data. Compared with non-robust boosting, robust boosting improves classification accuracy and variable selection.


Introduction
Boosting algorithms are one of the most influential methodological approaches for data analysis developed in the last two decades. Initially boosting was a powerful machine learning algorithm to predict binary outcomes [9]. The basic idea is to iteratively construct simple or weak classifiers and to combine their solutions to obtain a more accurate prediction. For instance, decision trees have been widely used in many classification problems, partly due to their interpretability. However, tree based classification has a mixed track-record of predictive performance. In boosting, single decision-tree models are combined to an arbitrary depth and shape, adjusted to optimize the resulting model's classification performance. The ensemble decision-trees perform substantially better than single-tree models. Boosting can be interpreted as a method for fitting regression models in a stagewise fashion when optimizing well defined loss functions [12]. This gradient descent view of boosting has led to considerable development in different settings for both linear and non-linear effects models [3,19].
Variable selection is an important issue in data analysis. Among a large number of candidate predictors, predictive models are expected to select a subset of risk factors for improved accuracy and parsimonious interpretation. In recent years, statistical methods to conduct variable selection have been actively developed. Among them, boosting is one of the most attractive methods by simply choosing appropriate base learners [12,3,19].
Boosting algorithms for classification include AdaBoost, LogitBoost and HingeBoost [12,3,31]. These algorithms were developed to minimize exponential, logistic and hinge loss, respectively. Note that the three loss functions are convex. The convexity property can make optimization in some sense "easier" than the general case -for example, any local minimum must be a global minimum. A convex loss function, however, suffers from the negative impact of outliers. The above boosting algorithms therefore tend to be sensitive to noisy training data. When there exist points far away from their own classes, the classifiers are strongly influenced by such points because of the underlying loss functions used in these boosting algorithms.
To this end, robust boosting algorithms based on nonconvex loss functions have been proposed. For instance, BrownBoost and RobustBoost have shown resistant to outliers [6,7]. A nonconvex loss can be conveniently obtained by truncating a popular loss function, leading to more robust prediction accuracy [35,24,14]. However, there is a lack of gradient decent boosting algorithms for truncated loss functions. In this paper, we aim to fill the gap. Because the truncated loss functions are nonconvex, a key computational trick is the difference of convex (DC) algorithm [29,35,24,14]. The idea is to work with a series of simpler but well defined surrogate functions instead of the original loss function. As a result, the DC algorithm substitutes a series of simple optimization problems for a difficult optimization problem. Most notably, the DC algorithm has been linked with the majorize-minimize (MM) algorithm [15]. Another popular MM algorithm, the so-called EM (expectation-maximization) algorithm has been extensively studied for missing data among many other applications [5,21]. The major contribution of this paper is to combine the gradient descent boosting and the DC algorithm for a suite of truncated loss functions. These DC-boosting algorithms are more robust to data contaminated with outliers than their counterparts based on standard loss functions. We conduct convergence analysis of the proposed algorithms as well as the standard functional gradient boosting algorithms. We provide implementations of these algorithms through the publicly available R package, bst (available at http://cran.r-project.org).
Methods for robust boosting have applications in many scientific fields, including healthcare research and gene expression analysis. In this article we apply the robust boosting methods to classify healthcare costs using the KID healthcare database (available at www.hcup-us.ahrq.gov). Healthcare cost is a tremendous burden of public expenditure, and high-cost patients are typically related to severe disease diagnosis. Early identification of high-cost patients can help design targeted interventions that can defer or even avoid adverse outcomes [22]. Administrative healthcare databases have different sources of variability including outliers. Research examining inpatient expenditures can be distorted by a small number of extremely high or low charges that have undue influence on the results. This is not desirable because these charges could reflect measurement errors that distort the goodness of fit of statistical models. Suspicious charges are potential data entry errors and represent discharges that are outliers in terms of the average charge per day of stay [10]. Another challenge is that administrative data have voluminous amount of patient records including patient demographics, payment information, disease severity, comorbidities, diagnostic and procedure information and hospital characteristics. For instance, there are hundreds of diagnosis categories as potential predictors. Therefore it is crucial to identify a small number of risk factors for healthcare costs stratification. Likewise, gene expression data may be mislabelled and also contain many predictors.
In Section 2 we describe robust truncated loss functions for binary classification problems. We show how to implement the DC technique to a truncated 602 Z. Wang loss. In Section 3 we present DC boosting algorithm (DCBA) to minimize truncated loss functions. To further reduce risk factors, adaptive robust boosting is presented. We generalize robust boosting to multi-class problems in Section 4. Theoretical analyses of the boosting algorithms are presented in Section 5. The performance of the proposed algorithms is investigated in Section 6 via simulated data. In Section 7, the proposed algorithms are applied to classify healthcare costs and clinical status from breast cancer patients. Proofs of the theoretical results are in the Appendix.

Robust loss functions
Assume that we have observations (x i , y i ), where x i is a p-dimensional predictor variable for i = 1, 2, ..., n. For a binary outcome y taking values +1 and −1, with prediction f and margin u = yf , consider a margin based loss function (u). Table 1 lists three widely used loss functions: logistic, exponential and hinge loss [12,3,30]. In particular, support vector machines utilize the hinge loss [30]. These loss functions are sensitive to outliers because the loss values are unbounded and can go to infinity with outliers, see Figure 1. Therefore classification rules based on these loss functions can suffer from outliers. A simple remedy is to truncate the unbounded loss functions [1,36,24]. A truncated loss at a constant location s is Because truncation reduces the impact of misclassified outliers, the resulting classifiers are more robust and accurate than the standard classifiers. Table 1 and Figure 1 compare standard loss functions (u) and truncated counterparts [35,24,14,1,36]. The truncated logistic loss shows that L(u, s) increases as u decreases, but once u is less than s, the truncated loss becomes a constant. This implies that the truncated loss becomes larger up to an upperbound as an observation deviates further away from the classification boundary. For outliers located further away from the boundary satisfying u ≤ s, the truncated loss maintains a constant value (s) so that the outliers cannot further influence the classification boundary. In contrast, as can be seen in Figure 1, the standard logistic loss has no boundary and the impact of outliers grows to infinity. The interpretations of truncated exponential and hinge loss functions are similar. One exception is the difference logistic loss. As u decreases, the difference logistic loss increases at a smaller rate compared to logistic loss, and converges to a constant limit value. Theoretical properties of truncated logistic, exponential and hinge loss have been established [35,24]. In particular, Fisher consistency provides justifications for these truncated loss functions when used in classification. In the statistical literature, Fisher consistency originally means that the estimation procedure using the entire population will produce the true value of the estimation. For instance, the maximum likelihood estimation is typically Fisher consistent. In decision theory, a loss function is Fisher consistent if the population minimizer of Table 1 Loss functions. z + = max(0, z), s > 0 for difference logistic, and s ≤ 0 otherwise.

Truncated loss
L(u, s) Standard loss (u) Truncated logistic min log(1 + exp(−u)), logistic log(1 + exp(−u)) log(1 + exp(−s)) Difference logistic the risk leads to the Bayes optimal decision rule given by f (x) = sign(p(x) − 1 2 ), where p(x) = P (Y = +1|x) for any x in the predictor space [17]. Minimizing an empirical risk 1 n n i=1 L(y i f (x i ), s), a classification procedure effectively approximates the optimal decision rule if the risk is Fisher consistent. For the difference logistic loss [14], we have the following properties. Proposition 1. The difference logistic loss L DL (u, s) and the truncated logistic loss L T L (u, s) are asymptotically equivalent. Specifically, s)) has the same sign as p(x) − 1/2. Proposition 1 recognizes the relationship between the difference logistic and truncated logistic loss, and Proposition 2 establishes Fisher consistency of the former loss.
In general, we may write a nonconvex loss function L(u, s) as a difference of convex loss between (u) and − s (u): For instance, Equation (1) may be converted to (2) where Many robust truncated loss functions in the literature can be written in this format, as demonstrated in Table 2 [14,35,24]. Note, the nonconvex loss functions may be nonsmooth, such as the truncated hinge loss.
An important issue is how to minimize the nonconvex truncated loss function L(u, s). Since s (·) in (2) is a concave function, we can utilize the DC or majorize-minimize (MM) scheme. In the sequel we use notation L(f, s) instead of L(u, s) after replacing u = yf . Even if suppressed in the notation, it should be understood that the loss function L(f, s) depends on the classification outcome y. The DC algorithm is an iterative process. Given the current estimate f (k−1) in the (k − 1)-th iteration, we apply a linear majorization to the concave function s (f ) in (2): Let H k (f ) denote the surrogate loss given by We have the following relationship regarding the objective loss function L(f, s): Therefore H k (f ) can be readily minimized or reduced to obtain a new estimate f (k) . Together with (5), the following statement holds: The descent property (6) We then obtain After eliminating constant terms in H k (f ), minimizing H k (f ) is equivalent to minimizing its simplified version L DCF (f, s). For truncated loss functions in Table 2, the corresponding surrogate loss functions L DCF (f, s) are presented in Table 3. We have two remarks on the subscript DCF. First, the two letters DC highlight the linear majorization trick (3) that is often referred to the DC algorithm. Second, the letter F emphasizes that the DC algorithm in this paper applies to function estimation problems which are different from parameter estimation problems typically found in the literature [1,14,35,36,24].

Boosting nonconvex loss functions
The truncated loss (2) is minimized through the empirical loss: We propose difference of convex boosting algorithm (DCBA) in Section 3.1 and adaptive DCBA in Section 3.2. To solve the nonconvex minimization problem (7), these algorithms minimize a sequence of surrogate convex problems given the current estimate f (x i ) (k−1) : where

Difference of convex boosting algorithm
The DCBA (Algorithm 1) has a sequence of nested loops. The outer loop is to update the surrogate convex function (8) using the current estimate f (k−1) . At the inner loop, the gradient decent boosting algorithm minimizes the convex loss (8). When implemented in a computer program, we choose a constant step-size w m+1 = ν for 0 < ν ≤ 1 in this and subsequent boosting algorithms throughout the paper. Algorithm 1 can be utilized to fit a variety of models with different base learners, including componentwise linear least squares, componentwise smoothing splines and regression trees [3]. These base learners are presented in greater detail in Section 5.1 when a convergence analysis is concerned. As will be clear in Section 5.4, Algorithm 1 monotonically decreases the loss function (7) under certain conditions although there is no guarantee that the algorithm can locate the global minimizer for the nonconvex loss. However, with a suitable starting point, the DC algorithm converges quite often to a global one. See Tao and An [29] and the references therein. In practice, different starting points can be explored for an optimal solution. This would add computing burden to Algorithm 1 which already has two layers of iterations. Note, the initial value in line 3 is updated with the estimate in line 10. This can be named as a warm start. To partially alleviate the starting point issue, a practical strategy is to begin with a cold start. For instance, the developed computer program has an option to change line 3 from f 0 (x) = f (k−1) to a constant vector such that every element is 1 n n i=1 U i or 0. As a result, the loss values still monotonically decrease from line 3 to 9 under certain conditions. However, the initial loss value in the (k + 1)-th DC outer loop (line 3 of Algorithm 1) may be larger than the last loss value in the k-th loop (line 10 of Algorithm 1). In practice, a Algorithm 1 Difference of Convex Boosting Algorithm (DCBA) 1: Input: training samples {(x 1 , y 1 ), ..., (xn, yn)}, iteration count K, M , parameter s, starting point f (0) . 2: for k = 1 to K do 3: Construct the surrogate convex loss L DCF (y i , f, s) majorizing the objective function L at f (k−1) via (8).

5:
for m = 1 to M do 6: Compute the residuals, defined as negative gradient of loss function (see Table 4),

7:
Fit a base learner gm to the residuals U i with predictors x i , for i = 1, 2, ..., n. 8: Update the estimated function with a selected step-size wm, fm = f m−1 + wmgm.
general observation is that the last loss values decrease for k = 1, 2, ..., K, albeit at a slower convergence rate. With high-dimensional data as in this paper, this strategy avoids overfitting, generates better prediction and provides more parsimonious models. The numerical results in Section 6 and 7 support such a choice. Indeed, the aforementioned cold start approach follows the general consensus of early stopping of boosting before convergence of a loss function [3].
Prediction accuracy of robust boosting is greatly impacted by two tuning parameters. One is the location of truncation, which is often considered as hyper-parameter [35,24]. For instance, consider the truncated logistic loss. If the truncation parameter s is too small towards −∞, then the loss becomes similar to the standard logistic loss, which is sensitive to outliers. On the other Table 4 Pseudo residuals in robust boosting algorithms at line 6 in Algoirithm 1.

Truncated exponential
TAdaBoost THingeBoost hand, if s is too close to 0, the properties of the logistic loss can be difficult to maintain for the data. Therefore a good choice of s should be as effective as the standard logistic loss and also robust to outliers. We select s from several candidate values as in Wu and Liu [35], Park and Liu [24]. Another tuning parameter is the boosting iteration. A larger number implies better fitting for the training data and possibly more predictors in the model, which may suggest overfitting with deteriorated prediction accuracy from the test data. In practice, an optimal boosting iteration can be chosen by a data driven method, such as a cross-validation scheme or a modified Bayesian information criterion (BIC) [25]. Alternatively, we may have disjointed training/tuning data sets and use the training data for model building and tuning data for tuning parameter selection.

Adaptive difference of convex boosting algorithm
We consider boosting as a sequential algorithm that iteratively fits a base learner with selected predictors, in particular, a single predictor variable. How to choose a particular variable in each iteration is a paramount important issue in variable selection with high-dimensional data. A simple approach to reducing noneffective predictor variables is adaptive boosting. Proposed by Bühlmann and Hothorn [4], twin boosting is a generic adaptive boosting procedure with two steps: the first step is the usual boosting, and the second step is constructed to resemble the first boosting round. As a result, if a variable has not been selected in the first round of boosting, it will not be included in the second. In addition, depending on the estimates of variables selected in the first round of boosting, variables are selected with different weights in the second round of boosting. Twin boosting has much better variable selection results than the corresponding boosting algorithm and can also improve prediction accuracy [4].
Here we implement adaptive boosting (Algorithm 2) to minimize the truncated nonconvex loss function (7).

Algorithm 2 Adaptive DCBA
1: Input: training samples {(x 1 , y 1 ), ..., (xn, yn)}, iteration count K, M , parameter s, starting point f (0) . 2: Run first round of the DCBA (Algorithm 1) to obtain the initial function estimates f init and select effective predictors. For simplicity, assume the selected predictors are : Run a modified DCBA among the remaining predictors to obtain the final function esti- n ) and modify line 7 in Algorithm 1 as below: Fit a base learner gm to the residuals U i with thel-th predictor, wherel is given bŷ , · and · are the inner product and norm, respectively. 4: Output: the classifier sign(f final ).

Robust multi-class boosting
There is limited research on robust multi-class boosting algorithm. See McDonald et al. [20] for an extension of BrownBoost in this setting. We now generalize the DCBA to multi-class problems using truncated loss functions [34,35]. We focus on robust multi-class hinge loss while the methodology can be extended to other truncated loss functions. Denote (u) = (u + 1) + and s (u) = −(u − s) + , the truncated hinge loss (u) + s (u) and non-robust hinge loss (u) = (u + 1) + are displayed in Figure 2. As we can see from the figure, the truncated loss becomes a constant if u ≥ s. This loss function has been utilized to develop robust multi-class classification rules. For a J-class problem with a response belonging to {1, ..., J}, denote f = (f 1 , ..., f J ) a J-tuple of functions. We first consider to minimize a standard hinge loss [34]: subject to The constraint (10) is used for uniqueness of function estimators. The classification rule is argmax j f j , which shares the Bayes decision rule. For a slightly different form from (9), Wang [32] developed a boosting algorithm as well as adaptive boosting algorithm. The robust boosting aims to minimize the objective function with constraint (10). This truncated multi-class loss is a generalization of the truncated binary hinge loss in Table 1 and holds Fisher consistency for s ≥ 0. Different from a binary classification problem in Section 2, there are total of J differences of convex functions in (11). Given the current estimates f (11) can be linearly majorized, thus the objective is majorized at f In an iterative process with given f , estimation of f j , j = 1, ..., J can be achieved by minimizing the right hand side of (12), or equivalently where .
is a linear function, the difference between these two functions is thus convex. As a result, L DCF (y i , f(x i ), s) is convex, hence we can use the boosting technology. For the truncated hinge loss, we present a multi-class DCBA, Algorithm 3 (mTHingeBoost). To circumvent overfitting as in Algorithm 1, we may change line 3 and instead begin with some constant, for instance, 0. In conjunction with Algorithm 2, we can develop an adaptive robust multi-class DCBA, AmTHinge-Boost. Without the outer loop in Algorithm 3, i.e., without lines 2-4, 15 and 16, and changing line 7 such that U ij = −I(y i = j)I(f j (x i ) + 1 > 0), then the modified algorithm amounts to minimizing the standard loss (9). We call this reduced algorithm mHingeBoost. This algorithm can be conveniently extended for adaptive boosting, which is referred to AmHingeBoost.

Convergence analysis
Convergence of functional gradient boosting has been investigated by Mason et al. [18], Grubb and Bagnell [13] among others. In this section, we provide new convergence results. We first review three commonly utilized base learners in the literature including this paper. The properties of these base learners are building blocks of the analysis. We then develop new convergence results of the functional gradient boosting algorithm. Next, we extend the results to a universal MM boosting algorithm, and evaluate Algorithm 1 as a special case. It is worth noting that we omit convergence analysis of the adaptive DCBA,

Algorithm 3 Multi-class DCBA for Truncated Hinge Loss
Construct the surrogate convex loss L DCF (y i , f, s) majorizing the objective loss (11) at f j (x i ) via (13). 5: for m = 1 to M do 6: for class j = 1 to J do 7: Compute U ij the negative gradient with respect to f j : 8: Fit a predict model g mj for pseudo response variable U ij with predictor variable x i . 9: end for 10: for class j = 1 to J do 11: Set end for 14: end for 15: Update the current estimate f Algorithm 2. This algorithm has two rounds of Algorithm 1 while the second round has a different variable selection. As will be clear in the sequel, variable selection doesn't contribute to the convergence analysis. Hence the convergence results for the DCBA also hold for the adaptive DCBA.

Base learners in boosting
The design matrix X is a n × p matrix given by X = (x 1 , x 2 , ..., x n ) , and the response vector is given by U = (U 1 , U 2 , ..., U n ) . A regression technique is to construct an estimatorÛ of U with the predictors x i , i = 1, ..., n. Many regression procedures are a linear smoother given by: where S is a n×n smoothing matrix whose i-th row assigns weights given to each U i in building the estimateÛ i . Note, the entries of S are related to the predictors x i but do not contain the observations U . Popular linear smoothers used as base learners in the boosting literature are reviewed. We pay special attention to their properties, which are building blocks in the subsequent convergence analysis.

Linear least squares
Let Z be a n × q matrix and a subset of X, such that 1 ≤ q ≤ min{n, p}. In addition, assume that Z is full column rank. In practice, componentwise least squares, i.e., q = 1 is typically employed in boosting. The smoothing or hat matrix S is given by S = Z(Z Z) −1 Z . The least squares prediction is given bŷ U = SU . It is well-known that the eigenvalues of the projection matrix S are either zero or one, and the number of nonzero eigenvalues is equal to the rank of the matrix. Furthermore, the hat matrix is symmetric idempotent.

Componentwise smoothing spline
Without loss of generality, suppose the r-th predictor x ir , 1 ≤ r ≤ p, is chosen by the base learner. We drop the subscript r. For observations (x i , U i ), i = 1, ..., n, a smoothing splineÛ = g(x i ) minimizes the penalized residual sum of squares among all functions with two continuous derivatives: where λ is a pre-specified smoothing parameter. A smoothing spline is a linear operator such thatÛ = SU [33]. The smoothing matrix S has two eigenvalues equal to 1, and the other eigenvalues are all strictly between zero and one. In addition to being positive definite, S is also symmetric but not idempotent.

Regression trees
We start with a bin smoother for a single predictor, also known as a regressogram [33]. As in section 5.1.2, suppose the r-th predictor x ir , 1 ≤ r ≤ p, is chosen by the base learner. We drop the subscript r again. Consider the observations (x i , U i ), i = 1, ..., n. We define J bins with cutpoints a 1 < ... < a J+1 where a 1 = −∞, a J+1 = ∞: An estimateÛ can be obtained by averaging the U i s over each bin: where n j is the number of observations in Define the vector of fitted valuesÛ = (Û (x 1 ), ...,Û (x n )) . It then follows that Denote A j a n j × n j sub-matrix whose every entry is 1/n j . In this case the matrix S is a symmetric block matrix whose diagonal blocks are A j , and the off-diagonal blocks are matrices of zeros: We have the following properties regarding the regressogram. (16) is symmetric idempotent thus positive semidefinite. Furthermore, S has the eigenvalues 1 and 0, with multiplicities J and n − J, respectively.

Proposition 3. Linear smoothing matrix S of a regressogram defined by
A regression tree with p predictor variables follows the same idea like the bin smoother, although the splitting variables and cut points are optimally chosen. A regression tree splits the whole predictor space into disjoint hyper-rectangles R j with sides parallel to the coordinate axes. The regression tree has the estimates in the same form as (15). Consequently, a regression tree is a linear operator with smoothing matrix S given by (16). Like the regressogram, we have the following results: Proposition 4. Linear smoothing matrix S of a regression tree defined by (16) is symmetric idempotent thus positive semidefinite. Furthermore, S has the eigenvalues 1 and 0, with multiplicities J and n − J, respectively.

Square integrable space
Before we evaluate the functional gradient boosting, we first review the relevant Hilbert space [13]. Given a measurable predictor set X , a complete vector space V of response, and measure μ over X , the square integrable function space This Hilbert space has a natural inner product and norm: For an observed set of points x i , i = 1, ..., n, the empirical inner product and norm can be correspondingly defined as the vector space operations defined by (17) reduce to the empirical versions.

Z. Wang
We need to compute the gradient of functionals over the Hilbert space L 2 (X , V, μ). Let : L 2 (X , V, μ) → R be a functional, its Fréchet derivative ∇ is a linear operator satisfying

Convergence of functional gradient boosting algorithms
Given a set of observations ( the hypothesis space generated by a base learner. We aim to minimize an empirical loss function for a functional : Mason et al. [18] frame the boosting as a functional gradient descent method. The Fréchet derivative ∇ implies Define U = −∇ (f ) and let ψ = δg. We then have: Hence to first order in δ > 0, (19) implies For the greatest reduction in loss values, we should seek the g which maximizes U, g . Now, where the last equality holds only for the trivial case U = g = 0. Therefore seeking the solution argmax − U −g 2 , or argmin U −g 2 , might be sub-optimal for greatest loss reduction, compared to directly seeking argmax U, g . However, the former strategy brings in numerous regression type base learners as presented in Section 5.1. In addition, if the solution g * = argmin U − g 2 results in U, g * > 0, the loss is thus reduced. Some base learners including linear least squares and regression trees, indeed provide the greatest loss reduction. See Proposition 5 for details. The above discussion is closely related to the concept of edge that quantifies the performance of any given set of base learners [13].
If ∀∇ (f ) ∈ Ω, there exists a function h ∈ Ω such that Then the hypothesis space Ω is defined to have an edge γ ∈ [0, 1]. If Ω is closed under scalar multiplication, letting U = −∇ (f ), g = −h, then (20) becomes The base learners in this paper are all generated from regression techniques, Ω is thus closed under scalar multiplication. Hence we can also claim that (21) holds for every negative gradient U if Ω has an edge γ.
In the literature, edge is related to weak learnability assumption when base learners are binary classifiers [8]. Roughly speaking, if Ω is a space of binary classifier that is more accurate than a random guess, then edge γ > 0. The relationship between the edge and base learners discussed in Section 5.1.2 can be summarized in the following results: Furthermore, the hypothesis space Ω has an edge γ for every U . Finally, for a constant ζ = 0, the following equality holds: The equality (22) will be handy in the subsequent analysis. All three base learners described in Section 5.1.2 are linear smoothers with smoothing matrices positive semidefinite. According to Proposition 5, therefore, the hypothesis space Ω has an edge γ for every negative gradient −∇ (f ). Bühlmann and Hothorn [3] observed that seeking argmax U, g [18] and seeking argmin U − g 2 [11] coincide for the componentwise linear least squares base learner. This instance is generalized by Proposition 5 to contain more base learners, for instance, regression trees (cf. Proposition 4).
The functional gradient boosting algorithm, Algorithm 4, aims to find a solution of (18) [18,11]. Different choices of step-size in line 5 have been proposed, including an iteration-varying number by a line search [11] and a small constant [3]. From Proposition 5, if the base learner is a linear smoother whose smoothing matrix is symmetric idempotent, then maximizing U, g is essentially the same as minimizing U − g 2 in line 4 of Algorithm 4. However, if the componentwise smoothing spline is used as the base learner, the two methods don't necessarily match since the smoothing matrix S is not idempotent.
In theoretical analysis of algorithms, it is common to assume some conditions on the loss functions. A functional is ζ-strongly smooth if ∀φ, ψ ∈ Ω, for some ζ > 0, the following inequality holds:

Z. Wang
Let f * = argmin f ∈Ω (f ). We first have a convergence result with an iterationvarying step-size.
Theorem 1. Let (f ) be a ζ-strongly smooth functional bounded below over L 2 (X , V, μ). Assume that Ω ⊂ L 2 (X , V, μ) has an edge γ ∈ (0, 1] for every negative gradient −∇ (f ). Suppose a step-size in the (m+1)-th boosting iteration is given by then Algorithm 4 converges to some value, in which case Furthermore, after M iterations, the following convergence rate is obtained: The results in Theorem 1 are different from Theorem 12.3 in Mason et al. [18] in that we have adopted the concept of edge. In addition, there are at least two other differences: the base learner here is not restricted to a classifier and may be a regression type base learner; and we may minimize the least squares −∇ (f m ) − g m+1 2 rather than maximizing −∇ (f m ), g m+1 . Notice if the boosting base learner is a linear operator g m+1 = S(−∇ (f m )), and the associated linear smoothing matrix S is symmetric idempotent, then (23) reduces to a constant w m+1 = 1 ζ based on Proposition 5. Finally, Theorem 1 yields very similar results as to a gradient method in function estimation. See, for instance, Nesterov [23].
For a gradient method in function estimation, a constant step-size and iteration-varying step-size can be unified in the convergence analysis [23]. For a functional gradient method, however, the properties of a constant step-size in Algorithm 4 have not been well addressed [3]. The following theorem fills the gap. The results may be applied to linear smoother base learners including linear least squares and regression trees. Compute the residuals, defined as negative gradient of loss function,

4:
Fit a base learner gm to the residuals U i with predictors x i , for i = 1, 2, ..., n.

5:
Update the estimated function with a selected step-size wm, fm = f m−1 + wmgm.
Furthermore, after M iterations, the following convergence rate is obtained:

Convergence of MM boosting algorithms
Given a set of observations (x i , y i ), i = 1, ..., n, where y i ∈ R and x i = (x i1 , ..., x ip ) ∈ R p , denote Ω the hypothesis space generated by a base learner. We seek the solution of the problem for a functional L : L 2 (X , V, μ) → R. From now on, we suppress y in the loss functions. We consider a general MM algorithm. Suppose the objective L(f ) is majorized by a surrogate loss H k (f ) at the majorization point f (k−1) in the k-th outer MM loop, k = 1, 2, ..., such that Note, H k (f ) depends on f (k−1) by the design. The surrogate loss H k (f ) can be minimized or reduced to obtain a new estimate f (k) . Therefore the DC surrogate function characterized by (5) is a special case of (25). Denote k (f ) the error between the surrogate loss and the objective given by The descent property of the MM framework is described below: Proposition 6. Suppose a loss functional L is bounded below, and a surrogate functional H k is defined by (25), then L(f (k) ) and H k (f (k) ) monotonically decrease and converge to the same value. Furthermore, k (f (k) ) → 0 as k → ∞.
We consider an implementation of the boosting algorithm in the MM framework, Algorithm 5, which is a generalization of Algorithm 1. We begin with some preliminary results concerning the convergence analysis. Denote f 2: for k = 1 to K do 3: Construct the surrogate convex loss H k (f ) majorizing L at f 0 (x). 5: for m = 1 to M do 6: Compute the residuals, defined as negative gradient of loss function Fit a base learner gm to the residuals U i with predictors x i , for i = 1, 2, ..., n. 8: Update the estimated function with a selected step-size wm, fm = f m−1 + wmgm.
Notice that f (k) 0 is the initial boosting estimate at the k-th MM outer loop. The equality at the majorization points between the surrogate and the objective is stated in the next lemma.

Lemma 3. For a loss functional L and surrogate loss functional H k defined by (25), the following results hold for Algorithm 5:
As a consequence, (27) also holds for Algorithm 1 which is a special case of Algorithm 5.
A concept of strongly convex functional is often needed in the convergence analysis. A functional over L 2 (X , V, μ) is η-strongly convex if ∀φ, ψ ∈ Ω, for some η > 0, the following inequality holds: For the sum of convex loss and concave loss functions (2), utilizing the special form of the surrogate loss (4), the following lemma suggests that we may only need to focus on the convex loss concerning the properties of the surrogate loss.

Lemma 4. Suppose a loss functional
and h k (f ) is given by (3). Then the following results hold: Let f * = argmin f ∈Ω L(f ). We first have a convergence result of Algorithm 5 with an iteration-varying step-size. a ζ-strongly smooth functional over L 2 (X , V, μ). Assume that Ω ⊂ L 2 (X , V, μ) has an edge γ ∈ (0, 1] for every negative gradient −∇H k (f ). Given a starting point f (0) and suppose a step-size w m+1 in the (m + 1)-th boosting iteration within the k-th outer MM loop is given by

Theorem 5. Consider a loss functional L(f ) bounded below over L 2 (X , V, μ). Assume that L(f ) is majorized by a surrogate loss H
then Algorithm 5 converges to some value, in which case: Furthermore, the following convergence rate is obtained: Notice if the boosting base learner is a linear operator, and the associated linear smoothing matrix is symmetric idempotent, then the step-size (29) reduces to a constant w m+1 = 1 ζ based on Proposition 5. With Theorem 5, the following results for the DC framework directly follow from Lemma 4.  (3). Assume that (f ) is a ζstrongly smooth functional over L 2 (X , V, μ). Assume that Ω ⊂ L 2 (X , V, μ) has an edge γ ∈ (0, 1] for every negative gradient −∇H k (f ). Given a starting point f (0) and suppose a step-size w m+1 in the (m +1)-th boosting iteration within the k-th outer DC loop is given by (29), then Algorithm 1 converges to some value, in which case: Furthermore, the following convergence rate is obtained: For a constant step-size, we have the following results which may be applied to linear smoother base learners including linear least squares and regression trees. Wang an edge γ ∈ (0, 1] for every negative gradient −∇H k (f ). Suppose that at the (m + 1)-th boosting iteration within the k-th outer MM loop, the boosting base learner is a linear operator, and the associated linear smoothing matrix S (k) m+1 is symmetric idempotent. Given a starting point f (0) and a constant step-size, then Algorithm 5 converges to some value, in which case:

Theorem 6. Suppose a loss functional L(f ) bounded below over
Furthermore, the following convergence rate is obtained: With Theorem 6, the following results for the DC framework directly follow from Lemma 4.

Corollary 6.1. Suppose a loss functional
and h k (f ) is given by (3). Assume that (f ) is a ζ-strongly smooth functional over L 2 (X , V, μ). Assume that Ω ⊂ L 2 (X , V, μ) has an edge γ ∈ (0, 1] for every negative gradient −∇H k (f ). Suppose that at the (m + 1)-th boosting iteration within the k-th outer DC loop, the boosting base learner is a linear operator, and the associated linear smoothing matrix S (k) m+1 is symmetric idempotent. Given a starting point f (0) and a constant step-size, then Algorithm 1 converges to some value, in which case: Furthermore, the following convergence rate is obtained: and H k (f (k) † ) monotonically decrease and converge to some value, say L(f ‡ ). In general and in particular if L is nonconvex, there is no guarantee that L(f ‡ ) = L(f * ). Nevertheless, we can explore how Algorithm 5 performs when L(f ‡ ) = L(f * ). For convenience of analysis we study Algorithm 6. Unlike Algorithm 5 with two layers of loops, Algorithm 6 is a simplified version with only one loop. This follows the same strategy as Krause and Singer [14]. We first prepare some results before the convergence analysis is conducted. In particular, we generalize strong convexity to the following condition, which will be useful in a constant step-size analysis:  (x 1 , y 1 ), ..., (xn, yn)}, iteration count K, starting point f (0) . 2: for k = 1 to K do 3: Construct the surrogate convex loss H k (f ) majorizing the objective function L at f (k−1) . 4: Compute the residuals, defined as negative gradient of loss function

5:
Fit a base learner g k to the residuals U i with predictors x i , for i = 1, 2, ..., n.

6:
Update the estimated function with a selected step-size w k , 7: end for 8: Output: f (K) .

Condition 1. For a loss functional over
holds for a symmetric idempotent matrix S and some constant η > 0.
Clearly, if (28) holds, then (30) holds, in which case S is the identity matrix. However, Condition 1 doesn't necessarily imply strongly convex. The following conclusion is analogous to that of a strongly convex functional: Proposition 7. If Condition 1 holds for a functional , ∀ψ ∈ Ω, the following inequality holds: where ψ * is a minimization point of .
Replacing S with the identity matrix, (31) simplifies to a standard result for strongly convex (cf. p. 460 in Boyd and Vandenberghe [2]): The next theorem describes the performance of Algorithm 6.

Theorem 7. Suppose a loss functional L(f ) bounded below over
has an edge γ ∈ (0, 1] for every negative gradient −∇H k (f ). Given a starting point f (0) , consider two cases below: (i) Assume that H k (f ) is a η-strongly convex functional, and a step-size w k+1 is given by

Z. Wang
(ii) Assume that functional H k+1 satisfies Condition 1, and a constant stepsize is given.
After K iterations of Algorithm 6, the bound on L(f (K) ) − L(f * ) is given by Furthermore, utilizing Proposition 6 to assume that the errors between the surrogate and objective are bounded by where β > 0 is a constant, then the following convergence rate is obtained: Notice if the boosting base learner is a linear operator, and the associated linear smoothing matrix is symmetric idempotent, then the step-size (33) reduces to a constant w k+1 = 1 ζ based on Proposition 5. Grubb and Bagnell [13] developed a convergence result on strongly smooth and strongly convex loss functions for Algorithm 4. Compared with Theorem 3 in Grubb and Bagnell [13], the right hand side of (34) has an additional second error term that is a bound characterizing the errors between the surrogate loss and the objective introduced at each iteration. Regardless whether L is nonconvex, a convergence rate of The latter rate is the same as boosting convergence rate for minimizing a ζ-strongly smooth and η-strongly convex functional [13]. While the assumption (35) seems difficult to verify, it suggests that the convergence rate of Algorithm 6 and hence Algorithm 5 relies on how tight a surrogate loss approximates the objective. Intuitively we should seek surrogate loss functions as close as possible to the original objective, yet easier to minimize than the latter. With Theorem 7, the following results concerning an iteration-varying stepsize for the DC framework can be developed, again with the aid of Lemma 4.
and h k (f ) is given by (3). Assume that (f ) is a ζstrongly smooth and η-strongly convex functional over L 2 (X , V, μ). Assume that Ω ⊂ L 2 (X , V, μ) has an edge γ ∈ (0, 1] for every negative gradient −∇H k (f ). Given a starting point f (0) and suppose a step-size w k+1 is provided by (33), after K iterations of Algorithm 6, the bound on L(f (K) ) − L(f * ) is given by Furthermore, utilizing Proposition 6 to assume that the errors between the surrogate and objective are bounded by where β > 0 is a constant, then the following convergence rate is obtained:

Examples
In the previous sections, general convergence theories are provided for boosting algorithms including FGB, MMBA, DCBA and SMMBA. In this section, we apply these theories to some standard and robust loss functions in Table 1.
The partial second derivative is given by The last inequality is trivially obtained by substituting a = 1/ exp(yf ), b = exp(yf ) in the following inequality: It is straightforward to argue that the logistic loss is 1 4 -strongly smooth (cf. pp. 460-461 in Boyd and Vandenberghe [2]). This, together with other assumptions 624 Z. Wang in the theorems, Theorem 1 and 2 thus hold for Algorithm 4 when applying to the logistic loss.
Note, for fixed y, (38) implies Hence the strongly convex condition in Corollary 7.1 is not satisfied for the logistic loss. However, some minor modifications shown below can restrict the support to a compact set, which in turn can force the logistic loss strongly convex. To maintain numerical stability, it has been suggested to clap the range of f such that f ∈ [−τ, τ ] for some constant τ > 0 [12]. Therefore we have yf ∈ [−τ, τ ] since y ∈ {−1, 1}, which leads to The last inequality is obtained from 1/ exp(yf ) ≤ exp(τ ), exp(yf ) ≤ exp(τ ).
Hence (y, f ) is 1 2+2 exp(τ ) -strongly convex. This, in conjunction with other assumptions in the corollary, Corollary 7.1 holds for Algorithm 6 when applying to the truncated logistic loss and difference logistic loss.

Exponential loss
Consider the exponential loss (y, f ) given by The second derivative is given by ∂ 2 (y,f ) . Therefore (y, f ) is smooth and convex, but neither strongly smooth nor strongly convex. Again, we can clap the range of f , such that f ∈ [−τ, τ ] for some constant τ > 0. Therefore we have yf ∈ [−τ, τ ] as before, which leads to With the restricted range f ∈ [−τ, τ ], (y, f ) is thus strongly smooth and strongly convex. Combining with other assumptions, Theorem 1 and 2 hold for Algorithm 4 when applying to the restricted exponential loss.

A simulation study
In the simulated examples, we study performance of the proposed robust boosting algorithms and compare with the standard non-robust boosting methods and existing robust boosting algorithms. Random data generations are similar to those in Wu and Liu [35], Park and Liu [24]. Example 1 and 2 are binary classification problems while example 3 is a multi-class problem. These examples include both linear and nonlinear classifiers.
In each example, we also generate another 18 noise variables from uniform[-1, 1]. Different level of outliers are introduced to the data. We randomly select v percent of the date and switch their class labels for binary problems; or switch their class labels to other classes with equal probabilities in example 3. We consider v=0, 5, 10 and 20. We generated random samples of training/tuning/test samples. Training data were used for model estimation, tuning samples were used to select optimal boosting iteration in the DCBA, and test data were used to evaluate classification accuracy. For the binary classification problems, RobustBoost requires choosing two parameters, the error goal and margin goal. The error goal is the proportion of the outlier observations, which can be easy to set up in the simulated data [7]. The margin goal was chosen based on the tuning data. For example 3, we also evaluated the multi-class BrownBoost and pre-specified the tuning parameter following McDonald et al. [20]. With the simulated data, the performance of an algorithm can be compared with the optimal Bayes errors. In example 1 and 2, the training/tuning/test sample sizes are n = 200/200/10, 000 from 100 random samples. In example 3, the training/tuning/test sample sizes are 100/1, 000/10, 000 from 50 random samples.
The DCBA in Table 4 was evaluated. For example 1 and 3, componentwise linear least squares were base learner, while componentwise smoothing splines were base learner in example 2. The following candidate values of the truncation location s were investigated. For truncated exponential loss, s = 626 Z. Wang 0, − log 2, − log 3; for truncated hinge loss, s = 0, −1, −2; for truncated logistic loss, s = 0, − log 3, − log 7; for logistic difference loss, s = log 2, log 4, log 8 except for s = log 16 for example 2 and v=0 and 5. For truncated multi-class hinge loss, s = 0, 0.5, 1, 1.5, 2. With candidate values of s described above, we only report the results from which the best prediction was obtained. The results are summarized in Tables 5 to 10. With no contamination, each DCBA and its corresponding non-robust boosting algorithm perform similarly. With data contaminated with outliers, the DCBA is more accurate than its non-robust counterpart in prediction and variable selection. Overall, the DCBA provides smaller misclassification errors and eliminate more noise variables. Compared to existing robust boosting algorithms, the DCBA outperforms RobustBoost in example 1. This is also the case in example 2 except for THingeBoost which has slightly poorer accuracy than RobustBoost but maintains more parsimonious variable selection. In example 3, the robust multi-class HingeBoost performs better than BrownBoost with smaller errors and number of selected variables. To further understand the dynamics of robust boosting with contaminated data, the DCBA is compared with non-robust counterparts in Figure 3 to 6. For each DCBA, we plot the evolutions of misclassification error on the test data as well as the number of variables selected, versus the boosting iteration. Clearly, the DCBA performs better than their non-robust counterparts. While we illustrate the patterns using data generated in example 1 with 20% contamination, similar conclusions hold for other examples as well as other data contamination scenarios (figures not shown).
We also evaluate the performance of the adaptive DCBA. We only illustrate selected results. For example 2, we focus on variations of HingeBoost. The results are summarized in Table 7, Table 8 and Figure 7. The adaptive DCBA (ATHingeBoost) performs better than its robust cousin: THingeBoost. It also outperforms the adaptive non-robust cousin: AHingeBoost. AThingeBoost generated more accurate classification and variable selection than the two competitors. For example 3, the results from the adaptive multi-class DCBA are summarized in Table 9 and 10. The adaptive robust boosting (AmHingeBoost)   performs better than either robust boosting (mTHingeBoost) or adaptive nonrobust boosting (AmThingeBoost).

Analysis of healthcare costs
The Kids' Inpatient Database (KID) is part of a family of databases developed for the US Healthcare Cost and Utilization Project. The KID contains a sample of pediatric discharges from more than 4,100 U.S. community hospitals in 44 states, sampling from 10% of uncomplicated in-hospital births and 80% of other pediatric cases. Acute kidney injury (AKI) is a common hospital complication with a rising incidence and a strong association with mortality and overall morbidity outcomes, especially in critically ill children. We predict inpatient costs for children with AKI who where ≤ 18 years old in the 2009 KID. Inclusion and exclusion criteria follow Sutherland et al. [28]. There are total of n = 10, 322 AKI cases. The cost of AKI inpatient care for a discharge can be estimated by multiplying total hospital charge by the group average all-payer inpatient cost/charge ratio. Although healthcare costs prediction if often considered a regression like problem, there are some merits to be treated as a classification problem. For instance, one can evaluate whether a risk factor improves classification accuracy. In this paper, hospital charges are dichotomized at a threshold of $50,000. Potential risk factors include patient demographics, characteristics of admission and hospitals, payer information, discharge status and more than 260  Table 4 were applied to the data. Two thirds of the data were randomly selected as training data, and the remaining test data were used to evaluate prediction accuracy. We used trees (stumps) as base learner in boosting algorithms. Different truncation locations were evaluated, and we only illustrate selected results. For truncated LogitBoost, s = −2, for difference LogitBoost, s = 3.3, for truncated AdaBoost, s = −1, and for truncated Hinge-Boost, s = −1.05. For the DCBA and its non-robust counterpart, Figure 8 to 11 plot the evolution of the misclassification error on the test data versus the iteration counter, as the algorithms proceed while working on the test set. These figures also plot the evolution of the number of variables selected versus the iteration counter, as the boosting algorithms proceed while working on the training set. When boosting iterations were selected by five-fold cross validation with the training data, the derived models generated results in Table 11. The best prediction accuracy of 86% was obtained with the truncated AdaBoost (TAdaBoost). The truncated LogitBoost (TLogitBoost) was better than its non-robust counterpart LogitBoost, the difference LogitBoost (DLogitBoost) was comparable to Z. Wang LogitBoost, and truncated Hingeboost (THingeBoost) was slightly better than HingeBoost. In addition, the DCBA provided more sparse variable selection than their non-robust counterparts. For instance, while more accurate in classification, TAdaBoost selected 84 risk factors compared to 127 with AdaBoost. With truncated loss functions, the advantage of DCBA on variable selection is clearly demonstrated in Figure 8 to 11. We used one data realization to illustrate the difference between DCBA and boosting, particularly in Figure 8 to 11. Multiple random data realizations showed similar conclusions, and the results are not presented here. We also performed RobustBoost on the data. We considered different error goals and chose the optimal margin goal based on five-fold cross validation. The best test error and the associated variable selection are presented in Table 11. Except for truncated HingeBoost, the DCBA is more accurate than RobustBoost. In general, RobustBoost selected more risk factors in the predictive model.

Classifying breast cancer clinical status
The data contains 278 breast cancer samples with 164 estrogen receptor positive cases [27]. Following Zhang et al. [37], Li and Bradic [16], we conducted a pre-screening procedure to reduce computation demand. From 22,283 genes, the dataset was reduced to 3000 genes with the largest absolute values of the two-sample t-statistics. The remaining genes were then standardized. We randomly split the data to training and test data. The training data consists 50/50 positive/negative estrogen receptor status, and the rest was the test data. The positive/negative outcome status in the training data was randomly exchanged with percentage v=0, 5, 10, 15. We used componentwise linear least squares as base learner in boosting algorithms. While different truncation location parameters were evaluated, we only report results from selected truncation parameters due to space limitation. For TLogitBoost, DLogitBoost, TAdaBoost and THingeBoost, truncation location s = −0.2, 0.8, −0.2, −0.5, respectively. A fivefold cross-validation was implemented to select the optimal boosting iterations with the training data. The above data generation and analysis were repeated 100 times. The results are summarized in Table 12 and 13. For comparison, the test errors of RobustBoost in Li and Bradic [16] are also reproduced in Table 12. Note, variable selection results of RobustBoost were not presented in that paper. For non-contaminated data, all methods show similar prediction while hinge loss based boosting algorithms select more variables. The same dataset was studied in Zhang et al. [37], Li and Bradic [16] where the best test error in ten methods was 0.0945, which is larger than six methods in Table 12 with v=0. For contaminated data with outliers, the truncated loss functions improve prediction accuracy. The DCBA has better test error than its non-robust boosting counterpart. For instance, with 15% outliers, THingeBoost has test error 0.1197 compared to 0.1379 with HingeBoost. As the level of outlier increases, classification is more difficult, and more variables are selected. The robust methods in Li and Bradic [16] found the best test errors 0.1116 (v=5), 0.1217 (v=10) and 0.1329 (v=15), respectively. Clearly, the proposed TLogitBoost and THingeBoost have better classification accuracy.

Discussion
The results from the simulation studies and data example considered in this paper strongly suggest that robust gradient boosting is an important alternative to the traditional boosting in many applications. In particular, the numerical results suggest that robust boosting can provide more accurate classification and variable selection.
In this paper, the convergence analysis is centered on smooth loss functions. However, the hinge loss and truncated hinge loss are both nonsmooth functions.      There is a lack of convergence analysis on such loss functions for functional gradient algorithm, or Algorithm 4. Indeed, Algorithm 4 may break down for nonsmooth loss functions [13]. Further analysis of the proposed DCBA on nonsmooth loss functions is needed to be done. Asymptotic theories of various boosting algorithms have been elaborated in the literature, for instance, see an early review Bühlmann and Hothorn [3]. However, there are only limited asymptotic theories developed for Algorithm 4 despite its wide applications particularly in the statistical community [3,19]. Much of the previous results have focused on special loss functions such as the least squares loss [3], or special algorithms like AdaBoost. As an extension of Algorithm 4, asymptotic consistency of Algorithm 1 and 5 remains a future research topic. Theorem 7 provides convergence rate analysis for Algorithm 6. It is also interesting to investigate the convergence speed of Algorithm 1 and 5.
The gradient descent view of boosting with the aid of difference of convex 636 Z. Wang 8.4(9.0) 9.4(9.5) 9.6(8.7) 11.2(9.0) allow us to design boosting algorithms for a variety of robust loss functions. This new approach can be extended to other nonconvex loss functions including but not limited to truncated loss. We anticipate that more innovative boosting algorithms will be developed coupling with other MM algorithms in addition to the difference of convex algorithm. Hence f * (x) can't be 0. Therefore it must be f * (x) > 0. In conclusion, f * (x) has the same sign as p(x) − 1/2.

Proof of Proposition 3:
For the diagonal block matrix S defined by (16), S is a symmetric block matrix whose diagonal blocks are A j , and the off-diagonal blocks are matrices of zeros. In addition, each A j is a n j × n j sub-matrix whose every entry is 1/n j . Therefore each A j is symmetric. The above arguments imply that S is symmetric. Furthermore, we have It can be easily shown that A j , j = 1, ..., J, are idempotent (cf. Searle [26], p. 322). Thus S 2 = S, i.e., S is idempotent. Denote I the identity matrix. Since S is a diagonal block matrix, the eigenvalues can be computed as below: The eigenvalues of S are thus just the list of eigenvalues of each A j . For each A j , the eigenvalues are 1 and 0, with multiplicities 1 and n j − 1, respectively (cf. Searle [26], p. 322). Therefore the eigenvalues of S are 1 and 0, with multiplicities J and J j=1 (n j − 1) = n − J, respectively. Proof of Proposition 4 is the same as for Proposition 3.

Proof of Proposition 5:
(i) Since g = SU and S is positive semidefinite, we have U, g = U, SU = U SU ≥ 0.
Since S is symmetric idempotent, S is positive semidefinite. Thus as in (i), Ω has an edge γ ∈ [0, 1] for every U . Finally, for the S defined above, it is simple to prove the following equality: