adabag : An R Package for Classiﬁcation with Boosting and Bagging

Boosting and bagging are two widely used ensemble methods for classiﬁcation. Their common goal is to improve the accuracy of a classiﬁer combining single classiﬁers which are slightly better than random guessing. Among the family of boosting algorithms, AdaBoost (adaptive boosting) is the best known, although it is suitable only for dichotomous tasks. AdaBoost.M1 and SAMME (stagewise additive modeling using a multi-class exponential loss function) are two easy and natural extensions to the general case of two or more classes. In this paper, the adabag R package is introduced. This version implements AdaBoost.M1, SAMME and bagging algorithms with classiﬁcation trees as base classiﬁers. Once the ensembles have been trained, they can be used to predict the class of new samples. The accuracy of these classiﬁers can be estimated in a separated data set or through cross validation. Moreover, the evolution of the error as the ensemble grows can be analysed and the ensemble can be pruned. In addition, the margin in the class prediction and the probability of each class for the observations can be calculated. Finally, several classic examples in classiﬁcation literature are shown to illustrate the use of this package.


Introduction
During the last decades, several new ensemble classification methods based in the tree structure have been developed. In this paper, the package adabag for R (R Core Team 2013) is decribed, which implements two of the most popular methods for ensembles of trees: boosting and bagging. The main difference between these two ensemble methods is that while boosting constructs its base classifiers in sequence, updating a distribution over the training examples to create each base classifier, bagging (Breiman 1996) combines the individual classifiers built training set where the individual classifiers are built. However, the base classifier of each boosting iteration depends on all the previous ones through the weight updating process, whereas in bagging they are independent. In addition, the final boosting ensemble uses weighted majority vote while bagging uses a simple majority vote.

Boosting
Boosting is a method that makes maximum use of a classifier by improving its accuracy. The classifier method is used as a subroutine to build an extremely accurate classifier in the training set. Boosting applies the classification system repeatedly on the training data, but in each step the learning attention is focused on different examples of this set using adaptive weights, w b (i). Once the process has finished, the single classifiers obtained are combined into a final, highly accurate classifier in the training set. The final classifier therefore usually achieves a high degree of accuracy in the test set, as various authors have shown both theoretically and empirically (Banfield, Hall, Bowyer, and Kegelmeyer 2007;Bauer and Kohavi 1999;Dietterich 2000;Freund and Schapire 1997).
Even though there are several versions of the boosting algorithms (Schapire and Freund 2012;Friedman, Hastie, and Tibshirani 2000), the best known is AdaBoost (Freund and Schapire 1996). However, it can be only applied to binary classification problems. Among the versions of boosting algorithms for multiclass classification problems (Mukherjee and Schapire 2011), two of the most simple and natural extensions of AdaBoost have been chosen, which are called AdaBoost.M1 and SAMME.
Firstly, the AdaBoost.M1 algorithm can be described as follows. Given a training set T n = {(x 1 , y 1 ), . . . , (x i , y i ), . . . , (x n , y n )} where y i takes values in 1, 2, . . . , k. The weight w b (i) is assigned to each observation x i and is initially set to 1/n. This value will be updated after each step. A basic classifier, C b (x i ), is built on this new training set (T b ) and is applied to every training example. The error of this classifier is represented by e b and is calculated as where I(·) is the indicator function which outputs 1 if the inner expression is true and 0 otherwise.
From the error of the classifier in the b-th iteration, the constant α b is calculated and used for weight updating. Specifically, according to Freund and Schapire, α b = ln ((1 − e b )/e b ). However, Breiman (Breiman 1998) uses α b = 1/2 ln ((1 − e b )/e b ). Anyway, the new weight for the (b + 1)-th iteration will be Later, the calculated weights are normalized to sum one. Consequently, the weights of the wrongly classified observations are increased, and the weights of the rightly classified are decreased, forcing the classifier built in the next iteration to focus on the hardest examples. Moreover, differences in weight updating are greater when the error of the single classifier is low because if the classifier achieves a high accuracy then the few mistakes take on more importance. Therefore, the alpha constant can be interpreted as a learning rate calculated as a function of the error made in each step. Moreover, this constant is also used in the final decision rule giving more importance to the individual classifiers that made a lower error.

Repeat for
(3) Table 1 shows the complete AdaBoost.M1 pseudocode. SAMME, the second boosting algorithm implemented here, is summarized in Table 2. It is worth mentioning that the only difference between this algorithm and AdaBoost.M1 is the way in which the alpha constant is calculated, because the number of classes is taken into account in this case. Due to this modification, the SAMME algorithm only needs that 1 − e b > 1/k in order for the alpha constant to be positive and the weight updating follows the right direction. That is to say, the accuracy of each weak classifier should be better than the random guess (1/k) instead of 1/2, which would be an appropriate requirement for the two class case but very demanding for the multi-class one.

Bagging
Bagging is a method that combines bootstrapping and aggregating (Table 3). If the bootstrap estimate of the data distribution parameters is more accurate and robust than the traditional one, then a similar method can be used to achieve, after combining them, a classifier with 1. Repeat for b = 1, 2, . . . , B a) Take a bootstrap replicate T b of the training set T n b) Construct a single classifier 2, . . . , B by the majority vote (the most often predicted class) to the final decission rule On the basis of the training set (T n ), B bootstrap samples (T b ) are obtained, where b = 1, 2, . . . , B. These bootstrap samples are obtained by drawing with replacement the same number of elements than the original set (n in this case). In some of these boostrap samples, the presence of noisy observations may be eliminated or at least reduced, (as there is a lower proportion of noisy than non-noisy examples) so the classifiers built in these sets will have a better behavior than the classifier built in the original set. Therefore, bagging can be really useful to build a better classifier when there are noisy observations in the training set.
The ensemble usually achieves better results than the single classifiers used to build the final classifier. This can be understood since combining the basic classifiers also combines the advantages of each one in the final ensemble.

The margin
In the boosting literature, the concept of margin (Schapire, Freund, Bartlett, and Lee 1998) is important. The margin for an object is intuitively related to the certainty of its classification and is calculated as the difference between the support of the correct class and the maximum support of an incorrect class. For k classes, the margin of an example x i is calculated using the votes for every class j in the final ensemble, which are known as the degree of support of the different classes or posterior probabilities µ j (x i ), j = 1, 2, . . . , k as where c is the correct class of x i and k j=1 µ j (x i ) = 1. All the wrongly classified examples will therefore have negative margins and those correctly classified ones will have positive margins. Correctly classified observations with a high degree of confidence will have margins close to one. On the other hand, examples with an uncertain classification will have small margins, that is to say, margins close to zero. Since a small margin is an instability symptom in the assigned class, the same example could be assigned to different classes by similar classifiers. For visualization purposes, Kuncheva (2004) uses margin distribution graphs showing the cumulative distribution of the margins for a given data set. The x-axis is the margin (m) and the y-axis is the number of points where the margin is less than or equal to m. Ideally, all points should be classified correctly so that all the margins are positive. If all the points have been classified correctly and with the maximum possible certainty, the cumulative graph will be a single vertical line at m = 1.

Functions
In this section, the functions of the adabag package in R are explained. As mentioned before, it implements AdaBoost.M1, SAMME and bagging with classification and regression trees, (CART, Breiman, Friedman, Olshenn, and Stone 1984) as base classifiers using the rpart package (Therneau, Atkinson, and Ripley 2013). Here boosted or bagged trees are used, although these algorithms can be used with other basic classifiers. Therefore, ahead in the paper boosting or bagging refer to boosted trees or bagged trees.
This package consists of a total of eight functions, three for each method and the margins and evolerror. The three functions for each method are: one to build the boosting (or bagging) classifier and classify the samples in the training set; one which can predict the class of new samples using the previously trained ensemble; and lastly, another which can estimate by cross validation the accuracy of these classifiers in a data set. Finally, the margin in the class prediction for each observation and the error evolution can be calculated.
3.1. The boosting, predict.boosting and boosting.cv functions As aforementioned, there are three functions with regard to the boosting method in the adabag package. Firstly, the boosting function enables to build an ensemble classifier using AdaBoost.M1 or SAMME and assign a class to the training samples. Any function in R requires a set of initial arguments to be fixed; in this case, there are six of them. The formula, as in the lm function, spells out the dependent and independent variables. The data frame in which to interpret the variables named in formula. It collects the data to be used for training the ensemble. If the logical parameter boos is TRUE (by default), a bootstrap sample of the training set is drawn using the weight for each observation on that iteration. If boos is FALSE, every observation is used with its weight. The integer mfinal sets the number of iterations for which boosting is run or the number of trees to use (by default mfinal = 100 iterations).
If the logical argument coeflearn = "Breiman" (by default), then alpha = 1/2 ln((1−e b )/e b ) is used. However, if coeflearn = "Freund", then alpha = ln((1 − e b )/e b ) is used. In both cases the AdaBoost.M1 algorithm is applied and alpha is the weight updating coefficient. On the other hand, if coeflearn = "Zhu", the SAMME algorithm is used with alpha = ln((1 − e b )/e b ) + ln(k − 1). As above mentioned, the error of the single trees must be in the range (0, 0.5) for AdaBoost.M1, while for SAMME in the range (0, 1 − 1/k). In the case where these assumptions are not fulfilled, Opitz and Maclin (Opitz and Maclin 1999) reset all the weigths to be equal, choose appropriate values to weight these trees and restart the process. The same solution is applied here. When e b = 0, the alpha constant is calculated using e b = 0.001 and when e b ≥ 0.5 (e b ≥ 1 − 1/k for SAMME), it is replaced by 0.499 (0.999 − 1/k, respectively).
Lastly, the option control that regulates details of the rpart function is also transferred to boosting specially to limit the size of the trees in the ensemble. See rpart.control for more details.
Upon applying boosting and training the ensemble, this function outputs an object of class boosting, which is a list with seven components. The first one is the formula used to train the ensemble. Secondly, the trees which have been grown along the iterations are shown. A vector which depicts the weights of the trees. The votes matrix describing, for each observation, the number of trees that assigned it to each class, weighting each tree by its alpha coefficient. The prob matrix showing, for each observation, an approximation to the posterior probability achieved in each class. This estimation of the probabilities is calculated using the proportion of votes or support in the final ensemble. The class vector is the class predicted by the ensemble classifier. Finally, the importance vector returns the relative importance or contribution of each variable in the classification task.
It is worth to highlight that the boosting function allows quantifying the relative importance of the predictor variables. Understanding a small individual tree can be easy. However, it is more difficult to interpret the hundreds or thousands of trees used in the boosting ensemble. Therefore, to be able to quantify the contribution of the predictor variables to the discrimination is a really important advantage. The measure of importance takes into account the gain of the Gini index given by a variable in a tree and the weight of this tree in the case of boosting. For this goal, the varImp function of the caret package is used to get the gain of the Gini index of the variables in each tree.
The well-known iris data set is used as an example to illustrate the use of adabag. The package is loaded using library("adabag") and automatically the program calls the required packages rpart, caret and mlbench (Leisch and Dimitriadou 2012).
The dataset is randomly divided into two equal size sets. The training set is classified with a boosting ensemble of 10 trees of maximum depth 1 (stumps) using the code below. The list returned consists of the formula used, the 10 small trees and the weights of the trees. The lower the error of a tree, the higher its weight. In other words, the lower the error of a tree, the more relevant in the final ensemble. In this case, there is a tie among four trees with the highest weights (numbers 3, 4, 5 and 6). On the other hand, tree number 9 has the lowest weight (0.175). The matrices votes and prob show the weighted votes and probabilities that each observation (rows) receives for each class (columns). For instance, the first observation receives 2.02 votes for the first, 0.924 for the second and 0 for the third class. Therefore, the probabilities for each class are 68.62%, 31.38% and 0%, respectively, and the assigned class is "setosa" as can be seen in the class vector.
R> [1] 0.05333333 The second function predict.boosting matches the parameters of the generic function predict and adds one more (object,newdata,newmfinal = length(objects$trees),...). It classifies a data frame using a fitted model object of class boosting. This is assumed to be the result of some function that produces an object with the same named components as those returned by the boosting function. The newdata argument is a data frame containing the values for which predictions are required. The predictors referred to in the right side of formula(object) must be present here with the same name. This way, predictions beyond the training data can be made. The newmfinal option fixes the number of trees of the boosting object to be used in the prediction. This allows the user to prune the ensemble. By default, all the trees in the object are used. Lastly, the three dots deal with further arguments passed to or from other methods.
On the other hand, this function returns an object of class predict.boosting, which is a list with the following six components. Four of them: formula, votes, prob and class have the same meaning as in the boosting output. In addition, the confusion matrix compares the real class with the predicted one and, lastly, the average error is computed.
The arguments of this function are the same seven of boosting and one more, an integer v, specifying the type of v-fold cross validation. The default value is 10. On the contrary, if v is set as the number of observations, leave-one-out cross validation is carried out. Besides this, every value between two and the number of observations is valid and means that one out of every v observations is left out. An object of class boosting.cv is supplied, which is a list with three components, class, confusion and error, which have been previously described in the predict.boosting function.
Therefore, cross validation can be used to estimate the error of the ensemble without dividing the available data set into training and test subsets. This is specially advantageous for small data sets.
Returning to the iris example, 10-folds cross validation is applied maintaining the number and size of the trees. The output components are: a vector with the assigned class, the confusion matrix and the error mean. In this case, there are four virginica flowers classified as versicolor and three versicolor classified as virginica, so the estimated error reaches 4.67%.
This function produces an object of class bagging, which is a list with almost the same components as boosting. The only difference is the matrix with the bootstrap samples used along the iterations instead of the weights of the trees.

The margins and errorevol functions
Owing to its aforementioned importance the margins function was added to adabag to calculate the margins of a boosting or bagging classifier for a data frame as previously defined in Equation 4. Its usage is very easy with only two arguments: 1. object. This object must be the output of one of the functions bagging, boosting, predict.bagging or predict.boosting. This is assumed to be the result of some function that produces an object with at least two components named formula and class, as those returned for instance by the bagging function.
2. newdata. The same data frame used for building the object.
The output, an object of class margins, is a simple list with only one vector with the margins. In the following it is shown the code to calculate the margins of the bagging classifier built in the previous section, for the training and test sets, respectively. Examples with negative margins are those which have been wrongly classified. As it was stated before, there were two and four cases in each set, respectively.  R> plot(sort(margins.train), (1:length(margins.train)) / + length(margins.train), type = "l", xlim = c(-1,1), + main = "Margin cumulative distribution graph", xlab = "m", + ylab = "% observations", col = "blue3", lwd = 2) R> abline(v = 0, col = "red", lty = 2, lwd = 2) R> lines(sort(margins.test), (1:length(margins.test)) / length(margins.test), + type = "l", cex = 0.5, col = "green", lwd = 2) R> legend("topleft", c("test","train"), col = c("green", "blue"), lty = 1, In the current version of the package, the last function is errorevol which calculates the error evolution of an AdaBoost.M1, AdaBoost-SAMME or bagging classifier for a data frame as the ensemble size grows. This function has the same two arguments than margins and outputs a list with only the vector with the error evolution. This information can be useful to see how fast bagging and boosting reduce the error of the ensemble. In addition, it can detect the presence of overfitting and, in those cases, the convenience of pruning the ensemble using the relevant predict function. Although, due to the small size of the ensembles, this example is not the most appropriate case to show the error evolution usefulness, it is used just for demonstrative purposes.

Examples
The package adabag is now more deeply illustrated through other two classification examples. The first example shows a dichotomous classification problem using a simulated dataset previously used by Hastie, Tibshirani, and Friedman (2001, pp. 301-309) and Culp, Johnson, and Michailides (2006). The second example is a four classes data set from the UCI repository of machine learning databases (Bache and Lichman 2013). This example and iris are available in the base and mlbench R packages.

A dichotomous example
The two classes simulated data set consists of ten standard independent Gaussians which are used as features and the two classes defined as Here the value 9.34 is the median of a Chi-squared random variable with 10 degrees of freedom (sum of squares of 10 standard Gaussians). There are 2000 training and 10000 test cases, approximately balanced. In order to enhance the comparison, stumps are also used as weak classifiers and 400 iterations are run as in Hastie et al. (2001) and Culp et al. (2006). Thus, there are only two arguments in the boosting function to be set. They are coeflearn ("Breiman", "Freund" or "Zhu") and boos (TRUE or FALSE). Since for k = 2 classes the Freund and Zhu options are completely equivalent, the latter is not considered in this example. The code for running the four possible combinations of parameters is available in the additional file, and a summary of the main results is shown below.

A multiclass example
In order to show the usefulness of this package solving problems beyond the dichotomous case, a four classes example is used in this section. This problem will help to illustrate the use of the functions in problems quite more complex than the iris one. Specifically, the Vehicle data set drawn from the UCI repository is employed. This data set contains 846 samples and the purpose is to classify a given vehicle silhouette as one of four types (bus, opel, saab and van), using a set of 18 numerical features extracted from the silhouette.
With the aim of showing the advantages of using ensembles, results from a single tree are compared to those obtained with boosting and bagging. The single tree is built by means of the rpart package, which is automatically called when loading the adabag package. Due to the high number of classes, trees with maximum depth of 5 are used both in the single and the ensemble cases. The ensembles consist of 50 trees.
In order to compare the error of the four methods, 50 iterations are run and the mean result is calculated. On the other hand, to analyze some characteristics of the ensembles, as the variable relative importance and the margins, a specific model is necessary, so the model which achieves the minimum error is used.

Summary and concluding remarks
In this paper, the R package adabag consisting of eight functions is described. This package implements the AdaBoost.M1, SAMME and bagging algorithms with CART trees as base classifiers, capable of handling multiclass tasks. The ensemble trained can be used for prediction on new data using the generic predict function. Cross validation accuracy estimations can also be achieved for these classifiers. In addition, the evolution of the error as the ensemble grows can be analysed. This can help to detect overfitting and, on the other hand, if the ensemble has not been developed enough and should keep growing. In the former case, the classifier can be pruned without rebuilding it again from the start, selecting a number of iterations of the current ensemble. Furthermore, not only the predicted class is provided, but also its margin and an approximation to the probability of all the classes.
The main functionalities of the package are illustrated here by applying them to three wellknown datasets from the classification literature. The similarities and differences among the three algorithms implemented in the package are also discussed. Finally, the addition of some of the plots used here to the package, with the aim of increasing the interpretability of the results, is part of the future work.