LPBoost with Strong Classifiers *

The goal of boosting algorithm is to maximize the minimum margin on sample set. Based on minimax theory, the goal can be converted into minimize the maximum edge. This idea motivates LPBoost and its variants (including TotalBoost, SoftBoost, ERLPBoost) which solve the optimization problem by linear programming. These algorithms ignore the strong classifier and just minimize the maximum edge of weak classifiers so that all the edges of weak classifier are at most γ .This paper shows that the edge of strong classifier may be higher than the maximum edge of weak classifiers and proposes a novel boosting algorithm which introduced strong classifier into the optimization problem and constrained the edges of both weak and strong classifiers no more than γ . Furthermore, we justified the reasonability of introducing strong classifier using minimax theory. We compared our algorithm with other approaches including AdaBoost, LPBoost, TotalBoost, SoftBoost, and ERLPBoost on the UCI benchmark dataset. In simulation studies we show that our algorithm converges faster than SoftBoost and ERLPBoost. In a benchmark comparison we illustrate the competiveness of our approach from the aspect of time consuming, and generalization error.


Introduction
Boosting algorithms have shown considerable success in many fields, such as OCR (optical character recognition), face recognition, ranking or recommendation, text classification, natural language process.Boosting algorithms originated from PAC 1,2 (Probably Approximately Correct) learning theory.Kearns and Valiant (1989) postulated the boosting conjecture in the framework of PAC learning.In this method, a weak classifier (with success probability just a bit over 50 %) can be boosted into a strong one (strong classifier) in the sense that the training error of the new one would go to zero with a polynomial-time run time.
The AdaBoost algorithm, proposed by Freund and Schapire 3,4,5 , is an efficient stage wise-optimum method, which boosts a series of simple and weak learners into one strong learner.The corrective update of sample distribution in AdaBoost can be view as a solution to minimize a relative entropy of current sample distribution versus uniform distribution, and this optimum problem is subject to some linear constraints that the edge of the last hypothesis is zero 6,7 .One of the important properties of AdaBoost is that it has a decent iteration bound and approximately maximizes the margin of the examples 8 .Similar algorithms including LogitBoost 9 , AdaBoost v * 10 , all of which can be viewed as "corrective" family of boosting algorithms that enforce only a single constraint at each iteration 6 (the edge of the hypothesis must be at most γ , where γ is adapted).
However, a natural idea is to constrain the edges of all past hypotheses to be at most γ and otherwise minimize the relative entropy to the initial distribution.Basing on this idea, such algorithms (i.e.LPBoost 11,12 , TotalBoost 13 ) were proposed and are called totally corrective in the sense that they optimize their weight based on all past hypotheses.Moreover LPBoost and TotalBoost are provable maximizing the margin with linear program.Nevertheless, unlike LPBoost, in which the upper bound γ on the edge is chosen to be as small as possible in each iteration, TotalBoost uses entropic regularization.Also, the γ decreased more moderately in TotalBoost.
Maximizing the hard margin is a provably approach for low generalization error 3 when the data is linearly separable.However, in case of inseparable data, maximizing the soft margin is a more robust and efficient choice.The soft margin maximization can be implemented via linear program with capping constraints for some small hard examples.Based on this idea, there are lots of boosting algorithms including AdaBoost with soft-margin 8 , MadaBoost 14 , v-arc 15,16 , SmoothBoost 17,18  η of the relative entropy to the initial distribution and made a trade-off between maximizing the soft margin and minimizing the relative entropy which solves the main problem in SoftBoost: the generalization error decreases slowly in early iterations.However, in total corrective family of algorithms, all of them update the sample distribution with weak classifiers ignoring the strong classifier.At each iteration in total corrective boosting algorithms, they just constrain the edges of existing weak classifiers to be at most γ even though the edge of strong classifier is larger than γ .It can be shown that the strong classifier edge is possibly larger than the maximum edge of all weak classifiers.Thus, a natural algorithm emerged: simply add edge constraint of strong classifier into edge-restraint conditions of ERLPBoost, which make the constraints stricter.Based on this, we proposed the StrongLPBoost which introduces the constraint of strong hypothesis to improve the convergence rate.
Our new algorithm is most similar to ERLPBoost because their goals are both to optimize the soft margin with all past hypotheses on condition of minimizing relative entropy.The most important difference is that we use tighter constraints.An important result of our work is to show that this strategy may help to increase the convergence speed.
The paper is organized as follows: in Section 2 we introduce the relevant notations, basic concepts and LPBoost.Section 3 deeply discusses the 4 problems existed in LPBoost and gives solution correspondingly before describing the detailed algorithms StrongLPBoost in Section 4.Finally, Section 5 contains our experimental evaluation of StrongLPBoost and its competitors.And the paper concludes with an outlook and discussion in Section 6.

Preliminaries
To conclude this section, we like to point the reader to Table 1 which summarizes our notations.And then some basic notations and relevant concepts about LPBoost will be presented in this section.Firstly, we will introduce two definitions: edge and margin, which are of a provable dual problem.
The edge h γ of a weak classifier on dataset ( ) x , where T is the number of the weak classifier, and t corresponds to the weight of weak classifier .We let .As for the data set , its margin is the minimum margin of set.Generally speaking, the margin represents for the generalization ability of a classifier.Obviously, more training data gives better generalization, and maximizing the margin can improve the ability to generalize.It's noteworthy that edges are linear in the distribution over samples and margins are linear in the distribution over the current set of hypotheses.This optimization problem of maximizing margin can be converted into linear programming 11,12 .According to Refs.11 we give a brief introduction of LPBoost.Given a fixed ensemble H and training set Χ , the error matrix (as shown in Fig. 1) contains entries In terms of U, the margin on sample i corresponds to the dot product: And the margin of a set of samples is denoted as: min e. the minimum margin of all the samples.The goal is to find a weight vector w that obtains the largest possible margin subject to the constraints 0 , 1 .This is a maxi-min problem where we choose w to maximize The task and objective of LPBoost have been presented in detail.For the issue (1), its dual problem can be proven as listed in (2) via Von-Neumann's MiniMax theory 23 .Additionally, the objective values of ( 1) and (2) satisfy equation (3).{1,2,..., } min max * . .0, 1 * max min * * min max .
Following ( 2), another boosting approach can be deduced, alternatively, the distribution over samples can be computed by linear programming, which maximizes the margin over all the base hypotheses.In dual problem (2), the goal is that finding ( , ) d γ to minimize γ subject to the constraints and .Note that these notations or parameters have good natural explanation: means a score of weak classifier on set .Thus, LPBoost tried to find a distribution of samples to minimize the edge of best weak classifier, which increases the weights of misclassification samples and decreases the weights of accurate classification samples.
, , 1,..., t The SoftBoost and ERLPBoost represent for latest research in the variants of LPBoost.In order to ignore the bad effect of noise or difficult samples, SoftBoost adds the slack variant i ζ for each i x of these samples and maximizes the "soft margin": the new primal problem as shown (4).Note that the relationship between capping and the hinge loss has long been exploited by the SVM community 24,25 .Moreover, the relative entropy is introduced to ERLPBoost for updating the sample distribution smoothly and continually.The relative entropy is denoted as follow: 0 0 ( , ): ln , where are distributions of samples in the different iterations.Then the dual problem is listed as (5).

Strong classifier and LPBoost
Following the dual problems (2) and ( 5), we can see that only one weak classifier with maximum edge was chose to minimize the edge when updating the distribution d .Their convergence rates can be improved by tightening the constraints of the optimization problem.The edge of strong classifier may be larger than the maximum bound of the weak classifiers.Thus, we can convert the strong classifier into a new weak one to add to the edge constraints of ERLPBoost and make the constraints stricter.In this way, the convergence speed can be accelerated.
We employ formal methods to describe it in detail.Referring to the above notations and definitions, the new notations defined as follows: Obviously, all the edges of the hypotheses satisfy inequation (6), but how about ( ) i H x and '( ) For the ( ) i H x , we deduce like (7):

H x d y h x w d y w h x d y
Incorporating with (6), we can obtain (8): From inequation (8), we can find that the edge of strong classifier ( ) i H x is lower than the maximum edge of the weak classifiers.Using this logic, the error of final strong classifier is higher than minimum error of the weak ones.However, this conclusion conflicts with LPBoost and runs counter to the boosting theory.Thus, what's wrong with the above conclusion and deduction?Close inspection shows that the final strong classifier should not be ( ) i H x but '( ) i H x .Alternatively, '( ) i H x maybe not satisfy the inequation (9).Once the edge of strong classifier was found to may be higher than the weak, a natural algorithm emerged: simply adding the edge constraint of strong classifier to the LPBoost, which makes constrains tighter.And then, the convergence rate is improved further.The experiment section gives painstaking experimental verification.

Strong classifier and minimax theory
This section shows the necessity of introducing the strong classifier from the point of minimax theory.The goal of boosting algorithm is to maximize the margin over sample set 25, and this maximizing problem(left side of the equation ( 10)) can be converted into the minimax problem(right side of the equation ( 10)) according to the equation (3).In more specific terms, the minimax problem can be solved by two steps: the first is to find the weak classifier with maximum edge, and then, adjusting the weight d over samples to minimize the edge of classifier j (where arg max . We expand minimax equation (3) to equation ( 10) which can be proved to still hold via Von-Neumann's MiniMax theory 21,26 .Different with (3), equation (10)  employ weighted mixed strategy rather than pure strategy.Observe that when d in the left side and w in the right side are values of base vector, we arrive at the equation (3).Obviously, equation ( 10) is better than (3) because the mixed strategy is more practical than pure strategy.In terms of (10), its left side is weighted combination of margin over examples, and the right side is weighted combination of edge over hypotheses.In this respect, strong classifier can be representative for combination of weak hypotheses.Consequently, it is reasonable to introduce the strong classifier into the equation (3).Specifically, when strong classifier H is converted into a new weak classifier , and add the into the cost matrix and solve the optimum problem (5).' h ' h

StrongLPBoost
In the minimax problem that motivates the main algorithm of this paper, StrongLPBoost, a constraint of strong classifier is added to the constraint of linear programming (5).The modified linear programming problem is defined as minimizing problem (11) after appending the edge constraints of '( ) i H x to the dual problem.The pseudo-code is shown in Fig. 2. StrongLPBoost minimizes the relative entropy to initial distribution of samples when maximizing the soft margin.
At each iteration, weak hypothesis is generated via calling oracle with parameter , and then, new and can be obtained by solving the (11) and ( 5) respectively.The problem (11) only devotes to the update of sample distribution and ( 5) is employed to get . ( , ) ( )   the weights of hypotheses.Thus, we can get the current strong classifier: 1 ( ) ( ) On one hand, our iteration bound for StrongLPBoost is the same to the bound proven for ERLPBoost since this algorithm just does the work on making the constraints of ERLPBoost tighter.On the other hand, the tighter constraints make d faster to reach to the ideal distribution than ERLPBoost, that is to say it needs at iterations to reach the optimum soft margin with ε error rate, where is the number of noise.v

Experiment
In order to evaluate the performance of our new algorithm, we made an extensive comparison among the original AdaBoost, LPBoost, SoftBoost and ERLPBoost using decision tree as the weak classifier algorithm.

Experiment Setup
As previously used in Refs.8,18,19, except for Spiral and Banana datasets, all of our experiments utilize data from 9 benchmark data sets derived from the UCI and DELVE benchmark repository: banana ,breast cancer, diabetes, german, heart, image segment, ringnorm, newthyroid, twonorm, waveform, spiral.However, these datasets can not be used as experiment data before preprocessing them as follows: (1) A random partition into two classes is necessary for the data set that is not used for binary classification originally.
(2)We remove the samples with missing value so that all the attributes of the samples have values.
(3)The symbolic or nominal attributes in samples are mapped into the number from 1 to N, here N is the number of attribute values.
Finally, the experiment data descriptions are shown as Table II.Basing on the two dimensional Spiral and Banana datasets (Seen in Fig. 3), it's more convenient to observe the differences of several boosting algorithms over edge, margin and iteration bound.
All the weak classifiers in these boosting algorithms are single decision tree.On each training set 5-foldcross validation is used to train and test model for every

Accuracy
Firstly, we evaluate the accuracy of StrongLPBoost comparing with other 4 boosting algorithms over 11 datasets.In Table III the average generalization performance (with standard deviation) over the 11 datasets with 5 models for every boosting algorithm are shown.For the purpose of more extensive comparison, we introduce other evaluation criterions (including recall, fscore, fp_rate, specificity, matthews 24,29,30 ).It's difficult to list all the results of 5 algorithms, so we just show the results of AdaBoost, LPBoost, StrongLPBoost  (As seen in Table 4).Note that except for the heart and diabetes datasets, the performance of StrongLPBoost is better than other boosting algorithms in almost all cases.
For the two datasets, even though StrongLPBoost perform not as good as to AdaBoost, experimental results still show the competiveness compared with other variants of LPBoost.

Margin and iteration bound
Next, we compare the four maximizing margin algorithms from the aspects of the weak classifier relevance, margin, accuracy and iteration number.In order to make it easier to compare StrongLPBoost with SoftBoost and ERLPBoost, we use the Banana dataset in this experiment similar with the work by 18,19 .Note that the reason we leave out AdaBoost is that it is not based on margin maximizing theory.Fig. 4 and Fig. 5 are the experimental result of 4 algorithms over Banana dataset at a time.Fig. 4 shows the margin value along with iteration number.The result shows that StrongLPBoost has the best convergence rate; ERLPBoost and LPBoost have similar convergence speed and the convergence of SoftBoost is worse when compared with other 3 boosting algorithms.Fig. 5 describes the accuracy boosting with the iteration about four algorithms.From the two figures, it can be seen that StrongLPBoost has fast convergence speed to close to the optimum soft margin with a small quantity of weak classifiers.

Strong classifier edge constraint and convergence
In this subsection, we show that the edge constraint of strong classifier has an impact on convergence.Here,   H x the experimental data is based on spiral dataset (The reason why not using banana dataset is that four boosting algorithms are so easy to converge to the optimum of soft margin that it's difficult to observe the convergence change after adding strong classifier edge constraint.).Fig. 6 shows the edges of ' ( ) H x , H ( ) x and γ along with iteration before adding the edge constraint of ' ( ) H x .Red curve represents for the edge of ' ( ) H x ,however, the edge of ( ) H x and γ approach zero( ) after solving the dual problem(5) at each iteration.Similar with our conclusion in section 3, the edge of 6 10 − ( ) H x (blue curve) is always lower than γ , and moreover, the edge of ' ( ) H x may be lower and higher than γ .When the edge constraints of ' ( ) H x are added, the three edges are shown as Fig. 7, they all approach zero.Then, we can see that ' ( ) H x takes effect.The convergence changes after adding ' ( ) H x 's edge constraint are shown in Fig. 8. Finally, this experiment shows how the constraint of strong classifier influences the weak classifier.In order to simplify the experiment, 800 samples in banana dataset are used.Fig. 9 shows the 15 weak classifiers in the 20 iterations generated by SoftBoost.There are amounts of relevance and redundancy during the 20 classifiers.On the contrary, there are just seven weak

Conclusion
In this paper, we firstly review the research progress of boosting algorithm, and analysis the LPBoost and its variants from the point of minimax theory.The existing algorithms based on LPBoost originated from the minimax of pure strategy.The new distribution of samples is computed via solving the problem of minimizing the edge of weak classifier with maximum edge.And then, we expand the minimax from pure strategy to mixed strategy because the mixed strategy is more practical compared with pure strategy.According to the minimax of mixed strategy and ERLPBoost, we proposed a new boosting algorithm of simply adding the edge constraint of strong classifier to the problem of minimizing the maximum edge.Finally, we evaluate the StrongLPBoost with the experiments over the benchmark data sets and the experimental results show that the new algorithm of this paper has the higher convergence rate and accuracy compared with the popular boosting algorithms.
Our future work will concentrate on a continuing improvement of selection on weak classifiers for noisy real world applications, in addition, a further analysis of relation between strong classifier edge and margin convergence.Moreover, it is interesting to see how the techniques established in this work can be applied to find the support samples.

Fig. 1 .
Fig. 1.Error matrix . All of the existing variants of LPBoost (including SoftBoost, ERLPBoost) basing on this idea find a weak classifier with maximum edge to minimize its edge.

Fig. 4 Fig. 5 .
Fig. 4 Soft margin of LPBoost，SoftBoost，ERLPBoost and StrongLPBoost over Banana dataset Fig. 6.Edges of ' ( ) H x , Fig. 9. Edges of ' ( ) H x , ( ) H x and γ before adding the strong classifier ' ( ) H x 21SoftBoost19, corrective ERLPBoost 20 , ERLPBoost21.This line of research culminated in SoftBoost and ERLPBoost, which both require where M is sample dataset size, is the weight of a sample ( ,

Table 2 .
Dataset description in the experiments

Table 3 .
Accuracy of 5 algorithms over 11 the mean and variance of 5-fold-cross validation Bold marking the statistics of the StrongLPBoost and underscores marking the weaker results comparing with the other algorithms)

Table 4 .
Six evaluation criterions of 3 algorithms over 11 datasets: the average value of 5-fold-cross validation