Sample re-weighting hyper box classiﬁer for multi-class data classiﬁcation

In this work, we propose two novel classiﬁers for multi-class classiﬁcation problems using mathematical programming optimisation techniques. A hyper box-based classiﬁer (Xu & Papageorgiou, 2009) that iteratively constructs hyper boxes to enclose samples of different classes has been adopted. We ﬁrstly propose a new solution procedure that updates the sample weights during each iteration

Over the past decades, a wide range of classification algorithms have been proposed in literature to tackle various classification problems.Classification algorithms can be broadly divided into two categories: binary and multi-class classifiers.A binary classifier is solely applicable to classification problems with two classes while a multi-class classifier can deal with problems with more than 2 classes.Compared with the large number of binary classifiers, there are relatively fewer multi-class classifiers in literature (Bal & Orkcu, 2011).Common strategies of tackling a multi-class classification problem include either solving the problem once using a multi-class classification algorithm or decomposing the whole problem into a series of binary problems and solving iteratively the sub-problems using binary classifiers (Ou & Murphey, 2007;Wang, Chen, & Qin, 2010).
The existing classifiers in open literature are based on diverse methodologies, including support vector machine (SVM), neural network (NN), Naïve Bayesian, decision tree, mathematical programming optimisation techniques, and so on.We provide below a brief summary of some of the most popular classification approaches, with some key classifiers shown in Fig. 1.

Contents lists available at ScienceDirect
Computers & Industrial Engineering j o u r n a l h o m e p a g e : w w w .e l s e v i e r .c o m / l o c a t e / c a i e misclassifications of the samples.The balance between distance of the constructed hyper plane to different classes of samples and the amount of misclassifications is controlled by a user-specified trade-off parameter.One of the features that make SVM powerful is the so-called kernel trick, which maps the dataset to higherdimensional inner product space, at where samples may be easier to separate.A number of kernel functions, which greatly enhance the suitability of SVM in identifying non-linear decision boundaries, can be employed, e.g., polynomial kernels and radial basis function kernel.Solving SVM has been formulated as a convex quadratic programming optimisation problem, which can be solved to global optimality using a large number of non-linear solvers (Carrizosa & Romero Morales, 2013;van Gestel et al., 2004).Despite the popularity, optimal tuning of the trade-off parameter and choice of kernel functions remain problem-specific issues that considerably affect the predictive power of SVM (Amari & Wu, 1999;Diosan, Rogozan, & Pecuchet, 2012;Noble, 2006;Ozer, Chen, & Cirpan, 2011).

NN
Mimicking a biological neural network, NN classifier consists of a number of connected layers of neurons, which transforms an input layer of features to an output layer of class labels.Each neuron takes input as weighted summation of outputs from all the neurons in the previous layer, and applies a non-linear activation function before passing the output to all the neurons in the next layer (Kavzoglu, 2009).Frequently used activation functions include: sigmoid, logarithmic and radial basis functions (Arulampalam & Bouzerdoum, 2003).Despite its capacity to tackle datasets with non-linear and complex decision boundaries, the number of hidden layers, how many neurons allowed for each hidden layer, which activation function to use amount to a difficult optimisation problem, which limits the generality of the method (Hunter, Hao, Pukish, Kolbusz, & Wilamowski, 2012).In reality, the structure of the network, i.e. the number of layers, the number of neurons for each layer and the types of activation function, are usually specified by the user, which reduces the problem of training a neural network classifier to tune the weights of connections between consecutive layers of neurons to minimise the classification error.Training a neural network is known to be time consuming and can only guarantee local optimality.

Naïve Bayes
Naïve Bayes classifier belongs to the group of statistical classifiers.It is based on the naive assumption that the effect of different features on class membership predictions is independent from each other (Martinez-Arroyo & Sucar, 2000;Rish, 2001).In general, Naïve Bayes simply computes the support of each feature for each class so that the maximum likelihood estimate is satisfied in the training samples set.With the derived Bayesian rules the probability of a sample being predicted into a class can be calculated.The simplicity of Naïve Bayes classifiers also ensures computational efficiency (Almeida, Almeida, & Yamakami, 2011).Although the assumption of independence among features is more often than not violated in practical datasets, Naïve Bayesian generally gives comparable performance against much more sophisticated classifiers (Jin, Lu, & Ling, 2003;Rish, 2001;Wong, 2012).

Decision tree
Decision tree is a recursive partitioning method that sequentially splits samples into subsets.Starting from the whole dataset, decision tree identifies one attribute and a break point, before partitioning samples into subsets so that to improve the homogeneity of the class label vector within the subsets.The partitioning procedure is recurred for each child node until no further split can result in an increase in training sample accuracy (Chipman, George, & McCulloch, 1998;Gray & Fan, 2008).After growing a large tree, small leaves that do not contribute significantly to the training accuracy are removed to improve the generalisation and predictive power of the constructed tree (Chipman et al., 1998;Polat & Gunes, 2007;Rastogi & Shim, 2000).Interpretability is one of the main strengths of decision tree classifier.The set of sequential linear rules generated are easy to understand, providing valuable insights into the mechanism of the underlying system.Decision tree has been shown to be particularly vulnerable that perturbing a small proportion of training samples or re-sampling the training set are likely to result in a very different tree structure (Gray & Fan, 2008).

Mathematical programming
Another group of classification models are built with mathematical programming optimisation techniques.Sueyoshi (1999Sueyoshi ( , 2001Sueyoshi ( , 2004Sueyoshi ( , 2006)), Sueyoshi and Goto (2009), Bal and Orkcu (2011) and Gehrlein (1986) have all proposed hyper planes-based classifiers using either linear programming or mixed integer programming techniques.Ryoo (2006) and Bagirov, Ugon, and Webb (2011) propose models on piece-wise linear classifiers.In Bertsimas and Shioda (2007), a model is presented separating samples into a number of polyhedrons, which are formed by multiple hyper planes.The proposed formulation tries to enclose as many samples belonging to the same class into the same polyhedrons by optimising the positions of polyhedrons.
On the other hand, Xu and Papageorgiou (2006) and Xu and Papageorgiou (2009) produce a mathematical programming-based formulation modelling a hyper box (HB) classifier.A hyper box is essentially a multi-dimensional rectangle with the number of dimensions being equal to the total number of attributes in the dataset.The proposed method aims to build for each class a number of hyper boxes enclosing as many samples as possible.The hyper boxes belonging to different classes are constrained to not overlap with each other, and each hyper box defines a distinct rule enclosing a proportion of training samples.In Maskooki (2013), a modified version of HB classifier has been developed which requires only 1/3-1/2 computational time compared with the original HB.Inspired by the promising performances of the HB classifier, we propose two refined hyper box classifiers in this work, aiming to improve the quality of the constructed boxes.
Without attempting to comprehensively review the above classifiers, we summarise their relative strengths and weaknesses in Table 1 below.

Ensemble classifiers
Besides the single classifiers described above, some recent research efforts have been focusing on developing ensemble classifiers, which builds a number of single classifiers and aggregate their classification outcomes to produce the final prediction (Breiman, 1996).Given a training sample set, Bagging (Bauer & Kohavi, 1999;Rokach, 2009) creates a number of bootstrap sample sets by uniformly sampling with replacements, and each bootstrap sample set is then learned by a classifier.The final prediction is an aggregation of decisions made by each classifier, by simple average or more sophisticated voting strategy that certain classifiers have more votes in the final decision (Breiman, 1996;Strobl, Malley, & Tutz, 2009).Another recent advance in ensemble classification algorithm is Boosting (Niu, Jin, Lu, & Li, 2009).One of the most recognised Boosting algorithms is Adaboost (Freund & Schapire, 1997), which trains a set of classifiers in an iterative manner so that the subsequent classifiers are constructed in favour of those samples misclassified by the last classifier, by updating the weights of samples.Given a new sample with unknown class label, all the single classifiers make their own predictions of which class it belongs to and their decisions are combined to yield a final prediction.
In this work, we introduce two new solution procedures to improve the performance of the HB classifier.We firstly extend our previous work of HB classifier by incorporating a sample reweighting scheme.For HB classifier, misclassified samples can either be outside all the derived hype boxes or can be enclosed in hyper boxes belonging to other classes.Our proposed sample re-weighting scheme works by assigning higher weights to misclassified samples enclosed by other hyper boxes, tweaking the model to favour those difficult samples in the next iteration.In doing so, we aim to increase the chance of them being correctly classified in subsequent iteration and achieving a better final solution.Furthermore, observing the generally high computational cost of the traditional HB classifier, we have introduced a data space splitting method that partitions the training samples into two disjoint regions, each one of which defines a much smaller optimisation problem and thus can be solved easier.Computational experiments clearly demonstrate that the proposed sample re-weighting scheme achieves consistently higher prediction accuracy than the traditional HB classifier.Meanwhile, the sample partitioning method reduces the computational cost by 1 or 2 orders of magnitudes on the basis of maintaining the desirable level of prediction rates.
The rest of the paper is structured as follows: in Section 2, we will summarise both mathematical formulation and solution procedure of the original HB classifier (Xu & Papageorgiou, 2006, 2009), which also serves as the basis of our work.Section 3 proposes a refined HB classifier.In Section 4 we further introduce a data space partition method in a bid to ease the high computational demand of constructing hyper box classifier.Results of computational experiments on a number of real world datasets are to appear in Section 5, with the last section concludes with our major findings.

A hyper box classifier
As mentioned before, our work is based on the classification method proposed in Xu andPapageorgiou (2006, 2009), which If E s takes the value of 1, sample s is correctly enclosed in hyper box i s , i.e. value of A sm lies between the lower bound (X im À LE im /2) and upper bound (X im + LE im /2) of its hyper box i s for all attributes; otherwise sample s is misclassified as being outside its target box.In Fig. 2a, we present a two dimensional presentation of samples being inside and outside their corresponding hyper boxes.Hyper boxes of different classes are not allowed to overlap, which is realised via the following two sets of constraints: In constraint (3), when binary variable Y ijm = 0, hyper box i and j belonging to different classes do not overlap, i.e. lower bound of box i is greater than upper bound of box j on attribute m; when binary variable Y ijm = 1 constraints (3) become redundant.To avoid overlapping of the hyper boxes in M-dimensional space, they need to not overlap in at least one dimension, which is modelled by constraints (4).In Fig. 2b, we give a graphical example of overlapping and non-overlapping hyper boxes.The objective function is to minimise the number of misclassifications (i.e.E s = 0): Objective function (5), sample enclosing constraints (1), (2), and hyper box non-overlapping constraints (3), (4) form the original mathematical formulation, named MCP, in the original HB classifier (Xu & Papageorgiou, 2009).The combination of linear objective function and linear constraints, and presence of binary variables define a mixed integer linear programming (MILP) formulation, which can be solved to global optimality using standard solution techniques, for example branch-and-bound.

HB iterative solution procedure
Last section describes a mathematical programming formulation for building hyper boxes to separate samples.In Xu and Papageorgiou (2009), an iterative solution procedure has also been developed to allow potentially multiple hyper boxes per class to improve the quality of the solution.This old iterative procedure is outlined in Fig. 3 below.
Initially, one hyper box is created for each class of samples (initialise i s ) and the MCP model is solved once to enclose as many as possible the samples into their own hyper boxes.Starting from the second iteration, for any class having at least one misclassified sample (E s = 0), one additional hyper box is allowed for this particular class, followed by updating i s , i.e. the correct classified samples are still mapped to their original hyper box while the misclassified samples are re-mapped to the new box.For the classes that all their samples are correctly classified in the last iteration, their sample-box mapping i s are kept.The iterative procedure terminates when the number of misclassified samples does not decrease in two adjacent iterations or when all the samples are correctly classified.An artificial example is given in Fig. 4 to illustrate the old iterative solution procedure.

Predicting new samples using derived hyper boxes
After training the HB classifier the derived hyper boxes are used to predict the class label of a new sample.The prediction procedure is intuitive as: (1) if a new sample falls into one of the derived boxes, it is assigned the class label of the box; (2) if a new samples lies outside all derived hyper boxes, it is assigned the class label of its nearest box.
After reviewing the main features of hyper box classifier proposed by Xu and Papageorgiou (2009), we are going to propose a refined HB classifier in the next section.

A refined hyper box classifier
Inspired by the success of boosting algorithms, which typically consists of iteratively learning classifiers while updating the weight distribution of samples, we have introduced a sample reweighting scheme into the traditional hyper box classifier in a bid to improve its performance.
As mentioned earlier in Section 2, the traditional HB inherently involves iterative training, i.e., after each iteration any class with misclassified samples is updated with an extra hyper box and the MCP model is re-solved.In our proposed work, we mimic the behaviour of boosting algorithms by reweighting samples between iterations.More specifically, after each iteration, we update the weights of all samples by assigning more weights to a subset of misclassified samples, thus putting more emphasis into correctly classifying them in the next iteration.When a sample s is misclassified by its hyper box, the misclassification can fall into two categories: (1) misclassified sample lies outside all derived boxes; (2) misclassified sample lies inside at least one of the derived boxes that belong to a different class.We call the two types of errors type1 and type 2, respectively.Fig. 5 visualises the two types of misclassifications for a two dimensional case.
In Fig. 5a, two misclassified samples lie outside both derived hyper boxes and before the next iteration, another box will be allocated for the two samples of type 1 error.In the second iteration, the two samples will be correctly enclosed in the additional hyper box.In Fig. 5b, however, the two type 2 misclassified samples will still be misclassified in the next iteration despite another allocated hyper box.In fact, type 2 misclassified samples have only slight chance of being correctly classified in the following iterations.In this work, we propose a sample re-weighting scheme that gives more weights to the type 2 misclassifications, which then will increase the chance of them being correctly classified and achieving a better final solution.In order to accommodate the different weights of samples, the objective function (5) in the traditional HB has been modified to the following: where P s denote the weight of sample s, equivalent to the cost of misclassification.Eq. ( 5) can be seen as a special case of Eq. ( 6) where P s = 1 for all samples.Considering the new objective function, when different weights are assigned to different samples, the model will prioritise those samples with higher weights for the overall misclassification cost to reach globally minimum.We keep other  The proposed SRW_HB also implements an iterative solution procedure.The first iteration of SRW_HB is identical to the first iteration of the traditional HB that one box per class is generated to minimise the total cost of misclassifications while all the samples are having a weight value, P s , of 1.If there are misclassified samples, from the second iteration one more box is allowed for each class with at least one misclassified sample.The sample-box mapping is updated that correctly classified samples from the last iteration keep their mapping from the last iteration, while the misclassified samples (both type 1 and type 2) are re-mapped to their newly generated hyper boxes.The misclassification cost for correctly classified samples and type 1 misclassified samples are set to 1, while the cost for type 2 misclassified samples are set to a higher value CT (CT > 1).The W_MCP model is re-solved and the above procedure is repeated.The iterative solution procedure terminates when the number of misclassified samples fail to improve in 2 consecutive iterations.The testing procedure is the same as the original HB that a new sample is allocated to its nearest derived hyper box and then assigned the membership of the hyper box.

A data space partition scheme
In the original publication (Xu & Papageorgiou, 2009), it is claimed that for some datasets, MCP models cannot be solved to global optimality in 200s for all iterations.Note that computational complexity of an MILP problem is dependent on the size of the problem, we therefore propose here a simple data space partition scheme to ease the computational burden of building hyper boxes and attempt to identify better solutions.
Given a dataset A sm , the average value of all samples on each attribute m is calculated as Aver m , followed by computing the number of samples satisfying A sm P Aver m and A sm < Aver m , respectively, which are denoted as RU m and RL m .Compute for each attribute the difference between the samples placed in the two disjoint regions partitioned from Aver m as Diff m = |RU m À RL m |.The attribute offering the most even partition, i.e. the smallest Diff m value is selected as the partition attribute m ⁄ .When there are multiple attributes offering equally low Diff m value, the partition attribute is randomly chosen among them.Subsequently the original dataset is partitioned into two disjoint regions R1 and R2, which respectively contain samples satisfying A sm Ã P Aver m Ã and A sm Ã < Aver m Ã .In each region, we train the proposed sample reweighting hyper box classifier (SRW_HB).It is important to note that extra constraints are added to the W_MCP to make sure that the derived hyper boxes from each region are not unnecessarily large to overlap with hyper boxes derived from the other region on the partition attribute m ⁄ , thus ensuring the boxes in one region do not overlap with the boxes in the other region: Eq. ( 7) are added to W_MCP when solving R1 while Eq. ( 8) are added to W_MCP when training on samples in R2.An arbitrarily small positive constant e is inserted in Eq. ( 8) to ensure the two regions do not share the same boundary.The final decision boundary is formed by all the derived hyper boxes from both regions.The idea behind the data space partition method is that the required computational time to solve an MILP grows exponentially with the number of training samples, making it hard to identify optimal solutions at feasible computational cost.Partition the dataset into  two disjoint regions with similar numbers of samples makes both regions equally easy to solve.We name the framework employing the proposed simple data space partition scheme to create two disjoint sub-regions and construct sample re-weighting hyper box classifiers in both regions as DR_SRW_HB, the flowchart of which is illustrated in Fig. 7.
In this work we have tested the proposed data space partition scheme, which splits the entire data space into two disjoint regions, on medium-size datasets.It is important to note that for larger size datasets, the current proposed strategy can be further generalised, i.e., partition the data space into 3, 4 or more disjoint parts, to accommodate more samples and attributes.

Computational results
In this section, the applicability and effectiveness of the proposed SRW_HB and DR_SRW_HB classifiers are demonstrated through 6 real world datasets, including Phenol (Niu et al., 2009), Firm (Sueyoshi, 2006;Xu & Papageorgiou, 2009) and 4 datasets downloaded from UCI machine learning repository (http:// archive.ics.uci.edu/ml/),namely Ionosphere, glass, breast tissue, and iris.We have implemented a number of literature classifiers to compare the classification rates with our proposed SRW_HB and DR_SRW_HB.The group of classifiers include Naïve Bayes, SMO (support vector machine), Logistic regression, Bagging, Adaboost, NN and three mathematical programming-based multiclass classifiers: HB, Gehrlein (1986) and Bal and Orkcu (2011).
To comprehensively evaluate the overall classification performances of various classification algorithms, we use two testing scenarios as below: Scenario 2: conduct a leave-one-out cross validation that for each dataset hold only one sample in the testing set while using the rest as training samples.The process is repeated until all samples are used as testing sample.
All the mathematical programming-based classification methods, including SRW_HB, HB, and approaches proposed by Gehrlein (1986) and Bal and Orkcu (2011), are implemented in General Algebraic Modeling System (GAMS) 24.1 (GAMS Development Corporation, 2013) and solved using CPLEX 12.3 solver on a 2.40 GHz speed, 2393 MHz cpu computer system.Optimality gap is set as 0 when solving MILP problems.For all hyper box-based methods we limit the computational time per iteration as 200 s.Other classifiers are implemented in Waikato Environment for Knowledge Analysis (WEKA) machine learning software (Hall et al., 2009).Default setting are retained for Naïve Bayes, Logistic regression, SMO, Bagging and Adaboost, while for NN the following parameters from Xu and Papageorgiou (2009) are used: hidenLayers = 2; learning rate = 0.1; momentum = 0.7; trainingTime = 10,000.

Real world datasets
We use 6 real world datasets to test the applicability and competitiveness of the proposed classification algorithms.Ionosphere concerns some radar data that given 34 attributes reflecting the received signals the task is to classify free electrons in the ionosphere into 2 classes.The dataset Phenol (Niu et al., 2009) concerns classifying 274 phenols, characterised by 9 molecular descriptors that quantify their compounds, into 4 possible toxicity mechanisms including polar narcotics, respiratory uncouplers, proelectrophiles and soft electrophiles.Glass example downloaded is a collection of glass samples belonging to 6 types of glass.Each glass sample is descripted by 9 attributes, each of which corresponds to weight percentage of a chemical compound (sodium, aluminium, calcium, etc.) in corresponding oxide.Breast tissue dataset has 106 freshly excised tissue samples in the breast area, and are descripted by 9 attributes such as area under spectrum, length of the spectral curve.Iris is one of the most studied benchmark datasets in data classification.150 instances 3 types of iris plant are characterised by 4 features, including sepal length, sepal width, petal length and petal width.Firm dataset aims to predict the financial performance of a number of companies, based on certain performance indices for example cash to total assets, long-term debt to total assets, into a class of 'good' firms and the other class of firms went bankrupt between 1996 and 2002.A brief summary of the employed real world datasets is provided in Table 2.

Sensitivity analysis of CT
In this section, a sensitivity analysis is performed to tune the user-specific parameter CT for the proposed SRW_HB, which denotes the cost for type 2 misclassified samples and is higher than 1.We present in Fig. 8 the results of sensitivity analysis for all the 6 datasets.
A series of values have been tested for CT, including 2, 3, 4 and 5.It is clear from Fig. 8 that varying CT has different effects on different datasets.For Ionosphere dataset and scenario 1, prediction accuracy first increases from CT = 2 to CT = 3, and then falls down when CT is equal to 4 and 5.With regards to scenario 2, the trend is similar that prediction rate goes up from CT = 2 to CT = 3, and then decreases later on.For Phenol, as CT increases classification rate for scenario 2 goes up from CT = 2 to CT = 3, 4 before decreasing when CT = 5, while classification rates for scenario 1 keep constant.Glass is the mostly affected by different values of CT among all tested datasets that for both scenarios prediction rates increase from CT = 2 to 4 by about 5%, which subsequently drops down when CT = 5.With regards to Breast tissue case study, classification rates for both scenarios fluctuate throughout the tested CT values and both peaked at CT = 3.When it comes to Iris dataset, increasing CT appears to have minor impact on scenario 1 while for scenario 2 prediction rate keeps constant between CT = 2 and 4 before growing slightly with CT = 5.Lastly, for Firm dataset, classification rate for scenario 2 keeps constant over the tested range while for scenario 1 the accuracy goes down from CT = 4 to 5.
Overall, it is obvious that the sensitivity analysis for SRW_HB does not yield a clear optimal CT value, as in different datasets and different scenarios peak prediction rates come from different CT values.On the other hand, it appears that CT = 3 gives a robust performance as prediction rate often peaks at or near CT = 3 (e.g.Ionosphere, Breast Tissue).Therefore we take CT = 3 for SRW_HB when comparing its classification performance against other implemented classifiers in literature, which has good performance for almost all datasets investigated.

Classification performance comparison
In this section, we evaluate the classification performance of 10 classifiers, including the proposed SRW_HB and traditional HB.For the proposed SRW_HB classifier, we set CT = 3 for all datasets to offer a fair comparison.The results are presented in Tables 3 and  4 for scenario 1 and 2, respectively.
For both testing scenarios, no classifiers are showing dominant classification rates against others, as different datasets play to the strengths of different classification methodologies.This observation is consistent with the previous findings (Adem & Gochet, 2006;Lam & Moy, 2002).A good classifier should maintain consistently good performance across many different classification problems.The proposed SRW_HB, showing this desired consistency, is usually among the top 3 out of the 10 classifiers.Note that the proposed SRW_HB outperforms the traditional HB for most scenarios.
We summarise here the overall classification performance of the 10 implemented classifiers by using a scoring scheme, employed also in Xu and Papageorgiou (2009).Briefly, for each scenario and a particular dataset, the classifiers are ranked in descending order according to their prediction accuracies, i.e. the classifier with the highest classification rate is awarded a score of 10; the classifier with the second highest classification rate is assigned a score of 9, and so on.For each scenario, the average score across all datasets is taken as the indication of the overall competitiveness of a particular classifier.The higher the average score, the better the performance of the classifier.
The score ranking is presented in Fig. 9, which shows that in both scenarios the proposed SRW_HB classifier not only gives improved classification accuracy from the traditional HB, but also outperforms other state-of-the-art classifiers.

DR_SRW_HB significantly reduces computational cost while maintaining the classification accuracy compared with SRW_HB
In the last section, we demonstrate that the proposed SRW_HB classifier, which modifies the traditional HB classifier by updating the misclassification costs for samples with type 2 errors after each iteration, gives overall better prediction accuracy compared with a number of state-of-the-art classifiers.Recall that we have proposed in Section 4 a DR_SRW_HB method that implements a simple data space partition scheme to split the original data space into two disjoin regions, followed by training the SRW_HB for both regions.The idea is that each region contains about half samples of the entire problem, which is then much easier to solve.
We now test the effectiveness of the DR_SRW_HB against SRW_HB for both scenarios.With regard to the proposed SRW_HB method, W_MCP model cannot be solved to global optimality for at least one iteration (within 200 s) on 4 datasets (either scenario), including Phenol, Glass, Breast tissue and Ionosphere.We therefore run DR_SRW_HB on those 4 datasets and compare the prediction accuracy with that achieved by SRW_HB.The results are presented in Table 5.For scenario 1, DR_SRW_HB leads to higher classification rate on Phenol while SRW_HB is more accurate on Glass, Breast tissue and Ionosphere.It should be noted that compared other literature classifiers, DR_SRW_HB still shows better overall performance.When it comes to scenario 2, DR_SRW_HB offers much higher prediction rates on Glass example, ties with SRW_HB on Phenol and Breast Tissue while losing on Ionosphere example.We can see that DR_SRW_HB performs better in scenario 2 than scenario 1, because scenario 2 requires more computational effort than scenario 1 as a result of more samples involved in training of scenario 2. Considering both two scenarios, it is therefore conclusive that the proposed data space partition scheme can maintain the overall prediction rates of SRW_HB on complex examples.
Recall that the DR_SRW_HB has been proposed to overcome the high computational cost of tackling complex classification problems, we report here, for scenario 2, the average computational  For each dataset, the highest classification accuracy achieved is marked in bold.time per run consumed by three variants of hyper box-based classifiers, namely HB, SRW_HB and DR_SRW_HB.The results, presented in Fig. 10, show clearly that by partitioning a complex problem into two sub-problems and solving two relatively easy problems, the computational cost dramatically decreases.On Phenol and Breast tissue, DR_SRW_HB constructs hyper boxes in a matter of seconds while the CPU time consumed by HB and SRW_HB are significantly higher.While it takes hundreds of seconds for DR_SRW_HB to train hyper boxes on Glass and Ionosphere, the actual computational time is still small fractions of the consumption of HB and SRW_HB.For scenario 1, the trend is similar that the proposed data space partition method considerably reduces computational cost (data not shown).We also compare our proposed DR_SRW_HB with an alternative solution procedure proposed in literature for hyper box classifier (Maskooki, 2013), in which after each iteration, correctly classified training samples are removed and the dimensions of established hyper boxes are fixed before optimising the hyper boxes for the next iteration.It has been shown that the proposed solution procedure results in the computational cost saving of 2-3-fold and generally decreased classification accuracy.Our proposed DR_SRW_HB classifier clearly outperforms (Maskooki, 2013) by offering much higher computational cost reduction.Thus, it is concluded that DR_SRW_HB results in huge CPU savings of 1 or 2 orders of magnitude, compared with HB and SRW_HB.Overall, we propose here a strategy that for a classification problem which SRW_HB struggles to identify globally optimal solutions for all iterations, the DR_SRW_HB is used instead; otherwise for an easy classification problem, the SRW_HB is used.Despite the significant reduction in computational time, DR_SRW_HB, based on mixed integer programming, is still generally consuming more computational resource than the existing methods in literature.We note here that the prediction accuracy remains the most important aspect of many real world data classification problems, for example medical disease classification problems (Dagliyan et al., 2011;Nguyen & Rocke, 2002;West et al., 2001).The classifiers proposed in this work are aimed to achieve higher prediction accuracy for offline classification problems where computational time is not of major concern.

Concluding remarks
Data classification is an important data mining area subject to extensive on-going research interest.Inspired by the promising classification rates of a hyper box classifier (Xu & Papageorgiou, 2006, 2009) in literature, we propose in this work two new solution procedures that aim to improve the performance of hyper box classifier.The first improvement, SRW_HB, updates the samples weights during each iteration of the training process so that the type 2 misclassified samples, i.e. misclassified samples enclosed in one of the hyper boxes from another class, are given more weights than the other samples.Through 6 binary and multi-class real world datasets, it is demonstrated that the proposed SRW_HB can provide consistently good classification rates, outperforming the traditional HB and other state-of-the-art classifiers for example SVM, NN and Logistic regression.
We further introduce a data space partition method to reduce the computational cost of SRW_HB, which works by splitting the dataset into two disjoint regions, each of which is then solved independently using SRW_HB.On the 4 complex datasets, the proposed DR_SRW_HB appears to consume dramatically less computational time than the original HB and SRW_HB, often in 1 to 2 orders of magnitude, on the basis of maintaining the desirable level of prediction accuracy compared with the proposed SRW_HB classifier.
A natural extension of this work in the near future is to investigate a more generic data space partition scheme.The sample partition scheme presented and used for DR_SRW_HB proves to significantly reduce computational cost but can only perform binary partition.For large-scale data classification problems, the proposed DR_SRW_HB may struggle to identify quality solution in training procedure.Therefore, a generic data space partition method, which splits data into multiple regions and each one of which is easy to solve, can help scale up the hyper box classifiers to large-size problems.

Fig. 4 .
Fig. 4. HB iterative solution procedure.At the initial iteration one box is allowed per class and after solving the MCP model, the 4 misclassified samples represented by circles are re-assigned to an extra box while no additional box is given to the class represented by triangle due to zero misclassifications.After solving the MCP model for the 2nd iteration, the only misclassified sample from the circle class is given another box, followed by solving the MCP model another time.The iterative procedure terminates at the 3rd iteration because the total number of misclassifications fails to decrease from the last iteration.

Fig. 5 .
Fig. 5. Two types of hyper box misclassifications.a: Type 1 misclassification that samples are not enclosed correctly by its hyper box and are outside all the boxes from other classes; b: type 2 misclassification that samples are not enclosed correctly by its hyper box and are inside at least one of the boxes belonging to another class.

Scenario 1 :
perform 50 random partitions of each dataset into a training set containing 70% samples and a testing set containing the 30% samples.For each partition we train a classifier on training set and test the classification performance on testing set.

Fig. 6 .
Fig. 6.Flowchart of the proposed SRW_HB.The highlighted content in red differentiates the SRW_HB from the traditional HB. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 8 .
Fig. 8. Sensitivity analysis of CT for the proposed SRW_HB on two testing scenarios.Blue line with triangle markers denotes scenario 1 and red line with round markers denotes scenario 2. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 9 .
Fig.9.Overall standing of classifiers.We give scores to competing classifiers according to the ranking of their classification rates in each dataset and average the scores over all dataset to comprehensively evaluate their relative competitiveness.In both scenarios, the proposed SRW_HB leads to the most robust classification performance across all implemented classifiers.

Fig. 10 .
Fig.10.Computational cost comparison between HB, SRW_HB and DR_SRW_HB.In the figure, average computational time per run of scenario 2 is reported for traditional HB, SRW_HB and DR_SRW_HB on 4 datasets Phenol, Glass, Breast tissue and Ionosphere.It is obvious that the DR_SRW_HB, which solves two sub-problems of smaller sizes, requires significantly lower computational cost.

Table 1
Relative strength and weakness of certain classifiers.
im central coordinate of hyper box i on attribute m LE im length of hyper box i on attribute m Binary variable E s 1 if sample s is correctly enclosed in its hyper box i s ; 0 otherwise Y ijm 1 if on attribute m lower bound of hyper box i is greater than upper bound of hyper box j and ensuring nonoverlapping between the two; 0 otherwise Whether a sample s is enclosed in its corresponding hyper box i or not is modelled using the following two sets of constraints:

Table 2
Summary of real world datasets.

Table 3
Classification rate comparison for scenario 1.

Table 4
Classification rate comparison for scenario 2.

Table 5
Classification rate comparison between two proposed classifiers DR_SRW_HB and SRW_HB.