Imbalanced Ensemble Classifier for learning from imbalanced business school data set

Private business schools in India face a common problem of selecting quality students for their MBA programs to achieve the desired placement percentage. Generally, such data sets are biased towards one class, i.e., imbalanced in nature. And learning from the imbalanced dataset is a difficult proposition. This paper proposes an imbalanced ensemble classifier which can handle the imbalanced nature of the dataset and achieves higher accuracy in case of the feature selection (selection of important characteristics of students) cum classification problem (prediction of placements based on the students' characteristics) for Indian business school dataset. The optimal value of an important model parameter is found. Numerical evidence is also provided using Indian business school dataset to assess the outstanding performance of the proposed classifier.


Introduction
Out of the many reasons behind the closing down of many of the private business schools, the foremost one is the unemployment of Master of Business Administration (MBA) students passing out of these business schools. The most challenging job for administrations is to find the optimal set of parameters for choosing the right candidates in their MBA program which will ensure the employability of the candidates. Attracting students in business schools are highly dependent on the schools' past placement records. If the right set of students are not selected for a few years, the number of unplaced students will certainly accumulate, resulting in the damage of reputation for the business school. One needs to develop a model in such a way that the model ensures appropriate feature selection (selection of important student's characteristics) with a decision on the optimal values or ranges of the features and higher prediction accuracy of the classifier as well. In our previous works, we proposed a hybrid classifier based on classification tree (CT) and artificial neural network (ANN) (to be referred to as hybrid CT-ANN model in the rest of the paper) to solve the business school 1 Email : Tanujit Chakraborty (tanujit r@isical.ac.in) 1 problem [1]. In this article, we identified a vital property of the business school data set, i.e., its imbalanced nature. Usual classifiers make a simple assumption that the classes to be distinguished should have a comparable number of instances. Many real-world data sets including the business school dataset are skewed, in which many of the cases belong to a larger class and fewer cases belong to a smaller, yet usually more interesting class. There are also the cases where the cost of misclassifying minority examples is much higher in terms of the seriousness of the problem in hand [2]. Due to higher weightage are given to the majority class, these systems tend to misclassify the minority class examples as the majority, and lead to a high false negative rate. In this particular example of business school data set, it is clearly a two-class problem with the class distribution of 80:20, where a straightforward method of guessing all instances to be placed would achieve an accuracy of 80%.
There are broadly two ways to deal with imbalanced data problems. One such way to deal with the imbalanced data problems is to modify the class distributions in the training data by applying sampling techniques. Sampling techniques include oversampling the minority class to match the size of the majority class and/or undersampling the majority class to match the size of the minority class. Sampling is a popular strategy to handle the data imbalance as it simply rebalances the data at the data preprocessing stage. But these approaches have obvious deficiencies like undersampling majority instances may lose potential useful information of the data set and oversampling increases the size of the training data set, which may increase computational cost. Nonetheless, sampling is not the only way for handling imbalanced data sets. There exist some specially designed "imbalanced dataoriented" algorithms which perform well on unmodified original imbalanced data sets. One of the most celebrated paper in the literature is hellinger distance decision tree (HDDT) [3] which uses hellinger distance (HD) as a decision tree splitting criterion and it is insensitive towards the skewness of the class distribution [4]. An immediate extension to this work is HD based random forest (HDRF) [5]. Another breakthrough in the literature is the class confidence proportion decision tree (CCPDT), a robust decision tree algorithm which can also handle original imbalanced datasets [6]. It is to be noted that "imbalanced dataoriented" classifiers are sometimes preferred since they work with original data sets. We are therefore motivated to ask: Can we create an ensemble imbalanced data-oriented classifier which can improve the performance of HDDT, mitigate the need of sampling and solve an Indian business school data problem?
In response to this question, we proposed an ensemble classifier for feature selection cum classification problems which can be used to solve the imbalanced business school dataset problem. Our proposed ensemble classifier has the advantages of both the HDDT and ANN algorithm and performs well in high dimensional feature spaces. The optimal choice of an important model parameter is also proposed in this paper. Further numerical evidence based on business school dataset shows the robustness of the proposed algorithm.
This paper is organized as follows. In section 2, we describe the proposed ensemble model. The theoretical results are presented in section 3 and experimental evaluation is shown in section 4. Section 5 is fully devoted to the concluding remarks of the paper.

An overview on HDDT
Chawla [3] proposed HDDT which uses HD as the splitting criterion to build a decision tree. HD is used as a measure of distributional divergence and has the property of skew insensitivity [7]. Let (Θ, λ) denote a measurable space. For any binary classification problem, let us suppose that P and Q be two continuous distributions with respect to the parameter λ having the densities p and q in a continuous space Ω, respectively. Define HD as follows: where Ω √ pqdλ is the Hellinger integral. It is noted that HD doesn't depend on the choice of the parameter λ. Given a countable space Φ, HD can also be written as follows: The bigger the value of HD, the better is the discrimination between the features. A feature is selected that carries the minimal affinity between the classes. For the application of HD as a decision tree criterion, the final formulation can be given as follows: where |X + | indicates the number of examples that belong to the majority class in training set and |X +j | is the subset of the training set with the majority class and the value j for the feature X. A similar explanation can be written for |X − | and |X −j | but for the minority class. Here K is the number of partitions of the feature space X. Since equation (6) is not influenced by prior probability, it is insensitive to the class distribution. Based on the experimental results, Chawla [3] concluded that unpruned HDDT is recommended for dealing with imbalanced problems as a better alternative to sampling approaches.

An overview on ANN
Neural network models are inspired by biological nervous systems [8]. The network functions are determined largely by the connections between elements. The train of a neural network can be done by performing a particular function by adjusting the values of the connections (weights) between elements. Neural networks are trained so that a particular input (feature vectors) leads to a specific target output (class level). The network is adjusted, 3 based on a comparison of the output and the target, until the network output matches the predicted class. Mapping function used in ANN is very flexible. Given the right weights, this function can approximate almost any functional form to any degree of accuracy. This function approximation is mainly done by an activation function (for example, sigmoid, logsig, tansig, etc). A common neural network architecture is shown in Fig. 1. While training the network with any particular dataset, the problem of overfitting can be avoided by training the network for a limited number of epochs [9]. Standard backpropagation (feedforward) is a gradient descent algorithm where the weights are moved along the negative of the gradient of the performance function. Typically, a new input leads to an output similar to the correct outputs if it is properly trained for input vectors used in training. Complex neural networks have more than one hidden layers in its architecture.

Proposed Imbalanced Ensemble Classifier
The motivation behind designing an ensemble classifier for imbalanced data sets is that one we would like to work with the original data set without taking recourse to sampling. Here we are going to create an ensemble classifier which will utilize the power of HDDT as well as the superiority of neural networks. In the proposed imbalanced ensemble classifier (to be denoted by IEC in the rest of the paper), we first split the feature space into areas by HDDT algorithm. Most important features are chosen using HDDT and redundant features are extracted. We then build a ANN model using the important variables obtained through HDDT algorithm. Also, the prediction results obtained from HDDT are used as another input information in the input layer of neural networks. The effectiveness of the proposed classifier lies in the selection of important features and using prediction results of HDDT followed by the ANN model. The inclusion of HDDT output as an additional input feature not only improves the model accuracy but also increases class separability. The informal workflow of our proposed IEC model, shown in Figure 2 is as follows: • Sort the feature value in ascending order and find the splits between adjacent different values of the feature. Calculate the binary conditional probability divergence at each split using HD measure (see equation (1)). 4 • Record the highest divergence as the divergence of the whole feature. Choose the feature that has maximum HD value and grow unpruned HDDT.
• Using the HDDT algorithm, build a decision tree. Feature selection model generated by HDDT takes into account the imbalanced nature of the data set.
• The prediction result of HDDT algorithm is used as an additional feature in the input layer of the ANN model. Export important input variables along with additional feature to the ANN model and a neural network is generated.
• Since the output results of HDDT has been incorporated as an additional feature along with other important features obtained by HDDT in the input layer of ANN, the number of hidden layer is chosen to be one. IEC not only handles imbalance through the implementation of HDDT in selecting features but also improves the performance of the classifier by incorporating better classification results for the data set obtained from HDDT and the model gets improved using ANN algorithm. This algorithm is a two-step problem-solving approach such as handling imbalanced class distribution, selecting important features and getting an improved ensemble classifier. The optimal characteristics of students which affect the placements can be chosen by our model and future predictions while modeling the imbalanced dataset can also be done by IEC.

Optimal value of IEC model parameter
Our proposed IEC has the following architecture: first, it extracts important features from the feature space using the HDDT algorithm, then it builds one hidden layered ANN model with the important features extracted using HDDT along with HDDT outputs as an additional feature. Now we are going to find out the optimal value of the number of neurons in the hidden layer of the proposed model.
Let X be the space of all possible values of p features and C be the set of all possible binary outcomes. We are given a training sample with n observations, L = {(X 1 , C 1 ), (X 2 , C 2 ), ..., (X n , C n )}, where X i = (X i1 , X i2 , ..., X ip ) ∈ X and C i ∈ C. We build IEC with HDDT given features and OP as another input feature in the model. The dimension of the input layer in the ANN model, to be denoted by d m (≤ p), is the number of important features obtained by HDDT + 1. We have used one hidden layer in the model due to the incorporation of OP as an input information in the model. It should be noted that one-hidden layered neural networks yield strong universal consistency and there is little theoretical gain in considering two or more hidden layered neural networks [10]. In IEC model, we have used one hidden layer with k neurons. This makes the proposed ensemble binary classifier less complex and less time consuming while implementing the model. After elimination of redundant features by HDDT and incorporating OP as another input vector, let us now consider the following training sequence ξ n = {(Z 1 , Y 1 ), ..., (Z n , Y n )} of n i.i.d copies of (Z, Y ) taking values from R dm × C. A classification rule realized by a one-hidden layered neural network having logistic sigmoid activation function is chosen to minimize the empirical L 1 risk, where the L 1 error of a function ψ : R dm → {0, 1} is defined by J(ψ) = E{|ψ(Z) − Y |}. The theorem stated below is based on the idea of Lugosi & Zeger (1995) [11] which states the regularity conditions for universal consistency of the one hidden layered ANN model. Theorem 1. Consider a neural network with one hidden layer with bounded output weight having k hidden neurons and let σ be a logistic squasher. Let F n,k be the class of neural networks with logistic squasher defined as |c i | ≤ β n and let ψ n be the function that minimizes the empirical L 1 error over ψ n ∈ F n,k . It can be shown that if k and β n satisfy then the classification rule is universally consistent.
To obtain the optimal choice of k of the proposed model it is necessary to obtain the upper bounds on the rate of convergence, i.e., how fast J(ψ n ) approaches to zero [12]. Though in case of the rate of convergence of estimation error, we will have a distribution-free upper bound [13]. And to obtain the optimal value of k, it is enough to find upper bounds of the estimation and approximation errors. The upper bound of approximation error investigated by Baron [14]. Proof. The upper bound of approximation error is found by Baron [14] to be O 1 √ k . Though the approximation error goes to zero as the number of neurons goes to infinity for strongly universally consistent classifier, for practical implementation the number of neurons is often fixed (eg., can't be increased with the size of the training sample).
Using lemma 3 of [13], we can write that the estimation error is always O kdmlog(n) n .
Bringing the above facts together, we can write Now, to find optimal value of k, the problem reduces to equating kdmlog(n) Remark. The optimal value of hidden nodes is found to be O

Application to Indian Business School Data
In this section, we first describe the business school data in brief and also discuss different evaluation measures that are used in this study. Subsequently, we are going to report the experimental results and compare our proposed IEC model with other state-of-the-art classifiers.

Description of data set
The data was provided by a private business school which receives applications for the MBA program from across the country and admits a pre-specified number of students every year. This dataset comprises several parameters of last 5 years passed out students' profile along with their placement information. The dataset has 17 explanatory variables out of which 7 categorical variables and 10 continuous variables which represent the parameters of the students and one response variable, namely placement which indicates whether the student got placed or not [1]. In order to measure the level of imbalance of these datasets, we compute the coefficient of variation (CV) which is the proportion of the deviation in the observed number of examples for each class versus the expected number of examples in each class [15]. The datasets with a CV more than equal to 0.30− a class ratio of 2 : 1 on a binary dataset is chosen as imbalanced data. In the business school dataset, CV turns out to be 0.50. We also applied 5 × 2 cross-validation while evaluating classifiers on the datasets [16], in which each dataset is broken into class-stratified halves, allowing two experiments in each half, one is used as training (70% of the data) and others as testing (30% of the data). The experiments are repeated 5 times and the average results are reported in the paper. Table  1 gives an overview of these data sets.

Performance measures
The performance evaluation measures used in our experimental analysis are based on the confusion matrix. Higher the value of performance metrics, the better the classifier is. The expressions for different performance measures as follows: where, Precision = T P T P +F P ; Sensitivity = T P T P +F N ; Specificity = T N F P +T N .

Analysis of results
We aim to select the optimal set of features and the corresponding model for the selection of the right set of students who will be fit for the MBA program of a business school and subsequently will be placed as well. We compare our proposed imbalanced ensemble classifier (IEC) with mostly other similar types of "imbalanced data-oriented" classifiers. Different performance metrics are computed to draw the conclusion from the experimental results. All the methods were implemented in the R Statistical package on a PC with 2.1 GHz processor and 8 GB memory. We started the experimentation with HDDT algorithm by using R Package 'CORElearn' for learning from imbalanced business school data set. HDDT achieved around 93% accuracy while CT achieved around 83% accuracy. This gives an indication that "imbalanced data-oriented" classifiers perform better than the traditional supervised classifiers designed for general purposes. Further, we implemented HDRF, CCPDT which are among other imbalanced data-oriented algorithms. Finally, we applied our proposed imbalanced ensemble classifier which is a two-step methodology. In the first stage, we select important features using HDDT and record its classification outputs. Below are the important features we obtained for business school data set by applying HDDT: SSC Percentage, HSC Percentage, Entrance Test Percentile, Degree Percentage, and Work Experience. In the next step, we design a neural network with the above mentioned important features along with HDDT output as an additional feature vector. The number of hidden neurons in the hidden layer of the model is chosen based on the recommendation of the model (see Remark in Section 3). Min-max method is used for scaling the data in an interval of [0, 1]. ANN training was done using 'neuralnet' implementation in R. We reported the performance of different classifiers in terms of different performance metrics in Table 2. It is clear from Table 2 that our proposed methodology achieved an accuracy of 96% for prediction in business school data set.

Conclusion
We proposed an imbalanced ensemble classifier (IEC) which takes into account data imbalance and used it for feature selection cum classification problems. Through experimental evaluation, we have shown our proposed methodology performed well compared to the other state-of-the-art models. It is also important to note that "imbalanced data-oriented" algorithms perform well on the original imbalanced datasets [4]. If we would like to work with 9 the original data without taking recourse to sampling, our proposed methodology will be quite handy. IEC has the desired statistical properties like universal consistency, less tuning parameters and achieves higher accuracy than HDDT and ANN model. We thereby conclude that for the imbalanced business school data set it is sufficient to use IEC model without taking recourse to sampling or any other imbalanced data-oriented single classifiers. Due to the robustness of the proposed IEC algorithm, it can also be useful in other imbalanced classification problems as well.