Creating Ensemble Classifiers with Information Entropy Diversity Measure

Chinese Academy of International Trade and Economic Cooperation, Beijing 10071, China School of Business Administration, Zhejiang Gongshang University, Hangzhou 310018, China School of E-Commerce & Management Science, Zhejiang Gongshang University, Hangzhou 310018, China Sunyard System Engineering Co., Ltd., Hangzhou 310053, China College of Chemical Engineering, Zhejiang University of Technology, Hangzhou 310014, China Zhijiang College, Zhejiang University of Technology, Shaoxing 312030, China


Introduction
e ensemble method was firstly proposed by Hansen and Salamon to optimize neural networks [1]. It is well known that an ensemble learning model is usually more accurate than a single learning model [2][3][4][5][6][7][8]. According to Singh's work, classifier combination is now widely applied in the area of machine learning and pattern recognition, such as text classification, speech recognition, seismic wave analysis, communication network, and online transaction log analysis [9]. Instead of constructing a monolithic system, ensemble learning is used to construct a pool of learners and combine them in a smart way into an overall system. In the research area of dynamic data stream classification, ensemble learning has become one of the hot spots [10].
Recently, a great number of researches propose various kinds of classifiers especially in the field of data stream mining [11,12]. To cope with concept drift, some published papers focus on dynamic weight mechanism. Previously, Wang and Pineau proposed online cost-sensitive boosting algorithms for online ensemble algorithms, which can achieve the similar accuracy of traditional boosting with simpler base models [13]. Tennant et al. presented a realtime data stream classifier to address the overlap of the velocity and volume aspects of big data analytics, which is adaptive to concept drift [14]. ese ensemble classification approaches have good stability and can overcome the general concept drift phenomenon in data stream classification. However, there is no evidence that an ensemble classifier system with more single base classifiers is better than the ensemble classifier system with fewer single base classifiers. Sometimes, the classifier fusion method creates large-scale classifiers that require a great deal of memory and computing resources, leading to low efficiency. To solve this problem, Zhou offered valuable sights about the diversity metric, which can be leveraged to select a subset of learners to comprise the final ensemble [15]. It was proved that ensemble classifiers with greater diversity have stronger generalization ability. However, Bi demonstrated that the accuracy of classifiers was not strongly correlated with the diversity; in some contexts, the relationship was negative [16]. Luo put forward a selfadapted classifier ensemble method with particle classification information, considering both ensemble classifier accuracy and diversity [17]. In this model, particle classification information was used to mark the learning effect, and the product of weighted accuracy and diversity was the selection criteria in a based classifier filter named C-Lib.
us, whether diversity correlated with ensemble performance is still unclear.
In this paper, we propose a method for generating ensemble classifiers by measuring the diversity between base classifiers, coming up with an incremental classification algorithm to maximize the diversity of component classifiers as well as minimizing the system cost of an ensemble classifier. We verify the approach in data stream mining along with other traditional algorithms, suggesting that the proposed model is efficient and promising.

Ensemble Classifier Diversity.
e diversity of an ensemble classifier is the difference of base classifiers, and an ensemble classifier with high diversity means complementariness. According to the previous work, data misclassified by one classifier can be probably correctly classified by others, leading to higher overall performance and better stability of an ensemble classifier with several different base classifiers than that of a single classifier [18,19]. Figure 1(a) presents the diversity between two linear classifiers in an ensemble classifier with different data distributions. Suppose Dataset A and Dataset B are datasets from two different classes, and the data distributions are denoted by two anomalous curves. Linear classifier p and linear classifier q are two selected base classifiers of an ensemble classifier, which are trained by test dataset. e correctly classified data by linear classifier p is denoted by the regions marked with horizontal bars: areas S1 and S2. e correctly classified data by linear classifier q is denoted by the regions marked with vertical bars: area H1. It is clear that there are still many blank regions left without correction and the diversity between the two linear classifiers is not high, resulting in ineffective ensemble classifiers.
Suppose we select another two base classifiers, linear classifier i and linear classifier j, which are combined to an ensemble classifier with the same data distribution in Figure 1(a). Figure 1(b) shows a better classification result. More data is correctly classified as shown in areas of S1′, S2′, H1′, and H2'. Comparing with the two ensemble classifiers, it is noted that the ensemble linear classifier with base classifier set (i, j) is significantly superior to the ensemble linear classifier with the base classifier set (p, q), which may attribute to the base classifier selection and optimization [20,21].
In data stream mining, the distribution of the dataset changes rapidly over time. In Figure 1(c), the dataset is changed to dataset M and dataset N that are two typical data stream datasets, and the base classifiers of the ensemble classifier remain the same. It is suggested that the proportion of blank regions in Figure 1(b) increases and the accuracy of the ensemble classifier (i, j) for the dataset (A, B) is lower than that for the dataset (M, N). Such classification performance is far from the requirements for dynamical stream data mining. erefore, the diversity of the same ensemble classifiers can be different when the dataset changed. e diversity measure method is particularly important in classifier combination optimization for better selection decision support, as well as the low computing resource consumption, especially in data stream classifiers.
Further, suppose another base linear classifier E is added to the ensemble classifier in Figure 1(b). If all the data in dataset A and dataset B can be correctly classified by the new ensemble classifier with E without blank area, such ensemble classifier is considered as the best ensemble classifiers and the diversity is positively correlated with ensemble classification accuracy. Instead, if the blank area is bigger than that in Figure 1(b), the performance of the new ensemble classifier with another base classifier F is lower and the diversity among base classifiers is negatively correlated with ensemble classification accuracy.

Diversity Measure Method in Ensemble Classifiers.
Diversity among the members of a team of classifiers is deemed to be a key issue in the classifier ensemble problem. Unfortunately, diversity measurement is not straightforward because there is no generally accepted definition [14,[22][23][24]. According to Zhukov et al. [11], the diversity measure methods for ensemble classifiers can be divided into two categories: pairwise measure and nonpairwise measure. Pairwise diversity measures emphasis on local optimum calculates the average (dis) similarity metric between all possible pairs of individual classifiers in an ensemble, such as Q-statistic and correlation coefficient. Nonpairwise measure emphasizes on global optimum, which often calculates a statistic using the notion of entropy or using (dis) similarity metrics between individual classifiers and the averaged classifier [25][26][27]. Both methods combine accuracy and diversity together.
Relevant concepts are defined to describe the two types of diversity measures as follows: let Z � z 1 , . . . , z N } be a training dataset with labels with M different classes in total, z j ∈ R n coming from the classification problem in question. Let D � D 1 , D 2 , . . . , D M } be a set of base classifiers, D i an N-dimensional binary vector, and vector C � {1, 2,. . ., M} the class label set. Assume z j is a sample of training data from dataset Z, z j � {A1, A2,. . ., AS, Cj}, descried by s features value A and one class label value C j belongs to C. e output of a base classifier Di for z j is denoted y j, i � 1, if D i classified z j to a class correctly, and 0; otherwise, i � 1, . . . , L by an N- LxM , including all of the classifying results from the training dataset Z and base classifiers set D. Let D i and D k be a pair of base classifiers from D; the relationship between them can be described as Table 1.
N ab means the amount of training data that can be correctly classified by base classifier D i , D k or not. For example, N 10 represents the amount of training data samples which are correctly classified by base classifier D i , which are incorrectly classified by D k . e table is from the conception of the confusion matrix. e size of training dataset Z is N, obviously, N � N 11 + N 10 + N 01 + N 00 , and two commonly used measures of diversity will be given as follows.
According to Yule's Q-statistic, the diversity between two base classifiers D i and D k can be calculated by the equation: where N ab is the number of elements z j of Z for which y i, j � a and y j, k � b (see Table 1). Q ik ranges between −1 and 1; classifiers that classify more common objects correctly have a positive Q value. In contract, those that classify more objects to different classes will result in a negative Q value. If two base classifiers are statistically independent, the expectation of Q ik is 0 [6,28]. e correlation coefficient between two base classifiers can be calculated as follows: ρ ik has the range as Q ik , and they have the same changing trend. It can be proved that |ρ ik | < |Q ik | [10]. For this comparison, diversity that measures by Q-statistic is more accurate and sensitive than that by the correlation coefficient.
In addition to these two pairwise measures of diversity, there are many other methods. e disagreement measure and the double-fault measure are two popular measures. In processing data stream by ensemble classifiers, pairwise diversity measure is an effective way to incrementally adjust the number of base classifiers. However, this paper applies nonpairwise diversity measures to classifier ensemble in the processing data stream, because nonpairwise measures can ensure the global optimal among classifiers when learning ensemble classifier. In this paper, information entropy is incorporated into the diversity measure. Entropy is defined as a measure of uncertainty in information theory; the greater the entropy value, the smaller the information uncertain degrees, and vice versa. Information entropy can be applied to the diversity measures of nonpairwise classifiers through the transformation of entropy.
For a data sample Z j , Z j ∈ Z, the output of base classifier Di for the training data Z j is denoted by y ji . If Z j is successfully classified by D i , y ji � 1, and otherwise 0, i � 1, . . . , L. If the outputs of |L/2| of the L base classifiers for Z j are the same (0 or 1), the outputs of the left L− L/2 of the L base classifiers are the alternative value, coming up to the highest diversity among classifiers for Z j . If all the y ji values of the L base classifiers are the same, 0 s or all 1 s, there is no disagreement among base classifiers, coming up to the lowest diversity among classifiers for Z j . For N training data, the measure of diversity based on information entropy is as the following equation: In equation (3), R(Z j ) denotes the number of classifiers from D with the same output value y ij , and entropy E varies between 0 and 1, where 0 indicates no difference and 1 indicates the highest possible diversity among the base  classifiers in D. In the context of data stream mining, E equals 0 means the lowest diversity among the base classifiers, and the number of base classifiers in the ensemble classifier can be reduced due to the reasonable classifier effectiveness. In contrast, the E value close to 1 means the diversity of the classifiers is high; several new base classifies can be added to the ensemble classifier for better classification effectiveness. Based on the above concepts, we design an incremental classification algorithm based on information entropy diversity measures to optimize the effectiveness of ensemble classifiers data stream processing.

An Incremental Classification Algorithm Based on Information Entropy Diversity Measure.
A typical data stream processing flow chart is shown in Figure 2. A data stream is inputted in an incremental ensemble classifier continuously chronologically. e data stream is processed according to the time period and the time granularity, which is set based on different requirements. For example, the weblog data stream frequently changes so a fine time granularity is required. However, for the credit-rating data stream, a wide time granular can be accepted.
In the time period from [t−f] to [t], ensemble classifier L t − f deals with coming data which arrive during the f times period, while at the time [t], the model will be incrementally updated coming with a new ensemble classifier L t to process data during [t] to [t + f]. In order to make an ensemble model to prevent concept drift when processing data stream, an incremental process is necessary which can be achieved by iterating the process of updating the model in each time period.
Taking time [t] for example, the training dataset of L t is mainly composed of labeled data, which has already been classified by ensemble classifier L t−f in the period of [t−f] to [t]. First, base classifiers are generated from the labeled training dataset by selected classification algorithms. Second, a certain number of base classifiers are selected to combine an incremental ensemble classifier L t at time of [t]. e base classifiers in the new ensemble classifier are selected from ensemble new learning classifiers and old classifiers. e selection is based on two criteria, accuracy and diversity, which are measured by transformed information entropy. On one hand, we use accuracy as a criterion to remove base classifiers which have poor classification performance. On the other hand, the diversity criterion is used to adjust the number of base classifiers to achieve the global optimization of incremental ensemble classifier [29][30][31].

Incremental_SEM Algorithm.
e most important process in generating an incremental ensemble classifier is selecting the most suitable classifiers with great accuracy and a proper number of classifiers. In this paper, the basic tactic for base classifier selection is integrating information entropy measure to the cyclic iterative selection algorithm, along with the accuracy performance data. e pseudocode of the base classifier selection algorithm for the proposed incremental classification model is given in Algorithm 1 Incremental_SEM.
Incremental_SEM uses cyclic iterative optimization selection method to maximize the information entropy difference and dynamically adjust the number of ensemble classifiers. e key part of the algorithm lies in the setting of the interval threshold of classification diversity, which should be set according to different applications. Since the initialization and preprocessing part is the same as the traditional method of the processing data stream, it is skipped in the paper. Starting from computing the diversity of ensemble classifier L t−f , we compare its value to the interval threshold and take different actions according to the comparison (line 3). If the value is higher than the upper limit of the interval threshold, keep generating a new base classifier and add it into the ensemble classifier. Recompute the diversity of a new ensemble classifier until the diversity is located in the interval threshold (lines 4-13). If the value is lower than the lower limit of the interval threshold, compute the accuracy of each base classifier and kick out the base classifier with the lowest accuracy (lines [14][15][16][17][18][19]. Otherwise, if the value is located in the interval threshold, it is no need to update the ensemble classifier for the next time stage (lines 21-23).

Results
is section lists the experiments conducted to evaluate the performance of the proposed algorithm on data stream classification. Trace based simulation approach has been used to evaluate and compare the performance of the proposed algorithm with other baseline algorithms.

Experimental Data.
e proposed algorithm was evaluated on steam data generated by a massive online analysis (MOA) system. MOA is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. We select the following stream generators to generate data.
(i) Hyperplane generator generates a problem of predicting the class of a rotating hyperplane. HP1 and HP2 are the data stream generated by hyperplane generator with 5% noise data in the experiment. (ii) Random tree generator generates a random radial basis function stream. It constructs a decision tree by choosing attributes at random to split and assigning a random class label to each leaf. RT1 and RT2 are data stream generated by random tree generator with both label attribute and number attribute. (iii) SEA generator generates SEA concept functions.
is dataset contains abrupt concept drift. SEA1 is the data stream generated by SEA generator with 5% noise data and concept drift. (iv) STAGGER generator generates STAGGER Concept functions which were introduced by Schlimmer. SG1 is generated by STAGGER generator.
A detailed description of the experimental data stream is shown in Algorithm 1. Due to the infinite nature of data stream in a real environment, it is not easy to do simulations in experiments. Massive data is used to simulate infinite data stream, the experiment data size of each dataset is shown in column 2 of Table 2. Figure 3 shows the scatter diagrams of each dataset that helps understand data more intuitively. Since the volume of each dataset is large, partial data is selected to be shown in the diagrams. Usually, a dimension reduction operation is needed for the preprocessing dataset. As shown in Figure 3, it can be found that each attribute is nonlinear relativity.

Experiment
Setup. An open-source mining software, WEKA, has been used to realize the ensemble classifier algorithms. e baseline algorithms in Weka are Naïve Bayes, Sequential Minimal Optimization (SMO), J48 that is the implementation of C4.5 for building a decision tree, IBk that is the implementation of the K-nearest neighbor algorithm (KNN), Kstar that is an instance-based classifier, NNge, PART that builds a '"partial" C4.5 decision tree in each iteration and makes the "best" leaf into a rule and AOD [32,33]. e algorithms are shown in Table 3.
A computer with 1.73 GHz CUP and 2 G memory is used as the experiment computer, installed with the operating system Windows XP. In order to study the effectiveness of the proposed approach, experiments were setup to compare Incremental_SEM with Bagging and AdaBoost on different datasets. In all ensemble methods, decision trees were used as the base classifier. Based on WEKA 3.6, the decision tree construction method was J48 from the Weka library, which are selected to generate base classifiers with the default parameter sets [10]. e performance of each ensemble classifier was evaluated using a stratified 10-fold cross-validation procedure, in which the original dataset was partitioned randomly into 10 equal size subsamples and each fold contains roughly the same proportions of class labels. e experiment settings were as follows: the parameters of Bagging and AdaBoost were kept at their default values in Weka.
e ensemble size can be regarded as a hyperparameter of the ensemble method. It can be tuned through cross-validation or using a separate validation set. It can also be thought of as an indicator of the operating complexity of the ensemble. For Incremental_SEM, different information entropy intervals were set for the six generated datasets, interval [0.21, 0.43] for HP1, HP2 and SEA1, [0.63, 0.85] for SG1, and [0.46, 0.69] for RT1 and RT2. Figure 2, f is set as a time interval in the incremental model of the processing data stream, and the classifier is adjusted every time period. For each algorithm, the accuracy of the current ensemble classifier was calculated in every time period. We verified algorithms from two aspects: classification accuracy and system memory cost. Suppose at time t, ensemble classifier has m base classifiers; each base classifiers classification accuracy is a i (i � 1, 2, . . . m). Take At as ensemble classification accuracy: At � (a1 + a2 + + am)/m. e ensemble classification results of the dataset in Algorithm 1 are shown in Figure 4.

Results and Analysis. As shown in
In Figure 4(a), it is clear that the accuracy of the Incremental_SEM algorithm is slightly higher than the Bagging algorithm and both of them are obviously higher than a single algorithm when comparing the experiment results of Incremental_SEM algorithm with Bagging and Single classifier in datasets HP1 and RT1 at the time interval value of 10 seconds. However, the execution time of Incremental_SEM algorithm and bagging is longer than the single classifier, mainly because diversity computing in Incremental_SEM is time-consuming. Moreover, in order to test the memory cost while adding entropy diversity in ensemble classifiers, Incremental_SEM with traditional incremental algorithms, bagging, and without diversity measure are investigated. e experiment results are shown  in Table 4 with the average classification accuracy (ACA) and average system memory cost (ASM). In Figure 4(b), it is noted that Incremental_SEM classification accuracy is almost the same as AdaBoost algorithm and both of them are higher than a single classifier when comparing Incremental_SEM with AdaBoost classification algorithm in datasets HP2 and RT2 at the time interval value of 20 seconds. Due to the higher dimension of the two datasets, the algorithm executing time average is longer than that in Figure 4(a), illustrating that learning a new base classifier is time-consuming. e results support the conclusion that adding diversity can increase classification time without improvement of entire effectiveness through comparing Incremental_SEM with a single algorithm.
In Figure 4(c), it is clearly noted that sharp accuracy drops (such as at times 30, 55, 60, 75) since the concept drift phenomenon existed in both two datasets when comparing Incremental_SEM with the AdaBoost algorithm at the time interval value of 5 seconds. Comparing with a single algorithm, Incremental_SEM and AdaBoost are more stable, suggesting that ensemble classifier has an advantage when concept drift exists in the dataset. It can be concluded that adding diversity into the ensemble classifier can improve algorithm performance, which is consistent with the view by Nan and Zhou [34][35][36][37].
From Tables 4-6, it can be found that the accuracy of Incremental_SEM classification is not significantly higher than AdaBoost and Bagging algorithm and all of them are nearly the same in our experiment. However, the average system memory cost of Incremental_SEM is much lower than AdaBoost and Bagging. It can be demonstrated that, for system memory cost, Incremental_SEM classification is better than traditional ensemble classification algorithms.
In order to testify the advantages of adopting entropy as a diversity measure when processing data stream, an experiment with the dataset in Table 3 was conducted to compare Q-statistic with correlation coefficient diversity measure. Table 7 shows that Incremental_SEM average accuracy is higher with Q-statistic than that with correlation coefficient ρ. (1) Begin (2) Loop (3) Compute diversity value λ 0 of ensemble classifier L t−f; (4) If For i � 1 to k (6) Sampling training data from labeled dataset at period of [t−f, t] by L t−f ; Generate a new base classifier L i; (8) Add L i to L t−f ; (9) Compute the diversity value λ 1 ; (10) If Return L t (13) End for (14) else if λ 0 ∈ [0, a] (15) Compute the accuracy of each base classifier at L t−f ; (16) Sort base classifiers in decreasing order of accuracy as baselist; (17) Delete some member base classifiers with the lowest accuracy at L t−f ; (18) Update the L t−f; (19) L t � L t−f; (20) return L t; (21) else (22) L t � L t−f; (23) return L t (24) End if (25) Break; (26) End loop ALGORITHM 1: Incremental_SEM algorithm. 6 Security and Communication Networks        Naïve Bayes e Naïve Bayes classifier using kernel density estimation over multiple values for continuous attributes, instead of assuming a simple normal distribution 2 SMO Sequential minimal optimization algorithm for training a support vector classifier using polynomial kernels 3 J48 Decision tree, the implementation of C4. 5 4 IBk An instance-based learning algorithm, the implementation of k-nearest neighbor algorithm (kNN) 5 KStar e K instance-based learner using all nearest neighbors and an entropy-based distance 6 NNge Nearest neighbor-like algorithm using nonnested generalized exemplars 7 PART Generating a PART decision list for classification 8 AOD Perform classification by averaging over all of a small space of alternative Naive Bayes-like models that have weaker independence

Conclusions
Ensemble classifier, as a common algorithm at processing data stream, is famous for its high classification accuracy and stability. We proposed an ensemble algorithm incorporating entropy as the diversity measure. It is proved that our Incremental_SEM algorithm has a higher classification accuracy rate than a single classifier and lower system memory cost than the Bagging and AdaBoost algorithm. It is also suggested that the Q-statistic diversity measure outperforms the correlation coefficient diversity measure. Future research will focus on how to verify the relativeness between accuracy and diversity in theory.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request as the size of the experimental data is too large to upload via this submission interface.

Conflicts of Interest
e authors declare no conflicts of interest.