Discriminatory Target Learning: Mining Significant Dependence Relationships from Labeled and Unlabeled Data

Machine learning techniques have shown superior predictive power, among which Bayesian network classifiers (BNCs) have remained of great interest due to its capacity to demonstrate complex dependence relationships. Most traditional BNCs tend to build only one model to fit training instances by analyzing independence between attributes using conditional mutual information. However, for different class labels, the conditional dependence relationships may be different rather than invariant when attributes take different values, which may result in classification bias. To address this issue, we propose a novel framework, called discriminatory target learning, which can be regarded as a tradeoff between probabilistic model learned from unlabeled instance at the uncertain end and that learned from labeled training data at the certain end. The final model can discriminately represent the dependence relationships hidden in unlabeled instance with respect to different possible class labels. Taking k-dependence Bayesian classifier as an example, experimental comparison on 42 publicly available datasets indicated that the final model achieved competitive classification performance compared to state-of-the-art learners such as Random forest and averaged one-dependence estimators.


Introduction
With the rapid development of computer technologies, business and government organizations create large amounts of data, which need to be processed and analyzed. Over the past decade, to satisfy the urgent need of mining knowledge hidden in the data, numerous machine learning models [1,2] (e.g., decision tree [3], Bayesian network [4,5], support vector machine [6] and Neural network [7]) have been proposed.
To mine all "right" knowledge that exist in a database, researchers mainly proposed two kinds of learning strategies to address this issue. (1) Increase structure complexity to represent more dependence relationships, e.g., convolutional neural network [8] and k-dependence Bayesian classifier (KDB) [9]. However, as structure complexity grows overfitting will inevitably appear, which will result in redundant dependencies and performance degradation. Sometimes the overly complex structures hide the internal working mechanism and make them criticized for being used as "black box". (2) Build ensemble of several individual members having relatively simple network structure, e.g., Random forest [10] and averaged one-dependence estimators (AODE) [11]. Ensembles can generally perform better than any individual member. However, it is difficult or even impossible to give a clear semantic explanation of the combined result since the working mechanisms of individual members may differ greatly. In practice, people would rather use models with simple and easy-to-explain structures, e.g., decision tree [12] and Naive Bayes (NB) [13][14][15], although they may perform poorer.
Bayesian networks (BNs) have long been a popular medium for graphically representing the probabilistic dependencies, which exist in a domain. Recently, work in Bayesian methods for classification has grown enormously. Numerous Bayesian network classifiers (BNCs) [9,[16][17][18][19][20] have been proposed to mine the significant dependence relationships implicated in training data. With solid theoretic support, they have strong potential to be effective for practical application in a number of massive and complex data-intensive fields such as medicine [21], astronomy [22], biology [23], and so on. A central concern for BNC is to learn conditional dependence relationships encoded in the network structure. Some BNCs, e.g., KDB, use conditional mutual information I(X i ; X j |Y) to measure the conditional dependence relationships between X i and X j , which is defined as follows [24], For example, I(X i ; X j |Y) = 0 indicates that attributes X i and X j are conditionally independent. However, in practice, for any specific event or data point, the situation will be much more complex. Taking Waveform dataset as an example, attributes X 15 and X 16 are conditionally dependent, since I(X 15 ; X 16 |Y) > 0 always holds. Figure 1 shows the distributions of I(x 15 ; x 16 |y i ), where i ∈ {1, 2, 3}. As can be seen, there exist some positive values of I(x 15 ; x 16 |y 1 ) and I(x 15 ; x 16 |y 2 ). However, for the class label y 3 , the negative or zero values of I(x 15 ; x 16 |y 3 ) have a high proportion among all values. That is, for different class labels, the conditional dependence relationships may be different rather than invariant when attributes take different values. We argue that most BNCs (e.g., NB and KDB), which build only one model to fit training instances, cannot capture this difference and cannot represent the dependence relationships flexibly, especially hidden in unlabeled instances. The scientific data can be massive, and labeled training data may account for only a small portion. In this paper, we propose a novel learning framework, called discriminatory target learning, for achieving better classification performance and high-level of dependence relationships while not increasing structure complexity. KDB is taken as an example to illustrate the basic idea and prove the feasibility of discriminatory target learning. By redefining mutual information and conditional mutual information, we build a "precise" model kdb i for each unlabeled instance x with respect to class label y i . The ensemble of kdb i , i.e., kdb e , can finely describe the dependency relationships hidden in x. The final ensemble of kdb e and regular KDB can fully and discriminately describe the dependence relationships in training data and unlabeled instance.
The rest of the paper is organized as follows: Section 2 introduces some state-of-the-art BNCs. Section 3 introduces the basic idea of discriminatory target learning. Experimental study on 42 UCI machine learning datasets is presented in Section 4, including a comparison with seven algorithms. The final section draws conclusions and outlines some directions for further research.

Bayesian Network Classifiers
The structure of a BN on the random variables {X 1 , · · · , X n } is a directed acyclic graph (DAG), which represents each attribute in a given domain as a node in the graph and dependencies between these attributes as arcs connecting the respective nodes. Thus, independencies are represented by the lack of arcs connecting particular nodes. BNs are powerful tools for knowledge representation and inference under conditions of uncertainty. BNs were considered as classifiers only after the discovery of NB, a very simple kind of BN on the basis of conditional independence assumption. It is surprisingly effective and efficient for inference [5]. The success of NB has led to the research of Bayesian network classifiers (BNCs), including tree-augmented naive Bayes (TAN) [16], averaged one-dependence estimators (AODE) [18] and k-dependence Bayesian classifier (KDB) [9,17].
Let each instance x be characterized with n values {x 1 , · · · , x n } for attributes {X 1 , · · · , X n }, and class label y ∈ {y 1 , · · · , y m } is the value of class variable Y. NB assumes that the predictive attributes are conditional independent of each other given the class label, that is Correspondingly for any value pair of arbitrary two attributes X i and X j , P(x i , x j |y) = P(x i |y)P(x j |y) always holds. From Equation (1) there will be I(X i ; X j |Y) = 0 and this can explain why there exist no arc between attributes for NB. However, in the real world, it will be much more complex when considering different specific event or data point. We now formalize our notion of the spectrum of point dependency relationship in Bayesian classification.

Definition 1.
For unlabeled data point x = {x 1 , · · · , x n }, the conditional dependence between X i and X j (1 ≤ i, j ≤ n) with respect to label y on point x is measured by pointwise y-conditional mutual information, which is defined as follows, Equation (2) is a modified version of pointwise conditional mutual information that is applicable to labeled data point [25]. By comparing Equations (1) and (2), I(X i ; X j |Y) is a summation of expected values of I(x i ; x j |y) given all possible values of X i , X j and Y. The traditional BNCs, e.g., TAN and KDB, use I(X i ; X j |Y) to roughly measure the conditional dependence between X i and X j . I(X i ; X j |Y) is non-negative, I(X i ; X j |Y) > 0 iff X i and X j are conditionally dependent given Y. However, only considering I(X i ; X j |Y) = 0 as the criterion for identifying the conditional independent relationship is too strict for BN learning, which may lead to classification bias, since I(x i ; x j |y) ≤ 0 may hold for specific data point x. That may be the main reason why NB performs better in some research domains. To address this issue, in this paper I(x i ; x j |y) is applied to measure the extent to which X i and X j are relatively conditionally dependent when P(x i |x j , y) > P(x i |y) or relatively conditionally independent or irrelevant when P(x i |x j , y) < P(x i |y), respectively.

Definition 2.
For unlabeled data point x = {x 1 , · · · , x n } with respect to label y, if I(x i ; x j |y) > 0 (1 ≤ i, j ≤ n), then X i and X j are y-conditionally dependent on point x; if I(x i ; x j |y) = 0, then they are y-conditionally independent on point x; and if I(x i ; x j |y) < 0, then they are y-conditionally irrelevant on point x.
TAN maintains the structure of NB and allows each attribute to have at most one parent. Then, the number of arcs encoded in TAN is n − 1. During the constructing procedure of maximum weighted spanning tree, TAN sorts the arcs between arbitrary attributes X i and X j by comparing I(X i ; X j |Y), and adds them in turn to the network structure if no cycle appears. KDB further relaxes NB's independence assumption and can represent arbitrary degree of dependence while capturing much of the computational efficiency of NB. KDB first sorts attributes by comparing mutual information I(X i ; Y), which is defined as follows [24], Suppose the attribute order is {X 1 , · · · , X n }. By comparing I(X i ; X j |Y), X i select its parents, e.g., X j , from attributes that ranks before it in the order. KDB requires that X i must have min(i − 1, k) parents and there will exist min(i − 1, k) arcs between X i and its parents. The number of arcs encoded in KDB is nk − k 2 2 − k 2 and will grow as k grows. Thus, KDB can represent more dependency relationships than TAN. For TAN or KDB, they do not evaluate the extent to which the conditional dependencies are weak enough and should be neglected. They simply specify the maximum number of parents that attribute X i can have before structure learning. Some arcs corresponding to weak conditional dependencies will inevitably be added to the network structure. The prior and joint probabilities in Equations (1) and (3) will be estimated from training data as follows: where N is the number of training instances. Then, P(x j |y) and P(x i , x j |y) in Equations (1) and (3) can be computed as follows: Sahami [9] suggested that, if k is large enough to capture all "right" conditional dependencies that exist in a database, then a classifier would be expected to achieve optimal Bayesian accuracy. However, as k grows, KDB will encode more weak dependency relationships, which correspond to smaller value of I(X i ; X j |Y). That increases the risk of occurrence of negative values of I(x i ; x j |y) and may introduce redundant dependencies, which will mitigate the positive effect from significant dependencies that correspond to positive values of I(x i ; x j |y). On the other hand, conditional mutual information I(X i ; X j |Y) cannot finely measure the conditional dependencies hidden in different data points. The arc X i → X j in BNC learned from training data corresponds to positive value of I(X i ; X j |Y) and represents strong conditional dependence between X i and X j . However, for specific labeled instance d = {x 1 , · · · , x n , y 1 }, I(x i ; x j |y 1 ) ≤ 0 may hold. Then, X i and X j are y 1 -conditionally independent or irrelevant on point d and the arc X i → X j should be removed. For unlabeled instance, the possible dependency relationships between nodes may differ greatly with respect to different class labels.
Thus, BNCs with highly complex network structure do not necessarily beat those with simple ones. The conditional dependencies hold for training data in general do not necessarily hold for each instance. BNCs should discriminate between conditionally dependent and irrelevant relationship for different data points. Besides, BNC should represent all possible spectrums of point dependency relationship that correspond to different class labels for dependence analysis.

Discriminatory Target Learning
In probabilistic classification, Bayes optimal classification suggests that, if we can determine the conditional probability distribution P(y|x) with true distribution available, where y is one of the m class labels and x is the n-dimensional data point x = {x 1 , x 2 , · · · , x n } that represents an observed instance, then we could achieve the theoretically optimal classification. P(y|x) can be described in an unrestricted Bayesian network, as shown in Figure 2a. By applying arc reversal, Shachter [26] proposed to produce the equivalent dependence structure, as shown in Figure 2b. The problem is reduced to estimating the conditional probability P(x|y). Figure 2a,b represents two inference processes that run in the opposite directions. Figure 2a indicates the causality that runs from the state of {X 1 , · · · , X n } (the cause) to the state of Y (the effect). In contrast, if the causality runs in the opposite direction as shown in Figure 2b and the state of Y (the effect) is uncertain, the dependencies between predictive attributes (the causes) should be tuned to match with different states of Y. That is, the restricted BNC shown in Figure 2b presupposes the class label first and then the conditional dependencies between attributes can verify the presupposition.  For different class labels or presuppositions, the conditional dependencies should be different. It is not reasonable that, no matter what the effect (class label) is, the relationships between causes (predictive attributes) remain the same. Consider an unlabeled instance x = {x 1 , · · · , x n }; if I(x i ; x j |y) > 0, then the conditional dependence between X i and X j on data point x with respect to class label y is reasonable, otherwise it should be neglected. Since the class label for x is uncertain and there are m labels available, we take x as the target and learn an ensemble of m micro BNCs, i.e., bnc e = {bnc 1 , · · · , bnc m }, each of them fully describes the conditional dependencies between attribute values in x with respect to different class labels. The linear combiner is used for models that output real-valued numbers, thus is applicable for bnc e . The ensemble probability estimate for bnc e is, bnc e may overfit the unlabeled instance and underfit training data. In contrast, regular BNC learned from training data may underfit the unlabeled instance. Thus, they are complementary in nature. After training bnc e and regular BNC, the final ensemble that estimates the class membership probabilities by averaging both predictions will be generated. The framework of discriminatory target learning is shown in Figure 3. Because in practice it is hardly possible to find the true distribution of P(x|y) from data, KDB approximates the estimation of P(x|y) by allowing for the modeling of arbitrarily complex dependencies between attributes. The pseudocode of KDB is shown in Algorithm 1.

Algorithm 1 Structure learning of KDB.
Input: Training set T , parameter k, vector I(X i ; Y)(1 ≤ i ≤ n) and crosstab Definition 3. For unlabeled data point x = {x 1 , · · · , x n }, the dependence between x i (1 ≤ i ≤ n) and any given label y is measured by pointwise y-mutual information, which is defined as follows, Equation (8) is a modified version of pointwise mutual information that is applicable to labeled data point [25]. The prior and joint probabilities in Equations (2) and (8) will be estimated as follows Conditional probabilities in Equations (2) and (8) can be estimated by: Similar to the Laplace correction [27], the main idea behind Equation (9) is equivalent to creating a "pseudo" training set P by adding to the training data a new instance {x 1 , · · · , x n } with multi-label by assuming that the probability that this new instance is in class y is 1/m for each y ∈ {y 1 , · · · , y m }.
KDB uses I(X i ; Y) to sort the attributes and I(X i ; X j |Y) to measure the conditional dependence. Similarly, for unlabeled instance x = {x 1 , · · · , x n }, the corresponding micro KDB with respect to class label y t , called kdb t , uses I(x i ; y t ) (see Equation (8)) to sort the attribute values and I(x i ; x j |y t ) (see Equation (2)) to measure the conditional dependence. The learning procedure of kdb t is shown in Algorithm 2.
Algorithm 2 Structure learning of kdb t with respect to class label y t .
Input: Unlabeled instance t, parameter k, class label y t , vector I(x i ; y t )(1 ≤ i ≤ n) and crosstab I( Breiman [28] revealed that ensemble learning brings improvement in accuracy only to those "unstable" learning algorithms, in the sense that small variations in the training set would lead them to produce very different models. bnc e is obviously an example of such learners. For individual members of kdb e , the difference in network structure is the result of change of I(x i ; y) or I(x i ; x j |y) (1 ≤ i = j ≤ n), or, more precisely, the conditional probability defined in Equations (2) and (8). Given unlabeled instance x = {x 1 , · · · , x n } and binary class labels y 1 and y 2 , if I(x i ; y 1 ) > 0, i.e., P(y 1 |x i ) > P(y 1 ), then X i is y 1 -dependent on x. Because P(y 2 ) = 1 − P(y 1 ) and P(y 2 |x i ) = 1 − P(y 1 |x i ), we have and Thus, X i is y 2 -irrelevant on x. X i plays totally different roles in the relationships with different class labels on the same instance. Supposing that before small variations in the training set I(x i ; y 1 ) > 0 and after that I(x i ; y 1 ) < 0, the attribute values will be resorted and correspondingly the network structures of kdb 1 and kdb 2 for x will change greatly. The sensitivity to the variation makes kdb e finely describe the dependencies hidden in x. Figure 4 shows examples of kdb 1 and kdb 2 corresponding to class labels y 1 and y 2 , respectively. If the decision of the final ensemble is y 1 , then we will use Figure 4a for dependence analysis. Otherwise, we will use Figure 4b instead. The attribute values annotated in black correspond to positive values of I(x i ; y t )(t = 1 or 2) and they should be focused on.
KDB requires training time complexity of O(n 2 Nmv 2 ) (dominated by the calculations of I(X i ; X j |Y)) and classification time complexity of O(n 2 Nm) [9] for classifying a single unlabeled instance, where n is the number of attributes, N is the number of data instances, m is the number of class labels, and v is the maximum number of discrete values that an attribute may take. Discriminatory target learning requires no additional training time, thus the training time complexity of final ensemble is the same as that of regular KDB. At classification time it requires O(n 2 Nm) to calculate I(x i ; x j |y), and the same time complexity for classifying a single unlabeled instance.

Experiments and Results
We compared the performance of our proposed methods kdb e and KDB e with several state-of-the-art classifiers. We analyzed the performance in terms of zero-one loss, root mean square error (RMSE), bias and variance on 42 natural domains from the UCI Machine Learning Repository [29]. These datasets are described in Table 1, in ascending order of number of instances. The structure of this section is as follows: we discuss our experimental methodology and evaluation function in details in Section 4.1. Section 4.2 includes comparisons with three classic single-structure BNCs, namely NB, TAN and KDB, as well as one ensemble BNC: AODE. Then, in Section 4.3, KDB e is compared with Random Forest with 100 decision trees. Section 4.4 presents a global comparison of all learners considered by applying the Friedman and Nemenyi tests.

Experimental Methodology and Evaluation Function
The experiments for all BNCs used C++ software (NetBeans 8.0.2) specially designed to deal with classification problems. Each algorithm was tested on each dataset using 10-fold cross validation.
All experiments were conducted on a desktop computer with an Intel(R) Core(TM) i3-6100 CPU @ 3.70 GHz, 64 bits and 4096 MB of memory(Dell Vostro 2667, Changchun, China).
• Win/Draw/Lose (W/D/L) Record: When two algorithms were compared, we counted the number of datasets for which one algorithm performed better, equally well or worse than the other on a given measure. We considered there exists a significant difference if the output of a one-tailed binomial sign test was less than 0.05. • Missing Values: Missing values for qualitative attributes were replaced with modes, and those for quantitative attributes were replaced with means from the training data. • Numeric Attributes: For each dataset, we used MDL (Minimum Description Length) discretization [30] to discretize numeric attributes. • Dataset Sizes: Datasets were categorized in terms of their sizes. That is, datasets with instances <1000, ≥1000 and <10,000, ≥10,000 were denoted as small size, medium size and large size, respectively. We report results on these sets to discuss suitability of a classifier for datasets of different sizes. • Zero-one loss: Zero-one loss can be used to measure the extent to which a learner correctly identifies the class label of an unlabeled instance. Supposing y andŷ are the true class label and that generated by a learning algorithm, respectively, given M unlabeled test instances, the zero-one loss function is defined as where (y i ,ŷ i ) = 1 if y i =ŷ i and 0 otherwise. • Bias and variance: The bias-variance decomposition proposed by Kohavi and Wolpert [31] provides valuable insights into the components of the zero-one loss of learned classifiers. Bias measures how closely the classifier can describe the decision boundary, which is defined as where x is the combination of any attribute value. Variance measures the sensitivity of the classifier to variations in the training data, which is defined as • RMSE: For each instance, RMSE accumulates the squared error, where the error is the difference between 1.0 and the probability estimated by the classifier for the true class for the instance, and then computes the squared root of the mean of the sum, which is defined as where s is the sum of training instances.

KDB e Versus Classic BNCs
We compared KDB e with several classic BNCs, namely NB, TAN, KDB and AODE. Sahami [9] proposed the notion of k-dependence BNC, which allows each attribute X i to have a maximum of k attributes as parents. NB and TAN are, respectively, 0-dependence and 1-dependence BNCs. To clarify the effect of dependence complexity, we set k = 2 for both KDB and KDB e .

Zero-One Loss and RMSE Results
The detailed results in terms of zero-one loss and RMSE are shown in Tables A1 and A2 in  Appendix A, respectively. Tables 2 and 3 show W/D/L records summarizing the relative zero-one loss and RMSE of different BNCs. When k = 2, NB, TAN and KDB can, respectively, represent 0, n − 1 and 2n − 3 conditional dependencies, where n is the number of predictive attributes. As shown in Table 1, since n > 3 holds for all datasets, 2n − 3 > n − 1 also holds. Thus, KDB can represent the largest number of dependencies among all. With respect to zero-one loss, NB represents no conditional dependencies due to its independence assumption and performed the worst in general. As the dependence degree or structure complexity increased, KDB was competitive compared to NB and TAN. AODE performed better than the other single-structure BNCs due to its ensemble mechanism. Surprisingly, kdb e had significantly better zero-one loss performance than NB, TAN and KDB. When discriminatory target learning was introduced for discovery of dependencies that exist in different unlabeled instances, the final ensemble KDB e could possess significant advantage over other classifiers. For example, KDB e beat KDB in 26 domains and lost only in three in terms of zero-one loss. RMSE-wise, KDB e still performed the best. For instance, KDB e enjoyed a significant advantage over TAN (20/19/3). When compared to KDB, KDB e also achieved superior performance, with 17 wins and 5 losses.  To make the experimental results more intuitive, from the viewpoints of the ensemble mechanism and structure complexity, Figure 5a,c shows the comparisons of KDB e , KDB and AODE in terms of zero-one loss, whereas Figure 5b,d shows the comparisons for RMSE. The red squared symbols are used to indicate significant advantages of KDB e over the other BNCs. In Figure 5a,b, only two points are far above the diagonal line, thus the negative effect caused by discriminatory target learning was negligible. In contrast, many more points are below the diagonal line, which means that discriminatory target learning worked effectively in most cases. A notable case is Waveform dataset, where discriminatory target learning helped to substantially reduce classification error, such as the reduction from 0.0256 to 0.0193 for zero-one loss and from 0.1145 to 0.0901 for RMSE. When comparing KDB e with AODE, it can be seen in Figure 5c,d that there are still many points below the diagonal line, which means that KDB e enjoyed a significant advantage over AODE. For example, a notable case is our largest dataset Localization, where the zero-one loss of KDB e (0.2743) was much lower than that of AODE (0.3596).

Bias and Variance Results
The detailed results in terms of bias and variance are shown in Tables A3 and A4 in Appendix A, respectively. The W/D/L records with respect to bias and variance results are shown in Tables 4 and 5, respectively. We can observe in Table 4 that ensemble classifiers, i.e., AODE and kdb e , performed better than TAN but worse than KDB, although these results were not always statistically significant. NB still performed the worst. High-dependence structure or ensemble construction strategy could help reduce the bias. Jointly applying both helped KDB e reduce bias significantly. For example, KDB e performed better than TAN (26/9/7) and KDB (11/27/4).
In terms of variance, since the network structures of NB and AODE are definite and irrelevant to the variation of the training data, the independence assumption helped reduce the variance significantly. KDB was the most sensitive to the variation in training data among all classifiers. As discussed in Section 3, discriminatory target learning made kdb e underfit training data and overfit the unlabeled instance. When kdb e was integrated with regular KDB, discriminatory target learning helped to reduce the variance and the final ensemble classifier, i.e., KDB e , performed the best only after NB and AODE.

Time Comparison
We compared KDB e with the other classic BNCs in terms of training and classification time. Since kdb e is a part of KDB e , we removed it in this experiment. Figure 6a,b shows the training and classification time comparisons for all BNCs. Each bar represents the sum of time on 42 datasets in a 10-fold cross-validation experiment. No parallelization techniques were used in any case. As discussed in Section 3, discriminatory target learning requires no additional training time, thus the training time complexity of KDB e was the same as that of regular KDB. Due to the structure complexity, KDB e and KDB required a bit more time for training than the other BNCs. With respect to classification time, KDB e took a little more time than the other BNCs. The reason lies in that KDB e learned kdb e for each unlabeled test instance, while the other BNCs only needed to directly calculate the joint probabilities. In general, discriminatory target learning helped to significantly improve the classification performance of its base classifier at the cost of a small increase in time consumption, which is perfectly acceptable.

KDB e Versus Random Forest
To further illustrate the performance of our proposed discriminatory target learning framework, we compared KDB e with a powerful learner, i.e., Random forest.Random forest (RF) is a combination of decision tree predictors, where each tree is trained on data selected at random but with replacement from the original data [10]. As the number of trees in the forest becomes large, the classification error for forests tends to converge to a limit. RF is an effective tool in prediction. RF can process high-dimensional data (that is, data with a lot of features) without making feature selection. Furthermore, due to the random mechanism, RF has the capacity to deal with imbalanced datasets or data with numerous missing values. Moreover, the framework in terms of strength of the individual predictors and their correlations gives insight into the ability of the RF to predict [10]. Because of its high classification accuracy, RF has been applied to many scientific fields, e.g., ecology and agriculture [32]. In our experiment, RF with 100 decision trees was used. The detailed results of RF in terms of zero-one loss, RMSE, bias and variance can be found in Tables A1-A4 in Appendix A, respectively. Table 6 shows the W/D/L records with different dataset sizes. When zero-one loss was compared, KDB e won more frequently than RF, especially on small and medium datasets. The results indicate 10/4/3 on small datasets and 7/4/4 on medium datasets. The reason may lie in that 100 decision trees are complex and tend to overfit the training data. RMSE-wise, KDB e also performed better than RF, which is shown as 16 wins and 11 losses. Bias and variance comparison of KDB e and RF (Table 6) suggested that KDB e is a low variance and high bias classifier. One can expect it to work extremely well on small and medium datasets. This is evident in Table 6 showing the zero-one loss and RMSE comparisons. KDB e beat RF on 26 datasets and lost on 12 datasets with respect to variance. Thus, the advantages of KDB e over RF in terms of zero-one loss and RMSE could be attributed to the change in variance. Since the variance term increased as the algorithm became more sensitive to the change in labeled training data, obviously, discriminatory target learning helped to alleviate the negative effect caused by overfitting. Besides, we display the time comparisons between KDB e and RF in Figure 7. It is obvious that KDB e enjoyed a great advantage over RF in terms of training time on datasets of all sizes. This advantage could be attributed to that KDB e only learned a regular KDB for every dataset during the training phase while RF needed to train 100 decision trees. When comparing classification time, the performance of KDB e and RF showed a slight reversal. Learning kdb e for each unlabeled test instance made KDB e take a bit more time than RF. However, when comparing on small and medium datasets, the advantage of RF over KDB e was not significant. To conclude, on small and medium datasets, KDB e had a significantly better zero-one loss performance and better RMSE than RF. This was packaged with KDB e 's far superior training times and competitive classification times over RF, which makes KDB e an excellent alternative to RF, especially for dealing with small and medium datasets.

Discussion
RF has been applied to several scientific fields and associated research areas [32], because of its high classification accuracy. However, RF is more negatively affected in terms of computation consumption (memory and time) by dataset sizes than BNCs [19]. Furthermore, due to the random mechanism, RF is sometimes criticized for difficulty giving a clear semantic explanation of the combined result that is outputted by numerous decision trees. In contrast, our proposed discriminatory target learning framework considers not only the dependence relationships that exist in the training data, but also that hidden in unlabeled test instances, which makes the final model highly interpretable. KDB e outperformed RF in terms of zero-one loss, RMSE and variance, especially on small and medium size datasets, while RF beat KDB e in terms of bias. Moreover, RF required substantially more time for training and KDB e took a bit more time for classifying.
To illustrate the better interpretability of KDB e than that of RF, we took medical diagnostic application as an example. The Heart-disease-c dataset (http://archive.ics.uci.edu/ml/datasets/ Heart+Disease) from UCI Machine Learning Repository was collected from Cleveland Clinic Foundation, containing 13 attributes and two class labels. The detailed description of this dataset is shown in Table 7. The zero-one loss results of KDB, RF and KDB e are 0.2244, 0.2212 and 0.2079, respectively. KDB learned from training data can describe the general conditional dependencies, while for a certain instance some of dependence relationships may hold instead of all the dependencies shown in KDB. In contrast, kdb e can encode the most possible local conditional dependencies hidden in one single test instance. We argue that an ideal phenomenon is that KDB and kdb e are complementary to each other for classification and they may focus on different key points. To prove this, randomly taking an instance from Heart-disease-c dataset as an example, the detail of this instance is shown as, T = {x 0 = 57, Figures 8 and 9 show the structural difference between KDB and the submodels of kdb e . For KDB, by comparing mutual information I(X; Y), {X 6 , X 1 , X 12 } are the first three key attributes for this dataset. There are 23 arcs in the structure of KDB which represent the conditional dependencies between predictive attributes. However, the values of I(X 8 ; X 1 |Y), I(X 8 ; X 6 |Y), I(X 9 ; X 1 |Y) and I(X 9 ; X 6 |Y) are all 0. For the instance T , in Figure 9, we can easily find that the structure of kdb e differed greatly from that of KDB. The true class label for T is y 1 . KDB misclassified T , while KDB e correctly classified the instance. Thus, we can use Figure 9a for dependence analysis. By comparing the pointwise y 1 -mutual information, {x 12 , x 11 , x 7 } are the first three key attribute values for T . It is worth mentioning that X 1 ranked second in KDB, whereas x 1 ranked last in kdb y 1 . Furthermore, there were only 15 arcs in kdb y 1 , which means that some redundant dependencies were eliminated. In general, KDB e could utilize the knowledge learned from the training data and unlabeled test instances by building different models, which is obviously suitable for precision medical diagnosis.

Imbalanced Datasets
There are 15 imbalanced datasets in our experiments, which are annotated with the symbol "*" in Table 1. To prove that KDB e has the capacity to deal with imbalanced datasets, we conducted a set of experiments to compare the performance of KDB e with RF in terms of extended Matthews correlation coefficient (MCC). The MCC provides a balanced measure for skewed datasets by taking into account the class distribution [33]. The classification results can be shown in the form of a confusion matrix as follows:    Each entry N ii of the matrix gives the number of instances, whose true class was Y i that were actually assigned to Y i , where 1 ≤ i ≤ m. Each entry N ij of the matrix gives the number of instances, whose true class was Y i that were actually assigned to Y j , where i = j and 1 ≤ i, j ≤ m. Given the confusion matrix, the extended MCC can be calculated as follow, Note that the MCC reaches its best value at 1, which represents a perfect prediction, and worst value at −1, which indicates a total disagreement between the predicted and observed classifications. Figure 10 shows the scatter plot of KDB e and RF in terms of MCC. We can see that many points fall close to the diagonal line, which means that KDB e achieved competitive results compared with RF. Furthermore, there are three points far above the diagonal line, which means KDB e enjoys significant advantages on these datasets. A notable case is Dis dataset annotated with red color, where the MCC of KDB e (0.4714) was much higher than that of RF (0.3710). In general, KDB e had the capacity to handle the imbalanced datasets.

Global Comparison of All Classifiers
In this section, to assess whether the overall differences in performance of these learners was statistically significant, we employed the Friedman test [34] and the post-hoc Nemenyi test, as recommended by Demšar [35]. The Friedman test is a non-parametric test for multiple hypotheses testing. It ranks the algorithms for each dataset separately: the best performing algorithm getting the rank of 1, the second best ranking 2, and so on. In case of ties, average ranks are assigned. The null-hypothesis is that all of the algorithms perform almost equivalently and there is no significant difference in terms of average ranks. The Friedman statistic can be computed as follows: where R j = ∑ i r j i and r j i is the rank of the jth of t algorithms on the ith of N datasets. The Friedman statistic is distributed according to χ 2 F with t − 1 degrees of freedom. Thus, for any pre-determined level of significance α, the null hypothesis will be rejected if χ 2 F > χ 2 α . The critical value of χ 2 α for α = 0.05 with six degrees of freedom is 12.592. The Friedman statistics of zero-one loss and RMSE were 53.65 and 60.49, which were both larger than 12.592. Hence, the null-hypotheses was rejected. According to the detailed results of rank shown in Tables A5 and A6 in Appendix A, Figure 11 plots the average ranks across all datasets, along with the standard deviation for each learner. When assessing the calibration of the probability estimates using zero-one loss, KDB e obtained the lowest average rank of 2.5952, followed by kdb e with 3.5595 and RF with 3.7024 (very close to those for AODE). When assessing performance using RMSE, KDB e still performed the best, followed by RF with 3.4285 and AODE with 3.7500. We found NB at the other extreme on both measures, with average ranks 5.8690 and 5.9523 out of a total of seven learners. Since we rejected the null-hypotheses, Nemenyi test was used to further analyze which pairs of algorithms were significantly different in terms of average ranks of the Friedman test. The performance of two classifiers is significantly different if their corresponding average ranks of the Friedman test differ by at least the critical difference (CD): where the critical value q α for α = 0.05 and t = 7 is 2.949. Given seven algorithms and 42 datasets, we used Equation (16) to calculate CD and the result is 1.3902. The learners in Figure 12 are plotted on the red line on the basis of their average ranks, corresponding to the nodes on the top black line. If two algorithms had no significant difference, they were connected by a line. As shown in Figure 12a, we easily found that KDB e had a significantly lower average zero-one loss rank than NB, TAN and KDB. KDB e also achieved lower average zero-one loss rank than kdb e , RF and AODE, but not significantly so. When RMSE was considered, KDB e still performed the best and the rank of KDB e was significantly lower than that of KDB, providing solid evidence for the effectiveness of our proposed discriminatory target learning framework.

Conclusions
Lack of explanatory insight into the relative influence of the random variables greatly restricts the application domain of machine learning techniques. By redefining mutual information and conditional information, the framework of discriminatory target learning can help fully and discriminately describe the dependency relationships in unlabeled instance and labeled training data. The kdb e learned from unlabeled instance and regular KDB learned from training data are different but complementary in nature, which will help further improve the classification performance. Discriminatory target learning can be expected to play for different types of BNCs with different dependency complexities. Exploration of application of discriminatory target learning in other kinds of machine learning techniques, e.g., decision tree or support vector machine, is a further area for future work.