Next Article in Journal
Symmetry Properties of Mixed and Heat Photo-Assisted Noise in the Quantum Hall Regime
Previous Article in Journal
Minimum Memory-Based Sign Adjustment in Signed Social Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Universal Target Learning: An Efficient and Effective Technique for Semi-Naive Bayesian Learning

1
College of Software, Jilin University, Changchun 130012, China
2
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
3
Department of Software and Big Data, Changzhou College of Information Technology, Changzhou 213164, China
4
College of Computer Science and Technology, Jilin University, Changchun 130012, China
5
College of Instrumentation and Electrical Engineering, Jilin University, Changchun 130012, China
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(8), 729; https://doi.org/10.3390/e21080729
Submission received: 15 June 2019 / Revised: 15 July 2019 / Accepted: 22 July 2019 / Published: 25 July 2019
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
To mitigate the negative effect of classification bias caused by overfitting, semi-naive Bayesian techniques seek to mine the implicit dependency relationships in unlabeled testing instances. By redefining some criteria from information theory, Target Learning (TL) proposes to build for each unlabeled testing instance P the Bayesian Network Classifier BNC P , which is independent and complementary to BNC T learned from training data T . In this paper, we extend TL to Universal Target Learning (UTL) to identify redundant correlations between attribute values and maximize the bits encoded in the Bayesian network in terms of log likelihood. We take the k-dependence Bayesian classifier as an example to investigate the effect of UTL on BNC P and BNC T . Our extensive experimental results on 40 UCI datasets show that UTL can help BNC improve the generalization performance.

1. Introduction

Supervised learning is a machine learning paradigm that has been successfully applied in many classification tasks [1,2]. Supervised learning has widespread deployment in applications including medical diagnosis [3,4,5], email filtering [6,7], and recommender systems [8,9,10]. The mission of supervised classification is to learn a classifier, such as neural network propagation and decision tree, from labeled training set T and then use it to assign class label c to some testing instance x = { x 1 , , x n } , where x i and c respectively denote the value of attribute X i and class variable C. Bayesian Network Classifiers (BNCs) [11] are such tools for indicating the probabilistic dependency relationships graphically and inferring under uncertainty conditions. They supply a framework to compute the joint probability, which can be written as the individual conditional probabilities of attributes given their parents, that is:
P ( c , x ) = P ( c | π c ) i = 1 n P ( x i | π i )
where π i and π c respectively denote the parents of attribute X i and that of class variable C.
Learning unrestricted BNCs is often time consuming and quickly becomes intractable as the number of attributes in a research domain grows. Moreover, inference in such unrestricted models has been shown to be NP-hard [12]. The success of Z-dependence Naive Bayes (NB) [13] has led to learning restricted BNCs or BNC T from labeled training data T , e.g., one-dependence Tree Augmented Bayesian classifier (TAN) [14] and k-Dependence Bayesian classifier (KDB) [12]. Among them, KDB can generalize from one-dependence to an arbitrary k-dependence network structure and has received great attention from researchers in different domains. These BNCs attempt to extract from labeled training data the significant dependencies implicated, whereas overfitting may result in classification bias. For example, patients with similar symptoms sometimes may have diverse kinds of diseases, for example, VM (viral myocarditis) [15] is often diagnosed as influenza due to the low incidence rate.
Semi-supervised learning methods generally apply unlabeled data to either reprioritize or modify hypotheses learned from labeled data alone [16,17,18]. These methods efficiently combine the expressed classification information of the labeled data with the information concealed in the unlabeled data [19]. These algorithms generally assume that The general assumption of this class of algorithms is that data points in high density regions likely belong to the same class simultaneously as decision boundary exists in low density regions [20]. However, the information carried by one single unlabeled instance may be overwhelmed by mass training data, and a wrongly-assigned class label may result in “ n o i s e p r o p a g a t i o n ”. To address this problem, we presented the Target Learning (TL) framework [21], in which an independent Bayesian model BNC P learned from testing instance P can work jointly with BNC T and effectively improve BNC T ’s generalization performance with minimal additional computation. In this paper, we present an expanded presentation of TL, Universal Target Learning (UTL), through dynamically adjusting dependency relationships implicated in one single testing instance at classification time to explore the most appropriate network topology. Conditional entropy is introduced as the loss function to measure the bits encoded in BNC in terms of log likelihood.
The remainder of the paper is organized as follows: Section 2 reviews the state-of-the-art-related BNCs. Section 3 shows the theoretical justification of the UTL framework and describes the learning procedure of KDB within UTL. The extensive experimental studies on 40 datasets are revealed in Section 4. To finalize, the final section shows the conclusions and the future work.

2. Preliminaries

A pair with < G , Θ > can formalize a Bayesian Network (BN). G represents the structure containing nodes and arcs with a directed acyclic graph. Nodes symbolize the class or attribute variable, and arcs correspond to dependency relationships existing between the child nodes and parent nodes. Θ represents the parameter set, which includes the conditional probability distribution of each node in G , namely P B ( c | π c ) or P B ( x i | π i ) , where π i and π c respectively denote the parents of attribute X i and that of class variable C in structure G . Facts proved that it is an NP-hard problem to learn an optimal BN [22]. To deal with the sticky complexity, some learning of restricted network structures is under research [23]. Thus, the joint probability distribution is defined as:
P B ( c , x ) = P ( c ) i = 1 n P B ( x i | c , π i ) .
Taking advantage of the underlying network topology of B and Equation (2), a BNC computes P B ( c | x ) by:
P B ( c | x ) = P B ( c , x ) P B ( x ) = P B ( c , x ) c Ω C P B ( c , x ) = P ( c ) i = 1 n P B ( x i | c , π i ) c Ω C P ( c ) i = 1 n P B ( x i | c , π i ) .
Among numerous restricted BNCs, NB is an extremely simple and remarkably effective approach with a zero-dependence structure (see Figure 1a) for classification [24,25]. It uses a simplifying assumption that given the class label, the attributes are independent of each other [26,27], i.e.,
P NB ( x | c ) = i = 1 n P ( x i | c ) .
However, in the real-world, NB’s attribute independence assumption is often violated and sometimes affects its classification performance. There has been generous prior work that explored methods to improve NB’s classification performance. Information theory, which was proposed by Shannon, has established a mathematical basis for the rapid development of BN. Mutual Information (MI) I ( X i ; C ) is the most commonly-used criterion to rank attributes for attribute sorting or filtering [28,29], and Conditional Mutual Information (CMI) I ( X i ; X j | C ) is used to find conditional dependence between attribute pair X i and X j for identifying possible dependencies. I ( X i ; C ) and I ( X i ; X j | C ) are defined as follows,
I ( X i ; C ) = x i Ω X i c Ω C P ( x i , c ) l o g P ( x i , c ) P ( x i ) P ( c ) I ( X i ; X j | C ) = x i Ω X i x j Ω X j c Ω C P ( x i , x j , c ) l o g P ( x i , x j | c ) P ( x i | c ) P ( x j | c ) .
The independence assumption may not hold for all attribute pairs, but may hold for some attribute pairs. Two categories of learning strategies have been proven effective based on NB. The first category aims at identifying the independency relationships to approximate NB’s independence assumption. Langley and Sage [27] proposed the wrapper-based Selective Bayes (SB) classifier, which carries out a greedy search through the space of attributes to accommodate redundant ones within the prediction process. Some methods relieve the violations of the attribute independence assumption through deleting strong related attributes (such as Backwards Sequential Elimination (BSE) [30] and Forward Sequential Selection (FSS) [31]). Some attribute weighting methods also achieve competitive performance. The earliest methods of weighted naive Bayes were proposed by Hilden and Bjerregaard [32], which used a single weight, then Ferreira [33] improved this by weighting each attribute value rather than each attribute. Hall [34] assigned the weight, which is in reverse ratio to the minimum depth at first tested in an uncorrected decision tree to each attribute. The other group introduced various categories to NB. Kwoh and Gillies [35] proposed a method that introduces one hidden variable to NB’s model as a child of the class label and as the parent of all predictor labels. Kohavi [36] described a hybrid approach that attempts to utilize the advantages of both decision trees and naive Bayes. Yang [37] proposed to fit NB’s conditional independence assumption by discretization.
The second category aims at relaxing the independence assumption by introducing the significant dependency relationships. TAN relaxes the independence assumption, as well as extends NB from the zero-dependence to the one-dependence maximum weighted spanning tree [14] (see Figure 1b). Based on this, Keogh and Pazzani [38] proposed to construct TAN by choosing the augmented arcs, which maximized the improvement of classification accuracy. ATAN [39] predicts by averaging each built TAN’s estimated class-membership probabilities. Weighted Averaged Tree-Augmented Naive Bayes (WATAN) [39] applies the aggregation weight by the mutual information between the class variable and root attribute. To represent more dependency relationships, an ensemble of one-dependence BNCs or high-dependence BNC is a feasible solution. RTAN [40] generates TAN, which describes the dependency relationships within a certain attribute sub-spaces. As a consequence, BaggingMultiTAN [40] trains these RTAN as component classifiers and is generated by the most votes. Averaged One-Dependence Estimators (AODE) [41] assumes that every attribute relies on the class and a shared attribute and only uses one-dependence estimators. To handle continuous variables, in every model, HAODE [42] considers a super-parent attribute’s discrete version, so that it can estimate the previous relationships by a univariate Gaussian distribution. As shown in Figure 1c (KDB with four attributes when k = 2), KDB can represent the arbitrary degree of dependency relationships and also achieve similar computational efficiency of NB [21]. Bouckaert proposed to average all of the possible network structures for the fixed value of k (containing lower orders) [43]. Rubio and Gámez presented a variant of KDB, which provided a hill-climbing algorithm to build a KDB incrementally [44].
To avoid high variance and classification bias caused by overfitting, how to mine the information existing in testing instance P is an interesting issue and has attracted more attention recently. Some algorithms try to combine P into training data T , which can help refine the network structure of classifier BNC T , which is learned from T only. The recursive Bayesian classifier [31] captures each predicted label provided by NB, and if misclassified, it induces a new NB from the cases that have the predicted label. A random oracle classifier [45] splits the labeled training data into two subsets using the random oracle and respectively trains two sub-classifiers. The testing instance then uses the random oracle to select one sub-classifier for classification. Other algorithms, though few, seek to explore the dependency relationships implicated in P only. Subsumption Resolution (SR) [46] identifies pairs of attribute-values in P , and if one is a generalization of the other, SR will delete the generalization. Target learning [21] extends P to a pseudo training set and then builds an independent BNC P for it, which is complementary to BNC T in nature.

3. UKDB: Universal Target Learning

3.1. Target Learning (TL)

Relaxing the independence assumption by adding augmented edges to NB is a feasible approach to refining NB and increasing the confidence level of the estimate of joint probability P ( x , c ) . However, from Equation (5), we can see that, to compute MI or CMI, the (conditional) probability distributions needed are learned from labeled training dataset T only. Thus, as the structure complexity increases, the corresponding BNC may overfit the training data and underfit the unlabeled testing instance. This may lead to classification bias and high variance. To address the issue, we proposed the TL framework to build a specific BNC P for any testing instance P at classification time to explore possible conditional dependencies that exist in P only. The BNC P applies the same learning strategy as that of BNC T learned from T . Thus, BNC P and BNC T are complementary to each other and can work jointly.
We take KDB as an example to illustrate the basic idea of TL. Given training dataset T , the learning procedure of KDB T is shown in Algorithm 1.
Algorithm 1: The learning procedure of KDB T .
Entropy 21 00729 i001
From the viewpoint of information theory, MI or I ( X i ; C ) can measure the mutual dependence between C and X i . From Equation (5), we can see that I ( X i ; C ) is the expected value of mutual information over all possible values of C and X i . Thus, although the dependency relationships between attributes may vary for different instances to a certain extent [21], the structure of traditional KDB cannot automatically fit diverse instances. To address the issue, for unlabeled testing instance { x 1 , , x n } , Local Mutual Information (LMI) and Conditional Local Mutual Information (CLMI) are introduced as follows to measure the dependency relationship between attribute values [21]:
I ^ ( X i ; C ) = c Ω C P ( x i , c ) l o g P ( x i , c ) P ( x i ) P ( c ) I ^ ( X i ; X j | C ) = c Ω C P ( x i , x j , c ) l o g P ( x i , x j | c ) P ( x i | c ) P ( x j | c ) .
Given training set T, KDB T sorts attributes by comparing I ( X i ; C ) and chooses conditional dependency relationships by comparing I ( X i ; X j | C ) . In contrast, given testing instance P = { x 1 , x 2 , , x n } , KDB P sorts attributes by comparing I ^ ( X i ; C ) and chooses conditional dependency relationships by comparing I ^ ( X i ; X j | C ) . The learning procedure of KDB P is shown in Algorithm 2 as follows.
Algorithm 2: The learning procedure of KDB P .
Entropy 21 00729 i002

3.2. Universal Target Learning

Generally speaking, the aim of BNC learning is to find a network structure that can facilitate the shortest description of the original data. The length of this description considers the description of the BNC itself and the data applying the BNC [38]. Such a BNC represents a probability distribution P B ( x ) over the instance x appearing in the training data T.
Given training data T with N instances T = { d 1 , , d N } , the log likelihood of classifier B given T is defined as:
L L ( B | T ) = i = 1 N log P B ( d i ) ,
which represents how many bits are required to describe T on account of the probability distribution P B . The log likelihood has a statistical interpretation as well: the higher the log likelihood, the closer the classifier B is to model the probability distribution in T . The label of testing instance U = { x 1 , , x n } may take any one of the | C | possible values of variable C. Thus, TL assumes that U is equivalent to a pseudo training set P that consists of | C | instances as follows,
U = { x 1 , , x n } P = P 1 = { x 1 , , x n , c 1 } P 2 = { x 1 , , x n , c 2 } P | C | = { x 1 , , x n , c | C | }
Similar to the definition of L L ( B | T ) , the log likelihood of classifier B given P is defined as:
L L ( B | P ) = i = 1 | C | log P B ( P i ) ,
By applying different CMI criteria as shown in Equations (5) and (6), BNC P and BNC T provide two network structures to describe possible dependency relationships implicated in testing instances. These two CMI criteria cannot directly measure the bits that are needed to describe P based on P B , whereas L L ( B | P ) can. From Equation (2),
L L ( B | P ) = i = 1 | C | log P B ( P i ) = i = 1 | C | log { P ( c i ) j = 1 n P B ( x j | c i , π j ) } = i = 1 | C | log P ( c i ) + j = 1 n i = 1 | C | log P B ( x j | c i , π j ) = H ^ ( C ) + j = 1 n H ^ ( X j | C , Π j )
If there exist strong correlations between the values of parent attributes, we may choose to replace these correlations with meaningful dependency relationships. For example, let Gender and Pregnant be two attributes. If Pregnant = “yes”, it follows that Gender = “female”. Thus, Gender = “female” is a generalization of Pregnant = “yes” [46] and P ( G e n d e r = f e m a l e , P r e g n a n t = y e s ) = P ( P r e g n a n t = y e s ) . Given some other attribute values x ^ = { x 1 , , x m } , we can also have P ( G e n d e r = f e m a l e , P r e g n a n t = y e s , x ^ ) = P ( P r e g n a n t = y e s , x ^ ) . Correspondingly,
P ( x m + 1 | G e n d e r = f e m a l e , P r e g n a n t = y e s , x ^ ) = P ( G e n d e r = f e m a l e , P r e g n a n t = y e s , x ^ , x m + 1 ) P ( G e n d e r = f e m a l e , P r e g n a n t = y e s , x ^ ) = P ( P r e g n a n t = y e s , x ^ , x m + 1 ) P ( P r e g n a n t = y e s , x ^ ) = P ( x m + 1 | P r e g n a n t = y e s , x ^ )
Obviously, for specific instances in which such correlations hold, the parent attribute Gender can not provide any extra information to X m + 1 and should be removed. To maximize L L ( B | P ) , X m + 1 may select another attribute, e.g., X p , as its parent to take the place of attribute Gender; thus, the dependency relationship between X p and X m + 1 that was neglected before can be added into the network structure. Many algorithms only explore improving the performance by removing redundant dependency relationships in the network structure, without considering to search for more meaningful dependency relationships. Because of the constraint of computational complexity that is closely related to structure complexity, each node in BNC can only take a limited number of attributes as parents. For example, KDB demands that at most k parents can be chosen for each node. Similarly, the proposed algorithm also follows this rule.
The second term in Equation (10), i.e., H ^ ( X j | C , Π j ) , is the log likelihood of conditional dependency relationships in B given P . To find proper dependency relationships implicated in each testing instance and maximize the estimate of L L ( B | P ) , we need to maximize H ^ ( X j | C , Π j ) for each attribute X j in turn. We argue that L L ( B | P ) provides a more intuitive and scalable measure for a proper evaluation. Based on the discussion presented above, in this paper, we propose to refine the network structures of BNC P and BNC T based on Universal Target Learning (UTL). In the following discussion, we take KDB as an example and apply UTL to KDB T and KDB P in similar ways, then we have UKDB T and UKDB P correspondingly. For testing instance P , UKDB T or UKDB P will recursively check all possible combinations of candidate parent attributes and attempt to find Π j , which corresponds to the maximum of H ^ ( X j | C , Π j ) , that is Π j may contain less than min { i 1 , k } attributes. By minimizing H ^ ( X j | C , Π j ) for each attribute X j , UKDB T and UKDB P are supposed to be able to seek more proper dependency relationships implicated in specific testing instance P and that may help to maximize the estimate of L L ( B | P ) . For example, suppose that the attribute order of KDB T is { X 0 , X 1 , X 2 , X 3 } and k = 2 , then for attribute X 2 , its candidate parents are { X 0 , X 1 } . Given testing instance P , we will compare and find Π 2 where H ^ ( X 2 | C , Π 2 ) = max { H ^ ( X 2 | C , X 0 ) , H ^ ( X 2 | C , X 1 ) , H ^ ( X 2 | C , X 0 , X 1 ) } , and Π 2 { X 0 , X 1 , ( X 0 , X 1 ) } . Thus, UKDB T dynamically adjusts dependency relationships for different testing instances at classification time. Similarly, UKDB P applies the same learning strategy to refine the network structure of KDB P .
Given n attributes, we can have n ! possible attribute orders, and among them, the orders respectively determined by I ( X i ; C ) and I ^ ( X i ; C ) have been proven to be feasible and effective. Thus, for attribute X i , its parents can be selected from two sets of candidates. The final classifier is also an ensemble of UKDB T and UKDB P . UTL retains the characteristic of target learning, that is UKDB T and UKDB P are complementary, and they can work jointly to make the final prediction. The learning procedures of UKDB T and UKDB P , which are respectively shown in Algorithms 3 and 4 as follows, are almost the same, except the pre-determined attribute orders.
In contrast to TL, UTL can help BNC P and BNC T encode the most possible dependency relationships implicated in one single testing instance. The linear combiner is appropriate to be used for models that output real-valued numbers, so it is applicable for BNC. For testing instance x , the ensemble probability estimate for UKDB T and UKDB P is,
P ^ ( y | x ) = α . P ( y | x , U K D B T ) + β . P ( y | x , UKDB P )
For different instances, the weights, α and β , may differ greatly, and there is no effective way to address issue. Thus, in fact, we simply use the uniformly- rather than non-uniformly-weighted average of the probability estimates. That is, we set α = β = 0.5 for Equation (12).
Algorithm 3: UKDB T .
Entropy 21 00729 i003
Algorithm 4: UKDB P .
Entropy 21 00729 i004

4. Results and Discussion

All algorithms for the experimental study ran on a C++ system (GCC 5.4.0). For KDB and its variations, as k increased, the time complexity and the structure complexity always increased exponentially. The k with larger values may contribute to promoting the classification accuracy in contrast to the smaller value of k. There are some requirements on k due to the constraint of currently available hardware resources. When k = 3 , UKDB’s experimental results on some large-scale datasets can not be tested due to the amount of CPU available. Thus, we only chose to select k = 1 and k = 2 in the following experimental study. To demonstrate the effectiveness of the UTL framework, the following algorithms (including three single-structure BNCs and an ensemble BNC) will be compared with ours,
  • NB, the standard Naive Bayes.
  • TAN, Tree-Augmented Naive Bayes.
  • K 1 DB , k-dependence Bayesian classifier with k = 1.
  • K 2 DB , k-dependence Bayesian classifier with k = 2.
  • AODE, Averaged One-Dependence Estimators.
  • WATAN, the Weighted Averaged Tree-Augmented Naive Bayes.
  • TANe, an ensemble Tree-Augmented Naive Bayes applying target learning.
  • UK 1 DB , k-dependence Bayesian classifier with k = 1 in the framework of UTL.
  • UK 2 DB , k-dependence Bayesian classifier with k = 2 in the framework of UTL.
We randomly selected 40 datasets from the UCI machine learning repository [47] for our experimental study. The datasets were divided into three categories, i.e., large datasets with the number of instances >5000, medium datasets with the number of instances >1000 and <5000, and small datasets with the number of instances <1000. The above datasets are described in Table 1 in detail, including the number of instances, attributes, and classes. All the datasets are ordered in ascending order of dataset size. The number of attributes ranged widely from 4–56, convenient for evaluating the effectiveness of the UTL framework to mine dependency relationships between attributes. Meanwhile, we can examine the classification performance with various sizes from 24 instances to 5,749,132 instances. Missing values were replaced with distinct values. We used Minimum Description Length (MDL) discretization [48] to discretize the numeric attributes.
To validate the effectiveness of UTL, the proposed UKDB are contrasted with three single-structure BNCs (NB, TAN, and KDB), as well as three ensemble BNCs (AODE, WATAN, TANe) in terms of zero-one loss, RMSE, and F 1 -score in Section 4.1. Then, we introduce the criteria, goal difference, and relative zero-one loss ratio, to measure the classification performance of UKDB while dealing with different quantities of training data and different numbers of attributes in Section 4.2 and Section 4.3, respectively. In Section 4.4, we compare the time cost for training and classifying. At last, we conduct the global comparison in Section 4.5.

4.1. Comparison of Zero-One Loss, RMSE, and F 1 -Score

4.1.1. Zero-One Loss

The experiments were tested by applying 10 rounds of 10-fold cross-validation. We used Win/Draw/Loss (W/D/L) to clarify the experimental results. To compare the classification accuracy, Table A1 in Appendix A reports the average zero-one loss for each algorithm on different datasets. The corresponding W/D/L records are summarized in Table 2.
As shown in Table 2, for the single-structure classifier, UK 1 DB performed significantly better than NB and TAN. Most importantly, UK 1 DB achieved significant advantage over K 1 DB in terms of zero-one loss with 21 wins and only seven losses, providing convincing evidence for the validity of the proposed algorithm. For large datasets, the advantage was even stronger. Simultaneously, UK 2 DB achieved significant advantage over K 2 DB with a W/D/L of 28/8/4. That is, K 2 DB only achieved better results of zero-one loss over UK 2 DB on four datasets (contact-lenses, lung-cancer, sign, nursery); thus, UK 2 DB seldom performed worse than KDB. In contrast, UK 2 DB performed better than K 2 DB more often on many datasets, such as car, poker-hand, primary-tumor, waveform-5000. When compared with the ensemble algorithms, UK 1 DB and UK 2 DB still enjoyed an advantage over AODE, WATAN, and TANe. Moreover, the comparison results of UK 2 DB with AODE and WATAN were almost significant (24 wins and only three losses, 24 wins and only two losses, respectively). Based on the discussion above, we argue that UTL is an effective approach to refining BNC.

4.1.2. RMSE

The Root Mean Squared Error (RMSE) is used to measure the deviation between the observed value and the true value [49]. Table A2 in Appendix A reports the RMSE results for each algorithm on different datasets. The corresponding W/D/L records are summarized in Table 3. The scatter plot between UK 2 DB and K 2 DB in terms of RMSE is shown in Figure 2. The X-axis shows the RMSE results of K 2 DB , and the Y-axis shows the RMSE results of UK 2 DB . We can observe that there are generous datasets under the diagonal line, such as labor-negotiations, lymphography and poker-hand, which shows that UK 2 DB has some advantages over K 2 DB . Simultaneously, except credit-a and nursery, the other datasets approach close to the diagonal line, which means UK 2 DB rarely performed worse than K 2 DB . For many datasets, UTL substantially helped reduce the classification error of K 2 DB , for example the reduction from 0.4362 to 0.3571 on dataset lymphography. As shown in Table 3, for the single-structure classifiers, UK 1 DB performed significantly better than NB and TAN. Moreover, UK 1 DB achieved significant advantages over K 1 DB with 10 wins and four losses and UK 2 DB over K 2 DB with 14 wins and the losses, which provides convincing evidence for the validity of the proposed framework. When compared with the ensemble group, UK 1 DB and UK 2 DB still had a significant advantage. UK 1 DB and UK 2 DB had obvious advantage with W/D/L of 10/24/6 and 24/13/3 when compared with AODE. UK 2 DB also achieved relatively significant advantage when coming to WATAN and TANe (14 wins and only two losses, 15 wins and only three losses). UK 2 DB reduced RMSE more substantially. UKDB not only performed better than single-structure classifiers, but also was shown as an effective ensemble model when compared with AODE in terms of RMSE.

4.1.3. F 1 -Score

Generally speaking, zero-one loss can roughly measure the classification performance of BNC, but it cannot evaluate whether the BNC can work consistently while dealing with different parts of imbalanced data. In contrast, precision gives the ratio of the true classification in all test data predicted to be true, and recall gives the ratio of the true classification in all test data actually to be true [50]. Precision and recall sometimes have contradictory situations; therefore, we employed the F 1 -score, the harmonic average of the precision and recall, to measure the performance of our algorithm. In order to apply the multiclass classification problem, we employed the confusion matrix to measure the F 1 -score. Suppose that there exists a dataset to be classified with the classes { C 1 , C 2 , , C m } . The confusion matrix as follows shows the classification results:
N 11 N 1 m N m 1 N m m
Each entry N i i of the matrix presents the number of instances, whose true class is C i that are actually assigned to C i (where 1 i m ). Each entry N i j presents the number of instances, whose true class is C i , but nevertheless are actually assigned to C j (where i j and 1 i , j m ). Given the confusion matrix, precision, recall, and F 1 -score are computed as follows:
P r e c i s i o n i = N i i j = 1 m N j i
R e c a l l i = N i i j = 1 m N i j
F 1 s c o r e i = 2 · P r e c i s i o n i · R e c a l l i P r e c i s i o n i + R e c a l l i
F 1 s c o r e = i = 1 m F 1 s c o r e i m
Table A3 in Appendix A reports the F 1 -score for each algorithm on different datasets. Table 4 summarizes the W/D/L of the F 1 -score. Several points in this table are worth discussing:
As shown in Table 4, for the single-structure classifiers, UK 1 DB performed significantly better than NB and TAN. When compared with the ensembles, UK 1 DB and UK 2 DB still had a slight advantage over AODE and achieved significant advantages over WATAN and TAN e . Most importantly, UK 1 DB performed better than K 1 DB and UK 2 DB better than K 2 DB , although the advantage was not significant, which provides solid evidence for the effectiveness of UTL.

4.2. Goal Difference

To further compare the performance of UKDB with other mentioned algorithms in terms of data size, the Goal Difference (GD) [51,52] was introduced. Suppose for two classifiers A, B, we compute the value of GD as follows:
GD ( A ; B | T ) = | w i n | | l o s s | .
where T represents the datasets for comparison and | w i n | and | l o s s | are respectively the numbers of datasets on which the classification performance of A is better or worse than that of B.
Figure 3 and Figure 4 respectively show the fitting curve of GD( UK 1 DB ; K 1 DB | S t ) and GD( UK 2 DB ; K 2 DB | S t ) in terms of the zero-one loss. The X-axis represents the indexes of datasets described in Table 1 (referred to as t), and the Y-axis respectively represents the values of GD( UK 1 DB ; K 1 DB | S t ) and GD( UK 2 DB ; K 2 DB | S t ), where S t denotes the collection of datasets, i.e., S t = { D m | m t } and D m is the dataset with index m.
From Figure 3, we can see that UK 1 DB achieved significant advantage over K 1 DB , and only on a few large datasets (nursery, seer-mdl, adult), the advantage was not obvious. Similarly, from Figure 4, we can see that there was an obvious positive correlation between the values of GD( UK 2 DB ; K 2 DB | S t ) and the dataset size. The advantage of UK 2 DB over K 2 DB was much more obvious than that of UK 1 DB over K 1 DB on small and medium datasets. This superior performance is owed to the ensemble learning mechanism of UTL. UTL played a very important role in discovering proper dependency relationships that exist in testing instances. Since UTL replaces redundant dependency relationships with more meaningful ones, we can infer that UKDB retains the advantages of KDB, i.e., the ability to represent an arbitrary degree of dependence and to fit training data. This demonstrates the feasibility of applying UTL to search for proper dependency relationships. When dealing with large datasets, overfitting may lead to high variance and classification bias; thus, the advantage of UKDB over KDB was not obvious when k = 1 or k = 2 .
For imbalanced datasets, the number of instances with different class labels will vary greatly, and that may lead to the estimate bias of the conditional probability. In this paper, the entropy function of class variable C, i.e., H ( C ) , is introduced to measure the extent to which the datasets are imbalanced. UTL refines the network structure of BNC T and BNC P according to the attribute values rather than the class label of testing instance U. The negative effect caused by the imbalanced distribution of C will be mitigated to a certain extent. From Figure 5 and Figure 6, we can see that the advantage of UKDB over KDB becomes more and more significant as H ( C ) > 0.8 . Thus, these datasets with H ( C ) > 0.8 are supposed to be relatively imbalanced and highlighted in Table A1, Table A2, and Table A3. Table 5 reports the corresponding H ( C ) values of these 40 datasets.

4.3. Relative Zero-One Loss Ratio

The criterion relative zero-one loss ratio can measure the extent of which classifier A 1 performs relatively better or worse than A 2 on different datasets. For instance, on dataset D 1 , the zero-one losses of classifier A 1 and A 2 were respectively 55% and 50%; whereas on dataset D 2 , the zero-one losses of classifier A 1 and A 2 were respectively 0% and 5%. Although the zero-one loss difference were always 5% for both cases, A 1 performed relatively better on dataset D 2 than A 2 on dataset D 1 . Given two classifiers A, B, the relative zero-one loss ratio, referred to as R Z ( · ) , is defined as follows:
R Z ( A | B ) = 1 Z A Z B .
where Z A ( o r B ) denotes the value of the zero-one loss of classifier A ( o r B ) on a specific dataset. The higher the value of R Z ( A | B ) , the better the performance of classifier A relative to classifier B.
Figure 7 presents the comparison results of R Z ( · ) of UK 2 DB and K 2 DB , UK 1 DB , and K 1 DB . The X-axis represents the index of the dataset, and the Y-axis shows the value of R Z ( · ) . As we can observe intuitively, on most datasets, the values of R Z ( UK 2 DB | K 2 DB ) and R Z ( UK 1 DB | K 1 DB ) were positive, which demonstrates that UKDB achieved significant advantages over KDB no matter k = 1 or k = 2 . Generally, in many cases, the difference between R Z ( UK 2 DB | K 2 DB ) and R Z ( UK 1 DB | K 1 DB ) was not obvious; thus, the working mechanism of UTL makes it insensitive to the structure complexity. For the first 10 datasets, the effectiveness of UTL was less significant. UK 1 DB beat K 1 DB on six datasets and lost on four, and UK 2 DB performed similarly. From Table 1, among these datasets on which UTL performed poorer, contact-lenses (No. 1), echocardiogram (No. 5), and iris (No. 7) had a small number of attributes, i.e., respectively 4, 6, and 4 attributes. A small dataset may lead to low confidence estimate of the probability distribution and then low-confidence estimate of H ^ ( X j | C , Π j ) . A small number of attributes makes it more difficult for UTL to adjust the dependency relationships dynamically. However, as the size of datasets increased, UKDB generally achieved more significant advantages over KDB. For the last 30 datasets, UTL only performed poorer on a few datasets, e.g., hypothyroid (No. 25), and among theses datasets, UK 2 DB worked much better than UK 1 DB . From the above discussion, we can come to the conclusion that the UTL framework was effective at identifying significant conditional dependencies implicated in testing instance, whereas enough data for assuring high-confidence probability estimate was a necessary prerequisite.

4.4. Training and Classification Time

The comparison results of time for training and classifying are respectively displayed in Figure 8 and Figure 9. Each bar shows the sum time of 40 datasets.
From Figure 8, we can observe that our proposed algorithms UK 1 DB and UK 2 DB substantially needed more training time than the rest of the classifiers considered, i.e., NB, TAN, K 1 DB , K 2 DB , AODE, WATAN, and TANe. UK 2 DB spent slightly more training time than UK 1 DB on account of more dependency relationships existing in UK 2 DB . On the other hand, as shown in Figure 9, due to the ensemble learning strategy of UTL, NB, TAN, AODE, K 1 DB , and K 2 DB consumed less classification time than UKDB when k = 1 or k = 2. This was due to the fact that during the learning process, UTL recursively tries to find the stronger dependency relationships for each testing instance based on log likelihood. UK 1 DB and UK 2 DB had similar time cost for classifying. Although UKDB generally had more training time and classification time than other BNCs, it had higher classification accuracy. Compared to KDB, UKDB delivered markedly lower zero-one loss, also causing too much average computation overhead. The advantage of UTL for improving classification accuracy came at a cost in training time and classification time.

4.5. Global Comparison

We performed the comparison of our algorithm and other algorithms with the Nemenyi test in Figure 10 proposed by Demšar [53]. If two classifiers’ average ranks are diverse by at least the Critical Difference (CD), their performance differs significantly. The value of CD can be calculated as follows:
CD = q ff t ( t + 1 ) 6 N .
where the critical value q α for α = 0.05 and t = 9 is 3.102 [53]. Given nine algorithms and 40 datasets, the critical difference (CD) is CD = 3.102 × 9 × ( 9 + 1 ) / ( 6 × 40 ) = 1.8996 . We plot the algorithms on the left line according to their average ranks, which are indicated on the parallel right line. Critical Difference (CD) is also presented in the graphs. The lower the position of algorithms, the lower the ranks will be, and hence the better the performance. The algorithms are connected by a line if their differences are not significant. As shown in Figure 10, UK 2 DB achieved the lowest mean zero-one loss rank, followed by UK 1 DB . The average rank of UK 2 DB and UK 1 DB was significantly better than NB, TAN, K 1 DB , and K 2 DB , demonstrating the effectiveness of the proposed universal target learning framework. Compared with the ensemble models AODE, WATAN, and TANe, UK 2 DB and UK 1 DB also achieved lower ranks, but not significantly.

5. Conclusions and Future Work

BNCs can graphically represent the dependency relationships implicit in training data and they have been previously demonstrated to be effective and efficient. On the basis of analyzing and summarizing the state-of-the-art BNCs in terms of log likelihood, this paper proposed a novel learning framework for BNC learning, UTL. Our experiments showed its advantages from the comparison results of zero-one loss, RMSE, F 1 -score, etc. UTL can help refine the network structure by fully mining the significant conditional dependencies among attribute values in a specific instance. The application of UTL is time-consuming, and we will seek methods to make it more effective. The research work on extending TL will be very promising.

Author Contributions

All authors contributed to the study and preparation of the article. S.G. and L.W. conceived of the idea, derived the equations, and wrote the paper. Y.L., H.L., and T.F. did the analysis and finished the programming work. All authors read and approved the final manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61272209 and No. 61872164.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Zero-one loss results of NB, TAN, AODE, WATAN, TANe, K 1 DB , K 2 DB , UK 1 DB , and UK 2 DB .
Table A1. Zero-one loss results of NB, TAN, AODE, WATAN, TANe, K 1 DB , K 2 DB , UK 1 DB , and UK 2 DB .
IndexDatasetsNBTANAODEWATANTANeK 1 DBK 2 DBUK 1 DBUK 2 DB
1contact-lenses0.37500.37500.37500.45830.37500.29170.25000.33330.3333
2lung-cancer0.43750.59380.50000.62500.65630.59380.56250.56250.5938
3post-operative0.34440.36670.33330.36670.34440.34440.37780.35560.3333
4zoo0.02970.00990.02970.01980.00990.04950.04950.01980.0297
5echocardiogram0.33590.32820.32060.32820.32820.30530.34350.32060.3282
6lymphography0.14860.17570.16890.16890.16890.17570.23650.14860.1554
7iris0.08670.08000.08670.08000.08670.08670.08670.08670.0733
8teaching-ae0.49670.54970.49010.53640.51660.54300.53640.47020.4503
9wine0.01690.03370.02250.03370.03370.03930.02250.02250.0169
10autos0.31220.21460.20490.21460.20000.21460.20490.20490.2000
11glass-id0.26170.21960.25230.21960.21030.22430.21960.21500.2056
12hungarian0.15990.17010.16670.17350.15990.17010.18030.16330.1463
13heart-disease-c0.18150.20790.20130.20460.18810.20790.22440.19140.2013
14primary-tumor0.54570.54280.57520.54280.55750.56930.57230.54570.5339
15horse-colic0.21740.20920.20110.21200.20380.21740.24460.20650.2092
16house-votes-840.09430.05520.05290.05290.05520.06900.05060.05520.0391
17cylinder-bands0.21480.28330.18890.24630.18330.22780.22590.18150.1815
18balance-scale0.27200.27360.28320.27360.27840.28160.27840.26400.2640
19credit-a0.14060.15070.13910.15070.13910.15510.14640.13480.1377
20pima-ind-diabetes0.24480.23830.23830.23700.23830.24220.24480.23570.2331
21tic-tac-toe0.30690.22860.26510.22650.27240.24630.20350.23170.1733
22german0.25300.27300.24800.27600.25900.27600.28900.25600.2680
23car0.14000.05670.08160.05670.05790.05670.03820.07410.0723
24mfeat-mor0.31400.29700.31450.29800.30500.29900.30600.30800.3070
25hypothyroid0.01490.01040.01360.01040.00920.01070.01070.00980.0101
26kr-vs-kp0.12140.07760.08420.07760.05660.05440.04160.04850.0454
27dis0.01590.01590.01300.01540.01620.01460.01380.01410.0127
28abalone0.47620.45870.44720.45820.45540.46330.45630.45390.4554
29waveform-50000.20060.18440.14620.18440.16500.18200.20000.15980.1642
30phoneme0.26150.27330.23920.23450.24290.21200.19840.19010.1841
31wall-following0.10540.05540.03700.05500.04620.04620.04010.03890.0295
32page-blocks0.06190.04150.03380.04180.03420.04330.03910.03640.0358
33thyroid0.11110.07200.07010.07230.07260.06930.07060.08350.0669
34sign0.35860.27550.28210.27520.27130.28810.25390.27130.2572
35nursery0.09730.06540.07300.06540.06170.06540.02890.07020.0555
36seer_mdl0.23790.23760.23280.23740.23320.23670.25550.23630.2367
37adult0.15920.13800.14930.13800.13260.13850.13830.13820.1347
38localization0.49550.35750.35960.35750.36100.37060.29640.33190.3112
39poker-hand0.49880.32950.48120.32950.07630.32910.19610.06180.0752
40donation0.00020.00000.00020.00000.00000.00000.00000.00010.0000
Table A2. RMSE results of NB, TAN, AODE, WATAN, TANe, K 1 DB , K 2 DB , UK 1 DB , and UK 2 DB .
Table A2. RMSE results of NB, TAN, AODE, WATAN, TANe, K 1 DB , K 2 DB , UK 1 DB , and UK 2 DB .
IndexDatasetsNBTANAODEWATANTANeK 1 DBK 2 DBUK 1 DBUK 2 DB
1contact-lenses0.50170.60770.52580.57370.57360.50240.49960.51360.5033
2lung-cancer0.64310.76230.69150.76620.70690.75230.73130.69420.7391
3post-operative0.51030.53400.52150.53580.51570.52890.56320.52560.5289
4zoo0.16230.13090.15360.13410.13130.19840.18150.14260.1428
5echocardiogram0.48960.48860.49030.48900.48520.48460.48890.48780.4891
6lymphography0.34650.38130.35560.38570.37610.37260.43620.36140.3571
7iris0.25450.24410.25440.24350.25050.24350.24470.26280.2407
8teaching-ae0.62040.63000.61170.62420.61910.63320.62860.62860.6262
9wine0.11340.17460.12450.17480.15830.17610.15010.13550.1374
10autos0.51900.44750.43970.44200.43620.44600.43990.42520.4385
11glass-id0.43530.41090.42350.40870.40360.42230.42050.41790.4020
12hungarian0.36670.34290.34760.34180.33150.33800.35520.35340.3444
13heart-disease-c0.37430.37750.36590.37830.35830.38100.39630.38020.3877
14primary-tumor0.70840.71700.71550.71660.71540.71900.72620.70850.7048
15horse-colic0.42090.42050.40150.42150.39510.41310.43480.41640.4247
16house-votes-840.29970.21810.19940.21810.21260.22350.19690.22210.1779
17cylinder-bands0.42910.43580.40800.42770.39730.44350.44310.40770.4083
18balance-scale0.44310.43440.43500.43440.44140.43840.43230.42790.4286
19credit-a0.33500.34150.32710.34070.33000.34160.34800.33360.3355
20pima-ind-diabetes0.41470.40590.40780.40590.40440.40540.40740.40950.4082
21tic-tac-toe0.43090.40230.39950.40230.42160.40500.37720.41340.3421
22german0.42040.43670.41610.43730.42060.43890.46650.43640.4531
23car0.33950.24050.30220.24060.25650.24040.20310.24260.2358
24mfeat-mor0.48170.46570.47100.46600.46860.46650.47070.46730.4652
25hypothyroid0.11380.09550.10360.09510.09330.09560.09370.09310.0928
26kr-vs-kp0.30220.23580.26380.23580.24170.21590.18690.19920.1866
27dis0.11770.11030.10800.10980.10840.10720.10240.10590.1021
28abalone0.58710.56380.55590.56370.55960.56530.56460.56540.5625
29waveform-50000.41010.36110.32570.36100.34170.36180.38680.34740.3568
30phoneme0.47920.50480.46890.46760.47960.43850.41950.40550.4091
31wall-following0.30830.22450.18290.22230.19890.20500.19300.18840.1642
32page-blocks0.23310.18940.16290.18950.16460.19400.18110.17390.1696
33thyroid0.31430.24430.24250.24310.24030.24140.24230.24930.2331
34sign0.52700.46150.47020.46140.46820.47590.43700.45810.4387
35nursery0.28200.21940.25030.21940.22520.21930.17760.21770.2003
36seer_mdl0.42330.41310.41120.41320.40710.41310.43400.42140.4219
37adult0.34090.30760.32450.30760.30240.30710.30890.31670.3132
38localization0.67760.56560.58560.56560.57760.57670.51060.54710.5169
39poker-hand0.58010.49870.53920.49870.43900.49870.40550.37360.3500
40donation0.01230.00500.01140.00500.00790.00500.00460.00640.0055
Table A3. F 1 -score results of NB, TAN, AODE, WATAN, TANe, K 1 DB , K 2 DB , UK 1 DB , and UK 2 DB .
Table A3. F 1 -score results of NB, TAN, AODE, WATAN, TANe, K 1 DB , K 2 DB , UK 1 DB , and UK 2 DB .
IndexDatasetsNBTANAODEWATANTANeK 1 DBK 2 DBUK 1 DBUK 2 DB
1contact-lenses0.45400.37780.38560.33770.36190.57480.68750.48780.4878
2lung-cancer0.56990.42110.50300.39220.35450.41630.43590.41880.3799
3post-operative0.26580.29810.30650.29810.30250.31850.30680.31540.2898
4zoo0.92370.99480.92960.97560.98240.88790.88050.96220.9364
5echocardiogram0.56310.54060.56580.54060.54060.57740.50780.55630.5507
6lymphography0.87200.72210.68560.56140.56180.78960.52810.80540.6357
7iris0.91330.92000.91330.92000.91330.91330.91330.91330.9267
8teaching-ae0.50110.45150.50510.46490.48320.45880.46600.52630.5481
9wine0.98320.96640.97800.96640.96640.96060.97800.97730.9832
10autos0.78250.84570.57920.84820.86060.84820.85960.86910.8685
11glass-id0.74000.78630.75640.78630.79680.78070.78640.78800.7994
12hungarian0.82240.81150.81480.80820.82320.81320.80330.82150.8390
13heart-disease-c0.81690.78940.79720.79260.80950.78970.77380.80700.7961
14primary-tumor0.31850.33070.29240.31390.28940.28480.28910.28800.2949
15horse-colic0.77010.77300.78490.77040.77810.76450.73910.77840.7706
16house-votes-840.90210.94190.94440.94440.94190.92760.94680.94220.9591
17cylinder-bands0.76280.67990.79790.73100.80410.75700.75880.80840.8088
18balance-scale0.50510.50410.49740.50410.50070.49850.50200.51080.5107
19credit-a0.85650.84690.85860.84700.85910.84240.85150.86390.8607
20pima-ind-diabetes0.72870.73170.73270.73340.72900.73110.72720.73190.7365
21tic-tac-toe0.63580.73000.68470.73300.62830.71310.76490.68530.7825
22german0.68800.66470.68240.65990.65690.65070.64510.65780.6509
23car0.66070.91750.75690.91750.89030.91750.93540.86850.8464
24mfeat-mor0.67590.70010.67970.69940.69270.69880.69050.68630.6874
25hypothyroid0.92510.94240.92990.94240.94880.94090.94050.94690.9447
26kr-vs-kp0.87820.92230.91540.92230.94320.94550.95830.95140.9546
27dis0.74600.56740.77990.58180.49590.61960.68700.66100.7041
28abalone0.50470.53670.54760.53720.53380.53340.53840.53960.5400
29waveform-50000.78860.81590.85320.81590.83510.81820.80020.83950.8350
30phoneme0.69710.67780.65510.72350.70870.78380.79080.78230.7781
31wall-following0.87420.93330.95640.93370.94450.94400.95140.95260.9613
32page-blocks0.75300.82190.83240.81990.83000.81300.81740.81890.8317
33thyroid0.61030.60650.68970.59470.61080.61140.57520.65030.6426
34sign0.63730.72280.71630.72300.72750.70990.74560.72690.7412
35nursery0.57090.61300.60470.61310.61340.61310.70530.61380.6509
36seer_mdl0.73630.72830.73640.72850.73030.72840.70820.73160.7292
37adult0.79860.80630.80700.80630.80920.80520.80280.81120.8133
38localization0.23960.39550.36860.39530.37070.38170.47420.40600.4383
39poker-hand0.06680.19190.07890.19200.19200.19120.25520.27490.2826
40donation0.98720.99770.98780.99770.99730.99770.99830.99470.9975

References

  1. Silvia, A.; Luis, M.; Javier, G.C. Learning Bayesian network classifiers: Searching in a space of partially directed acyclic graphs. Mach. Learn. 2005, 59, 213–235. [Google Scholar]
  2. Dagum, P.; Luby, M. Approximating probabilistic inference in Bayesian belief networks is NP-Hard. Artif. Intell. 1993, 60, 141–153. [Google Scholar] [CrossRef]
  3. Lavrac, N. Data mining in medicine: Selected techniques and applications. In Proceedings of the 2nd International Conference on the Practical Applications of Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 11–31. [Google Scholar]
  4. Lavrac, N.; Keravnou, E.; Zupan, B. Intelligent data analysis in medicine. Encyclopedia Comput. Sci. Technol. 2000, 42, 113–157. [Google Scholar]
  5. Kononenko, I. Machine learning for medical diagnosis: History, state of the art and perspective. Artif. Intell. Med. 2001, 23, 89–109. [Google Scholar] [CrossRef]
  6. Androutsopoulos, I.; Koutsias, J.; Chandrinos, K.; Spyropoulos, C. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and Development in Information Retrieval, Athens, Greece, 24–28 July 2000; pp. 160–167. [Google Scholar]
  7. Crawford, E.; Kay, J.; Eric, M. IEMS–The intelligent email sorter. In Proceedings of the 19th International Conference on Machine Learning, Sydney, NSW, Australia, 8–12 July 2002; pp. 83–90. [Google Scholar]
  8. Starr, B.; Ackerman, M.S.; Pazzani, M.J. Do-I-care: A collaborative web agent. In Proceedings of the ACM Conference on Human Factors in Computing Systems, New York, NY, USA, 13–18 April 1996; pp. 273–274. [Google Scholar]
  9. Miyahara, K.; Pazzani, M.J. Collaborative filtering with the simple Bayesian classifier. In Proceedings of the 6th Pacific Rim International Conference on Artificial Intelligence, Melbourne, Australia, 28 August–1 September 2000; pp. 679–689. [Google Scholar]
  10. Mooney, R.J.; Roy, L. Content-based book recommending using learning for text categorization. In Proceedings of the 5th ACM conference on digital libraries, Denver, CO, USA, 6–11 June 2000; pp. 195–204. [Google Scholar]
  11. Bielza, C.; Larranaga, P. Discrete bayesian network classifiers: A survey. ACM Comput. Surv. 2014, 47. [Google Scholar] [CrossRef]
  12. Sahami, M. Learning limited dependence Bayesian classifiers. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 335–338. [Google Scholar]
  13. Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; A Wiley-Interscience Publication, Wiley: New York, NY, USA, 1973; ISBN 978-7-1111-2148-0. [Google Scholar]
  14. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
  15. Corsten, M.; Papageorgiou, A.; Verhesen, W.; Carai, P.; Lindow, M.; Obad, S.; Summer, G.; Coort, S.; Hazebroek, M.; van Leeuwen, R.; et al. Microrna profiling identifies microrna-155 as an adverse mediator of cardiac injury and dysfunction during acute viral myocarditis. Circulat. Res. 2012, 111, 415–425. [Google Scholar] [CrossRef] [PubMed]
  16. Triguero, I.; Garcia, S.; Herrera, F. Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study. Knowl. Inf. Syst. 2015, 42, 245–284. [Google Scholar] [CrossRef]
  17. Zhu, X.J.; Goldberg, A.B. Introduction to Semi-Supervised Learning. Synth. Lec. Artif. Intell. Mach. Learn. 2009, 3, 1–130. [Google Scholar] [CrossRef] [Green Version]
  18. Zhu, X.J. Semi-Supervised Learning Literature Survey. In Computer Science Department; University of Wisconsin: Madison, WI, USA, 2008; Volumn 37, pp. 63–77. [Google Scholar]
  19. Ioannis, E.L.; Andreas, K.; Vassilis, T.; Panagiotis, P. An Auto-Adjustable Semi-Supervised Self-Training Algorithm. Algorithms 2018, 11, 139. [Google Scholar] [Green Version]
  20. Zhu, X.J. Semi-supervised learning. In Encyclopedia of Machine Learning; Springer: Berlin, Germany, 2011; pp. 892–897. [Google Scholar]
  21. Wang, L.M.; Chen, S.; Mammadov, M. Target Learning: A Novel Framework to Mine Significant Dependencies for Unlabeled Data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Cham, Switzerland, 2018; pp. 106–117. [Google Scholar]
  22. David, M.C.; David, H.; Christopher, M. Large-Sample Learning of Bayesian Networks is NP-Hard. J. Mach. Learn. Res. 2004, 5, 1287–1330. [Google Scholar]
  23. Arias, J.; Gámez, J.A.; Puerta, J.M. Scalable learning of k-dependence bayesian classifiers under mapreduce. In Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, Helsinki, Finland, 20–22 August 2015; Volume 2, pp. 25–32. [Google Scholar]
  24. David, D.L. Naive Bayes at forty: Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the Machine Learning: ECML-98, Chemnitz, Germany, 21–23 April 1998; pp. 4–15. [Google Scholar]
  25. David, J.H.; Keming, Y. Idiot’s Bayes—Not so stupid after all? Int. Stat. Rev. 2001, 69, 385–398. [Google Scholar]
  26. Kononenko, I. Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. Curr. Trend. Knowl. Acquisit. 1990, 11, 414–423. [Google Scholar]
  27. Langley, P.; Sage, S. Induction of selective Bayesian classifiers. In Uncertainty Proceedings 1994; Ramon, L., David, P., Eds.; Morgan Kaufmann: Amsterdam, Holland, 1994; pp. 399–406. ISBN 978-1-5586-0332-5. [Google Scholar]
  28. Pazzani, M.; Billsus, D. Learning and revising user profiles: the identification of interesting web sites. Mach. Learn. 1997, 27, 313–331. [Google Scholar] [CrossRef]
  29. Hall, M.A. Correlation-Based Feature Selection for Machine Learning. Ph.D. Thesis, Waikato University, Waikato, New Zealand, 1998. [Google Scholar]
  30. Kittler, J. Feature selection and extraction. In Handbook of Pattern Recognition and Image Processing; Young, T.Y., Fu, K.S., Eds.; Academic Press: Orlando, FL, USA, 1994; Volume 2, ISBN 0-12-774561-0. [Google Scholar]
  31. Langley, P. Induction of recursive Bayesian classifiers. In Proceedings of the 1993 European conference on machine learning: ECML-93, Vienna, Austria, 5–7 April 1993; pp. 153–164. [Google Scholar]
  32. Hilden, J.; Bjerregaard, B. Computer-aided diagnosis and the atypical case. Decis. Mak. Med. Care 1976, 365–374. [Google Scholar]
  33. Hall, M.A. A decision tree-based attribute weighting filter for naive Bayes. In Proceedings of the International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK, 15–17 December 2015; pp. 59–70. [Google Scholar]
  34. Ferreira, J.T.A.S.; Denison, D.G.T.; Hand, D.J. Weighted Naive Bayes Modelling for Data Mining. Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.1176 (accessed on 15 June 2001).
  35. Kwoh, C.K.; Gillies, D.F. Using hidden nodes in Bayesian networks. Artif. Intell. 1996, 88, 1–38. [Google Scholar] [CrossRef] [Green Version]
  36. Kohavi, R. Scaling Up the Accuracy of Naive-Bayes Classiers:A Decision-Tree Hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996. [Google Scholar]
  37. Ying, Y.; Geoffrey, I.W. Discretization for naive-Bayes learning:managing discretization bias and variance. Mach. Learn. 2009, 74, 39–74. [Google Scholar] [CrossRef]
  38. Keogh, E.J.; Pazzani, M.J. Learning the structure of augmented Bayesian classifiers. Int. J. Artif. Intell. Tools 2002, 11, 587–601. [Google Scholar] [CrossRef]
  39. Jiang, L.X.; Cai, Z.H.; Wang, D.H.; Zhang, H. Improving tree augmented naive bayes for class probability estimation. Knowl. Syst. 2012, 26, 239–245. [Google Scholar] [CrossRef]
  40. Ma, S.C.; Shi, H.B. Tree-augmented naive Bayes ensemble. In Proceedings of the 2004 International Conference on Machine Learning and Cybernetics, Shanghai, China, 26–29 August 2004; pp. 26–29. [Google Scholar]
  41. Webb, G.I.; Janice, R.B.; Zheng, F.; Ting, K.M.; Houssam, S. Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly Naive Bayesian classification. Mach. Learn. 2012, 86, 233–272. [Google Scholar] [CrossRef]
  42. Flores, M.J.; Gamez, J.A.; Martinez, A.M.; Puerta, J.M. GAODE and HAODE: Two Proposals based on AODE to Deal with Continuous Variables. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 313–320. [Google Scholar]
  43. Bouckaert, R.R. Voting massive collections of Bayesian Network classifiers for data streams. In Proceedings of the 19th Australian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence, Hobart, TAS, Australia, 4–8 December 2006; Volume 1, pp. 243–252. [Google Scholar]
  44. Rubio, A.; Gamez, J.A. Flexible learning of K-dependence Bayesian Network classifiers. In Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, Dublin, Ireland, 12–16 July 2011; pp. 1219–1226. [Google Scholar]
  45. Juan, J.R.; Ludmila, I.K. Naive Bayes ensembles with a random oracle. In Proceedings of the 7th International Workshop on Multiple Classifier Systems (MCS-2007), Prague, Czech Republic, 23–25 May 2007; pp. 450–458. [Google Scholar]
  46. Zheng, F.; Webb, G.I.; Pramuditha, S.; Zhu, L.G. Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning. Mach. Learn. 2012, 87, 93–125. [Google Scholar] [CrossRef]
  47. Murphy, P.M.; Aha, D.W. UCI Repository of Machine Learning Databases. 1995. Available online: http://archive.ics.uci.edu/ml/datasets.html (accessed on 1 February 2019).
  48. Fayyad, U.M.; Irani, K.B. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, 28 August–3 September 1993; pp. 1022–1029. [Google Scholar]
  49. Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef] [Green Version]
  50. Gianni, A.; Cornelis, J.V.R. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 2002, 20, 357–389. [Google Scholar] [Green Version]
  51. Duan, Z.Y.; Wang, L.M. K-Dependence Bayesian classifier ensemble. Entropy 2017, 19, 651. [Google Scholar] [CrossRef]
  52. Liu, Y.; Wang, L.M.; Sun, M.H. Efficient heuristics for structure learning of k-dependence Bayesian classifier. Entropy 2018, 20, 897. [Google Scholar] [CrossRef]
  53. Demšar, J. Statistical comparisons of classifiers over multiple datasets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. (a) NB; (b) Tree Augmented Bayesian classifier (TAN); (c) k-Dependence Bayesian classifier (KDB) (k = 2) with four attributes.
Figure 1. (a) NB; (b) Tree Augmented Bayesian classifier (TAN); (c) k-Dependence Bayesian classifier (KDB) (k = 2) with four attributes.
Entropy 21 00729 g001
Figure 2. The scatter plot of UK 2 DB and K 2 DB in terms of RMSE.
Figure 2. The scatter plot of UK 2 DB and K 2 DB in terms of RMSE.
Entropy 21 00729 g002
Figure 3. Goal Difference (GD( UK 1 DB ; K 1 DB | T )) in terms of zero-one loss.
Figure 3. Goal Difference (GD( UK 1 DB ; K 1 DB | T )) in terms of zero-one loss.
Entropy 21 00729 g003
Figure 4. GD( UK 2 DB ; K 2 DB | T ) in terms of zero-one loss.
Figure 4. GD( UK 2 DB ; K 2 DB | T ) in terms of zero-one loss.
Entropy 21 00729 g004
Figure 5. GD( UK 1 DB ; K 1 DB | H ( C ) ) in terms of zero-one loss.
Figure 5. GD( UK 1 DB ; K 1 DB | H ( C ) ) in terms of zero-one loss.
Entropy 21 00729 g005
Figure 6. GD( UK 2 DB ; K 2 DB | H ( C ) ) in terms of zero-one loss.
Figure 6. GD( UK 2 DB ; K 2 DB | H ( C ) ) in terms of zero-one loss.
Entropy 21 00729 g006
Figure 7. The comparison results of the relative zero-one loss ratio between UKDB and KDB when k = 1 and k = 2 .
Figure 7. The comparison results of the relative zero-one loss ratio between UKDB and KDB when k = 1 and k = 2 .
Entropy 21 00729 g007
Figure 8. Training time of NB, TAN, K 1 DB , K 2 DB , AODE, WATAN, TANe, UK 1 DB , and UK 2 DB .
Figure 8. Training time of NB, TAN, K 1 DB , K 2 DB , AODE, WATAN, TANe, UK 1 DB , and UK 2 DB .
Entropy 21 00729 g008
Figure 9. Classification time of NB, TAN, K 1 DB , K 2 DB , AODE, WATAN, TANe, UK 1 DB , and UK 2 DB .
Figure 9. Classification time of NB, TAN, K 1 DB , K 2 DB , AODE, WATAN, TANe, UK 1 DB , and UK 2 DB .
Entropy 21 00729 g009
Figure 10. Zero-one loss comparison with the Nemenyi test.
Figure 10. Zero-one loss comparison with the Nemenyi test.
Entropy 21 00729 g010
Table 1. Datasets.
Table 1. Datasets.
IndexDatasetInstanceAttributeClassIndexDatasetInstanceAttributeClass
1contact-lenses244321tic-tac-toe95892
2lung-cancer3256322german1000202
3post-operative908323car172864
4zoo10116724mfeat-mor2000610
5echocardiogram1316225hypothyroid3163252
6lymphography14818426kr-vs-kp3196362
7iris1504327dis3772292
8teaching-ae1515328abalone417783
9wine17813329waveform-50005000403
10autos20525730phoneme5438750
11glass-id2149331wall-following5456244
12hungarian29413232page-blocks5473105
13heart-disease-c30313233thyroid91692920
14primary-tumor339172234sign12,54683
15horse-colic36821235nursery12,96085
16house-votes-8443516236seer_mdl18,962132
17cylinder-bands54039237adult48,842142
18balance-scale6254338localization164,860511
19credit-a69015239poker-hand1,025,0101010
20pima-ind-diabetes7688240donation5,749,132112
Table 2. Win/Draw/Loss (W/D/L) of zero-one loss on 40 datasets. AODE, Averaged One-Dependence Estimators; WATAN, Weighted Averaged Tree-Augmented Naive Bayes; UK, k-dependence Bayesian classifier with Universal Target Learning (UTL).
Table 2. Win/Draw/Loss (W/D/L) of zero-one loss on 40 datasets. AODE, Averaged One-Dependence Estimators; WATAN, Weighted Averaged Tree-Augmented Naive Bayes; UK, k-dependence Bayesian classifier with Universal Target Learning (UTL).
W/D/LNBTANK 1 DBK 2 DBAODEWATANTANeUK 1 DB
TAN20/9/11
K 1 DB 22/9/99/26/5
K 2 DB 19/11/1017/13/1015/16/9
AODE20/15/512/15/1312/15/1315/11/14
WATAN21/8/112/36/25/27/88/17/1513/14/13
TANe21/16/326/9/517/17/613/14/1310/24/616/22/2
UK 1 DB 24/12/418/15/721/12/716/12/1215/18/719/16/513/20/7
UK 2 DB 26/12/226/12/228/8/428/8/424/13/324/14/218/18/416/22/2
Table 3. W/D/L of RMSE on 40 datasets.
Table 3. W/D/L of RMSE on 40 datasets.
W/D/LNBTANK 1 DBK 2 DBAODEWATANTANeUK 1 DB
TAN20/14/6
K 1 DB 20/17/36/33/1
K 2 DB 18/14/816/20/413/22/5
AODE20/18/211/21/87/23/1013/15/12
WATAN20/16/42/38/01/35/44/22/148/24/8
TANe21/15/410/28/28/27/512/15/138/26/69/29/2
UK 1 DB 19/19/212/25/310/26/410/21/910/24/69/28/39/26/5
UK 2 DB 24/14/217/21/215/23/214/23/224/13/314/24/215/21/419/17/4
Table 4. W/D/L of the F 1 -score on 40 datasets.
Table 4. W/D/L of the F 1 -score on 40 datasets.
W/D/LNBTANK 1 DBK 2 DBAODEWATANTANeUK 1 DB
TAN12/22/6
K 1 DB 15/19/67/31/2
K 2 DB 15/16/97/29/46/31/3
AODE15/24/17/28/57/29/410/23/7
WATAN13/22/52/34/47/31/27/29/47/26/7
TANe15/20/52/32/63/29/86/26/86/28/62/33/5
UK 1 DB 19/17/410/27/38/30/29/26/56/30/410/27/310/30/0
UK 2 DB 24/14/212/25/39/26/58/26/67/28/511/27/219/17/44/33/3
Table 5. The H(C)values for the 40 datasets.
Table 5. The H(C)values for the 40 datasets.
IndexDatasetH(C)IndexDatasetH(C)
1contact-lenses1.053621tic-tac-toe0.9281
2lung-cancer1.552222german0.8804
3post-operative0.967923car1.2066
4zoo2.350624mfeat-mor3.3210
5echocardiogram0.907625hypothyroid0.2653
6lymphography1.272526kr-vs-kp0.9981
7iris1.584627dis0.1147
8teaching-ae1.582828abalone1.5816
9wine1.566429waveform-50001.5850
10autos2.284630phoneme4.7175
11glass-id1.564531wall-following1.7095
12hungarian0.957932page-blocks0.6328
13heart-disease-c0.998633thyroid1.7151
14primary-tumor3.705434sign1.5832
15horse-colic0.953335nursery1.7149
16house-votes-840.969636seer_mdl0.9475
17cylinder-bands0.988837adult0.7944
18balance-scale1.311238localization2.7105
19credit-a0.991139poker-hand0.9698
20pima-ind-diabetes0.937240donation0.0348

Share and Cite

MDPI and ACS Style

Gao, S.; Lou, H.; Wang, L.; Liu, Y.; Fan, T. Universal Target Learning: An Efficient and Effective Technique for Semi-Naive Bayesian Learning. Entropy 2019, 21, 729. https://doi.org/10.3390/e21080729

AMA Style

Gao S, Lou H, Wang L, Liu Y, Fan T. Universal Target Learning: An Efficient and Effective Technique for Semi-Naive Bayesian Learning. Entropy. 2019; 21(8):729. https://doi.org/10.3390/e21080729

Chicago/Turabian Style

Gao, Siqi, Hua Lou, Limin Wang, Yang Liu, and Tiehu Fan. 2019. "Universal Target Learning: An Efficient and Effective Technique for Semi-Naive Bayesian Learning" Entropy 21, no. 8: 729. https://doi.org/10.3390/e21080729

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop