Exponential Loss Minimization for Learning Weighted Naive Bayes Classifiers

The naive Bayesian classification method has received significant attention in the field of supervised learning. This method has an unrealistic assumption in that it views all attributes as equally important. Attribute weighting is one of the methods used to alleviate this assumption and consequently improve the performance of the naive Bayes classification. This study, with a focus on nonlinear optimization problems, proposes four attribute weighting methods by minimizing four different loss functions. The proposed loss functions belong to a family of exponential functions that makes the optimization problems more straightforward to solve, provides analytical properties of the trained classifier, and allows for the simple modification of the loss function such that the naive Bayes classifier becomes robust to noisy instances. This research begins with a typical exponential loss which is sensitive to noise and provides a series of its modifications to make naive Bayes classifiers more robust to noisy instances. Based on numerical experiments conducted using 28 datasets from the UCI machine learning repository, we confirmed that the proposed scheme successfully determines optimal attribute weights and improves the classification performance.


I. INTRODUCTION
Based on the Bayesian decision theorem, a Bayesian classifier predicts a test instance as a class that has the highest membership probability. This implies that learning a Bayesian classifier involves estimating the prior and posterior distributions from the training data. When training a Bayesian classifier, to estimate the posterior distributions, knowledge of the relationships among attributes is required. In particular, Bayesian network classifiers, also called Bayesian belief networks [1], require background knowledge in the form of a graph structure. Determining the optimal structure of Bayesian networks is a well-known NP-hard problem.
Among Bayesian classifiers, the naive Bayesian classifier has the simplest structure as it assumes that all attributes in the training data are conditionally independent of each other and equally important in determining the classes. These assumptions prove advantageous for naive Bayes in that it is The associate editor coordinating the review of this manuscript and approving it for publication was Sergio Consoli . easy to implement and can be trained with a small dataset. Although naive Bayes has strong assumptions on training data, it has been the focus of many researchers and practitioners because it often performs remarkably well in many domains [2], such as rRNA sequence classification [3], antispam filtering, identifying data correctness in wireless sensor networks [4]- [6], document classification [7], and sentiment classification in e-commerce product reviews [8].
Owing to its unrealistic assumptions, naive Bayes sometimes exhibits poor performance. There are several approaches for improving the classification accuracy of naive Bayes [9]. One promising approach is the attribute weighting technique, called weighted naive Bayes, which works by assigning different weights to each attribute to alleviate the assumption that all attributes are equally important. Let w(j) be the weight of the jth attribute. If we restrict w(j) to be either one or zero, then the attribute weighting becomes attribute selection, where one indicates inclusion of the attribute, and zero indicates exclusion of the attribute. Therefore, in the approaches of attribute weighting, the value of w(j) is relaxed to a real number (usually a positive number for interpretation purposes). In this study, we mainly discuss the weighting approach, while still considering attribute selection by allowing w(j) to take the zero value if necessary.
For the weighted naive Bayes to be effective, it is important to assign appropriate weights to each attribute. This has been widely studied by many researchers and practitioners with respect to attribute weighting methods, and they are divided into three groups: filter, wrapper, and embedded approaches [10]. Given a training dataset, the simplest method to assign attribute weights is to measure the importance of each attribute using external criteria and apply the measured importance directly to each attribute before learning a naive Bayes classifier. Such methods are filter approaches. In filter algorithms, calculating the attribute importance or selecting significant attributes is performed as a pre-processing step. Because it is separated from learning algorithms for classification or regression, it is relatively faster than wrapper and embedded approaches. One method to calculate the importance of an attribute is using the Kullback-Leibler measure for the amount of information [11]. A decision tree is often used to measure the importance of an attribute [12], in which the weight of the jth attribute is calculated as 1/ d j , where d j is the depth at which the jth attribute first appears in a fully grown tree. The receiver operating characteristic (ROC) [13] can be a good choice for weight values when training data have a class imbalance problem [14]. Motivated by the fact that the area under the ROC curve (AUC) can be interpreted as a classseparating ability of each attribute, the study in [15] used the AUC scores as attribute weights in the weighted naive Bayes framework. The correlation-based feature selection (CFS) [16] is a widely used method that composes a subset of features based on their correlations. The main assumption to employ CFS is that important features are highly correlated with a class label but almost uncorrelated with other features. The research employing CFS for the weighted naive Bayes was proposed in that redundant features are removed with the expectation that the conditional independence assumption of the original naive Bayes can be relaxed [17]. Unlike other weighted naive Bayes research, it imposes the weights when estimating both prior and posterior probabilities. Another correlation-based weighting approach, CFW (correlationbased feature weighting), was proposed in [18]. Similar to CFS, CFW assumes that highly predictive features would be highly correlated with a class label (maximum mutual relevance) but uncorrelated with other attributes (minimum mutual redundancy) for naive Bayes. The mutual information was used to measure each attribute-class correlation and attribute-attribute correlations.
Although the aforementioned filter approaches provided reasonable attribute weights and successful experimental results, there is no guarantee of improvement on the classification accuracy of naive Bayes because attribute weighting is performed independently of training a classifier. To obtain weights that are optimal with respect to classification performance, wrapper approaches have been proposed. They assign attribute weights or select a subset of attributes in a heuristic optimization manner by utilizing a learning algorithm of interest in the training step with the learning objective of maximizing predictive power. In an earlier study, the selective Bayesian classifier (SBC) [19], a stepwise attribute selection technique, was combined with naive Bayes. To maximize classification accuracy, the algorithm repeatedly inserts important attributes or deletes irrelevant ones. The selective naive Bayes (SNB) algorithm [20] was proposed to overcome the heavy computational complexity of SBC. It first calculates mutual information with the class of each attribute and selects the attributes in descending order of the mutual information. In another wrapper approach for finding optimal attribute weights, the differential evolutionary algorithm was applied to determine the optimal weight values [21].
Wrapper approaches usually require a large amount of computational time because they repeatedly train and evaluate a classifier. Another weakness is that they need to prepare a validation set to evaluate a classifier within iterative procedures, which could be a problem with a small amount of data. Embedded approaches are designed to train a classifier and simultaneously assign attribute weights (or select important attributes). A well-known embedded approach is the decision tree [22], [23], which is suitable for a small dataset because the given dataset does not need to be divided into training and validation sets. The main idea of the embedded approaches is to formulate attribute weighting as an optimization problem, in which the objective function is the performance of a classifier and the decision variables are the attribute weights. The weighted naive Bayes based on the gradient-based L-BFGS-M method [24] was proposed by introducing two objective functions: conditional log-likelihood (CLL) and mean square error (MSE) [25], which focus on maximizing the likelihood of data from a probability perspective and minimizing predictive error from a classification perspective. This method was extended by giving different attribute weights to different class labels [26]. A matrix of weights which size is n c × m (n c : the number of class, m: the number of attributes) must be constructed to implement this idea. The superior performance was shown by conducting benchmark tests with real-world datasets. Another objective function, which maximizes the difference between membership probabilities for correctlyclassified and misclassified instances, was introduced in [27] for the weighted naive Bayes. This research also proposed different weight vectors for different classes.
In the context of embedded approaches, we propose an exponential loss minimization method to learn weighted naive Bayes classifiers. We attempt to investigate the intended characteristic of each loss function such as robustness to noisy instances and linkage from our research to the related studies. Beginning with a typical exponential loss of the classification margin, this study sequentially introduces four different loss functions, where each function has its own analytical reasoning. Because the loss functions are evaluated VOLUME 10, 2022 within a linear formulation of weighted naive Bayes, this study only considers binary classification.
The remainder of this paper is organized as follows. Section II reviews the naive Bayesian classification and mathematical formulation of the weighted naive Bayes. In Section III, we propose four different loss functions and their solutions for attribute weights. This section provides their analytical properties and empirical evidence from illustrative experiments. To confirm the performance of the proposed weighted naive Bayes classifiers, numerical experiments are presented in Section IV. Section V concludes the paper and discusses future research directions.

A. NAIVE BAYES CLASSIFIER
A Bayesian classifier predicts the class labels of unknown instances with a corresponding maximum posterior probability. Let m be the number of attributes and y ∈ {−1, 1} be the class label. The predicted classŷ i of the ith unseen instance . . x im ] is then determined as in (1).
Estimating the posterior probability P(x i |y) increases the model complexity and requires a large amount of training data. Therefore, naive Bayes simply assumes that all attributes are conditionally independent. With this assumption, we use (2) instead of (1).
When the jth attribute is numerical, there are several methods to compute P(x ij |y), such as changing the jth attribute to a discrete attribute by applying the histogram method or interpreting it as some other probability distribution. The empirical distribution can be estimated using the kernel density estimation [28]. In this study, we assume that the numerical attribute follows a normal distribution. (3) is used as an estimation of the posterior probability, where µ (j|y) and σ (j|y) are the mean and standard deviations of the jth attribute, respectively, given class label y.
For a categorical attribute, the posterior probability is estimated according to the proportion of the number of training instances in class y that take a particular value x ij , denoted by n(x ij |y), to the total number of instances in class y, n(y). After applying Laplace's correction, the estimated posterior probability is shown in (4), where n j is the number of categories (cardinality) of the jth attribute.

B. WEIGHTED NAIVE BAYES CLASSIFIER
Since the assumption of equally important attributes is rarely satisfied in practice, several studies have attempted to relax this assumption [9]. Attribute weighting is one of the most frequently used relaxation methods. The underlying principle of weighted naive Bayesian classification is that some attributes are more (or less) important than others in a classification task. This leads to the modification of (2), which is given in (5).ŷ where P w(j) (x ij |y) represents the weighted posterior probability, which is usually defined in an exponential form as in (6).
Note that the weight of the jth attribute, w(j), representing the significance of the attribute, can be any positive number. If it is allowed to take either 0 or 1, (6) is reduced to an attribute selection problem. In the binary class case, which is the main focus of this study, (5) can be described by an ''If . . . then . . . '' rule, as in (7).
Using (6) and taking the logarithm, the weighted naive Bayes becomes a linear function of the weights, as shown in (8), with the classification rule. where

III. PROPOSED METHOD
In this section, we propose an optimization approach for learning weighted naive Bayes classifiers. The optimization problem is formulated in that training a weighted naive Bayes classifier involves finding optimal attribute weights that maximize the overall accuracy of the classifier. To this end, we first introduce a typical exponential loss of classification margin and then sequentially introduce three other loss functions, namely, binomial deviance, modified binomial deviance, and generalized binomial deviance loss functions.
With all the loss functions, the optimization problems can be solved using gradient-based nonlinear optimization methods, such as quasi-Newton methods [29].

A. EXPONENTIAL LOSS MINIMIZATION
To obtain the optimal weights that maximize the classification accuracy, an exponential loss can be used. With the given ith instance x i and its class label y i ∈ {−1, 1}, the exponential loss L exp (x i , y i ) is defined as an exponential form of the classification margin, as shown in (9).
With a training set of n instances, the total exponential loss is given as a summation of (9). One of the proposed weighting methods, namely exponential naive Bayes (ENB), is defined in (10), which is the solution for minimizing the total exponential loss. Once we find the optimal weights, we can use (8) to classify new instances.
The validity of the ENB can be demonstrated as follows. It is straightforward to show that (10) is equivalent to (11). because .
It is obvious that both T FNR and T FPR are non-negative and measure incorrect classification of instances. If a positive instance is misclassified as a negative class, T FNR increases. Likewise, if a negative instance is incorrectly predicted, T FPR increases. Hence, the optimal attribute weights of w ENB simultaneously minimize the ''false negative rate'' and ''false positive rate'' of a classifier.
As an ideal case, if infinite training instances are given, it is known that the minimization of exponential loss is equivalent to the minimization of binomial deviance loss [30], which is given in (12).
However, with a finite training dataset, the optimization result of binomial deviance loss minimization must be different to that of exponential loss minimization in terms of scale.
As y i f (x i ) → −∞, the value of (9) increases exponentially, whereas that of (12) increases almost linearly. Therefore, we can expect that L dev is less sensitive than L exp to noisy instances, examples of which are instances that are difficult to classify. In (13), our second proposed model, namely deviance naive Bayes (DNB), is depicted. We can train a weighted naive Bayes classifier by minimizing the total binomial deviance loss.
Observing that the loss increase of L dev is slower than that of L exp as the classification margin increases, we additionally propose to remove the constant '2' from the classification margin y i (P 0 + wP xi ) in (13) to further reduce the loss increase. We believe that this modification would make the trained weighted naive Bayes classifier less sensitive to noisy instances. We name it the log-likelihood naive Bayes (LNB), which is shown in (14).
This simple modification explains why the solution to the optimization problem in (14) is reasonable, as shown in the following theorem. Theorem 1: w LNB is the maximum likelihood estimator of the attribute weights.
Proof: By converting (14) into a maximization problem and removing the logarithm, we obtain (15).
(15) can be rewritten as resulting in the product of two groups of terms, where each group of terms corresponds to one of the classes. By expanding (16), we obtain The last term in (17) contains a product of the posterior probabilities for all of the training instances. Therefore, w LNB is the maximum likelihood estimator of the attribute weights. By Theorem 1, we could deduce that the LNB is eventually equivalent to the CLL in [25].
Further, we focus on the multiplier of the classification margin y i (P 0 + wP xi ) in (13). Because we change the multiplier of the margin from −2 to −1 for the robustness of a classifier to noisy instances, we can apply a generalized multiplier, namely −α, and let the nonlinear optimization routine find its best value according to a given training set. Thus, it is reasonable to interpret the multiplier as a tuning parameter for the noise level in the data. By introducing −α, the generalized DNB (GDNB) is given in (18).
By defining GDNB, DNB and LNB can be seen as the special cases of GDNB with α = 2 and α = 1, respectively. α is a tuning parameter that controls the penalty of misclassified instances when evaluating the loss function, which means that if α is high, it is likely to focus more on outliers or noisy instances. Because the noise level would be different for different data, its optimal value should be determined such that it maximizes the classification performance of a trained classifier. To find the optimal attribute weights and optimal tuning parameter α simultaneously, (18) is reformulated as (19).
From the right-most side of (19), we observe that finding α is equivalent to assigning weights to not only posterior probabilities, P(x ij |y), but also prior probability, P(y). According to the context in which attribute weights indicate the degree of importance of the attributes, w 0 can be defined as the confidence of prior information. With GDNB, we need to modify the classification rule in (8), as shown below.
f (x i ) = w 0 P 0 + wP xi , In addition to the above loss functions, we introduce nonnegative constraints, which are shown in (21), for all four sets of attributes. The constraints are intended to provide interpretation ability to the proposed methods because a negative weight cannot be interpreted in terms of importance.
By minimizing the proposed loss functions with respect to (10), (13), (14), and (19) [24], where p is the control parameter for memory allocation. In this study, we set p = 5 as recommended in [24]. The proposed loss functions defined by the summation over training instances should be evaluated in each iteration of the L-BFGS-B update. Therefore, the computational complexity of one iteration in the proposed algorithm becomes O(p 2 mn).

B. COMPARISON OF LOSS FUNCTIONS
This subsection compares and interprets the loss functions using an illustrative example. Figure 1 depicts three loss functions, namely, exponential, binomial deviance, and modified binomial deviance loss functions, at varying classification margins yf , which monotonically decrease and have nonnegative values. The exponential loss is always greater than the other loss functions and increases exponentially in the negative margin, whereas the others increase less rapidly. This means that when noisy instances, which are difficult to classify, are included in a training set, it is expected that a classifier trained by minimizing the exponential loss would be more affected by those instances. Therefore, an overfitting problem could possibly occur because their losses account for a large part of the total loss that an employed nonlinear optimizer attempts to reduce.
The value of the binomial deviance is higher and lower than that of the modified binomial deviance at the negative and positive margins, respectively. This implies that the modified binomial deviance is less sensitive to misclassified instances than others, while it assigns a larger amount of loss to misclassified instances than to correctly classified instances; therefore, the less-steep losses are expected to be more robust to noisy data, as intended.
To show the difference among the loss functions, we generated an illustrative 2D example consisting of two classes that are easily separable but have few opposite class instances (marked 1 to 5 in Figure 2(a)) in each of the class regions. Hence, we intentionally generated instances that would be incorrectly classified to observe their losses. We then trained four naive Bayes classifiers (NB, ENB, DNB, and LNB) from the synthetic dataset. We did not include GDNB because its result would be similar to that of DNB or LNB for this small and simple example. Figure 2(a) depicts the decision boundaries for which the standard NB is distinct from the others. The instances marked 1 to 5 were misclassified by all the decision boundaries. Although the classification results are the same, the loss values of the instances varied across methods. More specifically, ENB showed a significantly different distribution of instance losses in comparison with DNB and LNB. Figure 2(b) shows the loss values of each training instance, which were normalized by the total loss such that the sum of the normalized losses became 1, for an effective comparison. The misclassified instances were marked from 1 to 5. Note that the loss values are the results after the optimization procedure, which implies that they are the already-minimized values. As can be seen from the figure, the misclassified instances of LNB and DNB account for more than 70% of the total loss, whereas ENB has less than 60%. This implies that ENB is more focused on instances 1 to 5 than DNB and LNB because, to minimize the total loss, it was beneficial to reduce the losses of the noisy instances more than those of other well-classified instances. This example shows that our reasoning regarding the robustness to noise in sequentially introducing the loss functions is valid. We provide more empirical evidence for the robustness in Section IV-C.

IV. NUMERICAL EXPERIMENTS A. EXPERIMENTAL SETTING
To confirm the performance of the proposed weighted naive Bayes, numerical experiments were conducted based on 28 real datasets from the University of California, Irvine (UCI) machine learning repository [31]. The datasets are listed in Table 1. They were collected to evaluate classifiers in various circumstances in terms of the percentage of minority class instances, total number of instances, and total number of attributes. The datasets are sorted in ascending order of minority ratio; the dataset ''Nursery'' is the most imbalanced case (2.53%), whereas the dataset ''Breast Cancer'' has the most balanced class distribution (37.26%). Each dataset contains different numbers of instances ranging from 151 to 28,056 and different numbers of attributes, ranging from 4 to 64. Most datasets consist of either numerical or categorical attributes. Some datasets, such as ''Chess,'' have both. To focus on binary classification, the datasets with more than two classes were converted into binary class cases by VOLUME 10, 2022 assigning one class as a minority class and integrating the other classes into a majority class.

B. GENERAL PERFORMANCE
The experimental results are summarized in Table 2, which records the accuracy and relative improvement rates for each combination of a classifier and a dataset. The relative improvement rate (RI) in parentheses is calculated as (Accuracy(·) − Accuracy(NB))/Accuracy(NB) × 100(%), where Accuracy(·) is the accuracy of each method and Accuracy(NB) is the accuracy of the standard NB, to determine how much improvement can be achieved by the weighted NB methods over the standard NB. The average RIs are shown in the last row of the table. Every experiment involved 10-fold cross-validation; thus, the accuracy reported in the table is the average of 10 accuracy values. The underlined numbers indicate the best (boldface) or worst (italic) accuracy for each dataset.
In three datasets (''Steel Plate Faults,'' ''Image Segmentation,'' and ''Wholesale Customers''), there was a significant improvement in the attribute weighting methods. For example, while NB correctly classified 66.66% of instances in the ''Steel Plate Faults'' dataset, the proposed DNB showed an accuracy of 93.6%, resulting in a 40.43% improvement. The correlation heatmaps of the three datasets are shown in Figures 4, 5, and 6. It is clear that the attributes in those datasets are highly correlated, which implies that some of the attributes are redundant for classification. From this example, we can see that the proposed method can improve the classification performance by attribute weighting for highly correlated data. In fact, compared with NB, one of the proposed methods showed better accuracy for all datasets except for ''Wall-Following Robot Navigation Data (4L).'' All other attribute weighting methods also performed worse than NB for this dataset. Considering that this dataset has the smallest number of attributes, it may be unnecessary to apply attribute weighting to the dataset. For the proposed methods, the attribute selection rate was additionally computed by the number of attributes with non-zero weights over total number of attributes in each dataset. As can be seen from Table 3, the selection rates are similar across the methods but different across the datasets. The dropped attributes by zero weights appeared in the most cases, while there was no attribute selection in four datasets (No. 1, 2, 16, and 24). Especially, we observed that only about 20% of attributes contributed to the classification task for the ''Multiple Attribute (mfeatzer)'' dataset (No. 11).
From Table 2, we can observe that the attribute weighting methods generally outperformed the standard NB. Next, we compare the weighted naive Bayes by only considering the RI values. As can be seen from the last row of the table, the proposed methods showed the largest improvement compared with the standard NB by 10.65% on average. The average RI for the proposed methods was computed by taking the largest RI value from each row in the last four columns (ENB, DNB, LNB, and GDNB) and averaging them because this study provides four options to perform attribute weighting. The proposed methods were followed by MSE (10.26%), CBFWNB (9.83%), SBC (8.82%), DFWNB (8.29%), KLNB (6.18%), and TreeNB (5.99%) in sequence of performance. Note that KLNB, TreeNB, DFWNB and CBFWNB are filterbased methods, SBC is a wrapper method, and MSE is one of the embedded methods, as are the proposed methods. These performance results are consistent with previous works showing that it is more desirable to find the attribute weights with simultaneous consideration of classification performance, as discussed in the introduction section.
Other observations from the experiments were as follows: In the severely class-imbalanced datasets (''Nursery,'' ''Car Evaluation,'' and ''Letter Recognition''), the filter and wrapper methods (KLNB, TreeNB, and SBC) showed even worse performances than the standard NB. Among the proposed methods, DNB, LNB, and GDNB showed similar performances, and ENB performed worse. This implies that it would be preferable to perform a logarithm in a loss function instead of using a pure exponential loss, and our logical derivation from ENB to GDNB is reasonable. It is notable that MSE recorded the highest accuracy in 8 datasets, which means that minimizing the squared error is still a promising method to train a weighted naive Bayes classifier.
We conducted the Wilcoxon signed-ranks test [32] on the experimental results to convince the generalization performance of the proposed methods, and the test results are summarized in Table 4. Each cell contains the p-value for the hypothesis that the classification performance of a row classifier is different from that of a column classifier. The p-values below the chosen significance level 0.05 are underlined. The boldface means that a row classifier outperformed a column classifier while the italic implies the opposite. Table 4 shows that the GDNB outperformed every benchmark classifier except for CBFWNB, MSE, and LNB.

C. ROBUSTNESS TEST
This subsection evaluates the robustness of the proposed methods. From one point of view, a robust classifier refers to a classifier that is minimally influenced by outliers in the training data. In this experiment, noise data were intentionally added to a training set to observe the degradation of the classification performance. If the degradation is small, we can consider the classifier to be robust.
As in the previous experiments, 10-fold cross-validation was employed. After extracting p% (noise rate) random instances from the folded training set, they were converted into noisy instances by changing their class labels. This means that we generated class noise. After learning a classifier with the noisy training set, the classification accuracy was measured using the remaining fold, which is a noise-free test set. This process was repeated 10 times per fold to reduce the uncertainty caused by random sampling. We considered the noise rates (p) from 0% to 40% with a 1% increment.
The results are shown in Figure 7. Each graph in the figure represents the change in the average accuracy at varying noise rates. NB was used as a baseline, and the four proposed methods were included in the comparison. As the noise level increases, the predictive accuracy of all classifiers tends to decrease. In general, LNB and GDNB showed the best performance and the least degradation in performance with increasing noise rate, whereas the standard NB performed the worst and showed a rapid decrease in accuracy as more noisy instances were included. Among the proposed methods, ENB appeared to be the most sensitive to noise. In the cases of (b), (c), and (e), although the proposed methods showed similar performances at the zero noise rate, the prediction accuracy of ENB remarkably decreased with a slight increase in noise.  These results support our concept of classifier robustness when introducing DNB, LNB, and GDNB in Section III.
We conducted another simulation study for the robustness to the dimensionality of data. It began with a two-dimensional separable dataset where one attribute (X A1 ) was generated from N (0, 2 2 ) and the other attribute (X B1 ) was generated from X B1 |Y = 1 ∼ N (−1.5, 0.5 2 ) and X B1 |Y = −1 ∼ N (1.5, 0.5 2 )). It is clear that not X A1 but X B1 solely contributes to the classification of two classes. In order to increase the dimension of a dataset while preserving the separability, we generated the ith set of attributes from X A,i ∼ N (X A,i−1 , 0.1 2 ) and X B,i ∼ N (X B,i−1 , 0.1 2 ) and concatenated it to the previously generated sets (1st to (i−1)th). Notice that is important to the classification, if the ratio is at least greater than 0.5 or ideally close to 1, we can confirm that the proposed methods are robust to the dimensionality of data. Figure 8 shows the results verifying the robustness. For each of the proposed methods, we reported the test accuracy (black solid line) and the weight ratio (red dashed line). As can be seen from the figure, the test accuracy was very close to 1 and the weight ratio ranged from 0.99 to 1 across all dimensions. It means that the attribute weights of were extremely small and the classification performance was almost perfect while the dimensionality increases.

D. MULTI-CLASS PROBLEMS
We have shown the generalization performance of the proposed method for the binary classification problems. We now attempt to apply the proposed method to several multi-class datasets because many real-world problems require the separation of more than three classes. As mentioned in Section I, our method was designed mainly for binary class cases owing to the underlying principle of the linear formulation for the weighted naive Bayes. Nonetheless, it can be easily generalized to multi-class classifiers using the decomposition strategies [33].
We can decompose a multi-class problem into multiple binary sub-problems. The decomposition strategy has two approaches, the one-against-all (OAA) and the one-againstone (OAO) approaches. The OAA trains k classifiers (k is the number of classes), where one of those discriminates a specific class from the other classes. This approach has the advantage over the OAO that fewer classifiers are trained. However, the class imbalance problem, which causes the accuracy degradation, is inevitable. In the OAO approach, a classifier is trained only for a pair of classes (i and j, i = j). It therefore decomposes a multi-class problem into k C 2 subproblems. Compared to the OAA, this approach trains more classifiers and the number of classifiers rapidly increases as the number of classes increases. A small size of training set for each classifier is another problem. In this study, we employed the OAO approach because its disadvantages are less likely to appear owing to the simple structure of naive Bayes.
The OAO approach uses the majority voting scheme to classify a test example [34]. The classification scheme with k classes is described below. Let wheref ij is the classifier learned from a dataset containing only the classes c i and c j (1 ≤ i, j ≤ k) by the proposed method. p ij (x), namely a sub-classifier, returns the probability that a test example x belongs to the class c i . For a test example x l , the set of sub-classifiers, P, is defined by Notice that p ij = 1 − p ji and P consists of k C 2 sub-classifiers. The diagonal elements of P are all zeros. The classification rule is shown below.
where p i· = k j=1 p ij . The experimental results are summarized in Table 5 in the  same manner of Table 2. The table also shows the number of classes in each dataset. The datasets were chosen from Table 1 with the intention of testing with three classes at least and ten classes at most.
Similar to the binary classification results, all proposed methods outperformed the original naive Bayes. We found  the minimum 17% and the maximum 20% RI improvements by the proposed methods. The RI of DNB was higher than that of LNB, and GDNB was slightly better than DNB. This performance ranking is different from that of the experiments for the binary classification. In the multi-class scenario with the OAO scheme, the GDNB is proven to be a good attribute weighting scheme by finding the optimal attribute weights and the α values simultaneously.

V. CONCLUSION
In this study, new attribute weighting methods for improving naive Bayesian classification have been proposed. The proposed methods consider learning a weighted naive Bayes classifier as a nonlinear optimization problem. Different weights are assigned to attributes by minimizing the proposed loss functions, namely, ENB, DNB, LNB, and GDNB. The validity of each method was confirmed both analytically and empirically. The first method (ENB) is a training-error minimizer. Next, DNB was proposed to extend ENB to an ideal case with an infinite number of training instances. Because DNB is not a maximum likelihood estimator, LNB was proposed by modifying DNB. Finally, GDNB was proposed to automatically find the multiplier of the classification margin according to the noise level of a given training set. The attribute weights determined by the proposed weighted naive Bayes can be seen as a quantitative importance measure of the attributes, that is, the equal attribute importance assumption of the standard NB is relaxed. Because the proposed methods train a classifier and measure its attribute importance simultaneously, one can adjust the complexity of a trained classifier according to the resulting attribute weights. Based on numerical experiments using 28 real-world datasets, we confirmed that the proposed scheme was successful in terms of accuracy and robustness.
This study has a limitation that we considered the binary classification case only due to the linear formulation of the weighted naive Bayes. Although we showed the successful results in Section IV-D by simply generalizing the proposed methods to multi-class problems with the OAO approach, it is still inefficient as k C 2 classifiers must be trained. In our ongoing work, we plan to investigate ways to build more elaborate optimization formulations for multi-class weighted naive Bayes.
TAEHEUNG KIM received the Ph.D. degree in industrial engineering from Sungkyunkwan University, South Korea. He is currently working as a Research Professor with Sungkyunkwan University. Before rejoining Sungkyunkwan University, he worked with the Production Research Institute, LG Electronics, as a member of the AI Solution Team. His research interest includes the design of learning and optimization algorithms to improve real manufacturing processes.