Nearest labelset using double distances for multi-label classification

Multi-label classification is a type of supervised learning where an instance may belong to multiple labels simultaneously. Predicting each label independently has been criticized for not exploiting any correlation between labels. In this article we propose a novel approach, Nearest Labelset using Double Distances (NLDD), that predicts the labelset observed in the training data that minimizes a weighted sum of the distances in both the feature space and the label space to the new instance. The weights specify the relative tradeoff between the two distances. The weights are estimated from a binomial regression of the number of misclassified labels as a function of the two distances. Model parameters are estimated by maximum likelihood. NLDD only considers labelsets observed in the training data, thus implicitly taking into account label dependencies. Experiments on benchmark multi-label data sets show that the proposed method on average outperforms other well-known approaches in terms of 0/1 loss, and multi-label accuracy and ranks second on the F-measure (after a method called ECC) and on Hamming loss (after a method called RF-PCT).


Introduction
In multi-label classification, an instance can belong to multiple labels at the same time. This is different from multi-class or binary classification, where an instance can only be associated with a single label. For example, a newspaper article talking about electronic books may be labelled with multiple topics such as business, arts and technology simultaneously. Multi-label classification has been applied in many areas of application including text (Schapire and Singer 2000;Godbole and Sarawagi 2004), image (Boutell et al 2004;Zhang and Zhou 2007), music (Li and Ogihara 2003;Trohidis et al 2008) and bioinformatics (Elisseeff and Weston 2001). A labelset for an instance is the set of all labels that are associated with that instance.
Approaches for solving multi-label classification problems may be categorized into either problem transformation methods or algorithm adaptation methods (Tsoumakas and Katakis 2007). Problem transformation methods transform a multi-label problem into one or more single-label problems. For the single-label classification problems, binary or multi-class classifiers are used. The results are combined and transformed back into a multi-label representation. Algorithm adaptation methods, on the other hand, modify specific learning algorithms directly for multi-label problems. Individual approaches are explained in Section 2.
In this paper, we propose a new problem transformation approach to multi-label classification. Our proposed approach applies the nearest neighbor method to predict the label with the shortest distance in the feature space. However, because we have multiple labels, we additionally consider the shortest distance in the label space. We then find the labelset that minimizes the expected label misclassification rate as a function of both distances, feature space and label space, exploiting high-order interdependencies between labels. The nonlinear function is estimated using maximum likelihood.
The effectiveness of the proposed approach is evaluated with various multi-label data sets. Our experiments show that the proposed method performs on average better on standard evaluation metrics (Hammming loss, 0/1 loss, multi-label accuracy and the F -measure) than other commonly used algorithms.
The rest of this paper is organized as follows: In Section 2 we review previous work on multi-label classification. In Section 3, we present the details of the proposed method. In Section 4, we report on experiments that compare the proposed method with other algorithms on standard metrics. In Section 6 we discuss the results. In Section 7, we draw conclusions.

Related work
There are several approaches to classifying multi-label data. The most common approach, binary relevance (BR) (Zhang and Zhou 2005;Tsoumakas and Katakis 2007), transforms a multi-label problem into separate binary problems. That is, using training data, BR constructs a binary classifier for each label independently. For a test instance, the prediction set of labels is obtained simply by combining the individual binary results. In other words, the predicted labelset is the union of the results predicted from the L binary models. This approach requires one binary model for each label. The method has been adapted in many domains including text (Gonçalves and Quaresma 2003), music (Li and Ogihara 2003) and images (Boutell et al 2004). One drawback of the basic binary approach is that it does not account for any correlation that may exist between labels, because the labels are modelled independently. Taking correlations into account is often critical for prediction in multi-label problems (Godbole and Sarawagi 2004;Ji et al 2008).
A method related to BR is Subset-Mapping (SM BR) (Schapire and Singer 1999;Read et al 2011). For a new instance, a vector of labels is obtained by the binary outputs of BR and the final prediction is made by the training labelset with the shortest Hamming distance to the prediction set. For predictions SM BR only chooses labelsets observed in training data, thus SM BR exploits the interdependencies among labels.
An extension of binary relevance is Classifier Chain (CC) (Read et al 2011). CC fits labels sequentially using binary classifiers. Labels already predicted are included as features in subsequent classifiers until all labels have been fit. Including previous predictions as features "chains" the classifiers together and also takes into account potential label correlations. However, the order of the labels in a chain affects the predictive performances. Read et al (2011) also introduced the ensemble of classifier chains (ECC), where multiple CC are built with re-sampled training sets. The order of the labels in each CC is randomly chosen. The prediction label of an ECC is obtained by the majority vote of the CC models.
Label Powerset learning (LP ) transforms a multi-label classification into a multi-class problem (Tsoumakas and Katakis 2007). In other words, LP treats each labelset as a single label. The transformed problem requires a single classifier. Although LP captures correlations between labels, the number of classes in the transformed problem increases exponentially with the number of original labels. LP learning can only choose observed labelsets for predictions (Tsoumakas and Katakis 2007;Read et al 2008).
The random k-labelsets method, (RAKEL) (Tsoumakas and Vlahavas 2007), is a variation on the LP approach. In a multi-label problem with L different labels, RAKEL employs m multi-class models each of which considers k(≤ L) randomly chosen labels, rather than the entire labelset. For a test instance, the prediction labelset is obtained by the majority vote of the results based on the m models. RAKEL overcomes the problem that the number of multinomial classes increases exponentially as a function of the number of labels. It also considers interdependencies between labels by using multi-class models with subsets of the labels.
A popular lazy learning algorithm based on the k Nearest Neighbours (kN N ) approach is M LKN N (Zhang and Zhou 2007). Like other kN N -based methods, M LKN N identifies the k nearest training instances in the feature space for a test instance. Then for each label, M LKN N estimates the prior and likelihood for the number of neighbours associated with the label. Using Bayes theorem, M LKN N calculates the posterior probability from which a prediction is made.
The Conditional Bernoulli Mixtures (CBM ) (Li et al 2016) approach transforms a multi-label problem into a mixture of binary and multi-class problems. CBM divides the feature space into K regions and learns a multi-class classifier for the regional components as well as binary classifiers in each region. The posterior probability for a labelset is obtained by mixing the multi-class and multiple binary classifiers. The model parameters are estimated using the Expectation Maximization algorithm.
3 The nearest labelset using double distances approach

Hypercube view of a multi-label problem
In multi-label classification, we are given a set of possible output labels L = {1, 2, ..., L}. Each instance with a feature vector x ∈ R d is associated with a subset of these labels. Equivalently, the subset can be described as y = (y (1) , y (2) , ..., y (L) ), where y (i) = 1 if label i is associated with the instance and y (i) = 0 otherwise. A multi-label training data set is described as T = {(x i , y i ), i = 1, 2, ..., N }.
Any labelset y can be described as a vertex in the L-dimensional unit hypercube (Tai and Lin 2012). Each component y (i) of y represents an axis of the hypercube. As an example, Figure 1 illustrates the label space of a multi-label problem with three labels (y (1) , y (2) , y (3) ).
Assume that the presence or absence of each label is modeled independently with a probabilistic classifier. For a new instance, the classifiers provide the probabilities, p (1) , ..., p (L) , that the corresponding labels are associated with the instance. Using the probability outputs, we may obtain a L-dimensional vector p = (p (1) , p (2) , ..., p (L) ). Every element ofp has a value from 0 to 1 and the vectorp is an inner point in the hypercube (see Figure 1). Givenp the prediction task is completed by assigning the inner point to a vertex of the cube.
For the new instance, we may calculate the Euclidean distance, Dy i , betweenp and each y i (i.e. the labelset of the i th training instance). In Figure 1, three training instances y 1 , y 2 and y 3 and the corresponding distances are shown. A small distance Dy i indicates that y i is likely to be the labelset for the new instance.

Nearest labelset using double distances (N LDD)
In addition to computing the distance in the label space, Dy i , we may also obtain the (Euclidean) distance in the feature space, denoted by Dx i . The proposed method, N LDD, uses both Dx and Dy as predictors to find a training labelset that minimizes the expected loss. For each test instance, we define loss as the number of misclassified labels out of L labels. The expected loss is then Lθ where θ = g(Dx, Dy) represents the probability of misclassifying each label. The predicted labelset,ŷ * , is the labelset observed in the training data that minimizes the expected loss:ŷ * = argmin y∈T L g(Dx, Dy) (1) The loss follows a binomial distribution with L and a parameter θ. We model θ = g(Dx, Dy) as follows: where β 0 , β 1 and β 2 are the model parameters. Greater values for β 1 and β 2 imply that θ becomes more sensitive to the distances in the feature and label spaces, respectively. The misclassification probability decreases as Dx and Dy approach zero.
A test instance with Dx = Dy = 0 has a duplicate instance in the training data (i.e. with identical features). The predicted probabilities for the test instance are either 0 or 1 and the match the labels of the duplicate training observation. For such a "double"-duplicate instance (i.e. Dx = Dy = 0), the probability of misclassification is 1/(1 + e −β0 ) > 0. As expected, the uncertainty of a test observation with a "doubleduplicate" training observation is greater than zero.
The model in (2) implies g(Dx, Dy) = 1/(1 + e −(β0+β1Dx+β2Dy) ). Because log θ 1−θ is a monotone transformation of θ and L is a constant, the minimization problem in (1) is equivalent tô That is, N LDD predicts by choosing the labelset of the training instance that minimizes the weighted sum of the distances. For prediction, the only remaining issue is how to estimate the weights.

Estimating the relative weights of the two distances
We need to estimate the parameters β 0 , β 1 and β 2 . This requires computing Dy, but of course the outcomes in the test data are not known. We therefore split the training data, T , equally into two data sets, T 1 and T 2 . T 2 is used for validation. Using T 1 , we next fit a binary classifier to each of the L labels separately and obtain the labelset predictions (i.e. probability outcomes) for the instances in T 2 . We then create a set of (Dx, Dy) by pairing instances in T 1 with those in T 2 . Note that matching any single instance in T 2 to those in T 1 results in N/2 distance pairs. Most of the pairs are uninformative because the distance in either the feature space or the label space is very large. Moreover, since T 2 contains N/2 instances, the number of possible pairs is potentially large (N 2 /4). Therefore, to reduce computational complexity, for each instance we only identify two pairs: the pair with the smallest distance in x and the pair with the smallest distance in y. In case of ties in one distance, the pair with the smallest value in the other distance is chosen. More formally we identify the first pair m i1 by Dy where W ix is the set of pairs that are tied; i.e. that each corresponds to the minimum distance in Dx. Similarly, the second pair m i2 is found by where W iy is the set of labels that are tied with the minimal distance in Dy. Figure 2 illustrates an example of how to identify m i1 and m i2 for N = 20. Our goal was to identify the instance with the smallest distance in x and the instance with the smallest distance in y. Note that m i1 and m i2 may be the same instance If we find a single instance that minimizes both distances, we use just that instance. (A possible duplication of that instance is unlikely to make any difference in practice). The 10 points in the scatter plot were obtained by calculating Dx and Dy between an instance in T 2 and the 10 instances in T 1 . In this example two points have the lowest distance in Dy and are candidates for m i2 . Among the candidates, the point with the lowest Dx is chosen.
The two pairs corresponding to the i th instance in T 2 are denoted as the set S i = {m i1 , m i2 }, and their union for all instances is denoted as S = N/2 i=1 S i . The binomial regression specified in (2) is performed on the instances in S and maximum likelihood estimators of the parameters are obtained. Algorithm 1 outlines the training procedure.

Algorithm 1 The training process of N LDD
Input: training data T , number of labels L Output: probabilistic classifiers h (i) , binomial regression g Split T into T 1 and T 2 For the classification of a new instance, we first obtainp using the probabilistic classifiers fitted to the training data T . Dx j and Dy j are obtained by matching the instance with the j th training instance. Using the M LEsβ 0 ,β 1 andβ 2 , we calculateθ j = ef j The second equality holds becauseÊ(loss) = Lθ and L is a constant. As in LP , N LDD chooses a training labelset as the predicted vector. Algorithm 2 outlines the classification procedure. N )) is the complexity of each binary classifier with d features and N training instances, O(g(d, N/2)) is the complexity for predicting each label for T 2 , N 2 (d + L) is the complexity for obtaining the distance pairs for the regression and O(N log(k)) is the complexity for fitting a binomial regression. T 1 and T 2 have N/2 instances respectively. O(Lf (d, N/2)) is the complexity for fitting binary classifiers using T 1 and obtaining the probability results for T 2 takes O(Lg(d, N/2)). For each instance of T 2 , we obtain N/2 numbers of distance pairs. This has complexity O((N/2)(d + L)). Since there are N/2 instances, overall it takes O((N/2)(N/2)(d + L)) or O(N 2 (d + L)) when omitting the constant. Among the N/2 pairs for each instance of T 2 , we only identify at most 2 pairs. This implies N/2 ≤ s ≤ N where s is the number of elements in S. Each iteration of the Newton-Raphson method has a complexity of O(N ). For k-digit precision Algorithm 2 The classification process of N LDD Input: new instance x, binomial model g, probabilistic classifiers h (i) , training data T of size N Output: multi-label classification vectorŷ compute Dx j and Dy j (Ypma 1995). Combined, the complexity for estimating the parameters with k-digit precision is O(N log(k)). In practice, however, this term is dominated by N 2 (d + L) as we can set k << N .

Experimental evaluation
In this section we compare the algorithms for multi-label classification on nine data sets in terms of Hamming loss, 0/1 loss, multi-label accuracy and F -measure. We next introduce the data sets and the evaluation measures and then present the results of our experiments.

Data sets
We evaluated the proposed approach using nine commonly used multi-label data sets from different domains. Table 1 shows basic statistics for each data set including its domain, numbers of labels and features.
In the text data sets, all features are categorical (i.e. binary). The last column "lcard", short for label cardinality, represents the average number of labels associated with an instance. The data sets are ordered by (|L| · |X| · |E|). The emotions data set (Trohidis et al 2008) consists of pieces of music with rhythmic and timbre features. Each instance is associated with up to 6 emotion labels such as "sad-lonely", "amazed-surprised" and "happy-pleased". The scene data set (Boutell et al 2004) consists of images with 294 visual features. Each image is associated with up to 6 labels including "mountain", "urban" and "beach". The yeast data set (Elisseeff and Weston 2001) contains 2417 yeast genes in the Yeast Saccharomyces Cerevisiae. Each gene is represented by 103 features and is associated with a subset of 14 functional labels. The medical data set consists of documents that describe patient symptom histories. The data were made available in the Medical Natural language Processing Challenge in 2007. Each document is associated with a set of 45 disease codes. The slashdot data set consists of 3782 text instances with 22 labels obtained from Slashdot.org. The enron data set (Klimt and Yang 2004) contains 1702 email messages from the Enron corporation employees. The emails were categorized into 53 labels. The ohsumed data set (Hersh et al 1994) is a collection of medical research articles from MEDLINE database. We used the same data set as in Read et al (2011) that contains 13929 instances and 23 labels. The tmc2007 data set (Srivastava and Zane-Ulman 2005) contains 28596 aviation safety reports associated with up to 22 labels. Following Tsoumakas et al (2011), we used a reduced version of the data set with 500 features. The bibtex data set (Katakis et al 2008)  Let L be the number of labels in a multi-label problem. For a particular test instance, let y = (y (1) , ..., y (L) ) be the labelset where y (j) = 1 if the j th label is associated with the instance and 0 otherwise. Letŷ = (ŷ (1) , ...,ŷ (L) ) be the predicted values obtained by any machine learning method. Hamming loss refers to the percentage of incorrect labels. The Hamming loss for the instance is where 1 is the indicator function. Despite its simplicity, the Hamming loss may be less discriminative than other metrics. In practice, an instance is usually associated with a small subset of labels. As the elements of the L-dimensional label vector are mostly zero, even the empty set (i.e. zero vector) prediction may lead to a decent Hamming loss.
Compared to other evaluation metrics, 0/1 loss is strict as all the L labels must match to the true ones simultaneously.
The multi-label accuracy (Godbole and Sarawagi 2004) (also known as the Jaccard index) is defined as the number of labels counted in the intersection of the predicted and true labelsets divided by the number of labels counted in the union of the labelsets. That is, M ulti-label accuracy = |y ∩ŷ| |y ∪ŷ| .
The multi-label accuracy measures the similarity between the true and predicted labelsets. The F -measure is the harmonic mean of precision and recall. The F -measure is defined as The metrics above were defined for a single instance. On each metric, the overall value for an entire test data set is obtained by averaging out the individual values.

Experimental setup
We compared our proposed method against BR, SM BR, ECC, M LKN N , RAKEL and CBM . To train multi-label classifiers, the parameters recommended by the authors were used. In the case of M LKN N , we set the number of neighbors and the smoothing parameter to 10 and 1 respectively. For RAKEL, we set the number of separate models to 2L and the size of each sub-labelset to 3. For ECC, the number of CC models for each ensemble was set to 10. On the larger data sets (ohsumed, tmc2007 and bibtex), we fit ECC using reduced training data sets (75% of the instances and 50% of the features) as suggested in Read et al (2011). On the same data sets, we ran N LDD using 70% of the training data to reduce redundancy in learning.
For N LDD, we used support vector machines (SV M ) (Vapnik 2000) as the base classifier on unscaled variables with a linear kernel and tuning parameter C = 1. The SV M scores were converted into probabilities using Platt's method (Platt 2000). SV M was also used as the base classifier for BR, SM BR, ECC and RAKEL. The analysis was conducted in R (R Core Team 2014) using the e1071 package (Meyer et al 2014) for SV M . For the data sets with less than 5,000 instances 10-fold cross validations (CV ) were performed. On the larger data sets, we used 75/25 train/test splits. For fitting binomial regression models, we divided the training data sets at random into two parts of equal sizes. For implementing CBM we used a Java program developed by the authors. The default settings (e.g. logistic regression and 10 iterations for the EM algorithm) were used on non-large data sets. For the large data sets tmc2007 and bibtex, the number of iterations was set to 5 and random feature reduction was applied as suggested by the developers. On each data set we used train/test split available at their website (https://github.com/cheng-li/pyramid).
We applied the Wilcoxon signed-rank test (Wilcoxon 1945;Demšar 2006) to compare the methods over multiple data sets because unlike the t-test it does not make a distributional assumption. Also, the Wilcoxon test is more robust to outliers than the t-test (Demšar 2006). Each test was one-sided at significance level 0.05. In multi-label classification, the Wilcoxon signed-ranks test was employed by Tsoumakas et al (2011).
In N LDD, when calculating distances in the feature spaces we used the standardized features so that no particular features dominated distances. For a numerical feature variable x, the standardized variable z is obtained by z = (x −x)/sd(x) wherex and sd(x) are the mean and standard deviation of x in the training data.

Results
Tables 2 to 5 summarize the results in terms of Hamming loss, 0/1 loss, multi-label accuracy and Fmeasure, respectively. We also ranked the algorithms for each metric. The Wilcoxon test results report whether or not any two methods were significantly different in their rankings across data sets. The results are shown at the bottom of each table. N LDD achieved highest average ranks on Hamming loss, 0/1 loss and multi-label accuracy while ECC achieved the highest average rank on the F -measure with N LDD taking the second place (and the difference between ECC and N LDD was not statistically significant). RAKEL achieved the second highest average rank on Hamming loss, while CBM achieved the second highest average rank on 0/1 loss and multi-label accuracy. The performance of CBM on the 0/1 loss was very variable achieving the highest rank on five out of nine data sets and the second worst on two data sets. Table 6 shows the running time in seconds of the methods. On the non-large data sets, the relative differences of running time between N LDD and BR tended to increase with the size of the data sets. On two of the large data sets, ohsumend and tmc2007, N LDD required less time than BR as we only used 70% of the training data.    Table 3: 0/1 loss (lower is better) averaged over 10 cross validations (with ranks in parentheses). The loss is 0 if a predicted labelset matches the true labelset exactly and 1 otherwise. The results from the Wilcoxon test on whether or not any two results are statistically significant from one another are summarized at the bottom of the table.
We next look at the performance of N LDD by whether or not the true labelsets were observed in the training data. A labelset has been observed if the exact labelset can be found in the training data and unobserved otherwise. Since N LDD makes a prediction by choosing a training labelset, a predicted labelset can only be partially correct on an unobserved labelset. Table 7 compares the evaluation results of BR and N LDD on two separate subsets of the test set of the bibtex data. The bibtex data were chosen because the data set contains by far the largest percentage of unobserved labelsets (33%) among the data sets investigated. The test data set was split into subsets A and B; if the labelset of a test instance was an observed labelset, the instance was assigned to A; otherwise the instance was assigned to B. For all of the four metrics, N LDD outperformed BR even though 33% of the labelsets in the test data were unobserved labelsets.
We next look at the three regression parameters the proposed method (N LDD) estimated (equation 2) for each data set in more detail.     Table 5: F -measure (higher is better) averaged over 10 cross validations (with ranks in parentheses). The results from the Wilcoxon test on whether or not any two results are statistically significant from one another are summarized at the bottom of the table.
each data set. In all data sets, the estimates of β 1 and β 2 were all positive. The positive slopes imply that the expected loss (or, equivalently the probability of misclassification for each label) decreases as Dx or Dy decreases.
From the values ofβ 0 we may infer how low the expected loss is when either Dx or Dy is 0. For example, β 0 = −3.5023 in the scene data set. If Dx = 0 and Dy = 0,p = 0.0292 because logp 1−p = −3.5023. Hencê E(loss) = Lp = 6 · 0.0292 = 0.1752. This is the expected number of mismatched labels for choosing a training labelset whose distances to the new instance are zero in both feature and label spaces. The results suggest the expected loss would be very small when classifying a new instance that had a duplicate in the training data (Dx = 0) and whose labels are predicted with probability 1 and the predicted labelset was observed in the training data (Dy = 0).   Table 8: The maximum likelihood estimates of the parameters of equation 2 averaged over 10 cross validations 5 Scaling up N LDD As seen in Section 3.2, the time complexity of N LDD is dependent on the size of the training data (N ). In particular, the term O(N 2 (d + L)) makes the complexity of N LDD quadratic in N . For larger data sets the running time could be reduced by running the algorithm on a fraction of the N instances, but performance may be affected. This is investigated next. Figure 3 illustrates the running time and the corresponding performance of N LDD as a function of the percentage of N . For the result, we used the tmc2007 data with 75/25 train/test splits. After splitting, we randomly chose 10% -100% of the training data and ran N LDD with the reduced data. As before, we used SV M with a linear kernel as the base classifier.
The result shows that N LDD can obtain similar predictive performances for considerably less time. The running time increased quadratically as a function of N while the improvement of the performance of N LDD appeared to converge. Using 60% of the training data, N LDD achieved almost the same performance in the number of mismatched labels as using the full training data. Similar results were obtained on other large data sets. For the sample data sets selected, N LDD performed significantly better than BR, SM BR and M LKN N on all of the four metrics. N LDD also significantly outperformed ECC on Hamming loss and 0/1 loss, RAKEL on 0/1 loss, multi-label accuracy and F -measure. Although no significant difference was found between N LDD and CBM , N LDD achieved higher average ranks on all of the four metrics. On any evaluation metric, no method performed statistically significantly better than N LDD.
Like BR, N LDD uses outputs of independent binary classifiers. Using the distances in the feature and label spaces in binomial regression, N LDD can make more accurate predictions than BR. N LDD was also significantly superior to SM BR, which is similar to N LDD in the sense that it makes predictions by choosing training labelsets using binary classifiers. SM BR is based on the label space only, while N LDD uses the distances in the feature space as well.
Like LP , the proposed method treats each training labelset as a different class of a single-label problem in the prediction stage. Using a training labelset as a predicted vector, the proposed approach takes potentially high order label correlations into account.
In fitting the binomial regression, N LDD restricts the fit of the binomial model to distance pairs with low distances in the feature and label spaces. This dramatically reduces the size of the data used for regression fitting. In the yeast data set, the training data T contained 2178 instances. Since we equally divided the training data into T 1 and T 2 , each of them contained 1089 instances. Hence the number of possible instances available for fitting is 1089 * 1089 = 1, 185, 921. On the other hand, N LDD used only 2, 018 instances which is less than 0.2% of all instances.
N LDD requires more time than BR. The relative differences of running time between N LDD and BR depended on the size of the training data (N ). The number of labels and features had less impact on the differences, as the complexity of N LDD is linear in them. For the larger data sets, we reduced the running time of N LDD by using a subset (70%) of the training data. The results of ohsumed and tmc2007 data sets show that N LDD with reduced data can perform fast compared to not only BR but also the other methods on large data problems.
Because N LDD makes a prediction by choosing a training labelset, the prediction label vector is confined to a labelset appearing in the training data. If a new instance has a true labelset unobserved in the training data, there will be at least one incorrect predicted label. Even so, N LDD beat the other methods on average. How frequently an unobserved labelset occurs depends on the data set. For most data sets, less than 5% of the test data contained labelsets not observed in the training data. In other words, most of the labelsets of the test instances could be found in the training data. However, for the bibtex data set about 33% of the test data contained unobserved labelsets. As seen in Table 7, when the true labelsets of the test instances were not observed in the training data (subset B), BR performed slightly better than N LDD in terms of 0/1 loss, multi-label accuracy and F -measure. On the other hand, when the true labelsets of the test instances were observed in the training data (subset A), N LDD outperformed BR on all of the metrics. Combined, N LDD achieved higher performances than BR on the entire test data.
N LDD uses binomial regression to estimate the parameters. This setup assumes that the instances in S are independent. While it turned out that this assumption worked well in practice, dependencies may arise between the two pairs of a given S i . If required this dependency could be modeled using, for example, generalized estimating equations (Liang and Zeger 1986). We examined GEE using an exchangeable correlation structure. The estimates were almost the same and the prediction results were unchanged. The analogous results are not shown.
For prediction, the minimization in (3) only requires the estimates of the coefficients β 1 and β 2 which determine the tradeoff between Dx and Dy. The estimate of β 0 is not needed. However, estimating β 0 allows estimating the probability of a misclassification of a label for an instance,θ. Such an assessment of uncertainty of the prediction can be useful. For example, one might only want to classify instances where the probability of misclassification is below a certain threshold value.
N LDD uses a linear model for binomial regression specified in 2. To investigate how the performance of N LDD changes in nonlinear models, we also considered a model: log θ 1−θ = β 0 + D β1 x · D β2 y in which the distances are combined in a multiplicative way. The difference of prediction results obtained by the linear and multiplicative models was small. The analogous results are not shown.
While SV M was employed as the base classifier, other algorithms could be chosen provided the classifier can estimate posterior probabilities rather than just scores. Better predictions of binary classifiers will make distances in the label space more useful and hence lead to a better performance.

Conclusion
In this paper, we have presented N LDD based on probabilistic binary classifiers. The proposed method chooses a training labelset with the minimum expected loss, where the expected loss is a function of two variables: the distances in feature and label spaces. The parameters are estimated by maximum likelihood. The experimental study with 9 different multi-label data sets showed that N LDD outperformed other stateof-the-art methods on average in terms of Hamming loss, 0/1 loss, multi-label accuracy and F -measure.
N LDD relies on labelsets observed in the training data and is unable to predict previously unobserved labelsets. N LDD outperformed other methods on the data sets observed where most test data sets contained 5% unobserved labelsets. While the method still outperforms the other methods with 33% of unobserved labelsets on the bibtex data, the method might not fare as well when the percentage of unobserved labelsets are substantially greater.