Multi-label feature selection method based on dynamic weight

Multi-label feature selection attracts considerable attention from multi-label learning. Information theory-based multi-label feature selection methods intend to select the most informative features and reduce the uncertain amount of information of labels. Previous methods regard the uncertain amount of information of labels as constant. In fact, as the classification information of the label set is captured by features, the remaining uncertainty of each label is changing dynamically. In this paper, we categorize labels into two groups: One contains the labels with few remaining uncertainty, which means that most of classification information with respect to the labels has been obtained by the already-selected features; another group contains the labels with extensive remaining uncertainty, which means that the classification information of these labels is neglected by already-selected features. Feature selection aims to select the new features that are highly relevant to the labels in the second group. Existing methods do not distinguish the difference between two label groups and ignore the dynamic change amount of information of labels. To this end, a Relevancy Ratio is designed to clarify the dynamic change amount of information of each label under the condition of the already-selected features. Afterward, a Weighted Feature Relevancy is defined to evaluate the candidate features. Finally, a new multi-label feature selection method based on Weighted Feature Relevancy (WFRFS) is proposed. The experiments obtain encouraging results of WFRFS in comparison with six multi-label feature selection methods on thirteen real-world data sets.


Introduction
Recent years, multi-label learning has emerged in many areas such as text categorization [3,6,21], semantic image [35] and bioinformatics [38,9]. In multilabel data, each instance is associated with multiple labels simultaneously. High-dimensional multi-label data sets often contain many irrelevant and redundant features. These irrelevant and redundant features do not only increase the computational burden, but also degrade the classification performance of multi-label learning [23,11]. Multi-label feature selection intends to obtain a compact feature subset by selecting relevant features and eliminating the irrelevant and redundant features [36,13,29,8].
From the perspective with respect to the relationship between learning algorithm and feature selection, multi-label feature selection methods can be divided into three categories: filter methods, wrapper methods and embedded methods [24]. Filter methods are independent of any learning algorithm, and they use predefined criteria to evaluate the importance of features [27,33]. Wrapper methods depend on the classification performance of a specific classifier to select the optimal feature subset [37,7]. Embedded methods implement the classification task and the process of feature selection simultaneously [2]. Filter methods are simple and efficient. In this paper, we focus on the filterbased multi-label feature selection methods.
Different from single-label feature selection that deals with the data sets containing only one label, there are two ways to handle multi-label data in multi-label feature selection: problem transformation and algorithm adaption [30,19]. Problem transformation is a straightforward way that transforms the multi-label data set into single-label data sets (binary or multiclass) and then the feature subset is selected using single-label feature selection methods based on the transformed data sets. However, this way may create too many new labels or lose some label information. Algorithm adaption methods directly select features using the multi-label data set. Many algorithm adaptationbased multi-label feature selection methods have been proposed in recent years [22,14].
Information theory is widely utilized to measure the correlations between features and the label set in algorithm adaptation-based methods. Many multilabel feature selection methods based on information theory have been proposed [20,16,15,17], which have been proved to be effective in reducing high dimension. Mutual information is an effective criterion that measures the reduced uncertainty for one variable when another variable is given. In the process of feature selection, feature relevancy can be regarded as selecting the   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 features that maximize the reduction of uncertainty in label set to obtain the largest amount of classification information for label set. In fact, the uncertainty of labels is changing dynamically when different features are given. Feature selection should ensure that the selected features can effectively cut down the uncertainty for all labels. However, existing methods regard the uncertainty of labels as constant. To avoid the issue that the much information of some labels is not obtained, it is necessary to study the uncertain changes of labels under the effect of selected features. In this paper, we divide labels into two groups: the first group contains the labels with few remaining uncertainty; the second group contains labels that are contrary to the first group, that is, the remaining uncertainty of these labels is large under the condition of already-selected features. The proposed method aims to select the new features with highly relevant to the labels with extensive remaining uncertainty. The main contributions are as follows: (1) A Relevancy Ratio is designed to clarify the degree of the contribution of the already-selected features to different labels. (2) A new feature relevancy term named Weighted Feature Relevancy (WFR) is defined that combines the mutual information with the Relevancy Ratio to evaluate the importance of candidate features. (3) A novel multi-label feature selection method named multi-label Feature Selection method based on Weighted Feature Relevancy (WFRFS) is proposed, which considers the WFR and the feature redundancy between candidate features and already-selected features. (4) To evaluate the classification performance of the proposed method, WFRFS is compared to six multi-label feature selection methods on thirteen realworld multi-label data sets. The experimental results show that WFRFS obtains better classification performance in terms of multiple evaluation criteria.
The rest of this paper is organized as follows. Section 2 introduces the preliminaries including some basic concepts of information theory and four evaluation criteria for multi-label classification performance. Section 3 briefly reviews the related work. In Section 4, we present the proposed multi-label feature selection method WFRFS. Section 5 conducts the experiments to verify the effectiveness of WFRFS. In Section 6, we make a conclusion and give the future research direction.

Some basic concepts of information theory
In this subsection, we introduce some basic information-theoretic concepts for feature selection [26,4]. Information theory provides a way to measure the amount of information for random variables. Let X = {x 1 , x 2 , ..., x n } be a discrete random variable. The information entropy H(X) measures the uncertainty of X. It is defined as follows: where p(x i ) is the probability of x i and the base of log is 2. Let Y = {y 1 , y 2 , ..., y m } be another discrete random variable. H(X, Y ) is the joint entropy of X and Y . H(X|Y ) is the conditional entropy of X given Y , which measures the remaining uncertainty of X under the condition of Y . H(X, Y ) and H(X|Y ) are defined as follows: where p(x i , y j ) is the joint probability of (x i , y j ) and p(x i |y j ) is the conditional probability of x i given y j . Mutual information is a measure of the amount of information shared by two variables. The greater the mutual information, the more relevant the two variables. The mutual information can be defined as: Let Z be a discrete random variable, conditional mutual information between X and Y given Z is defined by: Interaction information measures the amount of information shared by three variables, which is defined as:

Multi-label evaluation metrics
In multi-label learning, the evaluation metrics of classification performance are different from the traditional single-label learning, which are more complicated [34]. In this paper, we employ Macro-F1, Micro-F1, Hamming Loss and Coverage Error as the evaluation metrics. Macro-F1 and Micro-F1 based on the F 1 score are two widely used evaluation criteria for multi-label learning. Macro-F1 is an arithmetic average of the F 1 score of all labels. It can be defined as: where q is the number of labels, T P i , F P i and F N i are the number of true positives, false positives and false negatives in the i-th label, respectively. Micro-F1 is a weighted average of the F 1 score over all labels. Micro-F1 is defined as follows: the larger the value of Macro-F1 and Micro-F1 is , the better the classification performance is.
is the test set and L = {l 1 , l 2 , ..., l q } is the label set, where N is the number of instances and L i ⊆ L is the label set corresponding to the instance x i . Let L ′ i be the predicted label set corresponding to the instance x i . Hamming Loss (HL) calculates the average fraction of misclassified labels on the test data. HL is defined as: where ⊕ denotes the symmetric difference between the label set L i and L ′ i . The smaller the value of HL is, the better the classification performance is.
Coverage Error (CE) measures the average search depth in the label ranking list to cover all the correct labels for the instance. The label ranking list is obtained according to the real-valued likelihood between x i and each label l ∈ L i based a multi-label classifier. The definition for CE is: where r i (l) is the label rank of l ∈ L i corresponding to the instance x i . The smaller the value of CE is, the better the classification performance is.

Related work
Recent years, many multi-label feature selection methods have been proposed. In problem transformation-based feature selection methods, N. Spolaôr et al [28] use Binary Relevance (BR) [1] and Label Power set (LP) [31] to transform the multi-label data sets into single-label data sets. And then, they employ ReliefF and mutual information to evaluate the features. Doquire et al [5] propose a multi-label feature selection method based on mutual information using Pruned Problem Transformation (PPT) [25] (PPT+MI). BR decomposes the label set into independent binary classes. LP transforms instance's label combinations into new single-labels. PPT removes the instance with label combinations that occur too infrequently by defining the minimum occurrence to improve the LP. CHI square statistic is also used to select the effective features (PPT+CHI) [25]. However, the problem transformation methods may create too many new classes or lose the label information. Algorithm adaptation-based multi-label feature selection methods directly select features from the multi-label data set. S Kashef et al [12] propose a labelspecific multi-label feature selection algorithm based on the Pareto dominance concept without data transformation, and the method considers the effects of each feature on each label and transforms the multi-label feature selection problem to a multi-objective optimization problem. Multi-label Informed Feature Selection (MIFS) [10] adopts the latent semantics of the multi-label data to exploit label correlations and alleviates the negative effects of noisy and incomplete labels for feature selection. Li et al [18] propose a granular multi-label feature selection method based on mutual information, the method firstly employs a balanced k-means method to granulate labels into several information granules containing the labels with local dependency and then takes into account the maximal correlation minimal redundancy criterion based on mutual information to evaluate the features on each information granule.
Lee et al [15] propose an information-theoretical-based multi-label feature selection method named PMU. Its evaluation criterion is defined as follows: where L is the label set and S is an already-selected feature subset. f k is a candidate feature and f j is an already-selected feature, l i and l j are two labels. The larger the value of J(f k ) is, the more important the candidate feature f k is. In addition, Multi-label feature selection method using interaction information (D2F) [16] is proposed to efficiently evaluate the dependency of features in multi-label data. The criterion of D2F is defined by: According to the Formulas (11) and (12), PMU and D2F use the accumulated mutual information between candidate features and each label li∈L I(f k ; l i ) to evaluate the feature relevancy. In addition, SCLS [17] is presented to design a new multi-label feature selection method based on scalable relevance evaluation, SCLS evaluates the conditional relevance of features more accurately when a large number of labels are involved. It is denoted as follows: Similar to the information-theoretical-based multi-label feature selection methods mentioned above, to the best of our knowledge, these existing feature selection methods do not consider the dynamic change of classification information of labels. In fact, as the classification information in the label set is obtained in the process of feature selection, the remaining uncertainty of labels is changing under the effect on the already-selected features. In this paper, we design a Relevancy Ratio to present the change of uncertainty of each label .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Moreover, a new feature relevancy term named Weighted Feature Relevancy (WFR) is defined based on Relevancy Ratio. Finally, a novel method named multi-label feature selection method based on Weighted Feature Relevancy (WFRFS) is proposed.
4 Proposed multi-label feature selection method

The definition of Weight Feature Relevancy
In multi-label data sets, there exist different probability distributions for each label in all the instances. The information-theoretical-based multi-label feature selection methods intend to select the feature that is closely related to the probability distribution of labels. Information entropy H(.) can quantify the probability distribution of labels by numerical value, and it is the measure of uncertainty. The larger the value of entropy indicates the greater the uncertainty, and vice versa. An ideal feature f inclines to the same probability distribution with the label l i , indicating that the uncertainty of l i is 0 under the condition of f , that is, the conditional entropy H(l i |f ) = 0; an irrelevant feature f ′ is independent of l i , that is, under the condition of f ′ , the uncertainty of l i remains unchanged, i.e., H(l i |f ) = H(l i ). Let L be the label set. The task of feature selection is to find a feature subset S, which minimizes the conditional entropy H(L|S), that is, the remaining uncertainty of label set L is the smallest given S. However, according to the definition of the conditional entropy Formula (3), it is difficult to directly calculate H(L|S) due to the high dimensionality of the feature set and the limitation of the number of samples. Therefore, many feature selection methods focus on reducing the uncertainty of each label in the label set to obtain an approximately optimal feature subset [16,15,17].
Previous methods consider that the uncertainty of each label is constant when they evaluate the feature relevancy between candidate features and the label set. In fact, with the increase of already-selected features, the uncertainty of labels is changing dynamically in the process of feature selection. Let S be the already-selected feature subset and L be the label set. For a certain label l i ∈ L, the initial uncertainty is represented by H(l i ), and the remaining uncertainty of l i is H(l i |f j ) under the condition of feature f j ∈ S. The larger the value of H(l i |f j ) is, the greater the remaining uncertainty of l i is, that is, f j provides less information for l i . The smaller the value of H(l i |f j ) is, the more classification information f j provides for l i . When the feature relevancy between candidate features and the label set is evaluated, we should not only consider the relationship between candidate features and each label, but also consider the dynamic change of uncertainty of the labels under the condition of already-selected features. If the remaining uncertainty of label l i is very small under the condition of the already-selected features, it means that the already-selected features have provided enough classification information for the label l i . In this situation, we should pay less attention to the feature relevancy between the candidate features and l i , and vice versa. Moreover, the features in S have different degrees of contribution to different labels. Therefore, in order to select the features that are highly correlated with each label obtained less classification information in the already-selected feature subset, firstly, we propose the definition of Relevancy Ratio to clarify the degree of the contribution of the already-selected features to different labels. Then, a new feature relevancy term is proposed based on Relevancy Ratio.
Definition 41 (Relevancy Ratio) Let L be the label set and l i ∈ L, and S is the already-selected feature subset and f j ∈ S. Then, the Relevancy Ratio R Ratio(l i , S) for l i is defined as follows: where H(l i ) is the uncertainty of l i and H(l i |f j ) is the remaining uncertainty of label l i given f j . R Ratio(l i , S) calculates the proportion of the remaining uncertainty of the label l i given already-selected features in the initial uncertainty of the label l i . The Relevancy Ratio is changing dynamically with the increase of already-selected features. The larger the value of R Ratio(l i , S) is, the more the remaining information of the label l i is, as a result, we should pay more attention to the feature relevancy between candidate features and the label l i , and vice versa. Naturally, we propose a new feature relevancy term based on Relevancy Ratio.
We use Fig. 1 to illustrate the effect of Rel(f k ; L). f k1 and f k2 are two candidate features, and L = {l 1 , l 2 } are two labels. Suppose that the Relevancy Ratio between the already-selected feature subset S and l 1 , l 2 are R Ratio(l 1 , S) and R Ratio(l 2 , S), respectively. As shown in the Fig. 1, the area 1 is I(f k1 ; l 1 ), the area 2 is I(f k2 ; l 1 ), the area 3 is I(f k1 ; l 2 ) and the area 4 is I(f k2 ; l 2 ), respectively. Obviously, the area 1 is larger than the area 2 and the area 4 is larger than the area 3, that is I(f k1 ; l 1 ) > I(f k2 ; l 1 ) and I(f k1 ; l 2 ) < I(f k2 ; l 2 ). If I(f k1 ; l 1 ) + I(f k1 ; l 2 ) is equal to I(f k2 ; l 1 ) + I(f k2 ; l 2 ), that is, 1 + 3 = 2 + 4. In such situation, traditional feature relevancy term, that is, the accumulated mutual information between candidate feature and each label cannot determine which feature is more important for label set L. Employing Rel(f k ; L) can effectively address this issue. According to the Formula (15), For f k1 , Rel(f k1 ; L) = R Ratio(l 1 , S) * I(f k1 ; l 1 ) + R Ratio(l 2 , S) * I(f k1 ; l 2 ). For f k2 , Rel(f k2 ; L) = R Ratio(l 1 , S) * I(f k2 ; l 1 ) + R Ratio(l 2 , S) * I(f k2 ; l 2 ). There exist three cases: (1) If R Ratio(l 1 , S) > R Ratio(l 2 , S), then the already-selected features provide more information for label l 2 than label l 1 . In this case, more classification information for label l 1 should be obtained from the candidate features. Observing Fig. 1, f k1 is more informative with respect to l 1 than f k2 . According to I(f k1 ; l 1 ) + I(f k1 ; l 2 ) = I(f k2 ; l 1 ) + I(f k2 ; l 2 ) and R Ratio(l 1 , S) > R Ratio(l 2 , S), it holds that Rel(f k1 ; L) > Rel(f k2 ; L), which means that using Rel(f k ; L) can capture accurately the key features for labels. (2) If R Ratio(l 1 , S) < R Ratio(l 2 , S), then the already-selected features provide more classification information for label l 1 than label l 2 . Similar to the analysis in the case (1), it holds that Rel(f k1 ; L) < Rel(f k2 ; L). In this case, f k2 is more important than f k1 .  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 As shown in Fig. 2, when the candidate features f k1 and f k2 are related with respect to the labels l 1 and l 2 . In this situation, the union of areas 1 and 5 is I(f k1 ; l 1 ), the union of areas 2 and 5 is I(f k2 ; l 1 ), the union of areas 3 and 6 is I(f k1 ; l 2 ) and the union of areas 4 and 6 is I(f k2 ; l 2 ), respectively. If I(f k1 ; l 1 )+ I(f k1 ; l 2 ) is equal to I(f k2 ; l 1 )+I(f k2 ; l 2 ), that is, 1+5+3+6=2+5+4+6. Thus, the areas of 5 and 6 are the common information of two features with labels. Then, the effect of the common information is eliminated in the comparison of the two candidate features. It holds that 1+3=2+4, which is the same issue as Fig. 1. Therefore, the Formula (15) also holds. Fig. 2 The relationship between the candidate features f k1 , f k2 and the labels l 1 , l 2 .
In summary, employing the new feature relevancy can more accurately capture the discriminative features.

Proposed method
Based on the definition of WFR, we propose a novel multi-label feature selection method based on Weighted Feature Relevancy (WFRFS). The evaluation function is as follows: In the Formula (16), Red(f k ; S) can be considered as the feature redundancy term that is measured by the accumulated mutual information between candidate features and each already-selected feature. If the candidate features have   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 the same relevancy for the label set, then the features that have less redundancy with the already-selected features are selected. The sequential search strategy is used in the process of feature selection in our method. We select the feature f k that achieves the maximal value of J(f k ). The pseudo code of WFRFS is as follows:

Output:
The already-selected feature subset S. 1: S ← ∅ ; 2: k ← 0; 3: for i = 1 to n do 4: calculate the accumulated mutual information l j ∈L I(f i ; l j ); 5: end for 6: while k < K do 7: if k == 0 then 8: select the feature f j with the largest l j ∈L I(f i ; l j ); 9: k = k + 1; 10: end if 13: for each candidate feature f i ∈ F do 14: Calculate the Rel(f i ; L); 15: Calculate the I(f i ; f j ); 16: According to the Formula (16)  There are three stages in the WFRFS method. In the first stage (lines 1-5), it initializes the parameters, which includes the already-selected feature subset S and the number of selected features k in lines 1-2, and calculates the accumulated mutual information between each feature and label set in lines 3-5. The second stage (lines 7-12) selects the maximum value of the accumulated mutual information as the first selected feature. The third stage (lines [13][14][15][16][17][18][19][20][21] calculates the Formula (16) to update the J(f i ) for each feature.

Experimental results and analysis
In this section, we verify the effectiveness of the proposed method WFRFS on thirteen real-world multi-label data sets. First, the description of data sets and the experimental settings are introduced in Section 5.1. Then, WFRFS is compared to three information-theoretical-based feature selection methods (D2F [16], PMU [15] and SCLS [17]) and two problem transformation-based methods (PPT+MI [5] and PPT+CHI [25]) and one embedded-based method (MIFS [10]) in terms of four evaluation metrics in Section 5.2.

Data sets and experimental settings
To evaluate the classification performance of WFRFS, the experiments are conducted on thirteen real-world multi-label data sets that are from Mulan Library [32]. The description of the data sets is presented in Table 1. These data sets cover five different application areas where the data set birds is used in the audio categorization, the data set emotions is used for the emotional classification in music, the data set scene is collected for image categorization, the data sets yeast and genbase are sampled from biological domain and the remaining data sets are widely applied to text categorization. The continuous features of these data sets are discretized into three bins using the equal-width strategy, as recommend in the literature [16]. In addition, the training set and test set have been already separated in Mulan Library [32].  1  birds  645  260  19  322  323  2  emotions  593  72  6  391  202  3  medical  978  1449  45  333  645  4  scene  2407  294  6  1211  1196  5  yeast  2417  103  14  1500  917  6  Education  5000  550  33  2000  3000  7  Entertain  5000  640  21  2000  3000  8  Health  5000  612  32  2000  3000  9  Science  5000  743  40  2000  3000  10  Social  5000  1047  39  2000  3000  11  Computers  5000  681  33  2000  3000  12  genbase  662  1185  27  463  199  13  Society  5000  636  27 2000 3000 The experimental setting is as follows: first, the number of already-selected features K varies from 1 to M with a step size of 1, where M is the 20% of the total number of features (M=17% in medical data set). Second, the MLKNN [39] (K=10) is employed as the multi-label classifier to evaluate the Hamming Loss and Coverage Error performance for the proposed method WFRFS and other six compared feature selection methods. Finally, the Liblinear-based Support Vector Machine (SVM) is used as the binary classifier to evaluate the Macro-F1 and Micro-F1 performance for seven feature selection methods .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  5.2 Experimental comparison and analysis   Tables 2 and 3 show the multi-label classification performance in terms of Macro-F1 and Micro-F1 obtained by WFRFS and six compared methods. Tables 4-5 record the classification performance in terms of Hamming Loss and Coverage Error. These tables record the average classification results and the standard deviations across the M groups of feature subsets selected by each feature selection method. The bold fonts represent the best performance for each evaluation metric on thirteen real-world data sets. In addition, the last row "Avg.rank" presents the average rank of each method over all the multi-label data sets. Tables 2 and 3, the proposed method WFRFS obtains better classification in terms of Macro-F1 and Micro-F1 than the compared methods on nine and ten data sets, respectively. As a result, WFRFS ranks the best in terms of the average rank. In particular, WFRFS significantly outperforms three information-theoretical-based feature selection methods D2F, PMU and SCLS on these data sets.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  Table 3 Experimental results of seven feature selection methods in terms of Micro-F1 (mean ± std).  Tables 4 and 5 show that the proposed method WFRFS achieves the best average rank than other compared methods in terms of Hamming Loss and Coverage Error. In Table 4, the average rank of WFRFS is 1.38, which is the best Hamming Loss performance compared to six feature selection methods on these data sets. In Table 5, the best average rank on Coverage Error performance is achieved by the proposed method WFRFS, followed by PPT+CHI, D2F, PMU, PPT+MI, SCLS and MIFS. In general, our method outperforms the compared feature selection methods.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 In Tables 2-3, the experimental results indicate that the Macro-F1 and Micro-F1 performances of the proposed method WFRFS on 12 experimental data sets are better than the other three information-theoretic based feature selection methods D2F, PMU and SCLS. In Table 4, the proposed method has better Hamming Loss performance than D2F, PMU and SCLS methods on all experimental data sets. The results show that it is useful to consider the dynamic change of the uncertainty of labels in design of information-theoretic based methods. In addition, WFRFS outperforms the other compared methods PPT+MI, PPT+CHI and MIFS in terms of Macro-F1, Micro-F1 and Hamming Loss performance on most data sets. In Table 5, PPT+CHI and D2F methods obtain better coverage error performance than WFRFS method on three data sets. WFRFS obtains better coverage error performance than PPT+MI, MIFS and SCLS methods on all experimental data sets. Overall, the selected feature subsets using the proposed feature selection method is more effective.
We employ the Friedman test and the Nemenyi post-hoc test to present the statistical significance of the classification performance of the proposed method and other compared methods. Table 6 records the Friedman statistic FF of each evaluation criterion and the corresponding critical value. As the results shown in Table 6, the null hypothesis (i.e., all methods have equal classification performance) is clearly rejected in terms of four evaluation criteria at significance level α=0.05. Therefore, the Nemenyi post-hoc test is used to further analyze the relative performance of each pairwise methods. The classification performances of two feature selection methods are significantly different if the distance of the average ranks exceeds the critical distance CD. The value of CD is 2.499 at the significance level α=0.05. Fig. 6 shows the CD   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 results on each evaluation criterion, where the average ranks of seven methods are plotted along the axis. In Fig. 6, the methods whose average rank within one CD is interconnected. The results indicate that the proposed method WFRFS ranks No.1 among all the methods. WFRFS method significantly performs better than D2F, PMU, SCLS and MIFS methods in terms of the Macro-F1, Micro-F1 and Hamming Loss evaluation criteria. Additionally, the proposed method does not significantly different with PPT+MI, PPT+CHI, D2F and PMU methods in terms of Coverage Error. As a result, the proposed method obtains highly competitive performance against the compared methods.
To verify the effectiveness of our method, WFRFS is compared to six representative multi-label feature selection methods ( PPT+MI, PPT+CHI, MIFS, D2F, PMU and SCLS) using SVM classifier and MLKNN classifier on thirteen benchmark multi-label data sets in terms of Macro-F1, Micro-F1, Hamming Loss and Coverage Error. Compared to the six multi-label feature selection methods, WFRFS obtains better classification performance. As a result, we can conclude that WFRFS can effectively select the compact feature subset for multi-label data.