Abstract

Feature selection is an important way to optimize the efficiency and accuracy of classifiers. However, traditional feature selection methods cannot work with many kinds of data in the real world, such as multi-label data. To overcome this challenge, multi-label feature selection is developed. Multi-label feature selection plays an irreplaceable role in pattern recognition and data mining. This process can improve the efficiency and accuracy of multi-label classification. However, traditional multi-label feature selection based on mutual information does not fully consider the effect of redundancy among labels. The deficiency may lead to repeated computing of mutual information and leave room to enhance the accuracy of multi-label feature selection. To deal with this challenge, this paper proposed a multi-label feature selection based on conditional mutual information among labels (CRMIL). Firstly, we analyze how to reduce the redundancy among features based on existing papers. Secondly, we propose a new approach to diminish the redundancy among labels. This method takes label sets as conditions to calculate the relevance between features and labels. This approach can weaken the impact of the redundancy among labels on feature selection results. Finally, we analyze this algorithm and balance the effects of relevance and redundancy on the evaluation function. For testing CRMIL, we compare it with the other eight multi-label feature selection algorithms on ten datasets and use four evaluation criteria to examine the results. Experimental results illustrate that CRMIL performs better than other existing algorithms.

1. Introduction

In the era of big data, data in all fields are increasing explosively [13]. Therefore, feature selection has rapidly become a hot topic. Proper feature selection can improve the efficiency and accuracy of classifiers. Compared with the traditional single-label feature selection, multi-label feature selection is more suitable for solving problems in the real world [4]. Therefore, multi-label feature selection applies to various fields, such as image processing [5, 6], text categorization [7, 8], and bioinformatics [9].

Multi-label feature selection algorithms usually consider how to reduce the influence of redundancy among information. The commonly used processing methods include the swarm intelligence algorithm [10], which regards features as individuals and a group of features as populations for reproduction, evolution, and mutation to reduce the redundancy of information and improve the algorithm’s accuracy. Another idea is manifold learning [11]. This approach can diminish useless features for classifiers from the perspective of dimension reduction. Considering the relevance between features and labels by calculating mutual information between features and labels is another approach [12]. This method can help judge which features need to be kept. Much prior work has proved that mutual information is an efficient method to extract features [13, 14]. Because mutual information is more concise and effective [15], this paper will explore multi-label feature selection based on mutual information.

Many multi-label feature selection algorithms have been based on mutual information [1618]. Once the mutual information of two different features or two labels is greater than zero, redundancy appears. Although these algorithms have considered the relevance between features and labels, and the redundancy among features, they do not adequately process the redundancy among labels, eventually leading to an unsatisfactory result. This paper proposes a new approach to deal with the redundancy among labels and a multi-label feature selection based on this approach.

The rest of the paper reads as follows: In Section 2, the related work is summarized. We then propose a new multi-label feature selection algorithm in Section 3. In Section 4, relevant experiments prove the efficiency of the proposed algorithm. In Section 5, we summarize this paper and explain the directions of future work.

In summary, the study offers the following contributions:(i)We propose a new method to avoid repeating calculations on redundant label information.(ii)We propose a novel algorithm of multi-label feature selection and get good results. It performs better on most datasets, which have redundancy among labels.(iii)We set many experiments from different perspectives to test the proposed algorithms; some of them are innovative.

In the early stage of multi-label feature selection, most proposed algorithms transform multi-label datasets into multiple single-label datasets and process all single-label datasets with traditional single-label feature selection algorithms. For example, literature [19] divides a dataset D into q independent 01 datasets by Binary Relevance (BR) and transforms each possible label combination into unique classes by Label Powerset (LP). Then this paper deals with new datasets by Relief and traditional single-label feature selection algorithm based on mutual information. However, this kind of algorithm cannot work on large datasets. To overcome this challenge, the literature [20] pruned the labels that infrequently appeared in datasets. This approach can reduce the size of final datasets. However, this algorithm only transforms multi-label datasets into many single-label datasets. which may ignore the effects between features and features, labels and labels in the original datasets.

In recent years, many algorithm adaptation methods have been applied to high-dimension feature selection. For example, the literature [21] details two stages to implement feature selection of gene datasets. A greedy approach is used to assign the maximum number of samples to different gene classes in the first step. In the second step, clustering and lasso methods are selected to extract the remaining features. Additionally, Deep Neural Network is embedded into a high-dimension feature selection method [22]. To reduce the effects of outliers and noise in datasets, the literature [23] proposes Unsupervised Feature Selection with Robust Data Reconstruction (UFS-RDR) by minimizing the graph regularized weighted data reconstruction error function. The relevant estimation tools are also developed. To evaluate the stability of high-dimension feature selection approaches, the literature [24] proposes a novel estimator considering inter-intrastability of subsets. These high-dimension feature selection algorithms provide ideas for multi-label feature selection. Particularly, multi-label feature selection based on mutual information attracts extensive attention. The literature [25] has considered the interaction between selected features and unselected features and proposed MDMR as follows:where is the selected feature set and is the label set. The literature [26] considers redundancy when computing the relevance between features and labels. This paper regards redundancy existing among information as part of the relevance, which means that

The coefficient C should become greater when the selected features are strongly dependent on other features, and conversely, C should become smaller. Therefore, can be a part of C. Additionally, because , is used to normalize . As a result, the selected feature can be described as

However, the algorithm directly computes the relevance and redundancy without further processing. This method might lead to the effects of relevance and redundancy being unbalanced. To solve this problem, the literature [27] proposes granular feature selection, which transforms features into granular feature groups. After computing the relevance and redundancy, the results divide by the size of related sets. This idea can be detailed by the following formula:where is the granularity. However, these algorithms do not consider the redundancy among labels. Literature [15, 28] achieves better results after considering the redundancy among labels. The algorithm can be described as formula (5), respectively.

Although the redundancy among labels has been considered, the redundant information may be accumulated more than once. This problem is detailed in Section 3, and we propose a solution in that section.

3. Multi-Label Feature Selection considering Redundancy on Mutual Information of Labels (CRMIL)

Firstly, a problem in traditional multi-label feature selection is introduced. Many multi-label feature selection algorithms, which are proposed for solving this problem, have shortages. To improve the accuracy, we propose a new method to compute the redundancy among labels. This method can reduce the redundancy among labels and calculate the relevance between features and labels. Then, the redundancy among features is computed. Finally, we propose the new multi-label feature selection algorithm and detail the pseudocode.

3.1. A Problem

Traditional multi-label feature selection, which does not consider redundancy among labels, might encounter the following problem:

In Figure 1 and 2 show that Feature A and Feature B contain 16% and 20% of useful information, respectively. Feature B should be selected. If the redundancy among labels is ignored, the valuable information provided by Feature A and Feature B is 24% and 20%, respectively. As a result, Feature A will be selected due to the redundancy among labels. After considering the redundancy among labels, the mutual information between features and labels is 16% and 20%, respectively. Feature B will be selected. Therefore, the redundancy among labels is worth considering. The following parts will focus on how we design the multi-label feature selection algorithm considering the redundancy.

3.2. Multi-Label Conditional Mutual Information

Existing multi-label feature selection algorithms usually use conditional mutual information to calculate the redundancy among labels. In the literature [15, 28], is essential to compute the redundancy among labels. However, these algorithms enumerate every label as a condition and sum up all conditional mutual information. The sum can be regarded as the relevance between features and labels with diminishing redundancy among labels, such as formula (6). Once more than two labels contain the same information, the overlapping information will be counted more than once. This situation may reduce the accuracy of the result.where is the pending feature, and are the label elements that are different at any time. Formula (6) has been proved and detailed in the literature [22].

We propose that regarding part of the label set as conditions on mutual information can overcome this challenge. In the proposed multi-label feature selection algorithm, the relevant part which computes the redundancy among labels, can be detailed in the following formulafd7:where . This can reduce the effects of the redundancy among labels.

Proof. This shows that, compared to the traditional formula (6), formula (7) does not sum every element of every label in label sets. Therefore, this method calculates the better result in Section 3.1. Formula (7) thus can avoid the repeated calculation on information that many labels contain.

3.3. Alleviate the Redundancy among Features

After considering the redundancy among labels, the proposed algorithm calculates the redundancy among features. Mutual information can reflect the total information shared by two random variables. In feature selection, features can be seen as random variables. Therefore, we regard the mutual information of all pairs of features as the redundancy among features. Then, when a new feature is selected, the redundancy of features is computed by the following formulafd9:where is a pending feature.

3.4. Proposed Algorithms

Based on above proofs, features with larger value on formula (8) and less value on formula (9) should be selected. After analyzing the relevance and redundancy of information, we use the size of the label () and the selected feature set () to balance the effect of relevance and redundancy on the results. and are used to affect the importance of the label set and the selected feature set, respectively. We choose (this will be proved in Section 4.3). Finally, we proposed a new multi-label feature selection algorithm (CRMIL). The evaluation function can be defined as follows:where and is a pending feature.

Property 1.

Proof.

Property 2. , when most of the relevance between features and labels satisfies and most of the redundancy among features satisfies .

Proof. However, this is hardly the case in normal datasets.

Property 3. Because the size of datasets is considered, in normal datasets, .

Proof. In the beginning, is empty. To choose k features, we need k steps. In every step, we choose the feature with the largest . Then we put the selected feature into and delete the feature from the label set. Finally, the output is a k-dimension vector containing the index of selected features.

3.5. Pseudocode

The proposed algorithm requires a feature set F, a label set L, and the number of features K and returns the number set of selected features. Lines 1–2: initializing the number set of selected features and the number of selected features k. Lines 3–7: preprocessing the relevance between features and labels in formula (7). Lines 8–22: selecting k features by iterating. Among these lines, lines 9–10 select the first feature. The feature with the greatest relevance is selected because there is no element in the selected feature set. Lines 12–17: the redundancy among features is calculated by using formula (8). Lines 18–20: after selecting a feature, the feature needs to be added to the selected feature set and deleted from the original feature set. Finally, the number set of selected features is returned.

Input: a feature set F, a label set L, and the number of selected features K.
Output: selected feature subset S.
(1)
(2)
(3)for i = 1 to n do
(4)for j = 1 to m do
(5)calculate the relevance between fi and lj
(6)end for
(7)end for
(8)while k < K do
(9)ifthen
(10)select the feature fi with the greatest relevance
(11)else
(12)for every elements fi in F do
(13)for every elements fj in F except fido
(14)sum the rebundancy between fi and fj
(15)end for
(16)according to formula (16) and calculate the J (fi)
(17)end for
(18)
(19)
(20)
(21)end if
(22)end while
(23)return S.
3.6. Time Complexity Analysis

In the following explanation, N is the number of samples, is the number of features, and is the number of labels. The time complexity of the proposed algorithm is up to three main parts. Firstly, processing the mutual information among features needs to enumerate two different features. This step consumes . Calculating information entropy needs . Therefore, this part consumes . Secondly, the proposed algorithm preprocesses the relevance between features and labels, which is the main part of the algorithm. Enumerating every feature and label consumes , and computing the conditional mutual information consumes . Therefore, the time complexity of this part is . Thirdly, the algorithm needs to select K features. In every selection, pending features and selected features need to be enumerated simultaneously, which consumes at most. Therefore, the upper-bound time complexity limit on this part is . As a result, the algorithm’s time complexity should be , which depends on the kinds of data in the datasets.

As the time complexity test of a prior work [29], we use Intel(R) Core(TM) i9-9880H CPU @ 2.30 GHz to test the time cost on different datasets. All results are the average level after five times calculations. For example, when a dataset consists of 850 instances, 1000 features, and 50 labels, it takes on average 9.2 s. The number of instances in the dataset then is doubled and the dataset costs around 17.3 s. Furthermore, if the number of features is compressed by half, the time needed is around 2.1 s. These prove that, in reality, the analysis of time complexity is right with great possibility.

4. Experimental Results

In this section, we illustrate the adaptability of CRMIL on various datasets and list the experimental results. Firstly, four evaluation criteria are explained. Then we use ten different datasets (Corel5k, Delicious, Flags, Medical, Scene, Enron, GenBase, Social, Yeast, and Emotions) to test CRMIL and compare CRMIL with eight traditional multi-label feature selection algorithms, which are SCLS [26], D2F [30], FIMF [31], PMU [3], AMI [32], NMDG [33], FSSL [34], and MFS-MCDM [35].

4.1. Evaluation Criteria

This paper uses four evaluation criteria to examine the results of multi-label feature selection: Hamming Loss, Average Precision, One Error, and Ranking Loss. These criteria are usually used by multi-label feature selection papers [36, 37]. Hamming Loss can be defined as follows:where is the predicted label for every sample, is the real label for every sample, and is the XOR operation. Hamming Loss reflects the misclassification of every single-label. The lower Hamming Loss is, the better classification performance is. Average Precision can be defined by the following:

Average Precision=where is the size of every label in the label set, and records the rank of l after all labels are sorted in descending order. Average Precision reflects the average fraction of labels ranked higher than a specific label. Greater Average Precision indicates better classification performance. One Error can be defined as follows:

One Error records the percentage of labels with the highest predicted value that are not contained by the relevant label set. The lower One Error is, the better classification performance is. Ranking Loss can be defined by the following:

Ranking Loss=where is the likelihood that l is the proper label of f, and is the complementary set of . Ranking Loss reflects the average rank of these likelihoods. The lower Ranking Loss is, the better the classification performance is.

4.2. Datasets

The ten datasets are from Mulan Library [38], and Table 1 lists the detailed information of them. The domains of Corel5k, Flags, and Scene are images. Delicious, Medical, and Enron are text. GenBase is biology. The ten datasets contain various orders of magnitude, the number of features, and the number of labels. Additionally, datasets include different types of features, such as binary and polybasic. For experiments, every dataset has been divided into the training set and the test set by referring to the recommended size of the Mulan Library.

4.3. Analyze on Experiments
4.3.1. Experiment 1

To prove the correctness of the chosen and in Section 3.4, we assign different values to and in CRMIL and test these values in all datasets. The Hamming Loss of results are then grouped by coefficients. The mean value of Hamming Loss in the same group is the standard value of the group. We choose the minimum value of the standard values as the normalizing number. Next, all standard values are divided by the normalizing number. Finally, we acquire the normalized results of all groups.

The visualized results are represented in Figure 3. We can know that the corresponding bars are the lowest when is equal to . Moreover, if the ratio of to is larger, the results roughly become worse. This indicates CRMIL selects the best feature subset when is equal to . Therefore, the constant of formula (15) is suitable.

4.3.2. Experiment 2

To explore the comparative performance of CRMIL on different datasets, we test CRMIL and the other eight multi-label feature selection algorithms on mentioned datasets. The results are evaluated by Hamming Loss, Average Precision, One Error, and Ranking Loss. Tables 25 demonstrate all experimental results in detail. These experimental results are obtained by averaging the results as they tend to stabilize after five simulations.

According to Hamming Loss, CRMIL performs better than the best-performing algorithms among the other eight algorithms on ten datasets. For example, CRMIL is 25.8%, 23.5%, and 12.8% better than AMI on Enron, Corel5k and Delicious, respectively. Compared with FSSL, CRMIL optimizes the target by 9% in Flags, 15.9% in Medical, and 9.9% in Scene. The average improvement on ten datasets is about 17.7%. In terms of the Average Precision, CRMIL improves the target by 0.0232 and 0.0204 on Flags and Scene, respectively. On Medical, GenBase, and Social, compared with the best-performance algorithm of the other eight algorithms (PMU, FSSL, and FSSL), CRMIL improves the results by 46.5%, 6.3%, and 4.3%, respectively. Although CRMIL slightly lower the result on Enron, the average result on ten datasets has been increased by around 10.2%. Taking One Error as the evaluation criterion, CRMIL reduces the percentage of errors by 32.6% and 6.2% on Flags and Scene, respectively. The average One Error on ten datasets has been increased by approximately 17.9%. For Ranking Loss, CRMIL performs well in all ten datasets, reducing the target by 83.3%, 44.8%, 12.1%, 9.8%, 7.4%, 6.9%, 5.0%, and 3.9% on Corel5k, Enron, GenBase, Flags, Yeast, Scene, Social, and Emotions, respectively, and the target becomes 0 on Delicious and Medical.

4.3.3. Experiment 3

To study how many features should be selected when CRMIL can achieve stable experimental results, on Flags and Scene, we record the results with the increasing numbers of the selected features.

In Figures 47 shows the experimental results of all the mentioned multi-label selection algorithms on Flags when different numbers of features are selected. Because there are 19 features in Flags, we choose the step of x-axis is 1 in Figures 47. The ranges of Hamming Loss, Average Precision, One Error, and Ranking Loss on Flags are (0.26, 0.44), (0.6, 0.85), (0.2, 0.6), and (0.05, 0.45), respectively. Similarly, Figures 811 details the experimental results on Scene. We select 20 is as the step of x-axis on Scene, because the maximum is around 110 on this dataset. The ranges of Hamming Loss, Average Precision, One Error, and Ranking Loss on Scene are (0.15, 0.4), (0.45, 0.8), (0.2, 0.8), and (0.01, 0.45), respectively. On Flags and Scene, CRMIL has achieved good experimental results when the number of the selected features is 4 and 35, respectively. However, on Flags, SCLS, AMI, and FIMF cannot reach stable results when all features are selected, and the results of the other algorithms converge when the number of selected features is about 7. Furthermore, on Scene, most of the compared algorithms can get stable results if the number of selected features is around 60. This experiment indicates that CRMIL has a faster convergence. Compared with other algorithms, CRMIL can achieve better results and tend to be stable when the number of selected features is small.

4.3.4. Experiment 4

To further explore the performance of CRMIL and investigate the improvement if algorithms consider the redundancy among labels, we make a comparative experiment regarding SCLS as the baseline. SCLS innovates multi-label feature selection by using mutual information without considering the redundancy among labels. If we can figure out the redundancy among labels and results improvement on every dataset, we can know the relation between label redundancy and results improvement by using CRMIL. To some extent, we can verify the efficiency of CRMIL on label-redundant datasets.

We set the mean of the optimization percentage of the experimental results of SCLS by CRMIL on Hamming Loss, Average Precision, One Error, and Ranking Loss as the results of improvement. Table 6 details the mean value. Additionally, to understand the relation directly, we show both the redundancy between every two labels and the total label redundancy of every dataset. To illustrate the redundancy between every two labels, we use heatmaps (Figures 1216) of five datasets. Both x and y axis represent labels in datasets and heat represents the redundancy between every two labels. The brighter color means the more redundancy among labels. From Figure 1216, we can see that the heatmaps become brighter, which means the redundancy between every two labels of the five datasets increases in order of Corel5k, Delicious, Medical, Scene, and Flags. According to Table 6 and Figures 1216, the proposed algorithm can get better results if more redundancy exists among labels. To describe the total label redundancy of datasets, we use formula (18) to represent the redundant value among labels. Table 7 records the results.

According to Table 6 and 7, the larger the redundancy among labels is, the better CRMIL will perform. As shown in Figure 17, the improvement of the results is roughly proportional to the redundancy among labels.

5. Conclusion and Future Work

In recent years, multi-label feature selection has become a hot topic. However, the existing multi-label feature selection algorithms have not fully considered the redundancy among labels. This paper proposes a new multi-label feature selection algorithm (CRMIL) that has considered the label set as the condition when computing the mutual information between features and labels.

To test the performance of this algorithm, we compare CRMIL with eight existing multi-label feature selection algorithms (SCLS, D2F, FIMF, PMU, AMI, NMDG, FSSL, and MFS-MCDM) on ten commonly used datasets (Corel5k, Delicious, Flags, Medical, Scene, Enron, GenBase, Social, Yeast, and Emotions) and use four evaluation criteria (Hamming Loss, Average Precision, One Error, and Ranking Loss) to evaluate results. Experimental results show that CRMIL performs better on various datasets, and the algorithm has a fast convergence speed. Furthermore, the greater the redundancy among labels is, the better the experimental results are.

However, according to the proposed multi-label feature selection algorithm, when the redundancy among labels is too dense, part of mutual information may not be counted in the final result, which can reduce the accuracy of the results. We may implement more high-dimension methods to partly overcome these challenges. In the future, we will take more special cases into account, study how to deal with the redundancy among labels more reasonably, and make the relevance between features and labels closer to the real value.

Data Availability

The data that support the findings of this study are available from the author upon reasonable request.

Conflicts of Interest

The author declares no conflicts of interest.

Acknowledgments

This research was funded by the National Key R&D Program of China (grant no. 2017YFB0802803), Beijing Natural Science Foundation (grant no. 4202002).