Feature selection based on fuzzy joint mutual information maximization

Abstract: Nowadays, real-world applications handle a huge amount of data, especially with highdimension features space. These datasets are a significant challenge for classification systems. Unfortunately, most of the features present are irrelevant or redundant, thus making these systems inefficient and inaccurate. For this reason, many feature selection (FS) methods based on information theory have been introduced to improve the classification performance. However, the current methods have some limitations such as dealing with continuous features, estimating the redundancy relations, and considering the outer-class information. To overcome these limitations, this paper presents a new FS method, called Fuzzy Joint Mutual Information Maximization (FJMIM). The effectiveness of our proposed method is verified by conducting an experimental comparison with nine conventional and state-of-the-art feature selection methods. Based on 13 benchmark datasets, experimental results confirm that our proposed method leads to promising improvement in classification performance and feature selection stability.


Introduction
Recently, classification systems have a wide range in many fields such as text classification, intrusion detection, bio-informatics, and image retrieval [1]. Unfortunately, a huge amount of data which may include irrelevant or redundant features is one of the main challenges of these systems. The negative effect of these undesirable features reduces the classification performance [2]. For this reason, reducing the number of features by finding an effective subset of features is an important task in classification systems [2]. Feature reduction has two techniques: feature selection and feature extraction [3]. Both reduce a high-dimensional dataset into a representative feature subset of low-dimensional. Feature extraction is effective when the original features fail to discriminate the classes [4], but it requires extra computation. Moreover, it changes the true meaning of the original features. In contrast, the feature selection preserves the true meaning of the selected features, which is important for some classification systems [5]. Furthermore, the result of FS is more understandable for domain experts [6].
FS tries to find the best feature subset which represents the dataset well and improves the performance of classification systems [7]. It can be classified into three approaches [6]: wrapper, embedded, and filter. According to an evaluation strategy, wrapper and embedded are called classifier-dependent approaches, while filter is called classifier-independent approach [8]. In this paper, we use the filter approach according to its advantages over wrapper or embedded approaches in terms of efficiency, simplicity, scalability, practicality, and classifier-independently [6,9]. Filter approach is a pre-processing task which finds the highly ranked features to be the input of classification systems [7,10]. There are two criteria to rank features: feature relevance, and feature redundancy [11]. Feature relevance is related to how features discriminate different classes, while feature redundancy is related to how features share the same information of each other [12]. To define these criteria, filter approach uses many weighting functions which rank features based on their significance [10] such as correlation [13], mutual information (MI) [14]. MI overcomes the weakness of correlation, whereas, correlation is suitable only for linear relationship and numerical features [1]. MI is suitable for any kind of relationship such as linear and non-linear. Moreover, MI deals with both numerical and categorical features [1].
Although MI has been widely used in many methods to find the best feature subset that maximizes the relevancy between the candidate feature and class label, and minimizes the redundancy between the candidate feature and pre-selected features [15]. The main limitations of these methods are: (1) difficult to indicate the best candidate features with the same new classification information [16], (2) difficult to deal with continuous features without information loss [17], and (3) consider the innerclass information only [18]. In this paper, we integrate fuzzy concept with mutual information to propose a new FS method called Fuzzy Joint Mutual Information Maximization (FJMIM). The fuzzy concept helps the proposed method to exploit all possible information of data where it can deal with any numerical data and extract the inner and outer-class information. Moreover, the objective function of FJMIM can overcome the feature overestimation problem which happens when the candidate feature be completely correlated with some of pre-selected features and does not depend on the majority of the subset at the same time [8].
The rest of this paper is organized as follows: Section 2 presents the basic measures of fuzzy information theory. Then, we present the proposed method in section 3. After that, the experiment design was presented in section 4, followed by the results and discussion in section 5. Finally, section 6 concludes the paper.

Basic measures of fuzzy information theory
For the purpose of measuring the significance of features, information theory introduced many information measures such as entropy, and mutual information. To enhance these measures, fuzzy concept is used to estimate new extensions of information measures based on fuzzy equivalence relations such as fuzzy entropy, and fuzzy mutual information [19,20]. Fuzzy entropy measures the average amount of uncertainty of fuzzy relation in order to estimate its discriminative power, while fuzzy mutual information measures the shared amount of information between two fuzzy relations. In the following, we present the basic measures of fuzzy information theory: Given a dataset D = F ∪ C, where F is a set of n features, and C is the class label. LetF = {a 1 , a 2 , . . . .., a m } be a feature of m samples, whereF ∈ F. Let S is the feature subset with d of selected features, and the remaining set is {F −S}, whereF f ∈ F −S andF s ∈ S. Based on the fuzzy equivalence relation RF onF, the featureF can be represented by the relation matrix M(RF ).
where r i j = RF (a i , a j ) is the fuzzy equivalence relation between two samples a i and a j .
In this paper, the used fuzzy equivalence relation between two elements a i and a j is defined as [21]: Fuzzy equivalence class of sample a i on RF can be defined as: Fuzzy entropy of featureF 1 based on fuzzy equivalence relation is defined as: where |[a i ] RF | = ∑ m i=1 r i j . LetF 1 andF 2 be two features of F, fuzzy joint entropy ofF 1 andF 2 is defined as: Fuzzy conditional entropy ofF 1 givenF 2 is defined as Fuzzy Mutual information between two featuresF 1 andF 2 is defined as: Fuzzy conditional mutual information between featureF 1 andF 2 given class C is defined as: Fuzzy joint mutual information between two featuresF 1 ,F 2 and class C is defined as: Fuzzy interaction information between amongF 1 ,F 2 and C is defined as:

Proposed feature selection method
In this section, we presented the general theoretical frameworks of different feature selection methods based on mutual information. Then, we studied the limitation of previous work. Finally, we introduced the proposed method.

Feature selection based on mutual information
Brown et al. [22] studied the exist feature selection methods based on MI and analyzed the different criteria to propose the following theoretical framework of these methods.
This framework is a linear combination of three terms: relevance, redundancy, and conditional that measures the individual predictive power of the feature, the unconditional relation, and the class-conditional relation, respectively. The criteria of different feature selection based on MI depends on the value of β and γ. MIM (β = γ = 0) [23] is the simplest FS method based on MI. It considers only the relevance relation only. However, It may suffer from the redundant features. MIFS (γ = 0) [24] introduced two criteria to estimate the feature relevance and redundancy. An extension of MIFS, called MIFS-U [24] is proposed to improve the redundancy term of MIFS by considering the uniform distribution of the information. However, Both MIFS and MIFS-U still require an input parameter β . To avoid this limitation, MRMR (β = 1 |S| , γ = 0) [25] introduced the mean of the redundancy term as automatic value to the input parameter (β ). JMI (β = γ = 1 |S| ) [26] extended MRMR to extract the benefit of conditional term. In addition, Brown et al. [22] introduced also a similar non-linear framework to represent some methods as CMIM method [27]. According to [22], CMIM can be written as: The reason of the non-linear relation on CMIM returns to the using of max operation. Similar to CMIM, JMIM [8] introduces a non-linear relation as follows:

Limitation of previous work
Although MI has been widely used in many feature selection methods such as MIFS [24], JMI [26], mRMR [25], DISR [28], IGFS [29], NMIFS [30] and MIFS-ND [31]. These methods suffer from the overestimation of the feature significance problem [8]. For this reason, Bennasar et al. [8] proposed JMIM method to address the overestimation of the feature significance problem. However, it may fail to select the best candidate features if they have the same new classification information. To illustrate this problem, Figure 1 shows the FS scenario, whereF 1 andF 2 are two candidate features,F s is the pre-selected feature subset, and C is the class label.F 1 is partially redundant withF s , whileF 2 is independent toF s . Suppose thatF 1 andF 2 have the same new classification information I(F 1 ;C|F s )= (area 3) and I(F 2 ;C|F s )= (area 5) respectively. In this case, JMIM may fail to indicate the best feature where I(F 1 ,F s ;C) and I(F 1 ,F s ;C) are equal.
Unfortunately, JMIM also shares some limitations with the previous methods. Firstly, it can not directly estimate MI between continuous features [17]. To address this limitation, there are two methods were introduced. One is to estimate MI based on Parzen window [18], but it is inefficient in high-dimensional feature spaces with spare samples [17]. Moreover, its performance depends on the used window function which requires a window width parameter [32]. The other one is to discretize continuous features before estimating MI [33], but it may cause information loss [34]. Secondly, JMIM depends only on inner-class information without considering outer-class information [18].

Fuzzy Joint Mutual Information (FJMIM)
Motivated by the previous limitation of JMIM, we proposed a new FS method, called Fuzzy Joint Mutual Information Maximization (FJMIM). Both of FJMIM and JMIM depends on "maximum of the minimum" approach. The main difference is that JMIM maximizes the joint mutual information of the candidate feature and pre-selected feature subset with class, whereas FJMIM maximizes the joint mutual information of the candidate feature and pre-selected feature subset with class without considering the class-relevant redundancy. To illustrate the difference in Figure 1, JMIM depends on the union of areas 2, 4, and 3, while FJMIM depends on the union of areas 2 and 3. The proposed method discarded the class-relevant redundancy (area 4) because it can reduce the predictive ability of the feature subset when a candidate feature is selected [15]. On the other hand, integrating fuzzy concept with MI has many benefits. Firstly, using a fuzzy concept helps to deal directly with continuous features. Furthermore, it enables MI to take the advantages of inner and outer-class information [35]. Moreover, FS methods based on fuzzy concept are more robust toward any change of the data than methods based on probability concept [36].
According to FJMIM, the candidate featureF f must satisfy the following condition FJMIM also can be written according to the non-linear framework as follows: The proposed method can be summarized to find the best feature subset of size d as follows: Input: F is a set of n features, C is the class label, and d is the number of selected features.
Step 1: Initialize the empty selected feature subset S.
Step 2: Update the selected feature set S and the feature set F.
Step 2.1: Compute I(F;C) for allF in the feature set F.
Step 2.2: Add the featureF that maximizes I(F;C) to the selected feature set S.
Step 2.3: Remove the featureF from the feature set F.
Step 3: Repeat until |S|= d Step 3.1: Add the featureF that satisfies arg maxF f ∈F−S (minF s ∈S (I(F f ,F s ;C) − I(F f ;F s ;C))) to the selected feature set S.
Step 3.2: Remove the featureF from the feature set F.
Output: Return the selected feature set S.

Experiment
Success of FS methods depends on different criteria such as classification performance, and stability [11]. Consequently, we design the experiment based on these criteria ( Figure 2). To clarify our improvement, we compared our proposed method FJMIM, with four conventional methods (CMIM [27], JMI [26], QPFS [37], Relief [38]) and five state-of-the-art methods (CMIM3 [39], JMI3 [39], JMIM [8], MIGM [40], and WRFS [41]). The compared methods can be divided into two groups: FS based on fuzzy concept and FS based on probability concept. For the methods which depend on probability concept, data discretization is required as a pre-processing step prior to the FS process. So, the continuous features are transformed into ten bins using EqualWidth discretization [42]. Then, we selected feature subset from all methods based on threshold which is defined as the median position of the ranked features (or the nearest integer position when the number of ranked features is even).   [43]. To clarify our improvement, popular classifiers were used in this study such as Naive Bayes (NB), Support Vector Machine (SVM), and 3-Nearest Neighbors (KNN) [44]. The average classification performance measures were computed by 10-fold cross-validation approach [45].

Stability
Another important evaluation criterion for FS is stability. FS stability measures the impact of any change in the input data on FS result [46]. In this study, we measure the impact of noise on the selected feature subset. Firstly, we produce the noise using standard deviation and the normal distribution of each feature [47]. Then, we injected 10% of the data by adding noise. After that, we repeated this step ten times. Each time produces a different sequence of selected features. Finally, we computed the stability of each method using Kuncheva stability index [48].

Datasets
Our experiment was conducted using 13 datasets from UCI machine learning repository [49]. Table  1 presents a brief description about these datasets.  [-]) indicate the statistically significant (5%) that the proposed method (equals, wins, and losses) other methods. According to NB classifier, FJMIM achieved the maximum average accuracy with score 78.02%, while Relief achieved the minimum average accuracy with score 75.34% ( Table 2). The proposed method outperformed compared methods in the range from 0.06 to 2.68%. In SVM classifier, FJMIM outperformed other methods with score 80.03%, while Relief achieved the minimum average accuracy with score 77.33% (Table 3). The proposed method outperformed compared methods in the range from 0.69 to 2.7%. Similarly with KNN classifier, FJMIM kept the maximum average accuracy by 81.45%, while Relief achieved the minimum average accuracy with score 77.98% (Table 4). The proposed method outperformed compared methods in the range from 0.57 to 3.47%. Across all datasets, Figure 3(a) shows the distribution of the average accuracy values of all used classifiers. In a detailed box-plot, the box represents upper and lower quartiles, while the black circle represents the median. The box-plot confirms that FJMIM is more consistent and outperformed other compared methods. Figure 3(b) shows the average accuracy of the three used classifiers. FJMIM achieved the best accuracy, followed by QPFS, JMI3, both of JMI and CMIM, both of CMIM3 and JMIM, MIGM, WRFS, and Relief respectively. The proposed method outperformed compared methods in the range from 0.6 to 2.9%.
2) Precision: Figure 4 shows the precision results of NB, SVM, KNN, and their average. FJMIM achieved the highest precision, while Relief achieved the lowest precision. The proposed method      3) F-measure: Figure 6 shows the F-measure results of the three used classifiers and their average. FJMIM achieved the highest F-measure by 79.8, 71.5 and 84% on NB, SVM and KNN respectively. Relief achieved the lowest F-measure on NB and SVM, while WRFS achieved the lowest score on SVM. The proposed method outperformed other methods in the range from 0.3 to 16.6% based on NB, from 0.1 to 1.4% based on SVM, and from 0.6 to 15.2% based on KNN. According to the average of all classifiers, FJMIM achieved the highest precision by 78.4% and outperformed other methods in the range from 2.5 to 10.8%. The second-best method achieved by JMI3, followed by QPFS, CMIM3, JMI, JMIM, CMIM, MIGM, WRFS and Relief. Figure 7 shows the distribution of F-measure across all datasets. The box-plot confirms the outperformance of FJMIM compared to other methods.      Figure 11. AUCPR distribution across all datasets. 5) AUCPR: The highest AUCPR was achieved by both FJMIM and CMIM3 using NB, MIGM using SVM, and FJMIM using KNN ( Figure 10). On the other hand, Relief achieved the lowest AUCPR using all classifiers. According to the average of all classifiers, FJMIM achieved the best AUCPR by 81.6%, while Relief kept the lowest AUCPR by 77.2%. The proposed method outperformed other methods in the range from 0.3 to 4.4%. The second-best AUCPR was achieved by CMIM3, followed by MIGM, both QPFS and JMIM, both CMIM and JMI, and both JMI3 and WRFS. Figure 11 shows the distribution of AUCPR across all datasets. It's obvious that CMIM achieved the highest median, lower, and upper quartiles, while FJMIM achieved the highest median, upper quartile. Figure 12(a) shows the stability of the used FS methods on all datasets. It's obvious that FJMIM is more consistent and stable compared to all other methods. Figure 12(b) confirms the stability of the compared method. FJMIM achieved the highest average stability by 87.8%. The proposed method outperformed other methods in the range from 6.6 to 43%. JMI achieved the second-best position by 81.2%, while Relief achieved the lowest stability by 44.3%.  More detailed results are presented in the appendix section. According to previous results, It is obvious that FJMIM achieves the best results in most measures. This is expected because our proposed addresses the feature overestimation problem and handle the candidate feature problem well. Moreover, it avoids the discretization step. Another reason is the advantages of both inner and outer class information which FJMIM depends on it. This information helps the proposed method to be more robust toward the noise. On the other hand, the other compared methods are close to FJMIM than Relief that achieved the lowest result. This is because all compared method except Relief depends on mutual information as the proposed method to estimate the significant of features.

Conclusions
In this paper, we propose a new FS method, called, Fuzzy Joint Mutual Information Maximization (FJMIM). The proposed method depends on integrating an improved JMIM objective function with fuzzy concept. The benefits of our proposed method include: 1) The ability to deal directly with discrete and continuous features. 2) The suitability to handle any kind of relations between features such as linear and non-linear relation.
3) The ability to take the advantages of inner and outer class information. 4) The robustness toward the noise. 5) The ability to select the most significant feature subset and avoiding the undesirable features.
To confirm the effectiveness of FJMIM, 13 benchmark datasets have been used to evaluate the proposed method in the term of classification performance (accuracy, precision, F-measure, AUC, and AUCPR) and feature selection stability. According to nine conventional and state-of-the-art feature selection methods, the proposed method achieved promising improvement on feature selection in the terms of classification performance and stability.
In future work, we plan to extend the proposed method to cover multi-label classification problem. Moreover, we plan to study the effectiveness of imbalanced data on the proposed method.