Abstract

Feature reduction is essential at the preprocessing stage of designing any reliable and fast disease diagnosis model. Addressing the limitations like disease specificity, information loss, and operating NP problem in polynomial time, this paper introduces a two-step hybrid feature selection approach to identify a subset of most relevant and contributing features of each medical dataset for constructing diagnostic model. The concept of information gain is used in Step I to select the informative features, whereas a correlation coefficient-based approach is employed in Step II to retain the informative features possessing much dependency with class attribute but less dependency among the non-class attributes. In particular, both the approaches are sequentially fused to select approximately optimal features in order to construct better classification model in terms of performance and time. The optimal threshold criteria are decided to choose the most appropriate features from the datasets. The effectiveness of the proposed approach is assessed using six individual competent learners and one ensemble learner over seventeen disease datasets of smaller to larger dimensions. The empirical results indicate that the proposed approach improves the performance over the datasets after feature selection, reducing considerable amount of irrelevant and redundant data.

1. Introduction

Today, the healthcare sector is considered one of the essential industries in information technology (IT). But due to the application of IT to the healthcare industry, a huge amount of health data is constantly being generated. As a result, this industry demands overwork of human professionals like doctors, nurses, and other health workers to make more effective and efficient health services (e.g., diagnosis, nursing care, counselling, therapy, and nutrition). However, almost all these services are primarily associated with the diagnosis of diseases which should be accurate and prompt (on demand). It may be noted that the use of data mining (DM) and machine learning (ML) (including statistical analysis) has been an indispensable part of IT in the healthcare industry to improve the quality of health services. Undoubtedly, it is essential to design disease diagnosis support systems (DDSSs) by applying DM and ML approaches so that healthcare professionals benefit from accurate and fast diagnosis, reducing their time and efforts. Further, DDSSs can fill the gaps of the existing techniques adopted in the health units, and such models avoid information loss and reduce diagnosis costs.

However, clinical datasets are very complex in nature, for example, disease datasets generally possess huge amount of data with high dimensionalities (input variables/features/attribute); data in the dataset are usually collected from different sources in a different format (i.e., heterogeneous); there may exist lots of missing data, outlier, inconsistent data, and so on in the dataset, and characteristics of data are dynamically changing. Among these complexities, the dimensionality curse makes a major issue in designing good DDSS. Especially, clinical data with high dimensions may have many redundant/unnecessary (i.e., highly correlated non-target features) and irrelevant attributes (less relevant features with class feature), and they do not contribute to designing DDSS. Instead, they often degrade the performance of the designed DDSS [1]. For example, machine learning algorithms like C4.5 [2], K-nearest neighbour (KNN) [3], and Naïve Bayes [4] show often adverse effect on their performances due to the presence of such redundant attributes. Also, their presence in the database makes time concern during construction of DDSS and decision making. So, feature reduction is the only solution to overcome these concerns, and any ideal reduction approach assists in developing stable DDSS, even though the characteristics of medical data dynamically change.

Dimensionality reduction is an essential but challenging task in data mining. It helps in data compression and hence reduces storage space. It also reduces computation time. The reduction techniques are primarily divided into two categories: feature extraction (FE) and feature selection (FS). FE methods (usually applicable in image processing and natural language processing) aim to reduce the number of features in a dataset by creating new features (combing the existing ones) and then discarding the original (actual) features. On the contrary, FS methods reduce the dataset size by choosing only the relevant and non-redundant features but retaining adequate information for the learning task. Several FS approaches are introduced so far specifically to tackle medical datasets, and research is still going on for further improvement. The systematic review by Kawamoto et al. showed research interests prior to 2005 to improve clinical practice using clinical decision support systems through FS approaches [5]. A list of feature selection-based research works carried out over the last 20 years on clinical datasets is cited here to show the substantial research interest in FS in the medical domain [619].

Importantly, FS techniques are being extensively applied to reduce data dimensions in big data analytics [20, 21]. In 2021, Majid and Maryam proposed a distributed ensemble imbalanced FS framework to deal with big imbalanced datasets [22]. López et al. [23] proposed a distributed feature weighting algorithm based on RELIEF technique applied for small problem to estimate features importance of large-scale data. Reddy et al. [24] investigated two well-known dimension reduction approaches, namely, linear discriminant analysis (LDA) and principal component analysis (PCA), in the perspective of big datasets (including cardiotocography (CTG) dataset and diabetic retinopathy (DR) as medical datasets) and concluded that if the dimensionality of datasets is low, ML algorithms without dimensionality reduction yield better results. In 2022, Chen et al. [25] proposed a multi-tasking particle swarm optimization (PSO) approach for high-dimensional datasets (including many clinical datasets) to achieve higher classification accuracy in a shorter time than other state-of-the-art FS methods on high-dimensional classification. Interestingly, Hu et al. [26] introduced a multi-participant federated evolutionary FS algorithm for imbalanced data under privacy protection.

Very recently, the graph-based methods, including graph theory [2729], spectral embedding [30], spectral clustering [31], and semi-supervised learning [32], are being significantly used in many problems for FS because of their capability of encoding similarity relationships among the features. Interestingly, these techniques may be applied in the medical field, since most of the medical datasets consist of images. Alelyani proposed one bagging-based ensemble approach to improve stability of feature selection in clinical datasets using data variance reduction [33]. In 2021, Xie et al. [34] developed a standard deviation and cosine similarity-based FS approach to tackle the challenges in genomic data analysis caused by their tens of thousands of dimensions while having a small number of examples and unbalanced examples between classes. In 2020, Sarkar proposed a two-step knowledge extraction framework for faster and accurate detection of disease [35]. The model used the entropy reduction approach to select few best relevant features from each dataset, but the issue is that several features in selected set may be correlated (i.e., redundant) among themselves which may often degrade the performance of the developed model. A few more standard published studies are listed in Tables 1 and 2, comparing their performances with the present work.

1.1. Research Scope

As of now, there is extensive literature on feature selection in the medical domain. But most of them are disease specific or research focuses very less on generalizability case. Further, deciding threshold value criteria sufficient to identify minimal feature set is another issue in feature reduction. Also, dimension reduction often leads to information loss. It may be noted that, for dimension reduction, researchers prefer principal component analysis (PCA), but retaining the number of components is a big issue in PCA. Further, feature selection is viewed as a search optimization problem. More specifically, minimum feature subset selection (MFSS) is proved to be an NP problem [36, 37]. The MFSS NP problem is mathematically explained below.

1.1.1. Minimum Feature Subset Selection (MFSS) as NP Problem

The search space in context of MFSS includes all possible feature subsets to discover the best feature subset, and the total number of possible ways to select feature subsets will bewhere n is the dimensionality (quantity of original features) and s denotes the size of the chosen current feature subset.

Certainly, selection of zero (0) features (i.e., nC0) may be ignored.

1.1.2. Challenges

The problem to discover the ideal feature subset seems to be NP-hard because the analysis of all the feature subsets is costly in a computational manner, time-consuming, and inefficient even in case of small sizes. In fact, the exhaustive search can find the optimal solution, if the number of variables is not too large. In particular, there is still no effective way to deal with this problem. That is why, the problem is attempted to solve sustainably by using statistical or information theory-based or search-based strategies including best-first, branch-and-bound, simulated annealing, genetic algorithms, and so on.

1.1.3. Present Work

In the present study, a two-step hybrid generic model is proposed to identify an ideal subset of features for medical datasets, taking the strength of the existing statistical measures—information gain and correlation coefficient. The hybrid approach is a polynomial time approximation approach to tackle MFSS problem. More specifically, the concept of information gain is used in the first step to identify the most relevant non-target attributes with more information gain, whereas a correlation coefficient-based approach is used in the subsequent step to search the non-target features having maximum dependency with class attribute but minimum dependency among the non-target attributes. Optimal threshold criteria are decided based on the trial-and-error approach in Step II to select the most informative features from the datasets. However, threshold value is deterministically set in Step I. Generally, threshold limits are determined by expert knowledge, but such decision may not often result in good solution. In addition, it may vary from problem to problem. That is why, threshold values in Step II are decided based on the trial-and-error approach. The approach also includes provision of decremental and incremental scope of threshold values dynamically. Hence, we may claim that this approach has capability of high compression of storage and much time reduction, resulting in minimum information loss (since improvement is observed in performance metrics). Anyway, both the steps follow backward elimination technique to retain the best features.

1.2. Contributions of the New Hybrid Approach

(i)Operating MFSS (an NP problem) in polynomial time by the method of hybridization and setting optimal threshold values through trial-and-error approach to get the maximum accuracy and minimum false rate. The time complexity of the approach is O(n3) with n attributes/features in the dataset.(ii)It is a generic feature reduction model for medical datasets, i.e., it targets medical datasets irrespective of any medical disease dataset. However, the model may show better performance for datasets other than medical datasets but nothing wrong is there.(a)The speciality about operating disease datasets is claimed by adaptation of information gain-based approach and correlation coefficient-based approach in sequence. Medical datasets usually possess more irrelevant and redundant features. The information gain approach in the first step aims to eliminate the irrelevant features, whereas the correlation coefficient-based approach in the second step aims to eliminate both irrelevant and redundant features (not losing attributes with maximum information gain).(b)The speciality about disease generalability in the approach is the provision of changing threshold values decided for non-target and target and non-target and non-target pairs in the datasets.(iii)Preventing information loss due to feature reduction, since the model aims not to lose the informative features while processing features in Steps I and II as well. Prevention of information loss is validated through performance measuring metrics like accuracy, true positive rate (TPR), false positive rate (FPR), and area under the receiver operating characteristics (AUROC).(iv)Datasets of different dimensions and rarely considered datasets like Arrhythmia, Lower back pain, Malaria, and Parkinson’s are experimented in this study.(v)The percentage of feature reduction by the new model is high enough.

1.3. Organization of the Article

The rest of the paper is organized as follows. Section 2 includes previous works related to the present work. Section 3 discusses the proposed methodology in detail. The implementation of the method, the obtained results, and analysis of the results are illustrated in Section 4. The conclusions and the future scope are presented in Section 5.

2. Previous Works

Prior to the model description, previous works related to the proposed model are included in this section. It may be noted that basic knowledge on dimensionality reduction and its very common categories, namely, feature selection and feature extraction, are already included in the Introduction section. Truly, before feature reduction using machine learning approaches, few features may be simply ignored as follows:(i)Domain expert may reduce unnecessary features.(ii)Feature exceeding certain threshold value of missing data may be removed.

Next, suspecting interdependence among features and less contributary features, the machine learning-based feature reduction approaches need to be applied. Now based on labelled, unlabelled, and partially labelled data, the standard feature selection methods are usually divided into three main categories: supervised, unsupervised, and semi-supervised [38]. Any supervised method selects and evaluates convenient features based on labelled data. Entropy-based technique is a supervised FS technique. On the other hand, unsupervised FS techniques ignore the target variable and remove redundant variables. The correlation coefficient-based approach is usually considered as an example of unsupervised FS method. Evaluating and selecting features in the unsupervised method are made based on the ability to meet some of the dataset’s properties, like locality preserving ability and variance. However, a small amount of labelled data (not all) is available in many datasets, and finding their labels is costly. So, semi-supervised or constrained methods are used in such cases. In particular, the semi-supervised FS method uses both labelled and unlabelled data.

Further, based on the evaluation methods adopted for feature selection, the methods may be categorized as filter, wrapper, embedded, ensemble, and hybrid approaches [3941]. In the filter-based method, four types of evaluation criteria, namely, dependency, information, distance, and consistency (i.e., unambiguous), are used. These methods are classifier independent. So, such technique has better generalization property but ignores interactions between classifiers. For more details about filter-based approaches, one may refer to the studies [42, 43]. On the contrary, a learning algorithm in the wrapper-based method is iteratively employed to evaluate the quality of feature subsets in the search space. This method interacts with classifier frequently and focuses on minimizing the prediction error. So, the major issue of the wrapper method is the computational complexity. Some common examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc. In this regard, one may refer to the recent studies [44, 45]. The embedded method is a built-in FS mechanism that embeds the FS in the learning algorithm and uses its properties to good feature evaluation. Therefore, ensemble approaches are often the best way to tackle the limitations of the individual approaches. In general, the ensemble model aims to construct a group of feature subsets and then produce an aggregated result out of the group. It interacts with classifier. So, it is classifier specific but better than wrapper method, since it interacts with classifier once (not frequently). LASSO and RIDGE regressions are some popular examples of this method. These have inbuilt penalization functions to reduce overfitting. The time complexity of this model is also high. Finally, approaches based on hybridization employ the wrapper model’s proper performance and the filter model’s computational efficiency. However, the accuracy issue may be challenging in the hybrid model, since the filter and wrapper models are considered two separate steps [46]. So, we need to develop new ideas in order to design a new model (hybrid model) that will be able to improve performance of the learners, taking a smaller number of computational resources. The hybrid method can be formed by combining two or more different methods (usually filter method). It attempts to inherit the strengths of the individual methods.

3. Proposed Hybrid Feature Selection Approach

The conceptual view of the proposed selection-based feature reduction approach is depicted in Figure 1. The hybrid approach is, indeed, Phase II of the entire work carried out in the present study. More specifically, Phase I of the model deals with original datasets (drawn directly from several sources) and the performance measures of the selected competent learners over the chosen datasets. Phase II attempts to search for accurate features from each original dataset. Finally, Phase III uses the same learners and the same infrastructure (as applied in Phase I) to measure their performances over the datasets with reduced features.

3.1. Definition

(i)Original Dataset. Medical datasets with original features (drawn directly from the data repository) are termed here as the original datasets. In such a dataset, features/attributes are recommended by experts (physicians).(ii)Relevant Attribute. As per the existing literature, a non-target feature (x) is relevant with target attribute (C) if these two are highly dependent (or correlated), otherwise irrelevant(iii)Redundant Attribute. As per the existing literature, a non-target feature (x) is relevant with another non-target attribute (y) if these two are highly dependent (or correlated). Here, same information is carried by both x and y about the dataset

Details of Phase II. The very common but unimportant nominal attributes (e.g., id number, zip code, eye colour, and so on) in medical dataset are first discarded from it. Next, the suggested FS method is applied in Phase II. In fact, this phase consists of two steps: Step I and Step II. An entropy-based approach is employed in Step I to choose the most informative features from each original dataset. In Step II, the feature set decided by Step I (for each dataset) is passed to a correlation-based approach that finds association between target and non-target attribute pairs and then non-target and non-target attribute pairs in order to remove irreverent and redundant non-target attributes. In particular, two supervised feature selection approaches are fused here. Truly, the entropy-based approach emphasizes to identify the most relevant informative features, but some of these may have strong dependency among themselves and this results in redundant attributes. Certainly, finding the redundant attributes may not be resolved through this approach. However, the correlation-based approach has the capability to tackle this limitation. That is why, the entropy-based approach is applied in the first step and the correlation-based approach is applied in the subsequent step.

Importantly, optimal threshold criteria in Step II are decided based on the trial-and-error approach in order to retain the most informative as well as the essential features of the datasets, whereas the threshold value in Step I is set deterministically to select only the informative features. More specifically, two threshold values in correlation-based approach are set by applying the trial-and-error approach—one for checking relationship between non-target and target attribute pairs and the other for non-target and non-target attribute pairs. Actually, the threshold values in the proposed approach are decided based on the trial-and-error approach to yield maximum improvement or no loss in performance of the learners over almost all the chosen clinical datasets. More details about threshold values are described in the respective algorithm sections. It may be noted that the selection of threshold values (through trial-and-error approach) assists to solve MFSS NP problem approximately.

3.1.1. Concept Adopted in Step I Using Information Gain Measure

The approach first computes information gain of each attribute and then finds their mean (i.e., mean_Gain =  (Ai)/n) and standard deviation (s.d.). Next, parameter Threshold_value is set asHere, sd is the Gain standard deviation (later denoted as Gain_std.)

Finally, it filters out the attributes as follows. If any attribute has information gain less than Threshold_value, then it is discarded from the set of attributes. Thus, it reduces search space and enables to filter out the right informative attributes. The algorithmic version of this logic is presented in Algorithm 1.

Suppose a dataset (DS) of classification problem has n attributes, say Ai, (i = 1, …, n) and N instances. So, DS refers to the given dataset with N instances and n dimensions (attributes). Now let F be the set of original features of DS, where F = {A1, A2, …, An}. Further, let Fs denote the set of features, consisting of the most relevant informative features taken from F. Initially, Fs = F = {A1, A2,…, An}.
Goal: elimination of the non-informative features from Fs.
Input: DS  //Dataset with n features and N instances
Output: //Feature set with n features to m features, m ≤ n
Parameter: Threshold_value
Variables: Gain_measure[1, …, n], Gain_sum = 0, mean_Gain, Gain_suqare_diff = 0, Gain_sdt
 begin
 1. for each attribute: Ai (i= 1, …, n) of DS do
  begin
   1.1. Compute the entropy reduction measure for Ai as: Gain (E, Ai) = Entropy (E)  . Entropy (), where
     (j = 1, …, k) denotes values of attribute Ai and Entropy (E) = , where |E| returns the number of
    examples in DS, and  = |Em|/|E|, where |Em| is the number of m-th class examples, out of c classes.
   1.2 Gain_measure[i] = Gain(E, Ai)//Stores i-th attribute’s (Ai) information in i-th location of Gain_maesure[ ] array
  endfor
 2. for i: = 1 to n do
   Gain_sum = Gain_sum + Gain_measure[i]
  endfor
 3. mean_Gain = Gain_sum/n//finds mean value (mean_Gain) of information gain measures
 4. for i: = 1 to n do
  4.1Gain_square_diff = Gain_square_diff + square(Gain_measure[i] – mean_Gain)//square is the math function
  endfor
 5. Gain_sdt = /n)//finds standard deviation (Gain_sdt) of information gain measures
 6. Threshold_value = mean_Gain – Gain_std
 7. for each attribute: Ai (i = 1,…., n) of DS do
  7.1 If Gain (E, Ai), (i = 1, …, n) < Threshold_value, then discard Ai from Fs, i.e., Fs = Fs – {Ai}//It is backward elimination.
  endfor
 end//of the algorithm

Complexity Analysis. The algorithm is very simple and straightforward. Its running time (including entropy calculation time) is simply O(n2), where n is the number of attributes in the dataset. The algorithm is implemented in Python 3.9.

Note.The decided Threshold_value= (mean_Gain − Gain_std) is statistically appropriate for filtering features. In particular, feature possessing less than the Threshold_value is assumed to have very low contribution in constructing expert system and may be ignored/removed from the feature set. This strategy is a kind of search-based filtration technique for dimensionality reduction task. Now, the joint entropy of the discarded features is checked to conclude if they are statically independent by using the inequality: H(X1, X2, …, Xk) ≤ H(X1) +H(X2) +  + H(Xk). If the inequality is satisfied, then they are statistically independent; otherwise, they are dependent. One may note that Shannon’s joint entropy formula for two ensemble variables X and Y is defined as H(X, Y) = −.

Now, if X and Y are dependent, we may not directly find the measure of dependency level by using the inequality. However, it can be obtained from correlation measures, and so correlation measure is used in the subsequent step of the ensemble approach.

Importantly, information gain (in comparison to Gini index) is preferred here to remove irrelevant attributes, since Gini index facilitates the bigger distributions not for lesser distributions having small count with multiple specific values.

3.1.2. Concept Adopted in Step II Using Correlation Coefficient Measure

The statistical measure, correlation coefficient, represents the strength of association between the variables. Its values lie in [−1, 1]. In this study, Pearson’s product moment correlation coefficient is employed. The adopted correlation coefficient-based logic to reduce features is first graphically shown in Figure 2 (a wheel of complete graph) for easy understanding. The entire logic is described in 2 parts, namely, substeps I and II (as shown in Figure 2).

Logic to Decide Initial Threshold Values. The initial threshold values primarily decided by the trial-and-error approach (based on 10 trials) are here set as

Here, Ai is the i-th attribute and C is the class attribute.

Here, Ai and Aj are the i-th attribute and j-th attribute, respectively.

This logic is implemented using Python 3.9.

Note. Here, selection of the best features is done via removal of the irrelevant and redundant features from the feature set.

The high-level description of the logic is presented in Algorithm 2.

Suppose the dataset (DS) has now m attributes, say Ai,i = 1, …, m (after applying Step I) and m ≤ n. Therefore, DS now refers to the given dataset with N instances and m dimensions (features), and the feature set (Fs) of DS is now described as Fs= {A1, A2, …, Am}.
  (i) Correlation value between two variables: x and y (denoted as Cor(x, y)), is computed by the formula: Cor(x, y) = , where cov(x, y)  and var(x) =  . Likewise, we get variance for y (i.e., var(y)).
  (ii) The approach includes provision of decremental and incremental scope of threshold values dynamically.
Goal: removal of irrelevant features (via non-target to target co-relationship) and removal of redundant (with less information gain) features (via non-target to non-target co-relationship)
Input: DS //Dataset with m features and N instances
Output: //Feature set with m features to k features and k ≤ m
Parameter: Threshold_value1, Threshold_value2//For storing threshold values
Variables:
 rnc[1, …, m], rnn[1, …, (m − 1)] [1, …, (m − 1)], Atemp//Used matrices and temporary variable, Atemp
 Max_rnc = 0/ for capturing the maximum correlation value between class(target) and non-target attributes /
 Max_rnn = 0/ for capturing the maximum correlation value between non-target attribute and non-target attributes /
Step 1. Find correlation coefficient matrix (C) of size (m + 1) × (m + 1) for dataset DS with total (m + 1) attributes including the class (target) attribute (placed at the last column of the matrix).
 Actually, the matrix: C(m + 1) × (m + 1), is represented by the following two arrays
  (i) rnc[1, …, m]: a 1-D array to store correlation measures between non-target and class attribute pairs:
   (e.g., rnc[1] = Cor(1,class) (i.e., correlation measure between attribute 1 and the class), rnc[2] = Cor(2,class),…)
 and
  (ii) rnn[1, …, m][1, …, m]: a 2-D array to store correlation measures between non-target and non-target attribute pairs
   (e.g., rnn[1][2] = Cor(1, 2), i.e., correlation measure between attributes 1 and 2), …
Step 2. Find the maximum value from rnc[i], i = 1, …, m and store that at Max_rnc
Step 3. Find the maximum value from rnn[i][j], i = 1, …, (m 1); j =(i +1), …, m and store that at Max_rnn.
Substep I of Step II./ Removal of irrelevant non-target features from Fs using Threshold_value1 set for (non-target and class) attributes pairs /
Step 4. Threshold_value1 = 0.4  Max_rnc//i.e., (40% of Max_rnc)
Step 5. For each attribute: Ai (i = 1,2, …, m) of the current D do
  Step 5.1. If rnc[i] < Threshold_value1, then discard Ai from Fs, i.e., Fs = Fs − {Ai}.//Backward elimination
    endfor
Step 6. If all the attributes are discarded from FS (i.e., Fs = Φ), then perform the following substeps, else go to Step 7.
  Step 6.1. Threshold_value1 = Threshold_value1 − 0.1  Max_rnc and take Fs = {A1, A2,…, Am}.
  Step 6.2. If Threshold_value1 > 0, then go to Step 5.
Substep II of Step II./ Removal of redundant non-target features from the current Fs using Threshold_value2 set for non-target and non-target attribute pairs /
Step 7. Threshold_value2 = 0.75  Max_rnn//i.e., (75% of Max_rnn)
Step 8. For each attribute Ai (i = 1, 2, …, k − 1) in the current DS do//D with reduced features
  Step 8.1. For each attribute Aj (j = i + 1, 2, …, k) in the current DS do
   Step 8.1.1. If rnn[i][j] > Threshold_value2, then find Atemp = min_information_gain(Ai,Aj) and
         discard Atemp from Fs, (i.e., Fs = Fs − {Atemp}) if not already discarded.
   / min_information_gain(Ai, Aj) returns the attribute with minimum information gain between two attributes /
   endfor
  endfor
Step 9. If all the attributes are discarded from FS (i.e., Fs = Φ), then perform the following substeps, else got o Step 10.
  Step 9.1. Threshold_value2 = Threshold_value2 + 0.1  Max_rnn and take the Fs obtained after Step 6.
  Step 9.2. If Threshold_value2 < Max_rnn, then go to Step 8.
Step 10. Stop
3.2. Complexity Analysis

The algorithm is very simple and it uses two iterations (in cascaded fashion), each continuing for a maximum of m times. So, its running time (including correlation coefficient computation time) is simply O(m3), where m is the number of attributes in the dataset. The approach is implemented in Python 3.9.

4. Experimental Results and Discussion

To assess the performance of the proposed feature selection model, several extensive experiments are performed over seventeen publicly available datasets drawn from several machine learning repositories (e.g., UCI [47], Kaggle [48], and OpenML [49]). In particular, the values of the performance metrics obtained (before and after applying the suggested hybrid approach) by six state-of-the-art and well-known learners over the datasets are presented in Tables 35, respectively. Importantly, each learner belongs to one specific type of learning strategy, such as J48 (a decision tree-based rule inducer [2]), JRip (Java version of Repeated Incremental Pruning to Produce Error Reduction (RIPPER)) [50] (a sequential covering algorithm), nature-inspired artificial neural network [51] (here, a 3-layer NN with neurons in hidden layer is taken, where input layer has n neurons—one for each input parameter, whereas output layer has k neurons—one for each class and each neuron uses sigmoid function), KNN [3] (a distance/instance-based learner), Naïve Bayes [4] (a probability-based learner), and support vector machine (SVM) [52] (with popular radiant basis function (RBF) kernel). In fact, to show the performance of the proposed feature reduction model rigorously, learners are chosen based on different strategies. The experiments over the learners are performed in Weka (Waikato environment for knowledge analysis) platform (http://www.cs.waikato.ac.nz/ml/weka). On the other hand, the proposed combined feature selection model is implemented using Python 3.9.

Used Performance Measuring Metrics. The results of the standard performance metrics—prediction accuracy, TPR, FPR, and AUROC obtained by the machine learning algorithms (applied before and after the proposed hybrid feature selection approach), are used to assess the effectiveness of the approach. In brief, the classification accuracy (CA) is the ability to predict the right classes correctly. TPR is the proportion of actual positives that were classified as positives. High TPR indicates that most of the positive cases in (TP + FN) are correctly labelled as positive. In medical models, we always expect high recall and low FPR. In fact, FPR explains the number of negative cases incorrectly predicted as positive. This measure together with a related measure, namely, the false negative rate, is extremely important in medical testing. Undoubtedly, FPR increases the mental worries, and so high FPR is always bad, and it is necessary to minimize those. Lastly, AUROC measures the quality of predictions irrespective of the chosen classification threshold. AUROC close to 1 is desirable.

Description of the Datasets. In this study, several datasets with different properties are used in the experiments to demonstrate the robustness and effectiveness of the introduced hybrid model. The primary characteristics of the datasets like no. of non-target features, no. of classes, no. of instances, and presence/absence of missing values are presented in Table 6.

A brief description of each the selected datasets is presented in Table 7.

Logic to Handle Missing/Null Values in the Datasets. Missing or null values exist in the datasets. To process the missing values in the attributes, the following strategy is adopted.(i)If any attribute in a dataset possesses more than 60% missing values, then that attribute is simply dropped from the dataset.On the other hand, any attribute with less than 60% missing values is processed as follows.(ii)Missing values are replaced by mean value if the attribute type is continuous; otherwise, they are replaced by the value with maximum frequency.

Data analysis from Table 6:(i)Number of datasets with features less than 10 : 2 (11%); these are, namely, E. coli and New Thyroid.(ii)Number of datasets with features greater than 10 but less than 50 : 13(76%)(iii)Number of datasets with features greater than 50 : 3(17.6%); these are, namely, Arrhythmia, Colon Cancer (2000 features), and Lung cancer.(iv)Number of binary class datasets: 10 (58.8%); number of multi-class datasets: 7 (41%).

From the data analysis, it may be reported that the chosen datasets are of different sizes with diverse number of features (smaller to larger like big data). For instance, Arrhythmia, Colon cancer, and Lung cancer are significantly high dimensional datasets with small sample size; however, COVID-19 is an example of low dimensional database with a large number of samples. On the other hand, Primary Tumour is a multi-class dataset with twenty-two different kinds of classes.

4.1. Experimental Results

This section first describes the experiments conducted in the present research over the selected clinical datasets. The obtained results are presented in the tables, and the results are then analysed.

In the experiments, the number of selected features (including % age of dimension reduction), the classification accuracy (% age), TPR, FPR, and AUROC are used as the performance measures to evaluate the performance of the proposed model. First, a list of number of features reduced from the original datasets is reported in Table 8 on applying the proposed feature selection approach and its individuals. More specifically, the table describes, respectively, the name of dataset (DN), number of instances (NI), number of features (NF) in each original dataset, and the number of features (NF) reduced individually by Step I and Step II and by their combination. Next, the significance of the introduced approach is affirmed through the standard performance metrics attained by the classifiers, namely, J-48, NB, JRip, KNN, ANN, NBs, SVM, and J48+JRip over the chosen benchmark datasets. Importantly, NB learner is chosen because it works better on datasets with independent features and the suggested approach focuses on identifying such features. In particular, the accuracy results for each dataset are shown in Table 3 as follows:(i)Results obtained prior to applying the proposed hybrid approach.(ii)Results obtained after applying the proposed approach and these results are shown within parenthesis as (results obtained by applying Step I separately, results obtained by applying Step II separately, and results obtained by applying Step I and Step II combined).

Likewise, the (TPR, FPR) and AUC (ROC) metrics obtained from the employed learners over the datasets are presented in Tables 4 and 5, respectively, following the same order as adopted in case of accuracy result presentation.

For better estimation of the performance metrics of the learners, each experiment is repeated 10 times based on 10-fold cross validation scheme. Thus, each entry of Tables 35 denotes the mean value of the findings obtained from 10 independent runs, where each run applies 10-fold cross validation scheme. Particularly, in each column corresponding to each row of the performance tables, the best mean value (if obtained by any learner after feature reduction) is marked in bold. A head-to-head comparison of dimensionality reduction achieved by Step I singly, Step II singly, and their combination is reported in Table 9.

Recall that the trial-and-error approach is used here to decide the initial threshold values (mainly for Step II) to operate feature selection NP problem in polynomial time. In this model, 10 trials for each dataset are conducted. At each trial, 10% increment/decrement of max correlation (as specified in Substeps I and II of correlation-based algorithm) is done. Based on 10 trials, the threshold values (as shown in equations (3) and (4) in Section 3) produce better performance over almost all the selected datasets, resulting in acceptable amount of data reduction.

Referring to Table 8, we get the dimension reduction (% age) of the used clinical datasets by Step I and Step II separately and by the combination of Step I and Step II, and the corresponding measures are presented in Table 9.

4.2. Discussion on the Experimental Results

Based on the empirical results yielded by the applied learners over the chosen datasets, some significant findings about the proposed hybrid feature selection approach are listed below.(i)From Table 8, we may claim that the proposed approach is good enough to reduce noise from medical data. The justification behind its strength is analysed below from the results presented in Tables 35 and 9.(a)Table 3 reveals that each learner’s classification accuracy over almost all the clinical datasets is improved after applying the combined approach. More clearly, the competent classifiers’ mean accuracy (%) (presented in Table 3) increases in almost all cases after removing the non-informative, irrelevant, and redundant features. Notably, the performance of the NB classifier improves sufficiently over almost all the datasets, and it indicates that the introduced approach is good enough for reducing redundant features. The reason is that the NB learner works better over the dataset with independent features and the suggested approach can select such attributes. Thus, we may claim that the redundant features are removed from the datasets by applying this approach.(b)Table 4 reports that the metrics TPR and FPR bagged by the chosen learners over the datasets are improved considerably (i.e., TPR increases and FPR decreases almost over all the datasets) after applying the introduced feature reduction approach.(c)Table 5 deals with ROC-AUC metric, the most desirable performance metric of learners for clinical datasets. The head-to-head comparison of AUC values achieved by the learners between original datasets and the datasets with reduced features shows that AUC has increased over almost all the datasets after applying hybrid FS approach—it indicates a positive signal of the proposed approach.(d)From Table 9, it is clear that the proposed model results in more than 50% reduction of the features in 15 datasets (except E. coli and New Thyroid). In particular, 80% or more data reduction is made in 4 datasets, namely, Arrhythmia, Breast Cancer, Colon Cancer, Heart (Hung.), Heart (Swiss), and Lower Back Pain. Further, it is worth noting that the reduction in the datasets does not affect the classification accuracy, rather performance metrics are improved. More specifically, the original datasets contain about 50% to 80% redundant attributes, but the current hybrid approach is competent enough in removing these redundant attributes without affecting the classification accuracy. Consequently, removal of the features enables learning algorithms to speed up. Further, the comparison results presented in Table 9 exhibit that the proposed system is efficient in terms of data reduction as compared to the sole information gain-based approach and correlation coefficient-based approach. However, it may be noted that the individual correlation coefficient-based approach is better than the information gain-based approach alone. The reason is that perfectly correlated variables are truly redundant in the sense that no additional information is gained by adding them. Besides, the correlation coefficient-based approach’s data reduction capability demonstrates that clinical datasets often possess more redundant features, and removal of those is possible via a correlation-based approach.(ii)It is well accepted to the researchers that the SVM learner is comparatively appropriate for datasets with high dimension but small number of classes. Examples include here Arrhythmia, Lung cancer, and COVID-19. But the improvement in the performance of this learner over the datasets is observed here after reducing the redundant features.(iii)In some datasets, improvement in the used metrics yielded by some learners (not for all the chosen learners) is observed to be unchanged or very less or acceptably down, but amount (% age) of dimension reduction (i.e., noise reduction) is considerably good. This may be due to the adopted learning strategies by the learners that usually desire more features while training. Again, increasing the number of features in a dataset may not be always helpful to increase the classification performance of the data. In other way, increasing the number of features may often result in reduction of classification rate after a peak.

The proposed approach is not compared with other standard works in the literature, since almost all the approaches in the literature have chosen only few medical datasets (not a list of datasets) and few of them are disease specific. Of greater interest, the data reduction (% age) and classification accuracy performances of the present two-step system are compared with some standard studies for few specific clinical datasets, and these are presented, respectively, in Tables 1 and 2.

Referring Tables 1 and 2, the following insights may be highlighted in favor of the proposed approach.(i)The presented model is generic (i.e., not disease specific). From Tables 1 and 2, it is observed that most of the feature reduction models are disease specific and they used heuristic/metaheuristic/combinatorial strategy to tackle the MFSS NP problem. Therefore, lack of generability and time consideration are the main drawbacks of the described studies. Actually, due to application of heuristic/metaheuristic/combinatorial strategy, the chance of increasing time may increase.(ii)It is a good alternative of the standard dimension reduction models for clinical datasets. Data compression by the proposed model (as compared to most of the standard clinical dimension reduction models) is noticeable and performances over the datasets are quite encouraging. Some datasets of the comparison tables are analysed below.(a)For Breast cancer dataset, the present approach exhibits performance wise better (or equal) to the studies [53, 58].(b)In case of Colon cancer, the proposed approach attains better performance as compared to the methods [14, 53], reducing considerable amount of data reduction.(c)About Lung cancer dataset, the % age of data reduction and the CA achieved by [54] are, respectively, 54% and 75%, whereas our approach bags these measures as 65% and 84.34% (although, not better than [58]).(d)For Indian liver dataset, the study [57] reduces 90% data and attains 71.68% CA but the present study results in 50% data reduction and 72.14% CA. On the other hand, the work [55] yields only 70% CA without removing any feature.(e)Parkinson disease is rarely experimented. The cited study [57] reduces only 31% data reduction and attains 90.78% CA, whereas the presented model results in 77% data reduction and yields 97.82% CA.

Few vital reasons in favor of achieving good performance by the proposed strategy are stated below.(i)Note that Step I of the presented approach removes the irrelevant attributes, whereas Step II removes the correlated redundant attribute. However, while removing redundant attribute, the approach emphasizes to retain the informative attribute in between the correlated attributes. This prevents not only information loss of the data but also stops high dimensionality reduction. Practically, expected results are not achieved due to high dimensionality reduction.(ii)The idea to decide threshold value criteria is conceptually justifiable. Introduction of these threshold values in the approach enhances the strength of the model.

5. Conclusions and Future Scope

Conclusions. Over the last 10 years, the growth of computer and database technologies has led to the rapid growth of large-scale datasets. Importantly, large-scale datasets give more accurate and valuable results. But they require high speed to process. One reason is that the number of dimensions in such datasets is very high, i.e., the key issue is the curse of dimensionality, which is mostly faced in the applications like pattern recognition, classification, and clustering.

A very natural question that may arise now is that earlier feature reduction was a very much active field due to hardware limitation. But, computational resource is now unlimited. So, at present, we may keep the larger datasets, since the larger the dataset is, the better it is for machine learning and knowledge discovery. However, there may still be redundant and irrelevant attributes in the large dataset which need to be removed from the dataset for achieving more effective results. Further, recently advanced machine learning approaches are able to handle the curse of dimension and large datasets. But these approaches are more suitable for large dataset (not for small size data). That is why, feature reduction is always welcome.

In this paper, a novel hybrid feature selection approach is proposed to predict the disease in a cost-effective way. We compare the classification accuracy, TPR, FPR, and AUC over the chosen seventeen datasets with the selected features using six individual well-known state-of-the-art learning approaches (namely, C4.5 (J48), JRip, ANN, KNN, Naïve Bayes, and SVM) and one hybrid learning approach (J48 + JRip).(i)The list of the datasets (collected from several standard web data repositories) consists of both communicable and non-communicable disease datasets of smaller to larger dimension. The list includes the new dreadful disease—COVID-19.(ii)Out of 17 datasets, 4 datasets, namely, Arrhythmia, Lower back pain, Malaria, and Parkinson, are rarely considered by the researchers.(iii)In terms of the selected performance metrics, the overall performance of our method has been found to be very good for almost all these datasets. In summary, the presented approach works well for all the chosen medical datasets (i.e., it is not disease specific), and it can be an excellent alternative to the well-known data reduction approaches.(iv)The approach is simple to implement, and computational complexity is O(n3), where n is the number of attributes in the dataset.(v)The percentage of feature reduction by the new model is high.(vi)The article gives a solid background information (including literature review) for researchers who are not familiar enough with feature reduction (specifically for medical datasets).(vii)It assists to collect information data, saving data collection time.

Undoubtedly, with the help of the proposed method, redundant attributes can be removed efficiently from the datasets without sacrificing the classification performance. The proposed method of feature selection was also shown to perform well against feature selection with information gain.

5.1. Limitations

(i)The proposed method is not applied on more number of big medical datasets.(ii)Two variables that are useless can be useful, but they are simply removed here.

Future Scope. We are in the process of searching the following.(i)A variable that is completely useless by itself can provide a significant performance improvement when taken with others.(ii)Two variables that are useless by themselves can be useful together.

Data Availability

The data used to support the findings of the study are included within the article.

Additional Points

Highlights. (1) Operating dimension is reduced in polynomial time. (2) Model is generic for medical datasets. (3) Data loss is prevented and diagnostic accuracy is improved. (4) Learning and diagnostic time is reduced. (5) Storage space is compressed.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

The authors express great thanks to the Birla Institute of Technology for providing good infrastructure to carry out this research work.