A Novel Two-Stage Selection of Feature Subsets in Machine Learning

—In feature subset selection the variable selection procedure selects a subset of the most relevant features. Filter and wrapper methods are categories of variable selection methods. Feature subsets are similar to data pre-processing and are applied to reduce feature dimensions in a very large dataset. In this paper, in order to deal with this kind of problems, the selection of feature subset methods depending on the fitness evaluation of the classifier is introduced to alleviate the classification task and to progress the classification performance. To curtail the dimensions of the feature space, a novel approach for selecting optimal features on two-stage selection of feature subsets (TSFS) method is done, both theoretically and experimentally. The results of this method include improvements in the performance measures like efficiency, accuracy, and scalability of machine learning algorithms. Comparison of the proposed method is made with known relevant methods using benchmark databases. The proposed method performs better than the earlier hybrid feature selection methodologies discussed in relevant works, regarding classifiers’ accuracy and error.


INTRODUCTION
An extensive, high dimensionality problem is caused by the numerous volumes of feature dimensions [1].To find a solution, feature subsets are applied to alleviate the features dimensions and to give better outcomes in performance by reducing the less significant features.These methods boost the classification accuracy and reduce the training time of the learning techniques.The methods which select features, namely filters and wrappers, are distinguished on the basis of classifier evaluation [2].The filter method finds the weight of the attributes on the basis of relevance that is computed by using different measures like information, distance, consistency, and correlation.It is fast and simple computationally, not depending on any learning algorithm, and scalable to huge-dimensional datasets.The feature subset can undergo an evaluation by the classifier subsequently, once the feature selection (FS) is performed.The outcome of feature selected results in the worst classification performance because the dependency of features is mostly overlooked.Wrapper methods depend on classifiers and work together between feature subset searches and model selection dependent on feature selection methods.The impediments in this method are the high risk of overfitting and that it is computationally intensive [3], particularly if the classifier is built.The embedded technique chooses features on the basis of filter methods and performs evaluation on the basis of the classifier of the wrapper method within the model.This method is less computationally intensive than wrappers.The selection of feature subsets is considered as an initial processing technique on the basis of an evaluation criterion for high dimensional data-sets.The four step process of selecting features stated in [1] is generating subsets, evaluating subsets, setting criteria to stop the process, and validating results.The first step produces a subset of features and performs an evaluation of features.The process goes on until the stopping criterion is reached.The feature subsets selected in the previous step are validated for performance analysis by the classification algorithm.A feature space of n features causes the subsets of 2 n to exist generally.The most important drawbacks are the point to start the search and the direction of search.The subset of features begins with a null set, and the features are added in the onward direction which satisfies the evaluation criteria, otherwise they are eliminated.

A. Objectives
Feature subset methods depend on parameters.The number of features in the final set is identified by the inputs and the threshold values.The issues of the above-mentioned filterwrapper techniques claim the improvement of innovative algorithms.The overall objectives of this paper are: • To eliminate irrelevant and redundant features.
• To select the optimal features from a huge set.
• To build up a multiple objective, filter-wrapper hybrid framework for continuous, categorical and hybrid data.
• To conduct an extensive experiment with the hybrid framework for evaluating the proposed methodology.
• To improve the classification accuracy and to minimize the error rate.
A novel algorithm is proposed and compared with the state-of-the-art methods.The results of the proposed method, using a subset of features, are similar or superior with the ones of the existing methods.

II. RELEVANT WORK
A lot of attribute selection methodologies have been proposed in recent studies for classification by machine learning.In high-dimensional feature sets, the choice of relevant features is indispensable, as a result of the large space of search of 2 n for n dimension variables.It is a challenging task to perform a comprehensive search to enhance the significant measures of the learning system [2].A lot of filters, wrappers, and hybrid methodologies for feature selection have been adopted to accomplish a smaller set of features with relevance and significance [3].To deal with this issue, a collection of techniques to perform a search in every feasible way for a solution, and explore an algorithm that is guaranteed to uncover a solution.The greedy algorithm always makes the choice that seems to be the best at that moment.Heuristic search and arbitrary search have been adopted in [4].Particle swarm optimization (PSO) is proposed in [5].In comparison with different evolution algorithms such as genetic algorithms (GAs) and genetic programming, PSO is cost effective, and more swift convergence is possible.A combination of PSO, and ACO (ant colony optimization) hybrid method was initiated for classification in [6].The demerit of PSO was the requirement of conversion of nominal data to binary overcoming the need for preprocessing.A hybrid method of the filter -wrapper methods on the basis of PSO-GA which aimed to incorporate the merits of filter and wrapper techniques resulting in a smaller number of optimum features with better efficiency was proposed in [7].A novel Gini-Index filter was proposed in [8].It was formulated and adapted by the theory of Gini-Index for text classification in the selection of features, and produced better performance than the other methods.
Authors in [9] proposed a hybrid FS algorithm for gene data by combining mutual information maximization and adaptive GA (MIMAGA) to enhance the competence of MIMAGA algorithm.Authors in [10] employed a hybrid technique for GA by considering the merits of filters to improve the crossover and mutation operators.Hybrid FS approaches were created by subsets with features of different sizes and importance.Authors in [11] proposed a hybrid method using filters and wrappers based on instance learning.In the wrapper approach, a classification algorithm is adopted in a cooperative subset search in [11].Authors in [3] proposed a hybrid FS method by combining the information gain ratio (IGR) filter and backward elimination (BE) wrapper in the first phase and PSO in the second phase.The method performs better for continuous features than for categorical features [3].The relevant study leads to a framework of hybrid methodology of feature selection.

A. The Proposed Methodology
In this paper, feature subsets are formulated from Chi square, Gini index, and PSO algorithms to solve FS problems in machine learning.

1) Chi Square Test Statistic (CHI)
The characteristics of a categorical data were studied in various perspectives like data size, number of features, possible values of attributes, and values of frequency distribution for the attributes of a dataset.For categorical data the similarity measure CHI was acknowledged as a goodness of fit test [12], with an estimated CHI distribution.Comparison of several classes can be assessed with the help of this test.It is considered as a test for independence and homogeneity [13,14].
Authors in [15] accounted a comparison study of these methods and found that information gain and CHI are the best methods in feature selection.Chi-square distance is computed between every attribute and class.It is a measure of an attribute weighting task, as a result of its capability in attributes ranking [16].This test is applicable for categorical datasets, and it does not perform well for data of quantitative nature.Frequency or count of data is needed for calculations with chi square test.No relationship of associativity between attributes is stated as null hypothesis.It creates a model by distributing data in different categories with an assumption that it follows the null hypothesis.Thus, this test compares the given data values of distribution with the expected data values.The frequency outcomes observed for the cell C ij are ܱ݆݅ .The frequency outcomes expected for the cell where ‫=݆݅ܧ‬ni*nj/n.
Chi square value is contributed by rows, having different actual and expected values.The maximum value indicates the related features.

2) Gini Index(GINI)
GINI is the most suitable to classify attributes which have distributions clump together.During the evaluation process, GINI uses the combination of feature condition probability with its previous probability to evade the unbalanced class effects [17].The inequality of a distribution is measured by the Gini coefficient [17].The GINI is explained as the inequality percent within a specified population.Gini index is a correlation-based criterion.It approximates the features and discriminates among classes.It was first proposed as a rule for splitting in the generation of a decision tree [18].It reveals the reduction of impurity, if features are chosen.The feature is represented by Y. where

3) PSO
The wrapper method PSO [4] is selected due to advantages such as simple implementation, sufficiently less parameters to fine-tune, it is more robust, it has fast convergence, less computing time, and finds global optima with high probability and efficiency.PSO is inspired by the societal way of behaving birds in flocks.PSO's fundamental characteristic is the social interaction in the population that optimizes information.Every result can be described as a particle in the swarm.Every particle's position in the search space is described by a vector S i =(S i1 , S i2 , . . ., S iD ), where D is the dimensionality of the space for search.In order to achieve optimal solutions, the particles move in the space for search.As a result, every particle's velocity is described as C i =(C i1 , C i2 , . . ., C iD ).Every particle's position and velocity can be revised with respect to the movement of its neighbors.pbest is the best preceding location and is depicted as the personal best of the particle, and gbest is the best location in the population.Derived from pbest and gbest the optimal solution is searched by improving the velocity C and every particle's position in space for search S in relation to the subsequent equations [5]: where t describes the number of repetitions in the evolution process, ω inertia weight, which controls the responses of the earlier velocities, a 1 and a 2 are acceleration constants, r 1 and r 2 are random numbers following a uniform distribution in [0, 1].The algorithm stops after a predefined condition is attained, which might be the best suitable assessment or a specified count of iterations.

B. The Proposed Algorithm
The methods of the filter and wrapper undergo a combination in the proposed two staged method.The pseudocode of the proposed framework, named two-stage selection of feature subsets (TSFS), is presented in Figure 1. Figure 2 shows a flowchart of the TSFS algorithm.In Figure 1, lines 1-5 account for the preprocessing step for the selection of relevant features of the algorithm TSFS.It is an introductory step to be performed for the datasets downloaded from UCI (University of California, Irvine).In the first phase of the algorithm, filter techniques, namely Chi square and Gini index, are applied separately to the datasets for the elimination of the superfluous or inappropriate features.The CHI filter selects the feature subset f 1 features and the GINI filter selects the feature subset f 2 features.The selected subsets of features f 1 and f 2 are calculated as the most relevant features associated with the class label.The methods which are described above, weigh up the significance of the features by computing the weight for each and every feature of the dataset, with the class label of the dataset.The outputs from the filter techniques undergo a combination of two feature subsets f 1 and f 2 in line 6 for better performance outcomes by removing the subset of features available in both subsets resulting in a feature subset with fewer features.The pseudocode of the suggested TSFS algorithm Lines 7-11 correspond to the wrapper approach using PSO to select the subset of optimal features in the feature space.The PSO algorithm computes the two best values for each and every particle in each iteration.The selection for an optimal subset is achieved to diminish the dimension of features.Lines 12-16 correspond to the resultant optimal subset by the wrapper approach PSO.Training and testing sets in the second phase undergo tenfold cross-validation (CV) for improvement in learning efficiency.This is the tuning step to remove redundancy for optimal subsets.

C. K-Nearest Neighbour Classification (kNN)
To classify data, their nature is most important.It can be either parametric or non-parametric.K-nearest neighbor classification is non-parametric.The instances of the data

A. Datasets
The experimentation in java environment using RapidMiner confirms the efficiency of the suggested algorithm.Eight UCI [20] datasets are shown in Table I.The PSO parameters pbest and gbest are assigned to 1. True value is set for dynamic inertia weight which makes the enhancement of inertia during its The upper limit for the generations to be performed is 30.The size of population is 100.The model is evaluated by the conduction of experiments on all sample instances by performing 10 CVs using kNN classifier for better performance outcomes.

B. Performance Metrics
The most common measures for the evaluation of feature subset are Precision and Recall.Precision is the ratio of instances of relevance among the instances of retrieved while Recall is the ratio of instances of relevance that have been retrieved over the total number of instances of relevance.Precision and recall are the estimates for finding relevance in terms of true positives, false positives, false negatives and true negatives [21].Figures 3 and 4 show the outcomes of Precision and Recall fitness functions on the efficiency of various variable selection techniques and the suggested TSFS.

C. Statistical Analysis
The proposed TSFS algorithm's evaluation has been carried out by the kNN classifier.The main objectives, i. e. the number of selected features, the selected feature subsets' classification accuracy, and the computing time with the measures of performance like Precision, Sensitivity, and Kappa have been documented.In Figure 5 the accuracy of the suggested algorithm is compared with the traditional existing methods of features subsets in the literature.Depending on the results obtained in Figure 5, the proposed method TSFS is considered to be superior in terms of accuracy than the traditional methods.It can be seen that TSFS performs better for continuous datasets like sonar and waveform than for discrete datasets like splice and chess and hybrid datasets like sick, anneal, and hypothyroid.

D. Kappa Statistic as Performance Measure
The kappa statistic measures whether the instances are closely classified by the learning algorithm with the matching data label, controlling the accuracy of random classification.Classifiers constructed and calculated with datasets of various distributions of learning can undergo comparison by kappa in association with expected accuracy.This shows a better indication of how the classification of instances occurred, because accuracy could be skewed, provided that class distribution is skewed.There is no identical representation of kappa.Authors in [22] represent 0-0.20 as minor, 0.21-0.40 as light, 0.41-0.60 as reasonable, 0.61-0.80 as significant, and 0.81-1 as almost ideal, while author in [23] considers kappa>0.75 as outstanding, 0.40-0.75 as fine, and kappa<0.40 as deprived.It is commonly considered as a more vigorous estimate.The proposed TSFS's kappa analysis is outlined in Table II.

E. MAE and RMSE Analysis
Mean absolute error (MAE) [18] is the average of the discrepancy between anticipated and actual data values, and is given by: MAE ranges from zero to infinity and a perfect fit is obtained when MAE=0.Root Mean Squared Error (RMSE) [21] is one of the most commonly used measures for calculating the average amount of error in numerical prediction.It is the square root of the average of squared discrepancies between prediction and actual observation.Its value is computed by: The smaller the RMSE value, the better the model performance.In Tables IV and V the accuracy and run time of TSFS is compared with other methods, available in the literature.Regarding accuracy, the TSFS method shows appreciable improvement for all datasets except chess and sonar, when compared with the other existing methods.TSFS achieves the best accuracy values for vowel, sick, splice, and waveform datasets and the best run time for anneal, vowel and sonar datasets.

V. CONCLUSION AND FUTURE WORK
Feature selection is an imperative method for dimensional reduction in machine learning.In this paper, a proficient methodology for selecting instructive features is being proposed for massive hybrid datasets.Experimentation confirms that the methodology is effective and efficient, especially considering classification performance.It was shown that the proposed methodology provided better accuracy rates than the existing methods.Additionally, the use of the proposed method decreased error rates regarding two error measures, MAE and RMSE.These results are important and show that the use of feature selection can provide better performance and higher efficiency for hybrid systems.The weakness of this work is that it to produce better computation time even though the accuracy outperforms other methodologies, under experimentation with UCI repository datasets.It would be notable to use mutual information [34] for estimating and finding features that are minimal redundancy-maximal relevance (mRMR) in future research.These concepts could be integrated with the proposed method in order to get a quicker and improved method, which may be applied effectively to huge datasets.

Fig. 1 .
Fig. 1.The pseudocode of the suggested TSFS algorithm Input: Features set D ={ f i , i = 1.....n} C: class labels Output: The subset P of D features 1. (Initialize) Let D ← "Original set having d dimensions"; P ← { } 2. Evaluate the significance of CHI with the resultant class C. For each f d ∈ D find CHI(C; f d ) applying (1) 3. (possibility of the first dimensions) Find a dimension f 1 that improves CHI(C, f d ); Let D ←D \{f 1 }; Set P ← {f 1 } 4. Evaluate the significance of GINI with the output class C. For each f d ∈ D find GINI(C; f d ) applying (2).

Fig. 3 .
Fig. 3. Comparative study of the significance measure Precision of the TSFS with other existing methods www.etasr.comKamala & Thangaiah: A Novel Two-Stage Selection of Feature Subsets in Machine Learning

Fig. 5 .
Fig. 5.Accuracy comparison for each feature selection algorithm with the proposed algorithm.
5. (possibility of the next dimensions) Find a dimension f 2 that improves GINI(C, f d ); Let D ←D \{f 2 } 6. (optimal output) Output the set P ← P ∪ {f 2 } 7. Split the feature set P into s for training dataset and testing dataset.For P generate particles P. 8.For each particle from 1 to N 9. Initialize particle 9.1 Initialization of particle's position in the search space.9.2.Initialization of pbest and gbest.9.3.Initialization of velocity.10.Repetition until the termination condition is met.10.1 Update particle's velocity using (3).10.2 Update particle's position using (4).10.3 t ←t +1 11.Best found solution for the output gbest.12. Fitness evaluation through CV. 13.Split the data into k equal sized folds.14. for s = 1.....k 14.1 Training a model with basis features on s th fold's training set.14.2 Computation of testing error on this corresponding fold.15.Return values over all k folds with the lowest average of testing error.16.Learning results and significance of a predicted outcome for P as subset of selected features.which are together in nearby proximity are self-reliant and distributed separately.Those instances have similar classification results [19].

TABLE II .
KAPPA COEFFICIENT OF THE FRAMEWORK TSFS Table III presents the results of MAE and RMSE.

TABLE V .
Best results are shown in boldF.Result Analysis Comparison of Classification Accuracy and Computing Time with Existing Methods