Importance of Features Selection, Attributes Selection, Challenges and Future Directions for Medical Imaging Data: A Review

: In the area of pattern recognition and machine learning, features play a key role in prediction. The famous applications of features are medical imaging, image classification, and name a few more. With the exponential growth of information investments in medical data repositories and health service provision, medical institutions are collecting large volumes of data. These data repositories contain details information essential to support medical diagnostic decisions and also improve patient care quality. On the other hand, this growth also made it difficult to comprehend and utilize data for various purposes. The results of imaging data can become biased because of extraneous features present in larger datasets. Feature selection gives a chance to decrease the number of components in such large datasets. Through selection techniques, ousting the unimportant features and selecting a subset of components that produces prevalent characterization precision. The correct decision to find a good attribute produces a precise grouping model, which enhances learning pace and forecast control. This paper presents a review of feature selection techniques and attributes selection measures for medical imaging. This review is meant to describe feature selection techniques in a medical domain with their pros and cons and to signify its application in imaging data and data mining algorithms. The review reveals the shortcomings of the existing feature and attributes selection techniques to multi-sourced data. Moreover, this review provides the importance of feature selection for correct classification of medical infections. In the end, critical analysis and future directions are provided.


Introduction
Feature selection or feature reduction is one of the most critical steps in computer vision and image processing. A feature selection algorithm selects the most relevant features from the feature vector and drops the irrelevant attributes. This is also an active research area in machine learning and pattern recognition [1]. In medical imaging, a substantial number of images are being processed for recognition [2]. In some cases, images can be in high resolution, which contains more attributes [3]. Due to this large data, a greater dimensionality feature set is obtained in the feature extraction procedure. All the extracted attributes are not useful, and a higher dimensionality vector affects the performance of the model in terms of time cost and accuracy. Feature selection techniques are utilized to reduce the dimensionality as well as reduce the costs of extraction of information and understanding of the model. The importance of feature selection can be described by an example of using only 2 features from 7129 features to enhance the classification performance [4].
Recently, several feature selection algorithms have been introduced and utilized by the researchers for the recognition of medical images because of the advantages of these methods. These algorithms include genetic algorithm (GA), entropy selection, Particle Swarm Optimization (PSO), and grasshopper and many more. These techniques enhanced both accuracy and time performance. GA is a common feature selection method used nowadays. GA has the ability to give the exact or estimated best solutions. This is based on the theory of evolution and genetics of natural selection [5]. A multi-task hypergraph based feature selection method [6] is introduced for classification of Alzheimer's disease. This method illustrates the high-order dimensionality for each modality. A novel hybrid supervised feature selection method [7] is proposed for the classification of brain tumours from MRI scans. Furthermore, there are different feature selection methods that have developed for the recognition of medical images [8].
New technologies and applications are easily accessible, which results in producing an enormous amount of data [9]. Data can be in the form of reports, design visuals, video and sound arrangements and so forth [10,11]. Analysts in 1991 anticipated that measure of data gets multiplied at regular intervals. The increase in the size of datasets makes it difficult to comprehend and utilize data for various purposes [12]. Today bulk of high dimensional data is stored in online databases, which make its use more difficult and challenging [13]. Data mining (DM) is data science that discovers knowledge from huge databases. In DM, Knowledge is discovered from historical data. Data is analyzed, and Information is learned to extract knowledge [14]. In the context of data management, discovery is made on data to find meaningful patterns. Knowledge discovery in databases is not new and proved its analytical capabilities for data analysis [15]. Application of data mining on different types of data, including graphs, numbers, text, and web, among others are establishing the worth of this domain in different applications. Data mining applies analytical techniques to databases to extract hidden and non-trivial patterns from data. Its applications are broader and not limited to education, telecommunication, superstores, energy sector and bioinformatics.
Knowledge discovery techniques for classification of datasets are divided into two broader categories. (i) Supervised classification. (ii) Unsupervised classification. If the user is provided with class labels in datasets, the applied classification technique will be supervised and the classification done in the absence of class labels is called unsupervised classification. As discussed above that the large datasets have many issues and one of them is the difficulty in accurate classification after applying data mining algorithms, because of extraneous features present in larger datasets. Feature selection and reduction is commonly applied to avoid unnecessary features in the dataset.
Feature selection is a method for ousting the unimportant features and selecting a subset of components which produces prevalent characterization precision. The issue of finding the quality of an attribute is all inclusive in inductive machine learning or data mining algorithms. Effective mining can't be accomplished when inconsequential properties are a piece of forecast process. Such properties are represented in the form of features in a learning problem. Feature Selection is completed by setting a few criteria which assess the value of every competitor attribute [16]. Many words are used for attribute determination like feature selection, feature reduction, dimensionality reduction and attribute selection [17]. All of these have some differences and similarities. Feature selection is the process of selecting some relevant features from a complete feature set which satisfy the qualifying criteria of selection. In feature reduction, new features are created from existing features hence reducing overall number. Dimensionality reduction is all about transforming data into low-dimensional space where features are reduced but their impact on dataset remains. Whereas, attribute selection, is a mathematical representation within which a criterion is specified against which an attribute is evaluated. If it satisfies the criteria, the attribute is selected [18,19].
Feature selection is used in many application domains [20]. It is useful when high dimensional data is supplied with lots of irrelevant and less significant data. Recently, feature selection techniques have been used in many domains such as image recognition [21], image retrieval [22], nobreakbioinformatic data analysis [23], Skin Cancer [24][25][26], Lungs Cancer [27,28], brain tumor [29,30], Stomach Infections [31,32], Alzheimer's [33,34], dental infections [35] and name a few more [18,36,37]. When correct attributes are selected computations are reduced and time is saved. Fig. 1a shows the working of the general feature selection method. From original feature set, a subset of features is generated on the basis of some search strategy. Once the subset is generated, all the features of this subset have to undergo through the evolution criteria, if they pass the criteria they are gathered in the final feature set and the process of evaluation goes on. Applications of feature selection techniques in some of the prominent domains are given in Fig. 2.   2 shows that the feature selection techniques are most prominently used in sentiment analysis and image processing. The information of both domains contains a large number of attributes to describe data. Like all other domains, feature selection techniques evolved over time. This paper reviewed feature selection techniques and algorithms especially those which used data mining techniques for feature selection, with their pros and cons. The paper is organized in a systematic way. Different categories of feature selection methods are given in the next section. Filter algorithms, wrapper feature selection algorithms and attribute selection measures are reviewed in Sections 3-5 respectively. In Section 6, feature selection techniques utilizing data mining are given. The conclusion is given in Section 7.

Feature Selection Methods
On the premise of the sort, feature selection strategies are gathered in three classifications based on selection criteria [38], Filter, wrapper or embedded approach. In the filter-based approach, no data mining algorithm is used as evaluation criteria. Filter method filter outs irrelevant attributes before induction process starts. In filter-based techniques, quality of an attribute is assessed against some measure. The process of selecting attributes stops when every attribute gets evaluated. Top most attributes are selected which best satisfies the evaluation criteria. Wrapper based techniques require one settled data mining algorithm and utilize its performance as an assessment model. Choice of data mining algorithm ought to satisfy two conditions: (a) It ought to be fit for streamlining characteristics as low as could be allowed; (b) Calculation ought to be exceptionally computationally effective. Wrappers techniques are computationally costly but easy when contrasted with filter strategy. Embedded based techniques exploit both filter and wrapper techniques by changing assessment measures at various stages. The embedded strategy needs to diminish abundant computational time of wrapper approach. The algorithms proposed in the literature for feature selection are placed in one of these three categories. Algorithm wise division is shown in Fig. 3. Fig. 3 shows that in literature filter method algorithms are most commonly used than the wrapper and embedded algorithms, the distribution shows that 47% filter algorithms, 29% wrapper algorithm whereas only 24% embedded algorithms are discussed.
Figs. 4-6 shows the advantages and disadvantages of all the three methods of feature selection in a graphical form. For example, it is shown that embedded or filter method have better computational complexity than the wrapper. Filter method doesn't depend on classifier but wrapper and embedded do.

Feature Selection in Medical Data
Feature selection and extraction techniques are widely used in medical datasets. The role of data analysis techniques in the domains of medical and healthcare is multifaceted. The data used for making different types of diagnosis (differential and non-differential), treatment, prognosis and analysis is of diverse nature. The types of data that are mined in medical applications for concluding one of the aforementioned processes include microarray data [39], data related to heart issues [40], medical imaging data [41] and others. The data used for analysis in medical domain is usually dense and sparse, for example, the data may be in the form of simple and complex images, arrays, X-rays, medical resonance imaging (MRI), radiotherapic data and immunohistochemistry [42]. The usual or conventional data is also used in medical domain in combination with these complex data structures to conclude a diagnosis or treatment. This sparsity of data increases the number of features presents in the data and can cause curse of dimensionality [43]. Numerous feature selection and extraction methods are used on this data for analyzing relevant features only.  In the extant literature, these techniques are broadly classified into the techniques for structured and unstructured data. Unstructured data in this domain includes medical imaging data, genome sequences, medical digital signals and microarray data [44]. Medical images are complex or at least time consuming to analyze not only for the naïve user but also for the specialists related to the domain. These images are available in different formats and their redundant features are eliminated by using different feature selection techniques for X-Rays [45], MRI [46], Ultrasound [47], CT Scans [48], and PET scans [49]. Data mining and machine learning techniques are used with these images either as standalone source of data or in combination with other data, for prognosis, differential diagnosis and treatment.  The feature selection algorithms citable from literature for the above tasks converts the data to a more compact form capable of extracting implicit patterns by exploiting data mining and machine learning techniques but are constrained by the interpretability and integrity [50]. Another major limitation on application of feature selection techniques on medical image data is numerousity of the algorithms all of which are closely related to each other [51]. The algorithms which are commonly used for feature selection from medical images are given in Sections 2.2-2.4 and 3.1. As mentioned earlier, feature selection techniques are also used on medical data for genome analysis [52]. Genome analysis is primarily done for detecting genetic patterns, diseases, similarities and dissimilarities. This is done by analyzing genome sequences arranged in different orders. A systematic survey of feature selection techniques used in bioinformatics is presented by [53]. A taxonomy of feature selection techniques with all advantages of filter, wrapper and embedded feature selection techniques used on bioinformatics data is presented in [54]. The filter algorithms in the paper are classified into univariate, multivariate, wrapper methods into deterministic, random and embedded methods are classified into classifier dependent and independent methods. Some ensemble feature selectors are also reviewed in [55]. There are some feature selection algorithms cited in the literature used on medical data for biomedical signal analysis [56]. Biomedical signals are prominently discussed in the literature in the context of electroencephalogram (EEG) signal analysis [57], heart disease prediction [58], brain-computer interface and electromyography (EMG) analysis. Many sequence analysis methods attempt to recognize short, more or less conserved signals to patronized protein sequences.
Another citable attempt made in the literature is related to the application of feature selection techniques on microarray data [59]. Microarray databases are rich repositories of genetic data and are mostly used for cancer differential diagnosis [60]. Genetic algorithms are also very popular for feature selection from microarray data [61]. Manonmani et al. [62] in the first part of their paper presented a review of feature selection techniques used on healthcare datasets. The distribution of these applications is shown in Fig. 7. The feature selection techniques used for the above tasks are presented and reviewed starting from Section 2.2.

Microarray 22%
Signal Processing 17% Figure 7: Percent distribution of feature selection techniques in literature

Filter Algorithms
Filter methods use evaluation criteria against which each attribute is evaluated and filter outs irrelevant attributes before induction process starts. The process of selecting attributes stops when every attribute is evaluated against some measure. Topmost attributes are selected which best satisfies the evaluation criteria. Some of the filter algorithms are described below.

Focus Algorithm
FOCUS algorithm [63] used exponential search to search for consistent attributes in a dataset in forward direction with consistency as an evaluation measure. The algorithm works incrementally by evaluating individual feature set, at first and then each set of two features and so on. It stops when the global consistent solution is found. It uses the concept of conflicts, which is, identifying same samples with different class values. Algorithm searches for a subset with lesser number of conflicts. FOCUS algorithm is computationally expensive because of its exhaustive search. It is not very effective in noise tolerant and applied mostly to discrete values.

FOCUS-2 Algorithm
FOCUS-2 introduces the concept of conflict in positive and negative sets to reduce search space. In A = (001100), positive sets are those who possess 3 or 5 attributes in it, and others are negative [64]. It makes separate queues to place attribute subsets. The time complexity of the FOCUS-2 algorithm is high because, at first, it prunes on the basis of conflicts, and then it performs the search of relevant attributes. However, the tests performed by FOCUS-2 are lesser than FOCUS.

LVF Algorithm
LVF algorithm [65] do a random search in a random direction with any evaluation measure. Consistency is used as an evaluation measure in LVF. Each iteration, a subset is selected and marked as the best subset. In the next iterations, if the selected subset inconsistency rate becomes lesser than the inconsistency rate of the best subset, then it is replaced. LVF has the capability to handle small noisy datasets, but it outperforms in large, noisy datasets. The time complexity of LVF is high because it constantly checks the consistency levels of attributes. The number of attributes selected by this algorithm is usually larger in number for which it is contemplated for achieving a globally consistent solution [66].

LVI Algorithm
LVI (Las Vegas Incremental) [67] is a version of LVF, which uses consistency as an evaluation measure. In LVI, it is not necessary to use the whole sample in order to evaluate the measure. The algorithm starts with a small sample of data, and the inconsistency rate is set to 0. The sample is partitioned into two groups. LVI uses a subset of features for group 1 found by LVF, which is further used to check subsets for Group 2. If Group 2 does not exceed the threshold of inconsistency level, LVI stops. Otherwise, this new sample is handed over to LVI and process re-iterates. The process continues until a solution is found otherwise the whole set is returned as a solution. LVI along with LVF algorithms performs an extensive search of attributes, which identifies relevant attributes but at higher complexity rate. LVI initially performs good but, no improvement in the solution is seen in lateral phases concluding in a flat graph [68].

B&B Algorithm
In B&B (Branch and Bound) algorithm, exponential search in backward direction with any monotonic measure is used for evaluation in medical [69]. It starts with complete set. Features are removed using depth-first strategy. It is an optimal search algorithm in which a threshold is defined. Nodes with lower threshold values are not explored because of monotonicity assumption which states that its sub nodes will not generate optimal solution. Algorithm finds relevant attributes by doing branching of dataset. B&B algorithm stops when the possibility of branching stops. This algorithm is storage-intensive, which makes algorithm computationally expensive. ABB (Automatic branch & bound) is an extension of B&B in which threshold is automatically set. In Fig. 8 from the pie chart it is evident that consistency is commonly used more than one fourth, almost one third monotonicity and mutual information are used whereas relevance and distance are scarcely used [70]. Details of these measures are discussed in Section 5.

QBB Algorithm
Quick branch and bound (QBB) method uses random and exponential search in the backward direction and uses any monotonic evaluation measure [71]. It is hybrid algorithm made up by combining LVF and ABB. QBB uses LVF to find a good starting point for ABB which latterly explores the search space efficiently. Its execution time is lesser than FOCUS, LVF and ABB algorithms [72].

RELIEF Algorithm
RELIEF algorithm does a random search in the prescribed search space and assigns weights to the attribute which are near to the optimal value [73]. Distance is taken as evaluation measure in this algorithm. From the sample, it chooses an attribute randomly and finds its nearest hit and miss. Nearest hit is considered to be the closest instance of that attribute with same class whereas the nearest miss is the closest instance with a different class. The computation of two nearest neighbors increases time complexity of the algorithm [74].

RELIEF-F Algorithm
RELIEF-F is an improved version of Relief algorithm. The attributes in the RELIEF-F algorithm are selected randomly after which estimated quality values are allocated to the attributes, their nearest hits, and misses. All the hits and misses values are averaged after executing the algorithm for n times for which the time complexity of the algorithm is comparatively lower [75,76].

MRMR Algorithm
Maximum Relevance and Maximum Dependency (MRMR) [77] selects attributes on the basis of two evaluation measures dependency and relevance. If attribute A can be derived from any other attribute B then A is considered to be dependent. Dependency is calculated by finding correlation among the attributes. Relevance means closely connected. Relevance among attributes is decided on the basis of mutual information shared between attribute and class value. MRMR select those features only whose relevance is maximum and dependency is minimum. The performance of MRMR is claimed to be better than Naïve Bayes, SVM, and LDA algorithms [78].

MRMS Algorithm
It is complex to find dependencies of attributes. Maximum relevance and Maximum significance (MRMS) uses significance of an attribute in place of dependency [79]. It first selects that attribute whose relevance is maximum and then the significance of remaining attributes is compared with the selected attribute [80]. Significance determines mutually exclusive attributes. Zero significance indicates that the removal of the attribute will not affect the accuracy of the results. MRMS algorithm is compared with quick reduct in and gave better results [81].

Red Removing MRLR Algorithm
This is an algorithm based upon maximum mutual information, proposed in Li et al. [82]. Features are selected on the basis of mutual information. The relevance of selected feature is then compared with all candidate attributes. RRMRLR algorithm results in more relevant and less redundant attributes.

Joint Mutual Information Maximization Algorithm
Joint mutual Information Maximization (JMIM) algorithm selects relevant attributes on the basis of mutual information [83,84]. JMIM uses joint mutual information through a greedy approach by taking attributes as a feed-forward network. One of its variants is Normalized joint mutual information maximization (NJMIM) which computes the symmetrical relevance of an attribute. The classification accuracy of JMIM is higher than NJMIM but it seems to be biased towards the attributes with diverse information content [85]. From the above, it can be concluded that consistency attribute evaluator is most prominently used in filter feature selection algorithms. This evaluator keeps only those attributes in the dataset which are the reason for achieving a globally consistent solution. Mutual information and monotonicity are also widely used in filter feature selection algorithms. Algorithms use different search strategies including the exponential, sequential and random search for finding major relevant features. The statistics of use of different search strategies in filter algorithms is given in Fig. 9.
The sequential search is most unmistakably utilized in filter feature selection algorithms. It is preferred due to its optimal results and non-recursive nature. Exponential search is rarely used because it takes more time and has a higher complexity. However random search strategy is hardly used because of its limitations.

Wrapper Feature Selection Methods
The wrapper model requires one fixed data mining algorithm and uses its performance as an evaluation criterion. Selection of data mining algorithm should fulfill two conditions: (i) It should be capable of optimizing attributes to as low as possible.
(ii) The algorithm should be highly computationally efficient.
Some wrapper model algorithms are given below.

SFS Algorithm
Sequential Forward Selection uses heuristic search done in forward direction. It starts with an empty set and adds one feature in the first iteration which is added in the final subset [86]. In next iteration, pairs are made with existing features and new features and best one is selected. The subset is updated with new feature value. In the third iteration, the new feature is evaluated with previously selected features and best feature is selected to form a triplet. It continues until a predefined number of features are selected. SFS finds only those attributes which are needed. SFS gives best results when the size of sample data is small. The main limitation of SFS is its inability to remove features once they are added [87].

SBS Algorithm
A sequential backward selection algorithm (SBS) is exactly the opposite of SFS. All features are initially added to the feature set [88]. An evaluation criterion is used which incrementally removes feature one by one on the basis of some criterion. Removed features are the ones which have no effect on performance. The performance of SBS may be optimal when the large numbers of attributes are present in the dataset [89]. The main limitation of SBS is that it doesn't reevaluate attributes that are discarded.

SFFS Algorithm
The sequential floating forward search (SFFS) algorithm is based upon backtracking. The addition of this step makes it different from SFS. Like SFS, it initially adds one attribute to the feature set. The attributes are removed on the basis of evaluation measure. SFFS ensures that no relevant attribute is eliminated from feature set. Redundant attributes are not eliminated in this algorithm [90,91].

SFBS Algorithm
The sequential floating backward search (SFBS) algorithm starts with a complete list of features and an objective function. In every iteration, back and forward tracking are done until the attainment of maximum objective function. Forward and backtracking increase the complexity of the algorithm. The global feature set is conveniently obtained due to its exploration of features which are contributing to achieving objective function [92].

ASFFS Algorithm
Adaptive Sequential Forward Floating Selection (ASFSS) [93] algorithm is a bottom-up approach. It is a more organized search that works in the forward direction. It uses two variables R and O. The former one tells how many features need to be added in inclusion phase and the lateral tells the number of attributes to be removed in exclusion phase. ASFFSS concludes in lesser number of redundant features as compared to SFFS. ASBFS also works exactly in the same way but with the set of all attributes at the starting point. ASFFSS is more complex than ASFBS and it works well on smaller datasets [94].

Plus-L Minus-R Algorithm
Plus-L-Minus-R algorithm [95] is a generalized form of SFS and SBS algorithms. It has two variables, L and R. Variable L and R represents added and removed attributes from a set respectively. If L is greater than R, then it starts with an empty set. It keeps on adding and removing features at the same time. If L becomes lesser than R then it starts from a complete set of attributes. The process continues until defined numbers of attributes are achieved. It tries to compensate limitations of SFS and SBS by adding some backtracking capabilities. The algorithm is more dynamic in behavior as it can be used as SFS and SBS based on the values of L and R. However, no proper procedure for selecting values of L and R is defined for which they are chosen randomly. Fig. 10 shows search strategies used in wrapper algorithms.
Distribution of Search strategies in wrapper algorithms are illustrated in Fig. 10, owing to its peculiar nature in this type of algorithm sequential search is comparatively being used at the maximum on the other hand exponential and random search algorithms are used comparatively at a tangible lesser ratio. Different Data mining techniques are used for determining evaluation criteria in wrapper algorithms. Wrapper methods are computationally expensive and slow as compared to filter algorithms. Fig. 11 shows the division of data mining tasks used in wrapper 21

Embedded Feature Selection Methods
The embedded model tries to take advantage of both filter and wrapper methods by changing evaluation measures at different stages. It searches for features at the time of training. Embedded method wants to reduce the excess computational time of wrapper approach. Some algorithms of the embedded method are discussed below.

BDSFS Algorithm
Boosted Decision Stump Feature Selection (BDSFS) uses information gain criteria to select features. Number of features to be selected is assigned to a variable say k. The algorithm runs for k times. At every step, it ignores all the features that were selected previously. Hence focus is on unselected features with highest information gain [96,97]. It performs the greedy search. Boosting assigns weights. Attribute that correctly predicts class value will have the highest weight. BDSFS is focused upon diversity of an attribute. It picks up attributes which have more information by performing the greedy search.

BDSFS-2 Algorithm
BDSFS specified the number of features to be selected which is considered to be an undesirable property in this algorithm. Stopping criteria are assigned on the basis of learning algorithm used with BDSFS-2. Features with maximum information gain are inserted in the final subset. This algorithm gave comparatively better results. Selection not only depends upon information content of an attribute but also on the learning algorithm used as stopping criteria.

BBHFS Algorithm
Boosting-based hybrid feature selection (BBHFS) technique selects features based on the value of information gain. It uses the learning algorithm to identify features whose weight are high. Stopping criteria is same as of BDSFS-2. It is fast and classification accuracy results are good. Performance of BDSFS-2 and BBHFS is equal and comparable with BDSFS.

SVM-RFE Algorithm
SVM-RFE algorithms change the weight of a feature on the basis of the linear discriminant function of weights [98]. The new objective function is classified by using SVM to perform recursive feature elimination. SVM-RFE is used for binary class classification. It uses normalization for minimization of SVM problem [99]. Finding optimal objective function is difficult in SVM-RFE.

LFS Algorithm
Lazy feature selection algorithm selects and eliminates features from feature space and also does predictions. It uses K-NN algorithm to predict class label of an attribute mostly used in text categorization problem. Scattering of attributes is taken benefit of and used as a feature selection method. It is classifier dependent. Any change in the classifier changes the working of the algorithm [100,101].
Figs. 12 and 13 shows that all embedded techniques use sequential search strategy at most to find out relevant attributes. A large portion demonstrates the use of classification technique precisely two third and a good proportion of clustering just over a third is used in embedded algorithms.  High Low × High × * Cost, Complexity, T.C. and S.C are classified into low and high on the basis of certain thresholds.

Attribute Selection Measures
In addition to the above stated three major feature selection methods, literature is evident of a few attribute selection measures on the basis of which attributes are either eliminated from the datasets or classified on the basis of their prominence. There are a few evaluation measures which are used to select or reject features [18]. There are almost 29 measures discussed which are used in induction and pruning phase. Most of these attribute selection measures are used in decision tree algorithms where they evaluate attribute importance. Some attribute selection measures are discussed below.

Information Gain
Information gain is computed from the joint relation between two values. Equality in the values of numerator and denominator indicates independence of both hence generating The wrapper algorithms despite using optimization techniques of setting objective function and developing a strategy to address the constraints are computationally expensive. These algorithms are mostly applied on structured datasets and to a smaller class of unstructured datasets like text. The algorithms are found to be used on multimedia data but the complexity is seen to be higher which is because of the incremental approach used in the algorithms. Another reason for its limited application in multimedia systems is the limited use of backpropagation mechanism which is the demand of recursive algorithms applied on multimedia data. Such datasets expect exponential search for selecting appropriate features.
(Continued) Embedded algorithms are less in number for which the pros and cons of these algorithms need further deliberations in the literature. These algorithms attempt to combine the pros of wrapper and filter techniques. The algorithms attempt to reach global optimal and consistent solution in a sequential manner for which the time complexity of fewer algorithms is relatively lower. Embedded algorithms are applied on unstructured data for which their complexity in the literature seem higher in the application domain but should not be compared with the complexity of the algorithms applied on unstructured data. 0 result [102,103]. An attribute with greater information gain tells that it contains maximum information which shows its diversity. Information gain is biased towards more diverse attribute. Information gain of an attribute is computed by the following equation.

Kapur's Entropy
Kapur's entropy is a multilevel unsupervised automatic thresholding technique in which the value of entropy is threshold based on the segmented classes of the image. Kapur's entropy is taken as an optimization objective function with a set of constraints applied on it. The sum of entropies of distinct classes C 1 , C 2 , . . . , C n where P 1 , P 2 , . . . , P n are the probabilities of distinct classes in an image is denoted as Kapur's entropy. Maximize objective function as given below is used to get optimal threshold values [104].

Symmetric Information Gain Ratio 1
Information gain may result in biased symmetry which is avoided by combining the attribute and class as given in the equation below; Attribute is represented by A and class by C.

Symmetric Information Gain Ratio 2
Another way to do normalization of information gain is by dividing by the sum of individual entropies of an attribute with the class value. Both of these normalization techniques try to compensate the biasness of preferring many-valued attributes. Symmetric information gain is computed by following formula.

Quadratic Information Gain
In previous information gain computations, Shannon entropy was used which is not the only type of entropy. Another form of entropy that we can generate is quadratic entropy. In this type of entropy, guesses are made to select alternatives. The correctness of the guesses cannot be predicted but the frequency of incorrectness can be determined. Quadratic entropy is used to derive quadratic information gain. The formula is given below.

Shannon Measure of Information
Shannon proposed a measure of information of uncertainty and unlikelihood of occurrence of an event known as Shannon entropy [105]. The basic axiom of Shannon entropy for selection of an attribute is that the attribute with low probability value carries more information than the attribute with higher probability value. The value of Shannon probability is calculated by the equation given below [105].

Correlation Analysis
Correlation analysis is also used for selection of prominent attributes from larger datasets. It is a bivariate analysis that is used to measure the strength of association between two variable. The outcome of correlation analysis ranges from −1 to +1 where −1 indicates maximum negative correlation, +1 represents maximum positive correlation and 0 indicates independence of the variables. In order to select attributes based on correlation value, a threshold value is selected [106]. This is a popular attribute selection technique and is available in multiple forms like pearson correlation, spearman ranked correlation,kendall correlation coefficient, canonical correlation etc. Correlation between two variables is calculated by the following equation [107].

Canonical Correlation Analysis
Canonical correlation analysis is a method of inferring information from cross-variance matrices. CCA finds linear combinations between two variables having maximum correlation. The basic idea of CCA is to find an index that may expose a link between the variables between which the correlation is to be sought out. If X and Y are assumed to be two random variables, canonical correlation analysis searches for vectors a and b so that the relation of two indices a T X and b T Y can be quantified [108].

Metaheuristic Selection Algorithms
A metaheuristic is a process to generate a heuristic that can serve as solution to an optimization problem. Metaheuristic usually do not guarantee a globally optimal solution. There are multiple classifications of metaheuristic like local and global search metaheuristic or single vs. population based metaheuristic or parallel and nature inspired. These techniques include particle swarm optimization (PSO), artificial bee colony optimization, ant colony optimization, genetic algorithms and simulated annealing. The use of these algorithms on medical datasets has shown reduced execution time and earlier convergence [109].

Gini Index
Quadratic information gain (QIG) is very much similar to conventional info gain. Both measures are considered to be biased. Gini index is another measure which is used in different classification algorithms, removes biasness of information gain by penalizing attribute [110,111]. It favors attributes with larger partitions and mostly used in CART tree type. Gini index classifies data by using the squared proportion of classes. Attribute with the lowest value of Gini index is selected. Gini index is computed by using following formula.

Modified Gini Index
Modified Gini index for the first time undermined the diversity of an attribute as to be taken for measuring the worth of an attribute. It squares the probabilities to increase the effect of heuristics. More probable values having higher influence are selected. The formula is given below.

Relief Measure
Relief measure is closely related to Gini index. It is developed for classification tasks and assesses how well an attribute predicts class values [112]. Good prediction is achieved when every value of attribute corresponds to a unique value of the class attribute. The formula is given below.

Weight of Evidence
Weight of evidence is also used in classification tasks. In order to calculate the weight of evidence, the quotients that can be considered odd are computed by odds (n) = n/(1 − n). Like relief measure, it also evaluates that how important this attribute is in the prediction of class label [113]. Greater the value of an attribute, better it would be. The formula is given below.

Relevance Measure
Like relief measure, in the calculation of relevance, an attribute is given priority if its every value corresponds to a unique value of class. Conditional probabilities of an attribute and class are calculated which shows their relevance. The formula is given below.

χ 2 Measure
Measure is well known in statistics, here it also does the same. It computes individual squared difference between two distributions. Whereas information gain measure finds the difference between actual joint and independent distribution of two attributes. Equation is given below.

Specificity Gain
The measure is well known in statistics, here it also does the same. It computes the individual squared difference between two distributions. Whereas information gain measure finds the difference between actual joint and independent distribution of two attributes. The equation is given below.

Symmetric Specificity Gain
In similarity with information gain, there are many ways to reduce the biasness of multivalued attributes. Symmetric specificity gain also tries to eliminate biasness by using specificity gain. The formula is given below.

Possibilistic Mutual Information
Information gain was first known as mutual information because it finds the actual joint distribution and independent distributions. Mutual information is the combined probability of class and an attribute. The formula is given below.

Feature Selection Evaluation Criteria's
In data mining, classification, and feature selection, different algorithms used different criteria to evaluate the attributes. From literature, it has been observed that most of the algorithms use consistency as an evaluation measure. A detailed breakdown of literature citations in this regard are given in Fig. 14  Consistency evaluation measure is a recursive process which continuously checks the consistency level achieved by selecting or rejecting an attribute. It is computational overhead to compute distances of attributes in the same class, or across the classes. Relevance evaluation obviously selects relevant attributes at the cost of redundancy in a dataset. Whereas, information evaluation measure picks up attribute that is most informative but is found to be biased towards multi-valued attributes.
Feature selection techniques use classification algorithms which counter checks the classification accuracy results. Many techniques, like, Entropy, GA, PSO, Grasshopper, etc. [114][115][116][117][118][119][120][121] are used for this purpose in medical area. Moreover, in data mining, these techniques are also employing as figured Fig. 15. SVM and GA are top most used techniques of data mining in feature selection algorithms with a difference of one point whereas clustering is two points less than GA same as the ARM is two points less than clustering and RF is six points less than ARM. The feature selection plays a crucial role in the medical imaging domain for classification tasks. The major medical areas in which feature selection techniques are mostly used include skin cancer, brain tumor, lung cancer, stomach diseases, blood diseases, and name a few more. The importance of these areas bases on feature selection techniques is discussed in Section 4. Based on the following points, the feature selection techniques will be considered in future studies.
(a) The selection of robust features can improve the prediction accuracy of attributes based skin lesion classification. (b) The selection of best features also useful in recognition of multi-type skin lesion classification. (c) A selection of most robust features reduces the number of predictors, so it is helpful to improve the computation time of skin lesion classification. (d) Brain datasets like BRATS series are too large in size, and the number of images is in millions. Therefore, the classification process of brain modalities like T1, T1W, T2, and Flair is difficult to classify in relevant categories. Several deep learning techniques are also implemented in literature, but still, they did not achieve the desired accuracy. For this purpose, the feature selection techniques are more useful for brain modalities classification with improved accuracy and low computational time. (e) Sometimes, the selection of best features is also employing in the segmentation task like a brain tumor and skin lesion. For this purpose, the deep learning models are trained for the detection of lesion area. However, due to the presence of noise and a few other factors, the extracted features do not map the exact lesion area. For this purpose, the feature selection techniques can be used to select only relevant features for an accurate lesion or tumor detection. (f) The use of feature optimization technique such as genetic algorithm (GA), PSO, Swarm Intelligence, Bee Colony, Whale Optimization, Grasshopper, and few others, are useful in medical imaging and gain a huge success based on few recent articles [122].
(g) The selection of deep learning features also supports better recognition accuracy in the medical domain, especially for a massive amount of imaging data. For deep learning, the meta-heuristic techniques are not useful; therefore, it is essential to implement computational methods like Newton Raphson [123]. (h) The physics-based feature selection techniques like entropy-controlled give vast attention for computer vision researcher whose are working on medical imaging [124,125]. Through entropy-based techniques, the computational time of the implemented CAD system is minimized and improves recognition accuracy. (i) The performance of each selection method is based on the fitness function. Mostly, researchers utilized FKNN and MSER as a fitness function [2,126,127]. However, the fitness function can be optimized using better techniques like ELM, Neural Network, and Naïve Bayes.

Discussion & Conclusion
Due to the exponential growth of information investments in medical data repositories and health service provision, medical institutions are collecting large volumes of data [128]. These data repositories contain details information essential to support medical diagnostic decisions and also improve patient care quality. However, the data is of high-dimensions in which the features numbers are more important than samples. Fast and accurate machine learning systems are desired in the medical domain. Large numbers of samples are required for training by most of the machine learning algorithms because small samples can reduce generalization capacity and also lead to overfitting. Using traditional methods to deal with such data is not a good idea because of the reasons associated with the curse of dimensionality [129]. In order to obtain better accuracy, the selection of important features is required [130][131][132][133]. Feature selection is one of the effective methods used to remove the redundant and un-important features before pattern classification [134]. In this study, different approaches to medical applications of feature selection are reviewed. It is demonstrated that the feature selection process is not only useful to reduce the number of features but also can enhance the accuracy rate and thus helps in understanding the underlying cause of diseases. Our discussion spanned feature selection, feature extraction, dimensionality reduction, and attribute selection. Three general approaches of feature selection methods, namely filter, wrapper and embedded methods, are described in detail, and their algorithms are also presented. The attribute evaluation measures are reviewed, and the interesting facts regarding the advantages and disadvantages of feature selection methods and attribute evaluation measures are also discussed. It is concluded that the algorithms of feature selection and reduction are useful for both execution time and accuracy for medical imaging. Moreover, it is also concluded that the selection of relevant feature decreases the overall complexity of a CAD system.

Funding Statement:
The author(s) received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.