Deep-FS: A feature selection algorithm for Deep Boltzmann Machines

A Deep Boltzmann Machine is a model of a Deep Neural Network formed from multiple layers of neurons with nonlinear activation functions. The structure of a Deep Boltzmann Machine enables it to learn very complex relationships between features and facilitates advanced performance in learning of high-level representation of features, compared to conventional Artiﬁcial Neural Networks. Feature selection at the input level of Deep Neural Networks has not been well studied, despite its importance in reducing the input features processed by the deep learning model, which facilitates understanding of the data. This paper proposes a novel algorithm, Deep Feature Selection (Deep-FS), which is capable of removing irrelevant features from large datasets in order to reduce the number of inputs which are modelled during the learning process. The proposed Deep-FS algorithm utilizes a Deep Boltzmann Machine, and uses knowledge which is acquired during training to remove features at the beginning of the learning process. Reducing inputs is important because it prevents the network from learning the associations be-tween the irrelevant features which negatively impact on the acquired knowledge of the network about the overall distribution of the data. The Deep-FS method embeds feature selection in a Restricted Boltz-mann Machine which is used for training a Deep Boltzmann Machine. The generative property of the Restricted Boltzmann Machine is used to reconstruct eliminated features and calculate reconstructed errors, in order to evaluate the impact of eliminating features. The performance of the proposed approach was evaluated with experiments conducted using the MNIST, MIR-Flickr, GISETTE, MADELON and PAN-CAN datasets. The results revealed that the proposed Deep-FS method enables improved feature selection without loss of accuracy on the MIR-Flickr dataset, where Deep-FS reduced the number of input features by removing 775 features without reduction in performance. With regards to the MNIST dataset, Deep-FS reduced the number of input features by more than 45%; it reduced the network error from 0.97% to 0.90%, and also reduced processing and classiﬁcation time by more than 5.5%. Additionally, when compared to classical feature selection methods, Deep-FS returned higher accuracy. The experimental results on GISETTE, MADELON and PANCAN showed that Deep-FS reduced 81%, 57% and 77% of the number of input features, respectively. Moreover, the proposed feature selection method reduced the classiﬁer training time by 82%, 70% and 85% on GISETTE, MADELON and PANCAN datasets, respectively. Experiments with various datasets, comprising a large number of features and samples, revealed that the proposed Deep-FS algorithm overcomes the main limitations of classical feature selection algorithms. More speciﬁcally, most classical methods require, as a prerequisite, a pre-speciﬁed number of features to retain, however in Deep-FS this number is identiﬁed automatically. Deep-FS performs the feature selection task faster than classical feature selection algorithms which makes it suitable for deep learning tasks. In addition, Deep-FS is suitable for ﬁnding features in large and big datasets which are normally stored in data batches for faster and more eﬃcient processing.


Introduction
The successful performance of deep learning in various applications such as image recognition [1,2] , speech recognition [3] and bioinformatics [4] , has captured considerable attention in recent literature. Deep learning (DL) methods provide promising results on problems for which conventional machine learning methods have not made major progress, despite many attempts [1] . Conventional machine learning methods have limited ability to process raw data, and for this reason considerable effort is traditionally placed on feature engineering. Feature engineering represents data in a manner such that machine learning algorithms can identify patterns and classify the data. An important advantage of DL methods over conventional approaches (e.g. Artificial Neural Network, Support Vector Machine, Naïve Bayes), is that DL methods integrate the feature extraction and learning process into a single model, and thus feature engineering is dealt with as an integrated rather than a separate task.
Feature selection, aims to eliminate redundant and irrelevant features via different criteria. The most commonly used criteria measure the relevance of each feature to the desired output, and use this information to select the most important features [5 , 6] . Highly dependent features can be considered as redundant features and some of these can be eliminated during a feature selection process. Eliminating irrelevant and redundant features, results in a permanent reduction in the dimensionality of the data, and this can increase the processing speed and accuracy of the utilized machine learning methods.
Deep Neural Networks (DNNs), such as those proposed in [1,7,8] use feature extraction rather than feature selection methods for extracting underlying features from big data. For example Hinton et al. [1] proposed a DNN method which reduces data dimensionality through a non-linear combination of all input features in a number of layers of the neural network, and this approach inspired the development of new algorithms in the deep learning field [9,10,11] . DNNs can learn very complex relationships between variables through their high numbers of non-linear elements. However, if there exist a high number of irrelevant features in the input feature space, then the relationship between these irrelevant features may also be modelled. Modelling the irrelevant features acts as noise, and learning the associations between irrelevant features negatively impacts on the acquired knowledge of the network about the overall distribution of the data, as well as the computational time. Modelling the irrelevant features can also lead to overfitting a model [10] , because the method learns irrelevant details from the training data and it becomes more biased to previously seen data [12] . A technique called Dropout [10] was proposed to increase the generalisation ability of neural networks which have a high number of neurons. However, a major limitation of the Dropout method is that it could retain all the input features and neurons that include redundant and irrelevant features.
The Deep Belief Networks (DBNs) proposed by Hinton and Salakhutdinov [1] , and the Deep Boltzmann Machines (DBMs) proposed by Srivastava and Salakhutdinov et al. [7] are two types of DNNs which use densely connected Restricted Boltzmann Machines (RBMs). The high number of processing elements and connections, which arise because of the full connections between the visible and hidden units, increase the RBM's computational cost and training time. In addition, training several independent RBMs increases the training time [13][14][15] . When the scale of the network is increased, the required training time is also increased nonlinearly. Reducing the number of input features can significantly reduce the size of the constructed weight matrix, and consequently it can reduce the computational cost of running deep learning methods, especially when a large network size is required for practical applications [16] .
This paper proposes a novel algorithm, Deep-Feature Selection (Deep-FS), for embedding feature selection capabilities into DBMs, such that irrelevant features are removed from raw data to discover the underlying feature representations that are required for feature classification. DBMs primarily use an unsupervised learning method to initialize the learning parameters of a DNN. Then the initialized DNN is fine-tuned by a backpropagation method. Deep-FS combines the feature extraction property of DBM with a feature selection method which is based on the generative property of RBMs. RBMs are generative models and they can reconstruct missing input features. The proposed Deep-FS uses an RBM that is trained during the learning procedure of DBM to improve the efficiency of the method in dealing with high volumes of data. Deep-FS returns a reduced subset of features and improves the deep learning method's computational efficiency by reducing the size of the constructed network. DBMs are known to have good performance for feature extraction, and adding a feature selection ability to DBMs can lead to a new generation of deep learning models which have an improved ability to deal with highly dimensional data.
The remainder of the paper is structured as follows: Section 2 discusses the background to the work; Section 3 provides details of the proposed method; Section 4 describes the experimental results; and Section 5 provides a conclusion and future work.

Background
This section provides a background on Deep Boltzmann Machines and feature selection methods.

Deep Boltzmann Machine
Deep Neural Networks (DNNs) are mainly based on stochastic gradient descent and backpropagation training algorithms. Two main techniques are used for training DNNs. The first technique is based on a filtering strategy and the second one is based on unsupervised pre-training. The first filtering technique is used by Convolutional Neural Networks (CNNs) to locally filter inputs. Filtering is performed by convolving the input by weight matrices.
In the second technique, information processing starts by using an unsupervised learning method. In this stage, unlabelled data can be used. Then, the DNN is fine-tuned by a supervised method using labelled data. Deep Belief Networks (DBN) [1] and DBMs [7,17] are examples which use this semi-supervised technique. Pretraining using an unsupervised method improves the generalisation of the trained network especially when the dataset contains a small amount of labelled data. This paper is focused on DBMs that include unsupervised learning during their first stage of training procedure.
DBMs have been used in different applications such as imagetext recognition [7] , facial expression recognition [18] , 3D model recognition [19] , and audio-visual person identification [20] and belong to a group of DNNs that uses a pre-training procedure. After the pre-training procedure, DBM is fine-tuned using labelled data [11,20] . A DBM is composed of a set of visible units corresponding to input data. Additionally, there are a network of symmetrically coupled stochastic binary units called hidden units. The binary hidden units are arranged in different layers and there are top-down and bottom up couplings between two adjacent layers. There are no direct connections between units in the same layer. A DBM represents the input data in several layers with increasingly complex representations. In DBM a learning procedure is executed to pretrain a number of layers in conjunction with each other. DBM can potentially be used to capture a complex internal representation of input data. The interaction of different hidden layers during the initial learning generates a deep network with a high ability to reconstruct ambiguous inputs through top-down feedback using different interacting layers. During the pre-training stage of a DBM, a learning procedure similar to RBMs is used. The RBM is discussed in Section 3 .
Recently, a multimodal data processing method was designed based on DBMs [20] . In the multimodal data processing method, initially two networks of DBMs are trained separately on visual and auditory data. Then the output of the two DBMs are combined in a joint layer. The representation extracted from the joint layer is considered as input to a Restricted Boltzmann Machine (RBM). The RBM is trained on the joint layer representation. Then the entire network is fine-tuned by labelled data.
The fully interconnected network between the visible and hidden units increase the computational cost of a DBM, and finding a method to reduce the number of input feature through feature selection method can reduce the computational cost of a DBM.

Feature selection
The amount of data available to machine learning models is increasing exponentially. The number of features in datasets is also increasing, and many of the features are irrelevant or redundant, and thus not needed to improve the performance of a machine learning model [12] . Feature Selection (FS) can accelerate the learning process of a machine learning method because it allows the algorithm to learn using a smaller number of features. Additionally, it can improve the classification performance by preventing overfitting [12,21] . A feature selection method reduces the number of input features without significant decrease in the performance of a machine learning method [6,12,22] .
Feature selection methods for classification can be divided into three main categories: filter, wrapper and embedded methods. These approaches are used to combine feature selection with a classification model. In filter methods, feature selection is performed as a prepossessing stage which is independent of the classification model. Each filter method ranks the features based on a criteria, and the highest ranked features are selected [6] . In feature ranking methods, the relevance of each feature to a class label is evaluated individually, and a weight is calculated for each feature. Then, features are ranked based on their weights and the top features with the highest weights are selected [12,21] . Maximum Relevance (MR) feature selection is a type of feature ranking algorithm, where Mutual information and kernel-based independence measures are usually employed to calculate the relevance score [23] . Feature ranking methods are simple and have low computational cost. Kira and Rendell's "Relief" algorithm [24] , and the feature selection method proposed by Hall [25] are examples of feature selection methods that work based on the dependency of features to the class labels. Although the selected top features have the highest relevance to the class labels, they might be correlated with each other and have redundant information, and do not include all the useful information in the original feature sets.
The feature selection method proposed by Fleuret [26] considers the dependency between input features using conditional Mutual Information. The method considers the dependency of a new feature to the already selected features and eliminates that features that are similar to the previously selected features, because the main idea is that if a feature is dependent on previously selected features then it does not contain new information about the class, and it can be eliminated.
Peng et al. [27] proposed a feature selection method called Minimum Redundancy Maximum Relevance (mRMR) to solve the problem of feature redundancy. A subset of features which have high relevance to class labels and which are non-redundant are selected. The mRMR method has better performance compared to the method that only works based on the relevant features. In the mRMR filtering feature selection method, the dimension of the selected feature space is not set at the start of procedure.
In wrapper methods, the performance of a classifier is used to evaluate a subset of features. Different subsets of features are searched using different searching algorithms to find the optimal feature subset that gives the highest classification performance [6] .
For an initial feature space dimensionality of N , a total of 2 N subsets can be evaluated, and that becomes an NP-hard problem. Sequential search, and evolutionary algorithms such as the Genetic Algorithm or Particle Swarm Optimization can be used to design a computationally feasible method for wrapper feature selection methods [6] . Wrappers have higher computational cost compared to filter feature selection methods.
With Embedded feature selection methods, the feature selection procedure is integrated into the training procedure of the classifier. It reduces the computational cost compared to wrapper methods where a high number of subsets should be retrained by the classifiers [6] . Guyon et al. [28] used Support Vector Machine (SVM) as a classifier to design an Embedded feature selection method. Features are evaluated during iterations of learning and the features that decrease the separation margin between classes are removed.
Feature selection algorithms coupled with feature extraction methods, can improve the performance of machine learning methods [12,21] . Feature selection algorithms can reduce the inputs to a classifier which in turn reduces computational cost and increases accuracy. However, classical feature selection algorithms are usually designed for small datasets and thus there is an emerging need to implement feature selection algorithms which can optimally search through large datasets with thousands of samples and a high number of features. This paper focuses on taking the idea of feature selection coupled with the feature extraction capability of deep learning to improve the performance of deep learning models.

Deep learning and feature selection
This section discusses a number of feature selection methods that are used with Deep Learning methods. Ruangkanokmas et al. [29] have used a classical filter-based feature selection approach for sentiment classification. A filter-based feature selection technique, called the chi-squared method, was proposed to select features and, thereafter the selected features were utilised to train a DBN. Feature selection and training the DBN were performed in two separate stages. Combining feature selection and DBN training, has the potential to improve the efficiency of a classifier.
Ibrahim et al. [30] have also used DBN, classical feature selection methods, and the unsupervised Active Learning method to process gene expression data. DBN was used to model high dimensional input data and to extract higher level representation of the input data. Then statistical classical feature selection methods, such as the t -test method, were used to select features from the higher level extracted features. Ibrahim et al's [30] method has used three cascaded independent modulus with separate computational costs. Computational costs can significantly increase when the number of training data is high. DBN is designed to work with a high number of training data samples and it requires feature selection functionality which is suitable for training large datasets, in order to prevent unnecessary computational cost.
Nezhad et al. [31,32] proposed a feature selection method which is based on a five-layer stacked Auto-Encoder deep network. Higher-level representative features were extracted from the hidden layer placed in the middle of the Auto-Encoder. Then, a classical feature selection method based on the feature ranking method and random forest was used to select features. After that, a supervised learning method was trained on the selected features and evaluated using the Mean Squared Error (MSE). The number of hidden neurons in the Auto-Encoder hidden layers, is adjusted to optimize the structure of the network based on the obtained results. This process is continued until reaching an acceptable result. This method has used different modules that increase the computational cost of the method.
Li et al. [33] proposed a deep feature selection (DFS) model using a deep structure of nonlinear neurons. They have used a multilayer perception (MLP) neural network to learn the nonlinearity of input data. Additionally, they use a sparse one-to-one linear layer before the MLP to select input features. The weight matrix of the first linear layer is a sparse matrix and the input features corresponding to nonzero weights are selected. Their proposed feature selection method is flexible such that the one-to-one linear layer corresponding to the feature selection can be added to other deep learning architectures. Despite the flexibility of the method, its accuracy is not perfect and experimental results have shown that the method did not outperform the random forest method.
Zhang and Wang [34] proposed a feature selection method for deep learning to classify scenes. They converted a feature selection method to a feature reconstruction problem. In a scene classification task they have selected the features that are more reconstructive than discriminative. However, removing discriminative features might reduce the classification accuracy in a typical classification task.
Deep learning methods, such as DBM, are usually composed of a high number of nonlinear processing elements that need a high number of training samples. On the other hand, the number of required observations (training samples) should grow exponentially with the number of input features [23,35] to train a DL method. Through feature selection, the number of input features and consequently the required number of training samples for training a DBM can be reduced. Therefore, feature selection is strongly required to help deep learning methods to be trained with less training data. In this paper, a feature selection method is proposed for DBM to improve its processing ability and reduce the computational cost of feature selection for DBM.

Principles of the proposed deep feature selection method
In this section, the mathematical properties of RBM are first described. Then, the RBM is used to design the proposed Deep-FS feature selection method. The principle of the proposed feature selection method which works based on RBM is presented in the second part of this section. Two versions of the proposed deep feature selection algorithm are presented. In the first version of the proposed feature selection algorithm, the RBM is not trained during feature selection. However, in the second version, the RBM is trained during the feature selection procedure.

Mathematical properties of Restricted Boltzmann Machines
A Boltzmann Machine (BM) is a parameterised probabilistic model. The parameters of a BM are trained to approximately model the important aspects of an unknown target distribution by using available samples drawn from the target distribution. The available samples that are used to train the parameters of the model are called training samples [36] . There are two types of units in a BM, these are the visible and hidden units (or neurons). The two sets of units are arranged in two layers. The first layer is constructed of visible units, and the hidden units (or neurons) are in the second layer. Each neuron or unit in the first layer corresponds to an input feature. For instance, if the input is an image then each visible unit corresponds to a pixel. In general, the visible units can accept different types of input features. The hidden units are used to model complex dependencies between visible units.
Restricted Boltzmann Machines (RBMs) are a special case of the general BM where there are no inter-connections between units in a single layer, i.e. each unit is fully connected to all units in another layer but does not have any connections with any units in its own layer ( Fig. 1 ). RBMs have a close connection with statistical physics [37] , and represent a type of energy-based model. During training, the parameters of an RBM are adjusted to generate a model representing a probability distribution that is close to the actual probability distribution from which the data are drawn. RBMs have been successfully used for processing binary and real-value data [1,[38][39][40] . Consider where W ij is a weight that connects the i th visible unit, v i , and the j th hidden unit, h j . b i and a j are biases that are related to the i th visible and the j th hidden units. The energy function is used to assign a joint distribution over the visible and hidden variables as shown in (2) .
where Z is a normalizing term and it is called the partition function. Z is calculated by (3) .
The sum is calculated over all possible pairs of ( V , h ). Let V be a D dimensional vector and let h be an F dimensional binary vector. There are 2 D + F different pairs of ( V , h ) when the visible units are binary. The conditional probabilities P ( h | V ) and P ( V | h ) can be calculated by (4) and (5) .
Conditional Probabilities (4) and (5) can be written as follows: where g ( x ) is the logistic function, 1 / ( 1 + exp ( −x ) ) . The network can be trained to increase its assigned probability to an input by reducing the energy related to the input. The network parameters are updated by using the gradient-based method to reach the objective. Eq. (8) shows the derivative of the log probability of visible inputs on a set of observations, { V n } N n =1 , with respect to a weight, , shows the frequency in which both visible unit, v i , and hidden unit, h i , have the binary values of one.
is the same expectation with respect to the distribution defined by the model. The one-step contrastive diversion approximation is used to approximate E model [ v i h j ] [11,39] . It is calculated by running one iteration of the Gibbs sampler to reconstruct the visible unit using (6) and (7) to obtain (9) .
where ˜ v i is the reconstructed i th visible unit obtained through Gibbs sampling. Using the Gibbs sampler to approximate is called the contrastive approximation method. A similar procedure can be used to extract (10) and (11) for updating biases.
Gaussian-Bernoulli RBMs [1,7] are used for modelling realvalued vectors. In this case, the visible vector is V ∈ R D , and the hidden units are binary, h ∈ {0, 1} F . The RBM assigns the conditional distribution shown in (12) and (13) .
A Gaussian distribution with mean μ = b i + σ i F j=1 W i j and variance σ 2 i is used for modelling a visible unit, i.e. N( μ, σ 2 i ) . The derivative of the log probability on a set of observations, { V n } N n =1 , is given by (14) , which is similar to (8) .
Similar to the binary visible unit, the first expression on the right side of (14) can be calculated from the training data and the second expression can be approximated through a Gibbs sampler [7] . The result is obtained using (15) .
The input features, v i , are usually normalized to have zero mean, μ = 0 , and a variance of one, σ i = 1 . This simplifies (15) and it becomes similar to (9) . The following sections refer to (9) for both binary and real value input data.

Proposed RBM-based deep feature selection algorithm
An RBM is a generative probabilistic model and can generate the probability of the value of a visible unit given the states of hidden units. This property can be used to reconstruct missing visible units. The generative property of RBMs has been adopted to draw samples from the learned distribution to extract textures from images [41] . The generative property of RBMs has also been used to sample the missing parts of an input image in image denoising tasks [42] . The proposed Deep-FS algorithm adopts the generative property of RBM to define a method for feature selection.
Deep-FS aims to find a set of features with useful information. Therefore, features that do not hold useful information about the input data are removed by the generative property of the RBM. The final selected features have a lower number of features and they reduce the complexity of the network. Feature selection is performed via three steps 1) Initial training: of RBM on training data with all the features; 2) Feature elimination: Removing extra features by the initially trained RBM; and 3) Main Training: Training the DBM with the initially trained RBM on the training data which consists of the selected features. Each of these steps is described in the remainder of this section.
Step 1 -Initial training: The first step is performed via the learning method described in Section 3.1 by using (9) . During the training procedure, training data is input into the RBM. Then the outputs of hidden neurons are calculated for all training data, and the first expression in (9) In the next step the inputs are reconstructed by Gibbs sampling and the second expression in (9) , is calculated accordingly. Finally, the RBM weights are adjusted by calculating the difference (9) .
Step 2 -Feature elimination: In the second step, features are eliminated through a proposed feature selection algorithm which is describing in this step. The proposed algorithm can be tuned to eliminate a single feature or a group of features in each evaluation. The proposed algorithm starts with the set of all input features and evaluates the effect of each group of features by using the trained RBM. The learning aim in RBM is to minimize the network error or maximize the log-likelihood as shown in (9) . The derivative of loss or error function of RBM can be obtained by multiplying (9) by the value of minus one [11] . During the learning process, the absolute value of the error is reduced. Consequently, the weight adjustments are stabilized. The learning procedure is stopped when the error reaches zero or a predefined number of learning epochs is reached. The weight adjustment can become zero when the error reaches zero. In particular, consider the error related to one of the input features. The error related to the i th visible unit is given in (16) by using (9) .
The reconstruction error is defined using (17) : where e i is the reconstruction error related to input feature i . A lower e i can cause a lower absolute weight adjustment. The RBM learning rule is based on feature extraction and dimension reduction. In RBM's feature extraction method, different visible inputs are combined and hidden features are extracted. During this learning procedure, the value of weight adjustment is stabilized and consequently the absolute value of error shown in (16) is reduced. e i described in (17) is used to define an elimination criterion for the proposed feature selection method.
The input features are investigated to find whether a feature v i can be reconstructed by using other input features, i.e. to find whether the other features contain enough information to reconstruct the v i . The i th visible unit, v i , is eliminated and it is reconstructed by the trained RBM. To eliminate v i , it is initialized to zero. Therefore, the visible unit v i with the initial value of zero does not contribute to the output of the hidden variables (see (6) and (12) ). The hidden features are generated only by the other visible inputs. Then the i th visible unit is reconstructed by the hidden units using (7) for binary or (13) for continuous or real value data. The i th reconstructed feature is called v i . Then (17) is used to calculate the reconstruction error, 2 ]. Note that v i and ˜ v i both are reconstructed visible units. However, v i is generated when the initial value of v i is set to zero, and ˜ v i is generated when v i is set to its original value.
If the reconstruction error after elimination of i th visible unit, ē i , is equal or less than the original reconstruction error of the visible unit, e i , it becomes evident that the visible unit does not add any knowledge to the network or it reduces the general knowledge of the network. Therefore, removing the i th input feature reduces the complexity of the network and may also reduce the error of the network. Additionally, reducing the error can lead to the reduction of the absolute value of the learning error shown in (16) . Consequently, the absolute value of the weight adjustment shown in (9) is reduced and the weights become stabilized. Thus, the feature selection method can cooperatively work with the RBM weight adjustment method to reduce the final error.
The proposed feature selection has been designed based on training aim of RBM. Suppose that a RBM is trained on all the input features. The weights are updated based on (9) . The weight for the expression leads to high weight adjustment and a low value leads to low weight adjustment. If the weights are adjusted in such way that the RBM can regenerate the trained data, The equality leads to zero adjustment for the weights, i.e.
has a value close to zero).
Therefore, if by removing a visible unit a similar situation occurs, i.e. the error of the reconstructed visible unit is reduced, which implies that the removed feature not only does not add new information to the network (to reconstruct the visible unit) but it also reduces the overall knowledge by increasing the reconstruction error. Therefore, it is better to remove the visible unit from the selected feature set to make RBM become close to its training aim. The proposed method, similar to the optimal brain surgeon approach [43] , works based on the comparison of two errors. In the optimal brain surgeon approach the difference in the two errors is created by pruning of learning parameters, however in the proposed method the difference in errors is generated by eliminating the features. Then based on the derived reconstruction error the features are selected. Additionally, the proposed feature selection is designed to work with DBM. The procedure of the proposed feature selection is summarized in the pseudocode described in Table 1 . After training the RBM, it is used to calculate the reconstruction error e i for i th input features by using (17) for all i . In the while loop in Table 1 , the proposed method investigates a group of N e visible features. In each iteration a group of N e features are eliminated by setting the cor- responding visible units to zero. Investigation of a group of input features when N e > 1 (see Fig. 2 ), reduces the number of required iterations and consequently it reduces the processing time. The reconstruction error ē k is calculated for the N e eliminated features. ē k is the reconstruction error of k th visible unit in the N e eliminated features in the while loop in Table 1 . For the k th eliminated feature if ē k < e k , where e k is the reconstruction error of same input feature before eliminating the features, then the k th eliminated feature is removed from input features permanently. Otherwise, the k th eliminated feature is considered as selected features. The while loop in Table 1 is continued until all the input features are investigated.
In Fig. 2 the investigated group of features are selected by adjacency. Reconstruction error ē i is utilized to evaluate the effect of removing groups of features which have not been selected based on adjacency; and groups of features which contain adjacent features. ē i is the reconstruction error of the i th input feature when N e input features are eliminated by setting them to zero. As shown in Table 1 , the ē k < e k condition is used to decide whether to remove the k th feature, i.e. a visible unit with a lower reconstruction error is removed permanently. Therefore, visible units with lower ē i in the current iteration are more likely to be removed in the next feature selection iteration. Consequently, ē i can be used to find the next N e features that should be tested for removal. The N e visible units that have the lowest ē i in current iteration of feature selection are selected for elimination test in the next iteration. In this case, the N e selected visible units usually are not adjacent units.
An alternative version of the algorithm is presented in Table 2 , where the RBM is trained before feature selection on the all the input features (similar to the first version presented in Table 1 ). Additionally, it is trained during the feature selection procedure using the reduced number of features. In the alternative version, the RBM is first trained, then, e i is calculated using (17) for all visible features (see Table 2 ). After that, a set of N e features are selected as candidate features and they are temporarily eliminated from the input feature set by setting their values to zero. Then it is decided if the k th eliminated feature should be removed permanently by using ē k . The elimination of N e features are repeated to investigate all the input features. After removing of every N th feature the RBM is trained with the reduced number of input features. Then the new trained RBM is used to continue the feature selection procedure (see Table 2 ).
Step 3 -Main training: After eliminating redundant features by using one of the two algorithms provided in Tables 1 and 2 , the training of the RBM continues using the selected features, i.e. RBM with the remaining visible units is initialized by the previous corresponding weights and the learning is continued. The RBM and the selected features are used for training the DBM similar to the Table 1 Pseudocode of the proposed feature selection method when the initially trained RBM is not trained during feature selection.
Train RBM on the training data. Calculate the initial reconstruction error e i by the trained RBM using (17) for all i .
Select N e features for evaluation.
Set v k = 0 f or k ∈ { N e selected features} (elimination of the N e features).
Calculate the reconstruction error ē k for each eliminated feature using (17) . for k ∈ { N e eliminated features}: if ē k < e k then : Remove the k th visible unit.
Reset v k from 0 to its original value, and add it to selected features. i = i + N s Table 2 Pseudocode of the alternative version of the proposed feature selection method when the RBM is trained during the feature selection procedure.
Train RBM on the training data. Calculate the initial reconstruction error e i by the trained RBM using (17) for all i .
Select N e features for evaluation.
Calculate the reconstruction error ē k for each eliminated feature using (17) . for k ∈ { N e eliminated features}: if ē k < e k then : Remove the k th visible unit.
Reset v k from 0 to its original value, and add it to selected features.
Train RBM with current reduced number of features. Calculate the initial reconstruction error e i by the trained RBM using (17) . N r = 0 method used in [17] . The learning method proposed in [17] has an unsupervised learning procedure where a DBM is initially trained. Then the learning parameters of the trained DBM are used to initialize a corresponding multilayer neural network. Finally, a standard back propagation method is used to train the multilayer neural network.
In the proposed methods, the knowledge acquired during the training process of the DBM (i.e. training of the RBM as a building block of the DBM) is used to perform the feature selection, i.e. the result of pre-training of DBM is used for feature selection. Therefore, the computation cost of the feature selection task is reduced, because feature selection is performed during the DBM's learning process.
The proposed feature selection method uses the generative ability of RBM to reconstruct eliminated features and then calculates reconstruction errors to determine whether to retain or remove features. Other deep learning algorithms that have the generative property can also be used by the proposed method. For instance, the Deep Belief Network (DBN) is a deep learning method that uses RBM during its training procedure; therefore, the proposed feature selection method can be used with DBN. The Auto-Encoder method is also a deep learning method that reconstructs its input at the network output. Therefore, the proposed method can be extended to an Auto-Encoder. The proposed feature selection method has the ability to work with data that are suitable for deep learning methods.

Normalized error
The normalized error shown in (18) is used to evaluate the performance of the proposed method.
where D is the number of visual input features, and ē i is the reconstruction error related to the i th visible unit calculated by (17) .

Experimental results
This section first presents the experimental results of applying the proposed Deep-FS method on five benchmark datasets, namely MNIST [44] , MIR-Flickr [7] , GISETTE [45] , MADELON [45] , and PAN-CAN [46] . Datasets with a high number of features and training samples were selected in order to evaluate the accuracy and speed of the proposed feature selection method on high dimensional datasets. Thereafter, Deep-FS is compared with other feature selection methods using the MNIST dataset. Finally, Deep-FS's time complexity is analysed using MNIST and randomly generated data.

Experimental results on MNIST
This section provides a brief description of the MNIST image dataset, and provides an explanation and illustration of how features are selected and removed using the proposed Deep-FS method. Next, the effect of training an RBM during the feature selection procedure is discussed via an empirical comparison with the two proposed methods described in Tables 1 and 2 . After that the effect of the application of the two selection methods for evaluating the features, i.e. selecting the N e adjacent features ( Fig. 1 ), and selecting the N e features that have the lowest reconstruction error, ē i , are investigated. Finally, the proposed feature selection method, Deep-FS, is compared with the original DBM [17] .

MNIST image dataset
In the first set of the experiments, Deep-FS is applied on the MNIST image dataset [44] . MNIST contains 60,0 0 0 training samples and 10,0 0 0 test image samples of handwritten digits. In the MNIST dataset, each image is centred in a 28 × 28 pixel box. Each image in the MNIST dataset is obtained from an original image which is in a 20 × 20 pixel box through a transformation. The transformation is fully performed in such a way that the centre of mass of the pixels is preserved in the two images. This pre-processing is described by LeCun et al. [44] . The handwritten digits have different thickness, angular alignment, height and relative position in the image frame. Fig. 3 illustrates some samples of reconstructed images after applying Deep-FS described in Table 2 on the MNIST dataset. The left column of Fig. 3 shows samples of the original images and the right-hand column shows the reconstructed images with the removed pixels filled by black pixels. Fig. 3 shows that the method has in practice removed pixels surrounding the digits. The peripheral pixels do not contain any useful information about the digit, and therefore were removed. Some other pixels in the middle area of the images have been removed. The removed pixels can be reconstructed by the information from neighbouring pixels, and removing these pixels does not destroy the general appearance of the digits.

Illustration of the selected and removed features on the MNIST dataset
In Fig. 4 (a), a histogram illustrating the removed pixels for digit 7 is shown. The pixels are indexed from 1 to 784 as shown in Fig. 4 (b). The pixel located at the top right corner has the index of 1, the pixel next to it on its right side has the index of 2 and so forth. The first and last columns in the histogram, shown in Fig. 4 (b), have higher values than others and correspond to the pixels that are at the top and bottom of the figure, respectively. As shown in Fig. 3 , there is not much information in those pixels, and therefore these were removed appropriately by the proposed feature selection algorithm. Less pixels are removed from the area in the middle part of each image of the digits, as shown by shorter columns in the middle part of Fig. 4 (b).
In this paper the learning method proposed by Salakhutdinov and Hinton [17] is used as the baseline learning method for comparison purposes. In this baseline learning method, a DBM is initially trained, and thereafter the trained DBM is used to initialize a multilayer neural network. Then, a standard back propagation method is used to train the multilayer neural network.
The input vector has 28 × 28 = 784 units. The DBM has two hidden layers. The first hidden layer has 500 hidden units and there are 10 0 0 hidden neurons in the second hidden layer. A similar structure is used is used in the proposed method to compare the results. The difference between the Deep-FS and DBM [23] is in the first layer. Deep-FS selects a set of input pixels to reduce the number of input units. The maximum number of learning epochs are the same for all the methods. For example, the DNNs are finetuned for 100 learning epochs using backpropagation methods. Table 3 shows the experimental results when N e adjacent features are used during the feature selection s stage. In this experiment, the results of training the first RBM during the feature selection stage(using the algorithm described in Table 1 ) are compared with those results when the RBM is not trained during feature selection (i.e. using the algorithm of Table 2 ). In the first column, Deep-FS 1 , Deep-FS 5 , and Deep-FS 10 are proposed meth-  ods when N e is set 1, 5, and 10, respectively. N e is the number of pixels that are eliminated before feature reconstruction. In each iteration of the feature selection phase, N e features are eliminated and thereafter reconstructed together by RBM as explained in Section 3.2 and Table 1 . Deep-FS can evaluate features (pixels) one by one ( N e = 1) or it can evaluate a group of features (pixels) as described in Section 3.2 ( N e > 1). In the later situation, instead of searching by a single feature, N e = 1, the search process searches by a group of features in groups of 5 and 10, i.e. N e = 5 and N e = 10, respectively. The third column, #Input Features, is the number of input pixels (features) which are selected during feature selection. The fourth column, Classification Error During Testing, shows the classification error on the testing data (i.e. test images). Each test image is input into the network then the output of the ten neurons, each of which is corresponding to a class, on the output layer are investigated. The test input is assigned to a class that corresponds to the output neuron that has the maximum output value. The number of incorrect assignments are collected over 10,0 0 0 testing images and the results are reported in column Classification Error during Testing. The processing time column, is the total required time for the methods to perform training on 60,0 0 0 training images and also to test the trained network on the 10,0 0 0 testing images. Experiments were performed on an Intel E5-2640 v4 2.40 GHz processor with 64 GB RAM. Table 3 shows training RBM during feature selection slightly reduces the errors when N e > 1 adjacent features are used. Table 4 shows the results when the N e features that have the lowest reconstruction error, ē i , are used during the feature selection process. The N e features with the lowest reconstruction errors in the current iteration of feature selection are evaluated in the next iteration of the feature selection procedure, and these features are usually not adjacent. Comparing Tables 3 and 4 reveals that using the features that have the lowest reconstruction error, ē i , during the feature selection procedure ( Table 4 ) improves the accuracy of the proposed feature selection method, compared to when adjacent features are used ( Table 3 ). For instance, comparing Tables 3 and 4 , the classification error for Deep-FS 10 is reduced from 97 ( Table 3 when RBM is not trained) to 90 ( Table 4 when RBM is not trained). The results show that the approach which eliminates N e features with the lowest reconstruction error has higher accuracy compared to the method that eliminates N e features which are adjacent.

The effect of evaluating the features with the lowest ē i
Additionally, the results in Table 4 show that using the feature selection method, which does not train the RBM during the feature selection procedure (see Table 1 ), has higher accuracy compared to when RBM is trained during feature selection in addition to the initial training of RBM (see Table 2 ) when N e > 1. For instance, Table 4 shows that Deep-FS 10 misclassified fewer image samples when RBM was not trained during the feature selection procedure, and the number of misclassified images was reduced from 94 to 90.
In conclusion, highest classification accuracy is achieved when a RBM is not trained during the feature selection procedure, and hence when Deep-FS 10 uses the initially trained RBM before feature selection and then performs the feature selection procedure by the initially trained RBM.

Comparing Deep-FS with the baseline DBM
The proposed Deep-FS method was compared against the DBM [17] method using the MNIST dataset. The comparison considered the effect of each approach on reducing the number of input features and misclassified cases, and reduction in processing time. The baseline DBM is the one which was originally introduced by Salakhutdinov and Hinton [17] . In this comparison, Deep-FS 10 is used because the experiments in Sections 4.1.3 and 4.1.4 revealed that removing features in groups of 10 is a better feature elimination strategy which provides higher classification accuracy. For Deep-FS 10 , RBM is not trained during the feature selection procedure. Additionally, features are evaluated based on the lowest reconstruction error, ē i , as its results are reported in Table 4 .  Table 4 ). b Classification-error during Training Out of 60,0 0 0 is zero for all the methods.
The baseline DBM is trained on all 748 input pixels (see first row of Table 5 ). Table 5 shows that the proposed Deep-FS 10 method has reduced the number of input features. The proposed method selected 430 features out of 784 total input features (see second row of Table 5 ). The results show that the proposed method removes more than 45% of the input features. The proposed method reduced the number of misclassified cases from 97 to 90 for the baseline DBM and the proposed method respectively (see Table 5 ). The results show that the proposed method's capability in finding redundant features that do not add new information to the network and removing these redundant features without reducing the classification accuracy and it is much faster than the baseline DBM. Selecting appropriate features based on the lowest error difference helps the algorithm find appropriate features, and increases accuracy. The classification error of all the methods on the 60,0 0 0 training samples are zero.
The processing time of the experiment for the proposed method is reduced by about 5% when N e is increased from 1 to 10 features. The processing time is reduced from 55,134 s for N e = 1 to 52,442 s for N e = 10 (see Table 4 when RBM is not trained during feature selection). Eliminating a number of input features, N e > 1, in each investigation reduces the required number of reconstruction procedures which consequently reduces processing time. The number of selected features is increased by about 4% when N e is increased from 1 to 10 (see Table 4 when RBM is not trained during feature selection). A reconstruction error related to each eliminated feature is increased when the number of eliminated features, N e , before the reconstruction is high. Eliminating a high number of input features during feature selection increases the reconstruction errors, ē k . Consequently, a high number of features is kept in the selected feature set when ē k < e k is used for removing the features ( Table 1 ). Moreover, Table 5 shows that the processing time of Deep-FS 10 is lower than the baseline DBM [17] .

Experimental results on the MIR-Flickr dataset
In the second set of experiments, the performance of the proposed Deep-FS 1 method, that was used for MNIST data in Section 4.1 , is tested using the MIR-Flickr dataset obtained from the Flickr.com social photography site [7,47] . The MIR-Flickr dataset contains one million samples. Each input sample consists of an image which may have user-defined image tags. Additionally, some of the input samples, image and user text tags, are labelled. Out of the one million input samples, only 25,0 0 0 images with user assigned tags are annotated with labels and the remaining 975,0 0 0 samples are unlabelled. Labelling large data is a very demanding task. The images and their corresponding user assigned tags are annotated by 38 labels including object and scene categories such as tree, bird, people, clouds, indoor, sky, and sunset. Each sample can belong to a number of classes.
The unsupervised learning ability of RBMs, which are the building blocks of a DBM, enables DBMs to be trained using a huge number of unlabelled data, and RBMs and DBMs are known for their suitability in training unlabelled data. After initial training, a limited number of labelled data can be used for fine-tuning the model. Out of the 25,0 0 0 labelled samples in the MIR-Flickr dataset, 10,0 0 0 samples are used for training and another 50 0 0 labelled samples are used as a validation set. The remaining 10,0 0 0 samples are used during the testing stage [7] . Each sample in the MIR-Flickr dataset has two sets of features, i.e. text and image features. First, the text features are described then the image features are introduced.
Many words which appear in the user defined tags are not informative and some of them are filtered. To organize the text input, a dictionary is generated. The dictionary contains the 20 0 0 most frequent tags, which were extracted from a set of user tags found in one million samples. Then each text input is refined, and each text input contains only the text in the dictionary. Therefore, the text data of each sample is represented by the vocabularies of its user tags that are in the dictionary (i.e. the tags are restricted to the ones in the dictionary). Additionally, different f eature extraction methods are used to extract real value features for each image. In the previous experiment on the MNIST dataset, (see Section 4.1 ) the inputs had binary values, however, here there are a number of real value features for each image. In total, 3857 features were extracted for each image [7] . The following features were extracted for each image [7]   The DBM used in [7] is employed as the baseline learning method for comparison, for the MIR-Flickr image-text data in this set of experiments. There are 3857 Gaussian visible units with real number output values for image input, and there are two hidden layers for image pathway each of them composed of 1024 binary units. A Replicated Softmax model [7] with 20 0 0 input units is used for text inputs. The text pathway is completed by two hidden layers each of which has 1024 binary units. A joint layer with 2048 binary hidden units is placed after the image and text pathways.
Each sample, image with corresponding user text tags, found in the MIR-Flickr dataset can belong to a number of classes. Mean Average Precision (MAP) and precision at top-50 predictions (Prec@50) are two standard methods commonly adopted to evaluate multi-label classification tasks [7] . MAP and Prec@50 are also used in this research to evaluate the classification performance of the proposed Deep-FS and the baseline DBM on the MIR-Flickr data.
Deep-FS uses the first 50 0 0 batches of data with whole features to train the RBM. There are 128 samples in each batch. Then the proposed feature selection method uses the trained RBM to select features. The learning process is continued with the reminder of the training data by using the selected features. Extracted features from the first hidden layer of the image pathway are classified by the logistic regression method to show the effect of the feature selection method on the classification results. The results are shown in Table 6 . Deep-FS returned a higher MAP than the baseline DBM method, achieving values of 0.478 and 0.476 respectively. There is no notable difference in Prec@50. Deep-FS selects 3082 out of 3857 image features. The proposed feature selection method removes 775 features to reduce the number of input features. The classification results on the hidden features extracted from joint hidden layer are shown in Table 7 . Deep-FS removes features without affecting the classification performance of the testing data.
In Fig. 6 the errors of the Deep-FS method on the training and evaluation sets are compared to those of the baseline method, DBM [7] , across various learning steps. In each learning step a new batch of data is trained. Eq. (17) is used to calculate the errors. Until step 50 0 0 the two methods use the all the features so they reach the same error levels of 0.4228 and 0.2658 on training and evaluation sets respectively. In the next steps of the learning process, the errors of both methods are reduced, however, the error drops faster and reaches a final lower value when using the proposed method. For instance, the proposed method reaches the level of 0.2590 on the training set which is lower than that of the base method, i.e. 0.3075 ( Fig. 6 (a)) at the end of the learning steps. Similarly, the proposed method reaches the error level of 0.1745 compared to 0.2278 for the baseline method on evaluation set ( Fig. 6 (b)). The errors for the proposed Deep-FS method and the baseline DBM are 34% and 14% lower than the error at step 50 0 0, i.e. 0.2658, respectively. The feature selection method removes the redundant and irrelevant features and consequently prevents over-

Experimental results on the GISETTE dataset
The GISETTE dataset [45] is a benchmark dataset generated for binary classification tasks. GISETTE is a handwritten digit recognition dataset, which was part of the Advances in Neural Information Processing Systems (NIPS 2003) feature selection challenge. The GISETTE training data contains 60 0 0 samples and 50 0 0 features. The GISETTE learning task is a binary classification task to discriminate between two confusable handwritten digits of 4 and 9. In GISETTE, each digit has a dimension of 28 × 28. The pixels  In the experiment on GISETTE, 60 0 0 samples are used to train and test the proposed method. The Decision Tree classifier was adopted and k-fold cross validation with k = 10 was applied to evaluate the performance of the proposed feature selection method. The Decision Tree classifier was selected experimentally as it had achieved the highest classification accuracy compared to alternative conventional machine learning methods. Additionally, experiments showed that the Decision Tree classifier was trained in shorter time compared to most of the other methods. The number of splits in the Decision Tree was experimentally selected and set to 30. Table 8 shows that the proposed Deep-FS reduces the number of input features from 50 0 0 to 951, i.e. reduction of 81% of input features. The accuracy of the Decision Tree classifier on the selected features is 93.5%. Its accuracy when using the entire set of features decreased to 92.9%. Using a smaller subset comprising the selected features reduced the training time of the classifier (see Table 8 ). The classifier needed about 262 s to train using all 50 0 0 features, however, when the training was performed using the selected features (i.e. 951) features, it only needed 47 s to train. The proposed method reduced 82% of the classifier's training time (see Table 8 ).

Experimental results on the MADELON dataset
The MADELON dataset [45] is an artificial dataset, which was part of the Advances in Neural Information Processing Systems (NIPS 2003) feature selection challenge. This is a two-class classification problem with continuous input variables. The challenge is that the problem is multivariate and highly non-linear. In this experiment, 20 0 0 training samples from the MADELON dataset are used, and each sample has 500 features. The performance of the proposed Deep-FS method on the MADELON dataset is reported in Table 9 . Deep-FS reduces more than 57% of input features and achieves a higher classification accuracy, i.e. 80.3%. Additionally, it reduced the computation time of training the classifier, as reported in Table 9 . It reduced 70% of the classifier training time. In summary, higher or very close accuracy is achieved using a much smaller set of features, but in less time (i.e. 4.44 fewer seconds) when Deep-FS is used.

Experimental results on the PANCAN dataset
PANCAN [46] was obtained from TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR) [48] . The data contains 801 data samples from patients with 5 types of tumours: COAD, LUAD, PRAD, BRCA and KIRC. Each patient sample contains 20,531 features. A total of 264 features had the same value for all samples in the dataset and these were removed, resulting in a total number of 20,264 features. Table 10 shows thatDeep-FS has reduced the number of features from 20,264 to 4765, i.e. 76.49% reduction in the number of input features, and increased the accuracy from 97.1% to 98.5%. 10fold cross validation was applied to achieve the results for each of the two situations reported in Table 10 . Importantly, the results revealed significant reduction in the time needed by the classifier to train using the selected features. Training on the selected features needed 59.19 s compared to 400.39 s when all the features are used. Hence, training on the selected features was 341.20 s faster (i.e. it reduced 85% of the classifier's training time).

Comparison of Deep-FS with other feature selection approaches
In the following experiments the proposed Deep-FS is compared with other feature selection approaches. The comparison is performed in the following two steps.
Step 1: Select features using the proposed and other existing feature selection algorithms; and Step 2: Train DBM using the selected subset of features.
Step 1: Initially, the proposed Deep-FS method and three other feature selection methods were separately applied to the MNIST dataset. Each feature selection method returned a selected subset of features, and then the selected features were used to train a DBM (results are presented in Table 11 ). Please note that most of the conventional existing feature selection algorithms are computationally very expensive and not suitable for large data. The Genetic Algorithm (GA) for feature selection described in [49] , the Infinite Feature Selection (InfFS) [50] , and the Laplacian Score (LS) for feature selection [51] methods were compared to Deep-FS. The GA-based mRMR freature selection algorithm [49] calculates the joint mutual information matrix between pairs of features and this makes the algorithm impractical for high dimensional datasets. InfFS [50] is a filter feature selection method that selects the most important features based on their ranks. All other features are considered to evaluate the score of a feature. InfFS maps the feature selection task to a graph and the feature selection is considered as a subset of features that make a path in the graph. A cost matrix is constructed to give pairwise associations between the features using variance and correlation of the features. The matrix is used to evaluate relevance and redundancy of a feature with respect to all the other features. Laplacian Score (LS) for feature selection [51] is another well-known method with the ability of finding features in unlabelled data [52,53] . The LS method uses a nearest neighbour graph to evaluate local geometric structures and selects features based on the constructed graph.
Step 2: The selected features identified by each feature selection method are used to train DBMs. The weights of each DBM are initialized randomly, then the DBM is trained on the selected fea-  tures. In the proposed Deep-FS 10 , the weights which are trained during feature selection are used for initializing the DBM. The results are reported in Table 11 . Table 11 shows that the proposed Deep-FS 10 has achieved higher accuracy than the alternative approaches. Additionally, the proposed Deep-FS method can find the number of the selected features, i.e. 430, automatically. However, the other three feature selection methods require that the user specifies the number of selected features at the start of the feature selection procedure. The number of features (i.e. 430), obtained by the proposed method, is used by the other feature selection methods. The datasets are large and it is computationally very expensive to run the experiments using various numbers of features to experimentally determine the most suitable number of features to select. For this reason, there is a need for feature selection algorithms, such as Deep-FS, which can automatically identify the most relevant features in large data. The simulation results in Table 11 show that the proposed method has misclassified 90 images out of 10,0 0 0 (0.9% of the images) which is a lower error rate than the alternative methods. The training accuracy of the trained DBMs for all the methods is 100%. Table  11 also shows that the proposed feature selection method has the shortest processing time compared to the other methods. The GA [49] and Laplacian [51] methods need a much longer computation time to perform the feature selection task compared to Deep-FS. For instance, GA [49] took 30,784 s while the proposed method only took 133 s for the same feature selection task. The classical feature selection methods have high computational cost when applied to large datasets that have a high number of features and/or training samples. The improvement in performance and time when using the proposed Deep-FS 10 method instead of the other methods is provided in Table 12 .
Most Classical feature selection methods have not been designed to work on datasets which contain a large number of features. Furthermore, classical feature selection methods have been designed to take as input a single matrix that contains all the training samples, and this is another reason which makes them unsuitable for large data. For instance, the unsupervised feature se-lection for multi-cluster data (MCFS) method proposed by Cai et.al. [54] was applied to the MNIST dataset, but computational problems were encountered. In particular, because MCFS constructs a square matrix with size N × N , where N is the number of training samples, and there exist N = 60,0 0 0 image samples in MNIST, the square matrix was very big and MCFS could not converge when applied to the MNIST data. Other methods such as GA method can be applied to MNIST and other large datasets but have a high computation cost and computation time. However, the proposed Deep-FS overcomes the difficulties of working with datasets containing a high number of features and samples by dividing the training samples in a number of batches similar to what is performed in deep learning methods.

Time complexity analysis of the proposed method
In order to analyse the time complexity of the proposed method, two experiments are performed. The time complexity of the method is analysed in regard to the number of training samples and the number of input features.
In the first experiment, the computation times of the proposed Deep-FS method are obtained for different numbers of training samples. The MNIST dataset is used in the first experiment. The number of training samples is increased from 50 0 0 to 50,0 0 0 in steps of 50 0 0, and the running time of the proposed method for each number of training samples is calculated. Fig. 7  In the second experiment for analysing the time complexity of the proposed method, the total number of input features is changed and the computation time of the proposed Deep-FS is calculated for the different number of input features. Uniformly distributed random datasets in [0,1] interval with different numbers of input features are generated to perform the second time com-

Conclusion
This paper proposes a novel feature selection algorithm, Deep-FS, which is embedded into the DBM classifier. DBM is considered as a non-linear feature extractor that can reduce the dimensionality of large or big data. The proposed Deep-FS feature selection method works in conjunction with the feature extraction method of DBM to reduce the number of input features and learning errors. Deep-FS uses a RBM that is trained during the training stage of a DBM to reduce computational cost. Deep-FS uses the generative property of RBMs which enables RBMs to reconstruct missing input features. A group of features is temporary eliminated from the input feature set to evaluate reconstruction error using a new criterion based on RBM. RBM treats the eliminated feature(s) as missing feature(s) and reconstructs these feature(s) by using the information from other input features. Then, the reconstruction error is used as a criterion for feature selection. The proposed feature selection method has two versions. In the first version of the proposed method, a RBM is initially trained, then it is used for feature selection. In the second version of the proposed method, the initially trained RBM is additionally trained on the reduced feature set during a feature selection procedure. Experiments revealed that the first version has a higher classification accuracy than the second version. Experiments also revealed that removing selected groups of features instead of single adjacent features improves performance and feature selection time.
Deep-FS was evaluated using the MNIST, MIR-Flickr, PANCAN, GISETTE and MADELON benchmark datasets. The results demonstrated that Deep-FS can reduce the number of inputs without affecting classification performance. Deep-FS reduced the number of misclassified samples on the MNIST data from 97 to 90. The proposed method automatically selected 430 features out of 784 features and it reduced the total processing time by 3063 s. When applied to the MIR-Flickr dataset it altered MAP from 0.476 to 0.478. The impact on classification accuracy is minor, which is a desirable result given that the number of inputs was reduced.
Moreover, Deep-FS has reduced the computation time. The proposed algorithm was effective in reducing the number of input features, i.e. it removed 15,499 features out of 20,264 features, and reduced classifier training time by 85% for the PANCAN dataset. Experiment results also revealed that the proposed feature selection method reduced the number of input features, improved cross validation accuracy, and reduced classifier training time on the GISETTE and MADELON datasets.
The proposed method was compared with three other feature selection methods namely the: GA [49] , Infinite feature selection (InfFS) [50] , and Laplacian Score for feature selection [51,55] using the MNIST dataset. The results showed that the proposed feature selection method reduced the number of misclassified images compared to the other methods. Additionally, it reduced the processing time of feature selection, for instance the proposed feature selection method performed automatic feature selection in 133 s while the GA method [49] performed the same feature selection task in 30,784 s.
Deep-FS can improve the processing ability of the deep learning method for multimodal data. Recently, Deep Neural Networks have shown their ability to process multimodal data with a large volume of data [1] . One common property of the multimodal data is their high dimensionality. Not all the input features might have useful information and irrelevant input features can introduce noise and degrade the performance. Reducing the number of input features and removing the irrelevant features can improve the ability of a deep learning model to process multimodal data. The feature selection method reduces computational cost by reducing the size of reconstructed matrix.
DBNs [1] belong to a group of DNNs that uses an unsupervised pre-training stage. During the first learning phase of DBNs, layer-wise unsupervised training is performed. Each layer learns a non-linear transformation from its input to capture the main information of its input. Each adjacent pair of layers is considered as an RBM [1,8] . An RBM is used to govern the unsupervised training and to extract features. The proposed feature selection method, which works based on RBM, can be applied to DNNs to improve its processing abilities. DBNs [1] have demonstrated good results in different applications such as speech recognition [11] , audio [56] , image and text classification [7] . To apply the proposed method to other deep learning methods, the proposed feature selection method can be initially used to select features which will be input into the deep learning classifier (or other classifier), and the classifier can be trained on the selected features.
Koller and Sahami's Markov Blanket filtering feature selection method [12,57] eliminates a feature if there is a Markov Blanket for the feature. For a target feature, a Markov Blanket is a minimal set of variables from a feature space on which all other variables are conditionally independent of the target feature. However, it is not straightforward to determine whether a set of features makes a Markov Blanket for the target feature, especially when the number of input features is high [12,57,58] . The proposed Deep-FS method defines a criterion for each feature and checks whether other features can reconstruct the target feature. In particular, with the proposed method, when the reconstruction error of a feature is reduced, the other features can contain a Markov Blanket for the target feature and that feature can be eliminated.
The proposed feature selection method will be very useful to researchers working with large and big data. Currently there are not many feature selection methods suitable for large data. The paper demonstrates that the proposed Deep-FS can be applied to unimodal and multimodal data for various tasks. In particular, Deep-FS has been applied to unimodal handwriting digit recognition datasets (MNIST, and GISETTE), a multi-modal dataset comprising images and text (MIR-Flickr), and a biomedical dataset (PANCAN).
Reducing the number of inputs and consequently the size of constructed weight matrix can be useful to manage limited hardware resources during hardware implementation of DNNs for complex tasks [1,59,60] . Deep-FS can be used to reduce the input size, and the trained network for a specific task can be implemented with less silicon area on hardware. Future work includes exploring the capability of the proposed Deep-FS in reducing the complexity of deep learning networks, through reducing the number of the input features in real world applications when the inputs are generated by sensors. Reduction of input features leads to the reduction of the number of sensors which can consequently reduce implementation costs. Deep-FS can offer a systematic way to find an optimized number of sensors. For example, Deep-FS can be applied to optimize the number and selected positions of sensors. Future work also includes applying the algorithm to large-scale data analytics tasks, such as human activity recognition which require use of deep learning algorithms.