Minimum redundancy maximum relevance feature selection approach for temporal gene expression data

Background Feature selection, aiming to identify a subset of features among a possibly large set of features that are relevant for predicting a response, is an important preprocessing step in machine learning. In gene expression studies this is not a trivial task for several reasons, including potential temporal character of data. However, most feature selection approaches developed for microarray data cannot handle multivariate temporal data without previous data flattening, which results in loss of temporal information. We propose a temporal minimum redundancy - maximum relevance (TMRMR) feature selection approach, which is able to handle multivariate temporal data without previous data flattening. In the proposed approach we compute relevance of a gene by averaging F-statistic values calculated across individual time steps, and we compute redundancy between genes by using a dynamical time warping approach. Results The proposed method is evaluated on three temporal gene expression datasets from human viral challenge studies. Obtained results show that the proposed method outperforms alternatives widely used in gene expression studies. In particular, the proposed method achieved improvement in accuracy in 34 out of 54 experiments, while the other methods outperformed it in no more than 4 experiments. Conclusion We developed a filter-based feature selection method for temporal gene expression data based on maximum relevance and minimum redundancy criteria. The proposed method incorporates temporal information by combining relevance, which is calculated as an average F-statistic value across different time steps, with redundancy, which is calculated by employing dynamical time warping approach. As evident in our experiments, incorporating the temporal information into the feature selection process leads to selection of more discriminative features. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1423-9) contains supplementary material, which is available to authorized users.


Background
Feature selection approaches can be roughly categorized into filter-based methods [1], wrapper-based methods [2] and embedded methods [3]. Filter-based methods perform feature selection independently from the learning process. On the other hand, wrapper-based and embedded methods combine feature selection and the learning process in order to select an optimal subset of features. This combined process usually requires the use of nested cross validation procedure which may lead to increased computational cost and possible overfit, especially when a small number of observations is available, which is often the case in gene expression datasets. Therefore, we focus on filter-based feature selection approaches in this paper.
A challenge in gene expression studies is the identification of discriminative genes, which may be later used as predictors (inputs) to classification models. Removing irrelevant features may lead to improved accuracy and increased interpretability of the classification model. However, this task is challenging, especially when data have temporal characteristics. Various feature selection approaches have been developed for microarray data [4][5][6]. However, most of these methods cannot handle multivariate temporal data without data flattening, which is the process that transforms a temporal data into a single matrix and results in loss of temporal information.
Several feature selection approaches for temporal data have been proposed recently. For instance, [7] proposed a margin-based feature selection approach for temporal data, where the original feature space was transformed into weighted feature space to perform optimization in order to maximize temporal margin in this weighted feature space. However, redundancy among features was not considered. Following the same intuition, in [8,9] authors proposed an approach, where they project the data to another space to learn new features (factors or principal component). However, the methods are for dimension reduction, rather than feature selection which is our focus in this paper. The Multi-task Lasso method [10,11] employs group lasso regularization based on the L 2,1 -norm penalty for feature selection, thus ensuring all regression models at different time points to share a common set of features. This method removes redundant features by reducing their weights (coefficients) to zero but the approach belongs to the embedded feature selection methods (the search for an optimal subset of features is built into the classifier construction) rather than filter-type methods.
A special group of filter-based feature selection approaches tends to simultaneously select highly predictive but uncorrelated features. An example is the Maximum Relevance Minimum Redundancy (mRMR) algorithm developed for feature selection of microarray data [12]. It tends to select a subset of features having the most correlation with a class (relevance) and the least correlation between themselves (redundancy). In this algorithm, the features are ranked according to the minimal-redundancy-maximal-relevance criteria. Relevance can be calculated by using the F-statistic (for continuous features) or mutual information (for discrete features) and redundancy can be calculated by using Pearson correlation coefficient (for continuous features) or mutual information (for discrete features). In [13] authors proposed the MIFS-ND algorithm, which selects features according to the minimal-redundancy-maximalrelevance criteria by using an optimization algorithm known as Non-dominated Sorting Genetic Algorithm-II [14]. When selecting features, instead of using the calculated values for relevance and redundancy (e.g., F-statistic and Pearson correlation coefficient), authors used domination count and dominated count, which account for the rank in the sorted list of calculated relevance and the rank in the sorted list of calculated redundancy, respectively. In [15], authors proposed an approach, where they select one representative gene from each group/cluster with the objective that the selected genes are jointly discriminative. This approach requires features to be previously clustered based on correlation or domain knowledge (e.g., molecular functions, gene ontology, etc.). By clustering genes this algorithm prevents selection of redundant features. All these algorithms tend to select highly predictive uncorrelated features and require a preprocessing approach to perform temporal data flattening.
In this paper, we propose a temporal minimum redundancy -maximum relevance (TMRMR) feature selection approach, which is able to handle multivariate temporal data without data flattening. We preserve the idea of maximum relevance and minimum redundancy criteria [12] but we change evaluation procedure for relevance and redundancy. In the proposed approach, we compute the relevance of a gene by averaging the F-statistic values calculated across individual time steps, and redundancy between genes by using the dynamical time warping (DTW) approach. The proposed methodology, tested on three temporal gene expression datasets from viral studies, outperforms the alternatives used in this study.

mRMR algorithm and data flattening
The mRMR is a feature selection approach that tends to select features with a high correlation with the class (output) and a low correlation between themselves. For continuous features, the F-statistic can be used to calculate correlation with the class (relevance) and the Pearson correlation coefficient can be used to calculate correlation between features (redundancy). Thereafter, features are selected one by one by applying a greedy search to maximize the objective function, which is a function of relevance and redundancy. Two commonly used types of the objective function are MID (Mutual Information Difference criterion) and MIQ (Mutual Information Quotient criterion) representing the difference or the quotient of relevance and redundancy, respectively. For temporal data, mRMR feature selection approach requires some preprocessing techniques that flatten temporal data into a single matrix in advance. This may result in a loss of possibly important information among temporal data (such as temporal order information). A common way for data flattening used as a preprocessing step to mRMR is depicted in Fig. 1.

TMRMR algorithm
In this study, we preserve the idea of the mRMR algorithm by maximizing the objective function, which includes relevance and redundancy, but we adapt it to handle multivariate temporal data without flattening (Fig. 2).
Let us denote by D = {X i , c i } i=1,...,N the dataset with N individuals. X i ∈ R G×T represents G observed genes measured at T time steps for individual i. c i ∈ {1, ..., K} represents the class label for individual i. Let us also denote by g j ∈ R N×T , j = 1, ..., G the N × T matrix of gene expression data for jth gene. We represent relevance of a gene g j by calculating the F-statistic at each time step and then combining these values by using an appropriate aggregation operator. A number of aggregation operators may be applicable here, such as median, arithmetic mean, geometric mean, maximum or even an approach that combines aggregation operators [16]. However, we aim to choose an operator that is most appropriate for the observed problem, i.e., that is able to capture different gene expression patterns between groups even if differences are present in just a small fraction of the observed time period. Median is more robust to outliers than arithmetic mean which is a nice property, however in some cases it may fail to detect different levels of gene expressions. For instance, some genes may have different expression between groups in just a short time period following infection (e.g., initial two time points with large F-statistic values) and thereafter the differences between groups become insignificant (e.g., the next five time points with small or even neglectable F-statistic values). In such a case, operators like median and geometric mean would fail to detect different gene expression behavior between groups. The maximum F-statistic value may be more appropriate in this case. However, this operator is based on a single F-statistic value (maximal) and it neglects all other values corresponding to other time steps. On the other hand, arithmetic mean, although it will be affected with several small F-statistic values corresponding to the time points where differences in gene expression values between groups do not exist, will have a significant value. In addition, we implemented the Multilayer Aggregation (MLA) method from [16] to combine arithmetic mean, geometric mean and median for aggregation of F-statistic Fig. 2 The proposed approach for calculation of relevance and redundancy for temporal data values corresponding to different time steps, however, it did not improve results significantly and it reduced robustness of the proposed feature selection methods. For these reasons, we choose the arithmetic mean operator to aggregate F-statistic values calculated across all time steps into a single value representing the total gene relevance: where g is an N-dimensional vector containing gene expression data of a gene g j at the tth time step, c is a classification variable with K possible classes, n k is the number of observations belonging to the kth class,ḡ in all tissue samples belonging to the kth class, and g (t) j,l,(k) is the gene expression value of lth sample belonging to the kth class.
By using Eq. 1, we quantify correlation of a gene g j with a class at each time step t. Thereafter, we calculate the overall relevance of the gene g j (Eq. 2) by averaging relevance (F-statistic) values calculated for all time steps. Here, it should be noted that relevance calculated in this way differs from relevance calculated on flattened data. For instance, it may happen that for some phenotype 1 expression values for a certain gene have increasing trend (let say from 0 to 1) and for phenotype 2 symmetric decreasing trend (from 1 to 0). In this case, data flattening may lead to low inter-class variance and therefore to low relevance. On the other hand, relevance calculated by using Eqs. (1)-(2) should be able capture the different trends of gene expression data for the two phenotypes.
In the proposed approach for temporal feature selection we calculate redundancy by using DTW, which is an efficient algorithm for measuring similarity between two temporal sequences that may vary in time or speed. DTW uses "elastic" alignment and is able to capture similarity between curves even if they are out of phase in time (in such cases Euclidean and Manhattan distance measures, which align corresponding time points, would fail to detect similarity).
An issue with the mRMR algorithm is the possible selection of irrelevant features, which is possible especially in the first few iterations of the algorithm. For instance, based on the MIQ criterion the second feature may be selected simply because it is totally different from the first one (feature with the highest relevance) although it may be irrelevant. Thereafter, this problem is further propagated since a selected irrelevant feature affects selection of the next ones. In order to solve this issue, we introduced hyperparameter α, which controls the number of the top relevant features (according to the average Fstatistic value calculated by using Eqs. (1)-(2)) included in the feature selection process. This means that we choose the next non-redundant feature from only the top αG relevant genes (where G is the total number of genes). For each two genes g i and g j , belonging to the group of the αG most relevant features, we calculate N × N distance matrix D (Fig. 2) whose elements represent DTW distances between rows in matrices g i and g j (e.g., D pq represents DTW distance between pth row in matrix g i and qth row in matrix g j ). After computing the distance matrix D we calculate redundancy by using one of the following two approaches: In Eq. 3 R c represents redundancy calculated by using DTW distances between every pair of rows in matrices g i and g j , while in Eq. 4 R m represents redundancy calculated by using only DTW distances between corresponding rows in matrices g i and g j . Although DTW is able to capture similarity between curves that are out of phase in time it may fail to capture similarity between curves fluctuating in a similar manner but with different offsets and amplitudes. For instance, one signal may fluctuate with amplitude between 5 and 10, while another signal may fluctuate in a similar manner but with larger amplitude between 30 and 40. In order to deal with this issue, prior to evaluation of distance matrix D for each pair of genes g i and g j , all gene expression temporal sequences were normalized by the z-score normalization (Fig. 2) which is often used as a preprocessing step to DTW [17][18][19]: where g i,p is a time series corresponding to ith gene and pth observation (patient), andḡ i,p and σ i,p are the average value and standard deviation of this time series. Z-score normalization 'translates' gene expression time series to fluctuate around the same (zero) offset and removes differences in amplitudes. Thereafter, the gene expression time series differ only in shape which is exactly what we are interested in when calculating redundancy.
After the normalization of gene expression temporal sequences, for each pair of genes g i and g j distance matrix D is calculated. Each entry of D is calculated by using DTW approach: where dtw() is the function which calculates the DTW distance between temporal sequences g i,p and g j,q . As in mRMR [12], the proposed algorithm starts by selecting one feature (gene) having the largest relevance calculated by using Eq. 2. Thereafter, algorithm performs greedy search and adds one feature in each iteration according to the MIQ criterion: where S is a subset of already selected genes extended with gene g k and |S| is the number of features in S, F is the average F-statistic value across different time steps (Eq. 2), and R is either R c (Eq. 3) or R m (Eq. 4). Depending on the choice of the redundancy measure (R c or R m ), in this paper we propose two versions of the TMRMR algorithm: (1) TMRMR-C, using R c as a measure of redundancy and (2) TMRMR-M, using R m as a measure of redundancy. Figure 3 shows the pseudo-code of the proposed TMRMR-C and TMRMR-M algorithms. Solution to the optimization problem given in Eq. 7 requires O(αGm) computational complexity, where m is the number of genes selected. Taking into account that the computational complexity of the DTW algorithm is O(T 2 ) then the total time complexities of the TMRMR-C and TMRMR-M algorithms are O(αGmT 2 N 2 ) and O(αGmT 2 N), respectively. Both proposed algorithms require more computational complexity than the original mRMR algorithm whose computational complexity is O(GmTN) for the temporal gene expression dataset. However, in cases where it is necessary to reduce execution time of the proposed algorithms (e.g. datasets with large number of time points T), their computational complexity may be reduced through parameter α. In addition, we can further speed up the proposed algorithms by utilizing an approximate DTW that has a linear time complexity [20], however, it is out of the current manuscript's scope.

Implementation
Both, the TMRMR-C and TMRMR-M algorithms are implemented by using MATLAB software. DTW is implemented by using dynamic time warping package [21]. Our software takes as input a set of temporally aligned gene expression data and provides the ranked list of the top genes as the output. The number of genes to be selected is specified by a user. Source code is freely available at: https://github.com/radovicmiloskg/TMRMR.git.

Dataset description
In this study, we evaluated the proposed feature selected approach by comparing it with alternatives on three independent gene expression datasets from human viral challenge studies [22]. These datasets contain gene expression data for 17, 20 and 19 human volunteers, who were infected with H3N2 influenza, rhinovirus (HRV) and respiratory syncytial virus (RSV), respectively. A summary of the datasets is given in Table 1.
In each dataset, subjects were classified based on severity of reaction to infection into "symptomatic" and "asymptomatic" groups. In particular, symptoms were recorded twice daily and classified based on modified Jackson Score [23]. Patients with a modified Jackson score larger than or equal to 6 over the quarantine period were denoted as "symptomatic". Gene expression measurements were collected temporally, starting at baseline (24 hours prior to inoculation with virus) and thereafter at a certain time points following experimental procedure which is described in detail in [22], making a total of 16, 14 and 21 time-point measurements for H3N2, HRV and RSV datasets, respectively.

Comparison methods
We compared the proposed TMRMR-C and TMRMR-M methods with four popular state-of-the-art feature selection approaches, widely used for extraction of the most informative features from gene expression data: 1. mRMR: This algorithm tends to select a subset of features having the most correlation with the class (output) and the least correlation between themselves [12]. It ranks features according to the minimal-redundancy-maximal-relevance criterion which is based on mutual information. 2. F-statistic: ANOVA is one of the most widely applied techniques in microarray data analysis [24]. This approach selects features simply according to the F-statistic value (which is the statistic for ANOVA). It prefers to select features having small intra-class variances and large inter-class variance. 3. ReliefF: One of the most successful and most widely used feature selection approaches which is based on the idea that a good feature should have similar values in observations belonging to the same class and different values in observations belonging to  [25]. It choses instances randomly, finds their nearest neighbors from the same and the opposite class(es), and weights features according to their distances (more weight is given to features that discriminate the instances from neighbors of different class(es) and do not discriminate the instances from neighbors of the same class).

Multi-task Lasso (MT-LASSO): This method
represents one of the state-of-the-art methods for temporal feature selection [10,11]. It employs the group lasso regularization based on the L 2,1 -norm penalty for feature selection, thus ensuring that all regression models at different time points (tasks) share a common set of features. The method is implemented by using the MALSAR software package [26].

Performance evaluation procedure
We evaluated the feature selection approaches by calculating the classification accuracy of the three classifiers:

K-nearest neighbors (KNN):
Instance-based lazy learning algorithm which predicts the class of a testing observation that is dominant among the K most similar examples (nearest neighbors) in the problem space.

Naive Bayes classifier (NB):
A probabilistic classifier based on applying Bayes' theorem, which is often used for classification of gene expression data [12,27]. 3. Support vector machine (SVM): A discriminative classifier, which uses a kernel trick to transform the input data space in order to create a separating hyperplane. In this study, we used linear SVM

84.7
Bold represents the best average accuracy because previous studies have proved its effectiveness in gene expression classification problems [28].
For evaluation of the three classifiers, the 5-fold cross validation procedure was used, where, in each iteration, observations belonging to the left-out fold were used for testing purposes (test set), while the remaining observations were used for feature selection followed by classifier training (training set). In each iteration of the cross validation procedure we optimized parameters of the classifiers by applying nested 5-fold cross validation procedure on the training set. In this way optimal values of parameters C ∈ 10 −3 , 10 −2 , ..., 10 3 for SVM and K ∈ {1, 3, 5, 7} for KNN were selected. Here it should be noted that the test data were never used for feature selection and classifiers training (including optimization of classifiers parametersnested 5-fold cross validation procedure). In addition, the optimal value of the hyperparameter α can be estimated in a nested cross-validation procedure. However, due to the fact that datasets used in this study contain a huge number of features (12023) measured in a large number of time points (14-21 depending on a dataset), it was too time consuming to use the nested cross-validation to select the value of α. Thus, we simply fixed the α parameter to 0.3 in all experiments ensuring that all selected genes come from the pool of the top 30% relevant genes (3610 genes).
All three gene expression datasets used in this study are balanced, and therefore classification accuracy may serve as a good metric for comparison of TMRMR-C and TMRMR-M with other baseline feature selection methods. Prior to feature selection and evaluation, missing values in all three datasets were imputed by linear interpolation. In addition, gene expression values for each gene were normalized to the range [0, 1] by using min-max normalization. All methods were implemented by using MATLAB software.

Classification accuracy on gene expression data
The proposed TMRMR-C and TMRMR-M feature selection approaches were compared with four baseline feature selection algorithms according to the evaluation procedure described in the previous section. By using the 5-fold cross validation procedure, the accuracy of KNN, NB and SVM classifiers was calculated for the top m = {1, 10, 20, 30, 40, 50} genes.

91.3
Bold represents the best average accuracy Table 2 shows the results for the three datasets H3N2, HRV and RSV, respectively. It clearly shows that both TMRMR-C and TMRMR-M methods outperformed alternatives in most cases. More precisely, both TMRMR-C and TMRMR-M algorithms achieved improvement in 34 out of 54 cases when compared to alternatives (with 12 tie results). When comparing the two proposed approaches, TMRMR-C outperformed TMRMR-M in most cases  in favor of TMRMR-C and 33 tie results). These results reveal that redundancy calculated by using DTW distances between every pair of time series from gene expression matrices (R c ) significantly contributes to prediction accuracy. In addition, we calculated the average accuracy of the three classifiers over all datasets (last row in Table 2). These values show that, on average, both proposed methods outperformed alternatives in all cases (for all classifiers and all m values). This indicates that the proposed methods have selected the most discriminative features.
For each m value, we tested whether the proposed TMRMR-C approach (which outperformed the TMRMR-M) statistically significantly outperforms other methods. For this purpose, we applied Welch's t-test on the results given in Table 2 and found that the accuracy of the proposed TMRMR-C method is statistically more significant than other four baseline methods in 17 out of 24 cases (α = 0.05).
Results given in Table 2 are depicted in Fig. 4. In this figure the accuracy is plotted as a function of m for all classifiers and for all datasets. This figure clearly shows that in most cases both, TMRMR-C and TMRMR-M approaches, outperform baseline methods for most values of m. This figure also shows that, among the four tested baseline feature selection approaches, F-statistic outperformed the others in most cases including mRMR. Since mRMR uses F-statistic as a measure of relevance we can conclude that minimum redundancy condition, calculated as a Pearson correlation coefficient, hurts its performance. On the other hand, the proposed TMRMR-C and TMRMR-M methods achieved highest accuracy by combining relevance, calculated as an average F-statistic value across different time steps, with redundancy, calculated by employing DTW and thus succeeded to capture some additional information hidden in temporal characteristics of the data.
The accuracy of the DTW algorithm may degrade considerably when operating on expression profiles with not enough data points which is often the case in gene expression datasets. This may limit the applicability of the proposed TMRMR-C and TMRMR-M algorithms in such cases and, for this purpose, we performed analysis on how reducing the number of time points affects performance of the proposed methods comparing to baseline approaches. We repeated the same evaluation procedure but with reduced number of time points T = 3, T = 5 and T = 7 for all three datasets. We select the following time points for evaluation purposes: initial time point, end time point and equally distant time points between them (e.g. t 1 , t T/2 , t T ). Due to the space limitation, in Table 3 and Fig. 5 we show only results averaged over all datasets. The obtained results show that the reduction of time points did not affect the performance of the TMRMR-C algorithm, which outperformed all alternatives in all cases (for all classifiers and for all T and m values). On the other hand, the TMRMR-M algorithm showed improvement in all but 3 cases from which 2 occurred when the number of time points was set to 3 (T = 3) and the remaining one occurred when the number of time points was set to 5 (T = 5). This confirms the fact that a limited number of time points negatively affects the DTW approach and consequently the TMRMR-M algorithm, nevertheless, the proposed method showed improvement in most cases when comparing to baseline approaches. This leads to the conclusion that in cases with a limited number of time points the TMRMR-C approach, which is computationally more expensive, might be more appropriate than the TMRMR-M approach.

Gene ontology over-representation analysis
We have performed gene ontology over-representation analysis to find gene ontology (GO) terms that are overrepresented within the subset of selected genes. For this purpose we used annotations for the top 50 genes selected by the TMRMR-C algorithm from each of the three datasets used in this study (the full list of selected genes, together with error bars for the two groups, symptomatic and asymptomatic, is given in Additional file 1). Selected genes from each dataset were independently submitted to the PANTHER (protein annotation through evolutionary relationship) classification system (http:// www.pantherdb.org/) which extracted significantly overrepresented biological processes. For each of the three datasets, the top 5 GO terms are reported in Table 4. The last column in the table is p-value corrected based on the Bonferroni procedure. We can see from Table 4 that most of GO terms that are over-represented in all datasets are related to immune response to viral infection. This is consistent with the fact that the three gene expression datasets originate from human viral challenge studies where human volunteers were infected with H3N2 influenza, rhinovirus (HRV) and respiratory syncytial virus (RSV), respectively.

Robustness
In order to compare robustness of the proposed TMRMR-C and TMRMR-M feature selection methods with other baseline approaches used in this study, we calculated the Spearman's rank correlation coefficient (ρ), Tanimoto distance (T dist ) and number of features shared across all folds of the 5-fold cross validation procedure (N shared ) for the top 50 selected features (m = 50). For each method, Fig. 6 shows the average value of each stability measure across all datasets (H3N2, HRV and RSV) and across all tested numbers of time points (T = 3, T = 5, T = 7 and T = T all , where T all ∈ {16, 14, 21}). This figure clearly shows that, on average, the TMRMR-C feature selection method is the most stable one according to each of the three measures (N shared = 15, ρ = 0.40 and T dist = 0.33). The second most stable method is ReliefF (N shared = 10.08, ρ = 0.36 and T dist = 0.32) which appears to be more stable than the TMRMR-M algorithm (N shared = 9.66, ρ = 0.31 and T dist = 0.25), while the least stable method is mRMR (N shared = 2.16, ρ = 0.03 and T dist = 0.12). Since both the mRMR and the TMRMR-C algorithms are based on maximum relevance and minimum redundancy criteria, we can conclude that combining relevance, calculated as an average F-statistic value across different time steps, with redundancy, calculated by employing DTW significantly improves robustness for temporal data.

Conclusion
We presented filter-based feature selection methods for temporal gene expression data. The proposed methods utilize the maximum relevance and minimum redundancy criteria which were originally introduced by the mRMR algorithm. In order to handle multivariate temporal data without previous data flattening we modified the evaluation procedure for relevance and redundancy. Concretely, in the proposed methods we calculate the relevance of a gene by averaging F-statistic values calculated across individual time steps and redundancy between genes by using dynamical time warping. The proposed methods have been tested on three temporal gene expression datasets from viral studies. We showed that TMRMR-C and TMRMR-M proposed methods outperformed alternatives in most cases. In addition, we