Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data

  • Qingzhong Liu,

    Affiliation Department of Computer Science and Institute for Complex Additive Systems Analysis, New Mexico Tech, Socorro, New Mexico, United States of America

  • Andrew H. Sung ,

    sung@cs.nmt.edu (AHS); youping.deng@usm.edu (YD)

    Affiliation Department of Computer Science and Institute for Complex Additive Systems Analysis, New Mexico Tech, Socorro, New Mexico, United States of America

  • Zhongxue Chen,

    Affiliation Department of Epidemiology & Biostatistics, Robert Stempel College of Public Health & Social Work, Florida International University, Miami, Florida, United States of America

  • Jianzhong Liu,

    Affiliation QuantumBio Inc., State College, Pennsylvania, United States of America

  • Xudong Huang,

    Affiliation Division of Nuclear Medicine and Molecular Imaging and Center for Advanced Medical Imaging, Department of Radiology, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts, United States of America

  • Youping Deng

    sung@cs.nmt.edu (AHS); youping.deng@usm.edu (YD)

    Affiliations SpecPro, Vicksburg, Mississippi, United States of America, Department of Biology Science, the University of Southern Mississippi, Hattiesburg, Mississippi, United States of America

Abstract

Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods:

  • Support Vector Machine Recursive Feature Elimination (SVMRFE)
  • Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS)
  • Gradient based Leave-one-out Gene Selection (GLGS)

To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.

Introduction

Using microarray techniques, researchers can measure the expression levels for tens of thousands of genes in a single experiment. This ability allows scientists to investigate the functional relationship between the cellular and physiological processes of biological organisms and genes at a genome-wide level. The preprocessing procedure for the raw microarray data consists of background correction, normalization, and summarization. After preprocessing, a high level analysis, such as gene selection, classification, or clustering, is applied to profile the gene expression patterns [1]. In the high-level analysis, partitioning genes into closely related groups across time and classifying patients into different health statuses based on selected gene signatures have become two main tracks of microarray data analysis in the past decade [2][6]. Various standards related to systems biology are discussed by Brazma et al. [7]. When sample sizes are substantially smaller than the number of features or genes, statistical modeling and inference issues become challenging as the familiar “large p small n problem” arises. Designing feature selection methods that lead to reliable and accurate predictions by learning classifiers, therefore, is an issue of great theoretical as well as practical importance in high dimensional data analysis.

To address the “curse of dimensionality” problem, three basic strategies have been proposed for feature selection: filtering, wrapper, and embedded methods. Filtering methods select subset features independently from the learning classifiers and do not incorporate learning [8][11]. One of the weaknesses of filtering methods is that they only consider the individual feature in isolation and ignore the possible interaction among features. Yet the combination of certain features may have a net effect that does not necessarily follow from the individual performance of features in that group [12]. A consequence of the filtering methods is that we may end up with selecting groups of highly correlated features/genes, which present redundant information to the learning classifier to ultimately worsen its performance. Also, if there is a practical limit on the number of features to be chosen, one may not be able to include all informative features.

To avoid the weakness of filtering methods, wrapper methods wrap around a particular learning algorithm that can assess the selected feature subsets in terms of the estimated classification errors and then build the final classifier [13]. Wrapper methods use a learning machine to measure the quality of subsets of features. One recent well-known wrapper method for feature/gene selection is Support Vector Machine Recursive Feature Elimination (SVMRFE) [14], which refines the optimum feature set by using Support Vector Machines (SVM). The idea of SVMRFE is that the orientation of the separating hyper-plane found by the SVM can be used to select informative features; if the plane is orthogonal to a particular feature dimension, then that feature is informative, and vice versa. In addition to microarray data analysis, SVMRFE has been widely used in high-throughput biological data analyses and other areas involving feature selection and pattern classification [15].

Wrapper methods can noticeably reduce the number of features and significantly improve the classification accuracy [16], [17]. However, wrapper methods have the drawback of high computational load, making them less desirable as the dimensionality increases. The embedded methods perform feature selection simultaneously with learning classifiers to achieve better computational efficiency than wrapper methods while maintaining similar performance. LASSO [18], [19], logic regression with the regularized Laplacian prior [20], and Bayesian regularized neural network with automatic relevance determination [21] are examples of embedded methods.

To improve classification of microarray data, Zhou and Mao proposed SFS-LS bound and SFFS-LS bound algorithms for optimal gene selection by combining the sequential forward selection (SFS) and sequential floating forward selection (SFFS) with LS (Least Squares) bound measure [22]. Tang et al. designed two methods of gene selection, leave-one-out calculation sequential forward selection (LOOCSFS) and the gradient based leave-one-out gene selection (GLGS) [23]. Diaz-Uriarte and De Andres [24] presented a method for gene selection by calculating the out of bag errors with random forest [25].

In human genetic research, exploiting information redundancy from highly correlated genes can potentially reduce the cost and simultaneously improve the reliability and accuracy of learning classifiers that are employed in data analysis. To exploit the information redundancy that exists among the huge number of variables and improve classification accuracy of microarray data, we propose a gene selection method, Recursive Feature Addition (RFA), which is based on supervised learning and similarity measures. We compare RFA with SVMRFE, LOOCSFS, and GLGS by using the MAQC-II breast cancer dataset to predict pre-operative treatment response (pCR) and estrogen receptor status (erpos) and compare RFA with SVMRFE and LOOCSFS on the MAQC-II multiple myeloma dataset to predict the overall survival milestone outcome (OSMO, 730-day cutoff) and to predict event-free survival milestone outcome (EFSMO, 730-day cutoff).

Results

Results on MAQC-II Breast Cancer Dataset

We compare MSC-based RFA methods with GLGS, LOOCSFS, and SVMRFE on MAQC-II breast cancer dataset. Tables 1 to 10 list the cancer related genes of the first 100 features selected by these methods. In comparison to GLGS, LOOCSFS, and SVMRFE, RFA is associated with a greater number of currently known cancer related genes for prediction of pCR and a smaller number of currently known cancer related genes for prediction of erpos. Since disease status is not simply related to the number of these cancer related genes, we obtain the prediction performance by running multiple experiments, and compare the average prediction performance using the following measurements: testing accuracy, Matthews Correlation Coefficients (MCC) [26], [27] that has been used in MAQC-II consortium [28], and area under the receiver operating curve (AUC) errors with classifiers NMSC, NBC, SVM and UDC (uncorrelated normal based quadratic Bayes classifier) [29].

thumbnail
Table 1. The 34 cancer related genes of the 100 features selected by NBC-MSC on original training group of MAQC-II breast cancer data for pCR prediction.

https://doi.org/10.1371/journal.pone.0008250.t001

thumbnail
Table 2. The 34 cancer related genes of the 100 features selected by NMSC-MSC on original training group of MAQC-II breast cancer data for pCR prediction.

https://doi.org/10.1371/journal.pone.0008250.t002

thumbnail
Table 3. The 26 cancer related genes of the 100 features selected by GLGS on original training group of MAQC-II breast cancer data for pCR prediction.

https://doi.org/10.1371/journal.pone.0008250.t003

thumbnail
Table 4. The 25 cancer related genes of the 100 features selected by LOOCSFS on original training group of MAQC-II breast cancer data for pCR prediction.

https://doi.org/10.1371/journal.pone.0008250.t004

thumbnail
Table 5. The 20 cancer related genes of the 100 features selected by SVMRFE on original training group of MAQC-II breast cancer data for pCR prediction.

https://doi.org/10.1371/journal.pone.0008250.t005

thumbnail
Table 6. The 34 cancer related genes of the 100 features selected by NBC-MSC on original training group of MAQC-II breast cancer data for erpos prediction.

https://doi.org/10.1371/journal.pone.0008250.t006

thumbnail
Table 7. The 33 cancer related genes of the 100 features selected by NMSC-MSC on original training group of MAQC-II breast cancer data for erpos prediction.

https://doi.org/10.1371/journal.pone.0008250.t007

thumbnail
Table 8. The 40 cancer related genes of the 100 features selected by GLGS on original training group of MAQC-II breast cancer data for erpos prediction.

https://doi.org/10.1371/journal.pone.0008250.t008

thumbnail
Table 9. The 44 cancer related genes of the 100 features selected by LOOCSFS on original training group of MAQC-II breast cancer data for erpos prediction.

https://doi.org/10.1371/journal.pone.0008250.t009

thumbnail
Table 10. The 40 cancer related genes of the 100 features selected by SVMRFE on original training group of MAQC-II breast cancer data for erpos prediction.

https://doi.org/10.1371/journal.pone.0008250.t010

Figure 1 lists the testing accuracy, MCC values, and AUC errors on prediction of erpos by using two RFA methods: NBC-MSC and NMSC-MSC, as well as GLGS, LOOCSFS, and SVMRFE with the four learning classifiers. Regarding the prediction performance evaluated using testing accuracy and MCC values (left column and middle column, Fig. 1), on average the two RFA methods NBC-MSC and NMSC-MSC outperform other compared gene selection methods. The advantage of RFA by using NMSC and UDC is especially noticeable.

thumbnail
Figure 1. Comparison of different gene selection methods for prediction of erpos status of MAQC-II breast cancer dataset with different learning classifiers.

X-axis shows the number of used features and Y-axis shows average values of the testing accuracy (left column), MCC values (middle column), and AUC errors (right column) of twenty-time experiments, respectively.

https://doi.org/10.1371/journal.pone.0008250.g001

We notice that the performance evaluation using testing accuracy (left column) and MCC (middle column) is consistent, but the performance evaluation using AUC measurement is not always consistent with the evaluation using testing accuracy and MCC. For instance, in applying UDC to the feature sets, RFA methods have the best prediction performance evaluated using testing accuracy and MCC values; however, with respect to performance evaluated using AUC errors RFA is not the best. Regarding the testing performance measured by using AUC errors, the best results have been obtained by using RFA with NMSC classifier. The AUC errors are as low as 0.08, which is much better than the results by using other methods of gene selection.

The prediction results on pCR endpoint, shown in Figure 2, also demonstrate the advantage of RFA methods over other compared methods. All the best prediction results, evaluated by using AUC, MCC and testing accuracy, are obtained by using RFA methods.

thumbnail
Figure 2. Comparison of different gene selection methods for prediction of pCR status of MAQC-II breast cancer dataset with different learning classifiers.

X-axis shows the number of used features and Y-axis shows average values of the testing accuracy (left column), MCC values (middle column), and AUC errors (right column) of twenty-time experiments, respectively.

https://doi.org/10.1371/journal.pone.0008250.g002

Tables 11 and 12 list the average testing under the best training, MS_HR, and the best testing under the best training, HS_HR, for the predictions on erpos and pCR, respectively. Figure 3 and Figure 4 show box-plots of MS_HR values for the predictions on erpos and pCR. The results indicate that the prediction performance depends not only on gene selection but also on learning classifier. In other words, gene selection and classifier are coupled in determining prediction performance. On average, RFA gene selection methods deliver the best performance, followed by GLGS; NMSC classifier outperforms the others with respect to the performance and the consistency across the three measurements. The combination of RFA with NMSC has the best overall prediction performance.

thumbnail
Figure 3. Average erpos prediction performance by using MAQC-II breast cancer dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a red dash circle. If there are multiple best combinations, or the difference of these combinations is not conspicuous, multiple circles are placed.

https://doi.org/10.1371/journal.pone.0008250.g003

thumbnail
Figure 4. Average pCR prediction performance by using MAQC-II breast cancer dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a dash circle.

https://doi.org/10.1371/journal.pone.0008250.g004

thumbnail
Table 11. Mean values and standard errors of HS_HR and MS_HR on ERPOS prediction in the study of MAQC-II breast cancer dataset. In applying each classifier with each measurement, the best prediction value is in bold; the best prediction value of the results using the four learning classifiers is in bold and italic.

https://doi.org/10.1371/journal.pone.0008250.t011

thumbnail
Table 12. Mean values and standard errors of HS_HR and MS_HR on pCR prediction in the study of MAQC-II breast cancer dataset. In applying each classifier with each measurement, the best prediction value is in bold; the best prediction value is in bold and italic of the results by using four classifiers.

https://doi.org/10.1371/journal.pone.0008250.t012

Results on MAQC-II Multiple Myeloma Dataset

Figures 5 and 6 show the box-plots of average testing values for EFSMO and OSMO, the classification models are based on the best training. We did not apply GLGS because it would take an excessive amount of time for the identification of the feature sets on the multiple myeloma dataset. Experimental results again manifest that gene selection is strictly coupled to learning classifier in performance measurement. On average, RFA methods and LOOCSFS are superior to SVMRFE.

thumbnail
Figure 5. Average EFSMO prediction performance by using MAQC-II multiple myeloma dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a dash circle.

https://doi.org/10.1371/journal.pone.0008250.g005

thumbnail
Figure 6. Average OSMO prediction performance by using MAQC-II multiple myeloma dataset with the measurements testing accuracy (left column), MCC values (middle column), and AUC errors (right column), respectively.

Classification models are setup based on the best training. In each column, the best combination of gene selection and classifier is highlighted by a dash circle. If there are multiple best combinations, or the difference of these combinations is not conspicuous, multiple circles are placed.

https://doi.org/10.1371/journal.pone.0008250.g006

Discussion

Due to a huge number of variables and small sample size, there are complicated interactions and relations among genes as well as high redundancy information with microarray data. The selection of predictive models that depend on selected features and employed classifiers is extremely important for the classification of microarray data and for the further biological function analysis/validation. Machine learning and data mining techniques provide us with a powerful approach to the study of the relationship among genes. Based on supervised learning and similarity measurements, we propose a Recursive Feature Addition (RFA), recursively employ supervised learning to obtain the highest training accuracy and add a subsequent gene based on the similarity between the chosen features and the candidates to minimize the redundancy within the feature set. We believe this RFA method captures more informative and differently expressed genes than other methods. Experimental comparisons are performed by using two MAQC-II microarray datasets, breast cancer and multiple myeloma. Our studies show that the method of gene selection is strictly paired with learning classifier, which determines the final predictive model by using training data. In other words, the best classification models under different learning classifiers are associated with different methods of gene selection. Using several popular learning classifiers including NMSC, NBC, SVM, and UDC, on average, the best method of gene selection is RFA, followed by GLGS, LOOCSFS, and SVMRFE. Regarding compared learning classifiers, NMSC outperforms the others with respect to testing performance, stabilization, and consistency.

Biological function analysis based on MAQC-II breast cancer dataset finds that our applied feature selection methods including RFA, GLGS, LOOCSFS, and SVMRFE can generate features containing a significant portion of known cancer related genes for both pCR and erpos endpoints (Tables 110). Although the cancer related gene number is not absolutely correlated with the prediction performance of various methods of feature selection, the remarkable cancer related genes in the features indicate that the feature selection methods including RFA, GLGS, LOOCSFS, and SVMRFE could produce biologically meaningful features, which will convince the users to apply them for phenotype prediction.

In all results of the five gene selection methods with the four learning classifiers, on average, the combination of gene selections of NMSC-MSC and NBC-MSC with the learning classifier of NMSC has performed the best, regarding the comprehensive evaluation criteria of testing accuracy, MCC values, and AUC errors. It should be noted that the gene selection methods of NMSC-MSC and NBC-MSC are not always the best over the four learning classifiers, in other words, the best models among different learning classifiers are associated with different gene selection methods. The selection of the best model with the use of a specific learning classifier is normally based on the training and the evaluation criteria. Under an evaluation criterion with the use of some learning classifier, the best training model among the five gene selection methods is selected as the best model under the learning classifier. To select the best model among the four learning classifiers, the best models among the four learning classifiers are compared and the model obtaining the highest evaluation score is generally selected the best among the five gene selection methods and the four learning classifiers. For instance, Figure 7 demonstrates the average training performance of MAQC-II breast cancer dataset over twenty times, with the measurements of training accuracy, MCC values, and AUC errors, for the classification of pCR endpoint status. Regarding the comprehensive evaluation of the three criteria, the best classification models are obtained by using gene selection method NMSC-MSC for learning classifier NMSC, NBC-MSC for NBC classifier, and NBC-MSC and SVMRFE for UDC classifier. With the use of SVM classifier, although the gene selection method of SVMRFE first hits the best training as the feature dimension increases, almost all gene selection methods achieve the best before the feature number increases to 100, in such case, it is hard for us to determine the best model with the use of learning classifier SVM. In the limit of feature number, we can say that SVMRFE is the best for SVM classifier. The comparison between the training (Figure 7) and the testing (Figure 2) shows that such selection of the best model among the various methods of gene selection and various learning classifiers is reasonable.

thumbnail
Figure 7. Comparison of different gene selection methods for the training of pCR endpoint of MAQC-II breast cancer dataset using the four classifiers.

X-axis shows the number of used features and Y-axis shows average values of the training accuracy (left column), MCC values (middle column), and AUC errors (right column) of twenty-time experiments, respectively.

https://doi.org/10.1371/journal.pone.0008250.g007

Regarding different evaluation criteria, the gene selection associated with the best model under a learning classifier may be different, as shown in Figure 7; with the use of learning classifier NMSC, the best model is obtained by using gene selection method of NMSC-MSC, evaluated by training accuracy and MCC values, the best model is associated with SVMRFE, evaluated by AUC errors. Overall, NMSC-MSC is the best for learning classifier NMSC. Figure 7 shows that dozens of the best training models are obtained by using different methods of gene selection with the use of SVM classifier. One possible solution for the selection of the best model under SVM classifier is to divide all data points into training set, validation set, and testing set. The training set is used to construct training models, the validation set is used to select the best model by applying the validation data to the best training models and selecting the best training model that produce the best validation result. The testing set is used for prediction or testing. The selection of the best model using SVM classifier is very interesting and challenging, especially in the case of small data points and huge number of features. Although the topic is beyond the scope of this paper, it is worthy to be explored in the future study.

Materials and Methods

MAQC-II Breast Cancer Dataset

The breast cancer dataset used in the MAQC-II project is used to predict the pre-operative treatment response (pCR) and estrogen receptor status (erpos) [28]. The normalization was provided by MAQC-II project using standard procedure (i.e., MAS 5.0 for Affymetrix platform). It was originally grouped into two groups: a training set containing 130 samples (33 positives and 97 negatives for pCR, 80 positives and 50 negatives for erpos), and a validation set containing 100 samples (15 positives and 85 negatives for pCR, 61 positives and 39 negatives for erpos).

MAQC-II Multiple Myeloma Dataset

We take the MAQC-II multiple myeloma dataset to predict the overall survival milestone outcome (OSMO, 730-day cutoff) and to predict event-free survival milestone outcome (EFSMO, 730-day cutoff). For OSMO label information, there are 51 positives and 289 negatives in original training set, 27 positives and 187 negatives in original validation set; as for EFSMO label information, there are 84 positives and 256 negatives in original training set, 34 positives and 180 negatives in original validation set [28]. The normalization was provided by MAQC-II project research group.

Feature Selection

Supervised recursive learning.

Our method of recursive feature addition (RFA) employs supervised learning to achieve the best training accuracy and uses statistical similarity measures to choose the next variable with the least dependence on, or correlation to, the already identified variables as follows:

  1. Insignificant genes are removed according to their statistical insignificance. Specifically, a gene with a high p-value is usually not differently expressed and therefore has little contribution to classification of microarray data. To reduce the computational load, those insignificant genes are removed.
  2. Each individual gene is selected by supervised learning. A gene with highest classification accuracy is chosen as the most important feature or the first element of the feature set. If multiple genes achieve the same highest classification accuracy, the one with the lowest p-value measured by test-statistics (e.g., score test), is identified as the first element. At this point the chosen feature set, G1, contains just one element, g1, corresponding to the feature dimension one.
  3. The (N+1)st dimensional feature set, GN+1 = {g1, g2, …, gN, gN+1} is obtained by adding gN+1 to the Nth dimensional feature set, GN = {g1, g2, …, gN}. The choice of gN+1 is processed as follows:
  4. Add each gene gi (gi GN) into GN and have the classification accuracy of the feature set GN {gi}. The gi (gi GN) associated with the group, GN {gi} that obtains the highest classification accuracy, is the candidate for gN+1 (not yet gN+1). Considering the large number of variables, it is very likely that multiple features correspond to the same highest classification accuracy, these multiple candidates are placed into the set C, but only one candidate in C will be identified as gN+1. How to make the selection is described next.

Candidate feature addition.

To find a most informative (or least redundant) candidate for gN+1, we measure the statistical similarity between the chosen features and each candidate. We design a similarity measurement with the use of a widely-used Pearson's correlation coefficient [30].

Suppose gn (gn GN, n = 1, 2, …, N) is a chosen gene, gc (gc C) is a candidate gene, and cor stands for the function of Pearson's correlation coefficient. The sum of the square of the correlation SC that is calculated as follows:(1)

Then selection of gN+1 is based on Minimal value of the Square of the Correlation (MSC), that is,(2)

In the methods mentioned above, a feature is recursively added to the chosen feature set based on supervised learning and the similarity measurement. In our experiments we choose naïve bayes classier (NBC) and nearest means scale classifier (NMSC) [29] for supervised learning, NBC and NMSC-based RFA feature selection methods are denoted as NBC-MSC and NMSC-MSC, respectively.

Model Implementation and Comparison

Cross-validation is a technique for estimating how accurately a predictive model will perform in practice. Generally, the data are partitioned into complementary subsets, one subset (called the training set) is used for constructing the predictive model and the other subset (called the validation set or testing set) is used for validation. To reduce variability, multiple rounds are performed using different partitions, and the validation results are averaged over all rounds. There are three common types of cross-validation:

  1. Repeated random sub-sampling validation. This technique randomly splits the dataset intro training and testing data. The results are then averaged over the splits. The advantage over K-fold cross validation (described below) is that the portion of the training/testing split is not dependent on the number of iterations (folds);
  2. K-fold cross-validation. The original sample is partitioned into K subsamples. A single subsample is retained as the validation data for testing the model and the remaining K-1 subsamples are used as training data. The cross-validation process is repeated K times with each of the K subsamples used exactly once as the validation data;
  3. Leave-one-out cross-validation. It uses a single observation from the original sample as the validation data and the remaining observations as the training data. It is the same as a K-fold cross validation with K being equal to the number of observations in the original sample. Leave-one-out cross-validation is often computationally expensive.

Considering the high computational requirement of leave-one-out cross-validation and the insufficiency of one time K-fold cross-validation, we took the strategy of repeated random sub-sampling validation. In the model implementation, we mixed all the training data points and validation points. In each experiment, we randomly chose 80% of samples for training and the remaining 20% of samples for testing. Twenty experiments were performed (this strategy is approximately equal to performing 5-fold cross validation four times). The average testing performances, evaluated in terms of testing accuracy, MCC values, and AUC errors, were compared. The learning classifiers UDC, NBC, NMSC, and SVM were employed for training and testing.

Acknowledgments

The authors gratefully acknowledge Drs. Matteo Masotti and E Ke Tang for offering their codes of RfeRanking, GLGS, and LOOCSFS; Drs. Weita Tong and Leming Shi at the National Center for Toxicological Research, US Food and Drug Administration, for their invaluable help during this study; and Ms. Kimberly Lawson at the Department of Radiology of the Brigham and Women's Hospital and Harvard Medical School for her excellent assistance in revising the manuscript. The MAQC-II consortium for making the data sets available to this study is gratefully appreciated.

Author Contributions

Conceived and designed the experiments: QL. Performed the experiments: QL. Analyzed the data: QL ZC JL YD. Contributed reagents/materials/analysis tools: YD. Wrote the paper: QL. Supervised the study: AS. Revised the manuscript: AS, XH. Assisted the study: ZC, JL, XH. Coordinated the project: YD.

References

  1. 1. Chen Z, McGee M, Liu Q, Scheuermann RH (2007) A distribution free summarization method for affymetrix genechip arrays. Bioinformatics 23(3): 321–327.
  2. 2. Hand DJ, Heard NA (2005) Finding groups in gene expression data. J Biomed Biotechnol 2005(2): 215–225.
  3. 3. Qin Z (2006) Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics 22(16): 1988–1997.
  4. 4. Quackenbush J (2001) Computational analysis of microarray data. Nature Rev Genetic 2: 418–427.
  5. 5. Segal E, Friedman N, Kaminski N, Regev A, Koller D (2005) From signatures to models: understanding cancer using microarrays. Nature Genetics 37: S38–45.
  6. 6. Torrente A, Kapushesky M, Brazma A (2005) A new algorithm for comparing and visualizing relationships between hierarchical and flat gene expression data clusterings. Bioinformatics 21(21): 3993–3999.
  7. 7. Brazma A, Krestyaninova M, Sarkans U (2006) Standards for system biology. Nat Rev Gent 7: 593–605.
  8. 8. Bo T, Jonassen I (2002) New feature subset selection procedures for classification of expression profiles. Genome Biology 3(4): research0017.
  9. 9. Long A, Mangalam H, Chan B, Tolleri L, Hatfield G, et al. (2001) Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. J Biol Chem 276: 19937–19944.
  10. 10. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology 8(1): 37–52.
  11. 11. Yu J, Chen XW (2005) Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data. Bioinformatics 21: suppl-1i487–i494.
  12. 12. Pavlidis P, Noble WS (2001) Analysis of strain and regional variation in gene expression in mouse brain. Genome Biology 2(10): research0042.1–0042.15.
  13. 13. Inza I, Sierra B, Blanco R, Larranaga P (2002) Gene selection by sequential search wrapper approaches in microarray cancer class prediction. Journal of Intelligent and Fuzzy Systems 12(1): 25–33.
  14. 14. Guyon I, Weston J, Barnhill S, Vapnik VN (2002) Gene selection for cancer classification using support vector machines. Machine Learning 46(1–3): 389–422.
  15. 15. Liu Q, Sung AH, Chen Z, Xu J (2008) Feature mining and pattern classification for steganalysis of LSB matching steganography in grayscale images. Pattern Recognition 41(1): 56–66.
  16. 16. Monari G, Dreyfus G (2000) Withdrawing an example from the training set: an analytic estimation of its effect on a nonlinear parameterized model. Neurocomputing Letters 35: 195–201.
  17. 17. Rivals I, Personnaz L (2003) MLPs (mono-layer polynomials and multi-layer perceptrons) for nonlinear modeling. Journal of Machine Learning Research 3: 1383–1398.
  18. 18. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Statist Soc B 58(1): 267–288.
  19. 19. Tibshirani R (1997) The lasso method for variable selection in the cox model. Statistics in Medicine 16: 385–395.
  20. 20. Krishnapuram B, Carin L, Figueiredo M, Hartemink A (2005) Sparse multinomial logistic regression: fast algorithms, and generalization bounds. IEEE Trans Pattern Analysis and Machine Intelligence 27(6): 957–968.
  21. 21. Liang Y, Kelemen A (2005) Temporal gene expression classification with regularised neural network. International Journal of Bioinformatics Research and Applications 1(4): 399–413.
  22. 22. Zhou X, Mao KZ (2005) LS bound based gene selection for DNA microarray data. Bioinformatics 21(8): 1559–1564.
  23. 23. Tang EK, Suganthan PN, Yao X (2006) Gene selection algorithms for microarray data based on least square support vector machine. BMC Bioinformatics 7: 95.
  24. 24. Diaz-Uriarte R, de Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7: 3.
  25. 25. Breiman L (2001) Random forests. Machine Learning 45: 5–32.
  26. 26. Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405: 442–451.
  27. 27. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16: 412–424.
  28. 28. Shi L, et al. The MAQC-II project: a comprehensive study of common practices for the development and validation of microarray-based predictive models. Nature biotechnology. in press.
  29. 29. Heijden F, Duin R, Ridder D, Tax D (2004) Classification, parameter estimation and state estimation. John Wiley.
  30. 30. Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley. pp. 76–79.