Skip to main content

Advertisement

Log in

A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection

  • Original Article
  • Published:
Metabolomics Aims and scope Submit manuscript

Abstract

Metabolic markers are the core of metabonomic surveys. Hence selection of differential metabolites is of great importance for either biological or clinical purpose. Here, a feature selection method was developed for complex metabonomic data set. As an effective tool for metabonomics data analysis, support vector machine (SVM) was employed as the basic classifier. To find out meaningful features effectively, support vector machine recursive feature elimination (SVM-RFE) was firstly applied. Then, genetic algorithm (GA) and random forest (RF) which consider the interaction among the metabolites and independent performance of each metabolite in all samples, respectively, were used to obtain more informative metabolic difference and avoid the risk of false positive. A data set from plasma metabonomics study of rat liver diseases developed from hepatitis, cirrhosis to hepatocellular carcinoma was applied for the validation of the method. Besides the good classification results for 3 kinds of liver diseases, 31 important metabolites including lysophosphatidylethanolamine (LPE) C16:0, palmitoylcarnitine, lysophosphatidylethanolamine (LPC) C18:0 were also selected for further studies. A better complementary effect of the three feature selection methods could be seen from the current results. The combinational method also represented more differential metabolites and provided more metabolic information for a “global” understanding of diseases than any single method. Further more, this method is also suitable for other complex biological data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Archer, K. J., & Kimes, R. V. (2008). Empirical characterization of random forest variable importance measures. Journal Computational Statistics & Data Analysis, 52(4), 2249–2260.

    Article  Google Scholar 

  • Balding, D. J., Bishop, M., & Cannings, C. (2007). Handbook of statistical genetics. England: John Wiley & Sons, Ltd.

    Book  Google Scholar 

  • Bhattacharyyas, S., Epstein, J., & Suval, J. (2006). Biomarkers that discriminate multiple myeloma patients with or without skeletal involvement detected using SELDI-TOF mass spectrometry and statistical and machine learning tools. Disease Markers, 22(4), 245–255.

    Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.

    Article  Google Scholar 

  • Bryan, K., Brennan, L., & Cunningham, P. (2008). MetaFIND: A feature analysis tool for metabolomics data. BMC Bioinformatics, 9, 470.

    Article  PubMed  Google Scholar 

  • Chan, E. C. Y., Yap, S., Lau, A., Leow, P., Toh, D., & Koh, H. (2007). Ultra-performance liquid chromatography/time-of-flight mass spectrometry based metabolomics of raw and steamed Panax notoginseng. Rapid Communications in Mass Spectrometry, 21, 519–528.

    Article  PubMed  CAS  Google Scholar 

  • Cho, H., Kim, S. B., Jeong, M. K., Park, Y., Miller, N., Ziegler, T., et al. (2008). Discovery of metabolite features for the modeling and analysis of high-resolution NMR spectra. International Journal of Data Mining and Bioinformatics, 2(2), 176–192.

    Article  PubMed  Google Scholar 

  • Defernez, M., & Kemsley, E. K. (1997). The use and misuse of chemometrics for treating classification problems. TrAC Trends in Analytical Chemistry, 16(4), 216–221.

    Article  Google Scholar 

  • Díaz-Uriarte, R., & de Andrés, S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3.

    Article  PubMed  Google Scholar 

  • Eriksson, L., Johansson, E., Kettaneh-wold, N., Trygg, J., Wikstrom, C., & Wold, S. (2006). Multi- and megavariate data analysis principles and applications-principles and applications. Umetrics AB: Umeå.

  • Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.

    Article  Google Scholar 

  • Holland, J. H. (1992). Adaptation in natural and artificial systems (2nd ed.). Cambridge, MA: MIT Press.

    Google Scholar 

  • Jolliffe, I. T. (2002). Principal component analysis. New York: Springer.

    Google Scholar 

  • Kim, Y., Park, I., & Lee, D. (2007). Integrated data mining strategy for effective metabolomic data analysis. In The First International Symposium on Optimization and Systems Biology (OSB’07), Beijing, China.

  • Kima, S. H., Kima, D. H., Parka, J., Choia, E. J., Parkb, S., Leec, K. Y., et al. (2010). Discrimination of Scrophularia spp. according to geographic origin with HPLC-DAD combined with multivariate analysis. Microchemical Journal, 94(2), 118–124.

    Article  Google Scholar 

  • Laxman, Y., Jarkko, T., & Jaakko, H. (2010). Functional prediction of unidentified lipids using supervised classifiers. Metabolomics, 6, 18–26.

    Article  Google Scholar 

  • Lee, S. S. F., Sun, L., Kustra, R., & Bull, S. B. (2008). EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis. Bioinformatics, 24(14), 1603–1610.

    Article  PubMed  CAS  Google Scholar 

  • Li, L., Darden, T. A., Weingberg, C. R., Levine, A. J., & Pedersen, L. G. (2001). Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. Combinatorial Chemistry & High Throughput Screening, 4(8), 727–739.

    CAS  Google Scholar 

  • Mahadevan, S., Shah, S. L., Marrie, T. J., & Slupsky, C. M. (2008). Analysis of metabolomic data using support vector machines. Analytical Chemistry, 80(19), 7562–7570.

    Google Scholar 

  • Maher, A. D., Crockford, D., Toft, H., Malmodin, D., Faber, J. H., Mccarthy, M. I., et al. (2008). Optimization of human plasma 1H NMR spectroscopic data processing for high-throughput metabolic phenotyping studies and detection of insulin resistance related to type 2 diabetes. Analytical Chemistry, 80, 7354–7362.

    Article  PubMed  CAS  Google Scholar 

  • Man, M. Z., Dyson, G., Johnson, K., & Liao, B. (2004). Evaluating methods for classifying expression data. Journal of Biopharmaceutical Statistics, 14(4), 1065–1084.

    Article  PubMed  Google Scholar 

  • Nicholson, J. K. (2006). Global systems biology, personalized medicine and molecular epidemiology. Molecular Systems Biology, 2, 52.

    Article  PubMed  Google Scholar 

  • Ooi, C. H., & Tan, A. P. (2003). Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics, 19(1), 37–44.

    Article  PubMed  CAS  Google Scholar 

  • Pisitkun, T., Johnstone, R., & Knepper, M. A. (2006). Discovery of urinary biomarkers. Molecular & Cellular Proteomics, 5, 1760–1771.

    Article  CAS  Google Scholar 

  • Ramadan, Z., Jacobs, D., Grigorov, M., & Kochhar, S. (2006). Metabolic profiling using principal component analysis, discriminant partial least squares, and genetic algorithms. Talanta, 68(5), 1683–1691.

    Article  PubMed  CAS  Google Scholar 

  • Righi, V., Durante, C., Cocchi, M., Calabrese, C., Difebo, G., Lecce, F., et al. (2009). Discrimination of healthy and neoplastic human colon tissues by ex vivo HR-MAS NMR spectroscopy and chemometric analyses. Journal of Proteome Research, 8(4), 1859–1869.

    Article  PubMed  CAS  Google Scholar 

  • Saeys, Y., Lnza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517.

    Article  PubMed  CAS  Google Scholar 

  • Solank, K. S., Bailey, N. J. C., Holmes, E., Lindon, J. C., Davis, A. L., Mulder, T. P. J., et al. (2003). NMR-based metabonomic studies on the biochemical effects of epicatechin in the rat. Journal of Agricultural and Food Chemistry, 51, 4139–4145.

    Article  Google Scholar 

  • Stella, C., Beckwith-hall, B., Cloarec, O., Holmes, E., Lindon, J. C., Powell, J., et al. (2006). Susceptibility of human metabolic phenotypes to dietary modulation. Journal of Proteome Research, 5, 2780–2788.

    Article  PubMed  CAS  Google Scholar 

  • Strobl, C., Boulesteix, A., Zeileis, A., & Hothornt, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8, 25.

    Article  PubMed  Google Scholar 

  • Trevino, V., & Falciani, F. (2006). GALGO-an R package for multivariate variable selection using genetic algorithms. Bioinformatics, 22(9), 1154–1156.

    Article  PubMed  CAS  Google Scholar 

  • Vapnik, V. (1998). Statistical learning theory. New York: John Wiley and Sons.

    Google Scholar 

  • Wanga, Y., Taoa, Y., Lina, Y., Liangb, L., Wub, Y., Qua, H., et al. (2009). Integrated analysis of serum and liver metabonome in liver transplanted rats by gas chromatography coupled with mass spectrometry. Analytica Chimica Acta, 633(1), 65–70.

    Article  Google Scholar 

  • Xue, Y., Li, H., Ung, C. Y., Yap, C. W., & Chen, Y. Z. (2006). Classification of a diverse set of tetrahymena pyriformis toxicity chemical compounds from molecular descriptors by statistical learning methods. Chemical Research in Toxicology, 19, 1030–1039.

    Article  PubMed  CAS  Google Scholar 

  • Yang, J., Xu, G., Zheng, Y., Kong, H., Pang, T., Lv, S., et al. (2004). Diagnosis of liver cancer using HPLC-based metabonomics avoiding false-positive result from hepatitis and hepatocirrhosis diseases. Journal of Chromatography B, 813(1–2), 59–65.

    Article  CAS  Google Scholar 

  • Yin, P., Wan, D., Zhao, C., Chen, J., Zhao, X., Wang, W., et al. (2009). A metabonomic study of hepatitis B-induced liver cirrhosis and hepatocellular carcinoma by using RP-LC and HILIC coupled with mass spectrometry. Molecular Biosystems, 5(8), 868–876.

    Article  PubMed  CAS  Google Scholar 

  • Zou, W., & Tolstikov, V. V. (2008). Probing genetic algorithms for feature selection in comprehensive metabolic profiling approach. Rapid Communications in Mass Spectrometry, 22(8), 1312–1324.

    Article  PubMed  CAS  Google Scholar 

  • Zou, W., & Tolstikov, V. V. (2009). Pattern recognition and pathway analysis with genetic algorithms in mass spectrometry based metabolomics. Algorithms, 2(2), 638–666.

    Article  CAS  Google Scholar 

Download references

Acknowledgments

The study has been supported by the State Key Science & Technology Project for Infectious Diseases (2008ZX10002-019, 2008ZX10002-017), National Basic Research Program of China (2007CB914701) and National Key Project of Scientific and Technical Supporting Programs (2006038079037) from State Ministry of Science & Technology of China, and the foundation (No. 20835006, 90713032) from National Natural Science Foundation of China.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiaohui Lin or Peiyuan Yin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, X., Wang, Q., Yin, P. et al. A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection. Metabolomics 7, 549–558 (2011). https://doi.org/10.1007/s11306-011-0274-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11306-011-0274-7

Keywords

Navigation