A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection

Lin, Xiaohui; Wang, Quancai; Yin, Peiyuan; Tang, Liang; Tan, Yexiong; Li, Hong; Yan, Kang; Xu, Guowang

doi:10.1007/s11306-011-0274-7

A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection

Original Article
Published: 20 January 2011

Volume 7, pages 549–558, (2011)
Cite this article

Metabolomics Aims and scope Submit manuscript

Xiaohui Lin¹,
Quancai Wang¹,
Peiyuan Yin²,
Liang Tang³,
Yexiong Tan³,
Hong Li¹,
Kang Yan¹ &
…
Guowang Xu²

1469 Accesses
45 Citations
Explore all metrics

Abstract

Metabolic markers are the core of metabonomic surveys. Hence selection of differential metabolites is of great importance for either biological or clinical purpose. Here, a feature selection method was developed for complex metabonomic data set. As an effective tool for metabonomics data analysis, support vector machine (SVM) was employed as the basic classifier. To find out meaningful features effectively, support vector machine recursive feature elimination (SVM-RFE) was firstly applied. Then, genetic algorithm (GA) and random forest (RF) which consider the interaction among the metabolites and independent performance of each metabolite in all samples, respectively, were used to obtain more informative metabolic difference and avoid the risk of false positive. A data set from plasma metabonomics study of rat liver diseases developed from hepatitis, cirrhosis to hepatocellular carcinoma was applied for the validation of the method. Besides the good classification results for 3 kinds of liver diseases, 31 important metabolites including lysophosphatidylethanolamine (LPE) C16:0, palmitoylcarnitine, lysophosphatidylethanolamine (LPC) C18:0 were also selected for further studies. A better complementary effect of the three feature selection methods could be seen from the current results. The combinational method also represented more differential metabolites and provided more metabolic information for a “global” understanding of diseases than any single method. Further more, this method is also suitable for other complex biological data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predictive Modeling for Metabolomics Data

Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

Machine learning of plasma metabolome identifies biomarker panels for metabolic syndrome: findings from the China Suboptimal Health Cohort

Article Open access 23 December 2022

References

Archer, K. J., & Kimes, R. V. (2008). Empirical characterization of random forest variable importance measures. Journal Computational Statistics & Data Analysis, 52(4), 2249–2260.
Article Google Scholar
Balding, D. J., Bishop, M., & Cannings, C. (2007). Handbook of statistical genetics. England: John Wiley & Sons, Ltd.
Book Google Scholar
Bhattacharyyas, S., Epstein, J., & Suval, J. (2006). Biomarkers that discriminate multiple myeloma patients with or without skeletal involvement detected using SELDI-TOF mass spectrometry and statistical and machine learning tools. Disease Markers, 22(4), 245–255.
Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Article Google Scholar
Bryan, K., Brennan, L., & Cunningham, P. (2008). MetaFIND: A feature analysis tool for metabolomics data. BMC Bioinformatics, 9, 470.
Article PubMed Google Scholar
Chan, E. C. Y., Yap, S., Lau, A., Leow, P., Toh, D., & Koh, H. (2007). Ultra-performance liquid chromatography/time-of-flight mass spectrometry based metabolomics of raw and steamed Panax notoginseng. Rapid Communications in Mass Spectrometry, 21, 519–528.
Article PubMed CAS Google Scholar
Cho, H., Kim, S. B., Jeong, M. K., Park, Y., Miller, N., Ziegler, T., et al. (2008). Discovery of metabolite features for the modeling and analysis of high-resolution NMR spectra. International Journal of Data Mining and Bioinformatics, 2(2), 176–192.
Article PubMed Google Scholar
Defernez, M., & Kemsley, E. K. (1997). The use and misuse of chemometrics for treating classification problems. TrAC Trends in Analytical Chemistry, 16(4), 216–221.
Article Google Scholar
Díaz-Uriarte, R., & de Andrés, S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3.
Article PubMed Google Scholar
Eriksson, L., Johansson, E., Kettaneh-wold, N., Trygg, J., Wikstrom, C., & Wold, S. (2006). Multi- and megavariate data analysis principles and applications-principles and applications. Umetrics AB: Umeå.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.
Article Google Scholar
Holland, J. H. (1992). Adaptation in natural and artificial systems (2nd ed.). Cambridge, MA: MIT Press.
Google Scholar
Jolliffe, I. T. (2002). Principal component analysis. New York: Springer.
Google Scholar
Kim, Y., Park, I., & Lee, D. (2007). Integrated data mining strategy for effective metabolomic data analysis. In The First International Symposium on Optimization and Systems Biology (OSB’07), Beijing, China.
Kima, S. H., Kima, D. H., Parka, J., Choia, E. J., Parkb, S., Leec, K. Y., et al. (2010). Discrimination of Scrophularia spp. according to geographic origin with HPLC-DAD combined with multivariate analysis. Microchemical Journal, 94(2), 118–124.
Article Google Scholar
Laxman, Y., Jarkko, T., & Jaakko, H. (2010). Functional prediction of unidentified lipids using supervised classifiers. Metabolomics, 6, 18–26.
Article Google Scholar
Lee, S. S. F., Sun, L., Kustra, R., & Bull, S. B. (2008). EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis. Bioinformatics, 24(14), 1603–1610.
Article PubMed CAS Google Scholar
Li, L., Darden, T. A., Weingberg, C. R., Levine, A. J., & Pedersen, L. G. (2001). Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. Combinatorial Chemistry & High Throughput Screening, 4(8), 727–739.
CAS Google Scholar
Mahadevan, S., Shah, S. L., Marrie, T. J., & Slupsky, C. M. (2008). Analysis of metabolomic data using support vector machines. Analytical Chemistry, 80(19), 7562–7570.
Google Scholar
Maher, A. D., Crockford, D., Toft, H., Malmodin, D., Faber, J. H., Mccarthy, M. I., et al. (2008). Optimization of human plasma ¹H NMR spectroscopic data processing for high-throughput metabolic phenotyping studies and detection of insulin resistance related to type 2 diabetes. Analytical Chemistry, 80, 7354–7362.
Article PubMed CAS Google Scholar
Man, M. Z., Dyson, G., Johnson, K., & Liao, B. (2004). Evaluating methods for classifying expression data. Journal of Biopharmaceutical Statistics, 14(4), 1065–1084.
Article PubMed Google Scholar
Nicholson, J. K. (2006). Global systems biology, personalized medicine and molecular epidemiology. Molecular Systems Biology, 2, 52.
Article PubMed Google Scholar
Ooi, C. H., & Tan, A. P. (2003). Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics, 19(1), 37–44.
Article PubMed CAS Google Scholar
Pisitkun, T., Johnstone, R., & Knepper, M. A. (2006). Discovery of urinary biomarkers. Molecular & Cellular Proteomics, 5, 1760–1771.
Article CAS Google Scholar
Ramadan, Z., Jacobs, D., Grigorov, M., & Kochhar, S. (2006). Metabolic profiling using principal component analysis, discriminant partial least squares, and genetic algorithms. Talanta, 68(5), 1683–1691.
Article PubMed CAS Google Scholar
Righi, V., Durante, C., Cocchi, M., Calabrese, C., Difebo, G., Lecce, F., et al. (2009). Discrimination of healthy and neoplastic human colon tissues by ex vivo HR-MAS NMR spectroscopy and chemometric analyses. Journal of Proteome Research, 8(4), 1859–1869.
Article PubMed CAS Google Scholar
Saeys, Y., Lnza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517.
Article PubMed CAS Google Scholar
Solank, K. S., Bailey, N. J. C., Holmes, E., Lindon, J. C., Davis, A. L., Mulder, T. P. J., et al. (2003). NMR-based metabonomic studies on the biochemical effects of epicatechin in the rat. Journal of Agricultural and Food Chemistry, 51, 4139–4145.
Article Google Scholar
Stella, C., Beckwith-hall, B., Cloarec, O., Holmes, E., Lindon, J. C., Powell, J., et al. (2006). Susceptibility of human metabolic phenotypes to dietary modulation. Journal of Proteome Research, 5, 2780–2788.
Article PubMed CAS Google Scholar
Strobl, C., Boulesteix, A., Zeileis, A., & Hothornt, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8, 25.
Article PubMed Google Scholar
Trevino, V., & Falciani, F. (2006). GALGO-an R package for multivariate variable selection using genetic algorithms. Bioinformatics, 22(9), 1154–1156.
Article PubMed CAS Google Scholar
Vapnik, V. (1998). Statistical learning theory. New York: John Wiley and Sons.
Google Scholar
Wanga, Y., Taoa, Y., Lina, Y., Liangb, L., Wub, Y., Qua, H., et al. (2009). Integrated analysis of serum and liver metabonome in liver transplanted rats by gas chromatography coupled with mass spectrometry. Analytica Chimica Acta, 633(1), 65–70.
Article Google Scholar
Xue, Y., Li, H., Ung, C. Y., Yap, C. W., & Chen, Y. Z. (2006). Classification of a diverse set of tetrahymena pyriformis toxicity chemical compounds from molecular descriptors by statistical learning methods. Chemical Research in Toxicology, 19, 1030–1039.
Article PubMed CAS Google Scholar
Yang, J., Xu, G., Zheng, Y., Kong, H., Pang, T., Lv, S., et al. (2004). Diagnosis of liver cancer using HPLC-based metabonomics avoiding false-positive result from hepatitis and hepatocirrhosis diseases. Journal of Chromatography B, 813(1–2), 59–65.
Article CAS Google Scholar
Yin, P., Wan, D., Zhao, C., Chen, J., Zhao, X., Wang, W., et al. (2009). A metabonomic study of hepatitis B-induced liver cirrhosis and hepatocellular carcinoma by using RP-LC and HILIC coupled with mass spectrometry. Molecular Biosystems, 5(8), 868–876.
Article PubMed CAS Google Scholar
Zou, W., & Tolstikov, V. V. (2008). Probing genetic algorithms for feature selection in comprehensive metabolic profiling approach. Rapid Communications in Mass Spectrometry, 22(8), 1312–1324.
Article PubMed CAS Google Scholar
Zou, W., & Tolstikov, V. V. (2009). Pattern recognition and pathway analysis with genetic algorithms in mass spectrometry based metabolomics. Algorithms, 2(2), 638–666.
Article CAS Google Scholar

Download references

Acknowledgments

The study has been supported by the State Key Science & Technology Project for Infectious Diseases (2008ZX10002-019, 2008ZX10002-017), National Basic Research Program of China (2007CB914701) and National Key Project of Scientific and Technical Supporting Programs (2006038079037) from State Ministry of Science & Technology of China, and the foundation (No. 20835006, 90713032) from National Natural Science Foundation of China.

Author information

Authors and Affiliations

School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
Xiaohui Lin, Quancai Wang, Hong Li & Kang Yan
CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, China
Peiyuan Yin & Guowang Xu
International Cooperation Laboratory on Signal Transduction, Eastern Hepatobiliary Surgery Institute, The Second Military Medical University, Shanghai, China
Liang Tang & Yexiong Tan

Authors

Xiaohui Lin
View author publications
You can also search for this author in PubMed Google Scholar
Quancai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Peiyuan Yin
View author publications
You can also search for this author in PubMed Google Scholar
Liang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yexiong Tan
View author publications
You can also search for this author in PubMed Google Scholar
Hong Li
View author publications
You can also search for this author in PubMed Google Scholar
Kang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Guowang Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiaohui Lin or Peiyuan Yin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, X., Wang, Q., Yin, P. et al. A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection. Metabolomics 7, 549–558 (2011). https://doi.org/10.1007/s11306-011-0274-7

Download citation

Received: 14 October 2010
Accepted: 06 January 2011
Published: 20 January 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s11306-011-0274-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection

Abstract

Access this article

Similar content being viewed by others

Predictive Modeling for Metabolomics Data

Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

Machine learning of plasma metabolome identifies biomarker panels for metabolic syndrome: findings from the China Suboptimal Health Cohort

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection

Abstract

Access this article

Similar content being viewed by others

Predictive Modeling for Metabolomics Data

Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

Machine learning of plasma metabolome identifies biomarker panels for metabolic syndrome: findings from the China Suboptimal Health Cohort

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation