Elsevier

Journal of Chromatography B

Volume 966, 1 September 2014, Pages 100-108
Journal of Chromatography B

A modified k-TSP algorithm and its application in LC–MS-based metabolomics study of hepatocellular carcinoma and chronic liver diseases

https://doi.org/10.1016/j.jchromb.2014.05.044Get rights and content

Highlights

  • A modified k top scoring pairs (k-TSP) method is suggested to provide an improved classification procedure.

  • This new k-TSP method was applied to serum metabolomics data derived from LC–MS of liver diseases.

  • The metabolic feature pairs can be effectively used to differentiate HCC from chronic liver diseases.

Abstract

In systems biology, the ability to discern meaningful information that reflects the nature of related problems from large amounts of data has become a key issue. The classification method using top scoring pairs (TSP), which measures the features of a data set in pairs and selects the top ranked feature pairs to construct the classifier, has been a powerful tool in genomics data analysis because of its simplicity and interpretability. This study examined the relationship between two features, modified the ranking criteria of the k-TSP method to measure the discriminative ability of each feature pair more accurately, and correspondingly, provided an improved classification procedure. Tests on eight public data sets showed the validity of the modified method. This modified k-TSP method was applied to our serum metabolomics data derived from liquid chromatography-mass spectrometry analysis of hepatocellular carcinoma and chronic liver diseases. Based on the 27 selected feature pairs, HCC and chronic liver diseases were accurately distinguished using the principal component analysis, and certain profound metabolic disturbances related to liver disease development were revealed by the feature pairs.

Introduction

Feature selection and classification techniques are powerful tools in many applications, such as text processing [1], intrusion detection [2], and bioinformatics data analysis [3]. Large amounts of data are generated for the development of systems biology. The ability to sort through and classify the information that reflects the nature of related problems from large amounts of data and the ability to make a correct prediction for an input sample are critical for the comprehensive understanding of complex biological processes.

Support vector machines (SVMs) [4], random forests (RFs) [5], and genetic algorithms (GAs) [6] are popular classification methods. SVM and RF could easily handle high-dimensional data. The feature selection techniques based on SVM, RF and GA are also effective in applications. Besides, t-test [7], principal component analysis (PCA) [8] and partial least squares discriminant analysis (PLS-DA) [9] are also common and efficient data analysis techniques in metabolomic study.

Since the biologic process is very complex, there may exist two or more features together to interpret the biological phenomena. The top scoring pair (TSP) [10] method combines two features to distinguish the samples in different groups. It measures the discriminative ability of a feature pair by the difference of the corresponding two features. If the two features have the different change in different groups, the feature pair could distinguish the groups quite well. TSP calculates the scores of each feature pair and selects the top-ranked feature pairs to perform the classification without any parameter tuning problems. The TSP method uses only a few features to classify the input samples, which results in a simple interpretation of bioinformatics problems, especially biomedical problems. In certain cases, one feature pair may contain insufficient information; thus, the k-TSP method [11] is employed, which uses the k > 0 top-ranked feature pairs to build the classification model. The k-TSP method outperforms many machine learning techniques, such as the Naïve Bayes, k nearest neighbors, and decision tree methods [11].

Because of their simplicity and interpretability, many TSP family algorithms have been proposed. The weighted-k-TSP method [12] modifies the ranking rule by combining the sample probabilities with the misclassification cost. Focusing on gene data analysis, the weight-k-TSP method [13] extended k-TSP by adopting the weight of pair-wise variable comparisons and the change percentages in gene expression in different sample groups. The TSP-decision tree method [14] combines the TSP and decision-tree techniques.

The k-TSP method has demonstrated its validity in gene data analysis [3], [11]. Metabolomics, as a new branch of systems biology, is playing an increasingly important role in disease studies [15], [16], drug development [17], sports medicine [18], etc. Because metabolomics data are typically complex and highly dimensional, the selection of the most meaningful information from the data to obtain a thorough comprehension of the biological problems is critical in the study of metabolomics.

The k-TSP method evaluates the feature pairs according to the relationship of the features in the samples. The relationship of two features is “<”, “=” or “>”. The k-TSP method, which includes TSP, simply considers the relationship as “<” or “≥”.

In this study, k-TSP is firstly applied in metabolomics and a modified k-TSP (M-k-TSP) method is proposed in which the ranking criteria are improved to treat the “=” relationship accurately. Eight public data sets were used to test the validity of this modified method. The M-k-TSP method was then applied to the analysis of the metabolomics data of liver diseases from ultra-high-performance liquid chromatography (UHPLC)–mass spectrometry (MS) to select the discriminative ion features.

Section snippets

TSP and k-TSP

TSP is a binary classification method. Let X = {x1, x2, …, xn} contain n samples, where xi  Rm, F = {f1, f2, …, fm} denote the feature set, and C = {c1, c2} denote the class label set, c1 = +1, c2 = −1. Y = (y1, y2, …, yn) is the class label vector, and yi  C is the class label of xi, where 1  i  n.

The TSP method measures the feature pairs according to two criteria and selects the top pair to build the classifier. The principal criterion is Δij, and the second criterion is Γij (1  i  j  m) [10], [11]:Δij=Pijc1Pij

Experimental

The implementations of SVM and RF were from the WEKA machine learning package. The sequential minimal optimization with a linear kernel was applied for SVM. The size of the RF was set to 100. The algorithms of the k-TSP and M-k-TSP methods were written in C++. The executable file of M-k-TSP can be downloaded on the link (http://www.402.dicp.ac.cn/download_ok_2.htm). A 10-fold cross-validation was run 10 times to obtain the average performance.

To show the validity of the modified ranking

Comparison of the M-TSP method with the TSP, SVM, and RF methods using eight public data sets

To compare the classification performance of M-TSP with those of the TSP, SVM, and RF methods, eight public data sets were tested; the results are shown in Table 1. In five of the eight public data sets, the M-TSP method attained accuracy rates greater than those achieved with the TSP method. For the remaining three data sets, the accuracy rates of the M-TSP method were lower than those of the TSP method by 0.65% at most.

SVM and RF are very popular learning techniques. These methods have been

Conclusions

In this study, we improved the ranking criteria of the k-TSP method and proposed a modified k-TSP technique. The tests on eight public data sets showed the validity of the M-k-TSP method. Employing the newly developed M-k-TSP, we explored the liver disease metabolome data set of HCC and chronic liver diseases. Compared with the k-TSP method, the M-k-TSP method could measure the feature pairs more accurately, and the feature pairs contained more information reflecting the profound metabolic

Acknowledgments

The study has been supported by the State Key Science & Technology Project for Infectious Diseases (2012ZX10002011), the Sino-German Center for Research Promotion (GZ 753), National Natural Science Foundation of China (21375011).

References (44)

  • J. Yang et al.

    Inf. Process. Manage.

    (2012)
  • X.J. Wang et al.

    Mol. Cell. Proteomics

    (2012)
  • T.W.M. Fan et al.

    Pharmacol. Ther.

    (2012)
  • D. Singh et al.

    Cancer Cell

    (2002)
  • X. Lin et al.

    J. Chromatogr. B

    (2012)
  • C. Christin et al.

    Mol. Cell. Proteomics

    (2013)
  • S. Chen et al.

    Electrophoresis

    (2013)
  • M. Sheikhan et al.

    Neural Comput. Appl.

    (2012)
  • P. Shi et al.

    BMC Bioinf.

    (2011)
  • B.E. Boser et al.
  • L. Breiman

    Mach. Learn.

    (2001)
  • D.E. Goldberg et al.

    Mach. Learn.

    (1988)
  • R.M. Salek et al.

    Physiological. Genomics

    (2007)
  • Z. Pan et al.

    Anal. Bioanal. Chem.

    (2007)
  • U. Lutz et al.

    Anal. Chem.

    (2006)
  • D. Geman et al.

    Stat. Appl. Genet Mol.

    (2004)
  • A.C. Tan et al.

    Bioinformatics

    (2005)
  • H. Luo et al.

    Pattern Recognition in Bioinformatics

    (2008)
  • M. Czajkowski et al.

    New Frontiers in Applied Artificial Intelligence

    (2008)
  • Czajkowski Marcin, K. Marek, Software Tools and Algorithms for Biological Systems, Springer, New York, NY, 2011, pp....
  • S. Bereswill et al.

    PLoS One

    (2009)
  • A. Miccheli et al.

    J. Am. Coll. Nutr.

    (2009)
  • Cited by (12)

    • A new data analysis method based on feature linear combination

      2019, Journal of Biomedical Informatics
      Citation Excerpt :

      Both k-TSP and LC-k-TSP are classification methods which try to define the simple and effective classification rules based on feature relationships. k-TSP examines each pair by the same linear combination, it has been shown that the performance of k-TSP is similar as SVM and RF, but it only uses several feature pairs and is easy to explore the biomedical explanation [1,27]. While LC-k-TSP explores the unique best linear combination for each feature pair by SVM, and each pair is evaluated by its own combination form.

    • Analyzing omics data by pair-wise feature evaluation with horizontal and vertical comparisons

      2018, Journal of Pharmaceutical and Biomedical Analysis
      Citation Excerpt :

      Tan et al. [1] proposed k-TSP which selected k > 0 top ranked feature pairs to build the classification model. We proposed the modified k-TSP (M-k-TSP) and applied it to analyzing the metabolomics data [12]. The TSP family algorithms identify important feature pairs by the horizontal comparison (i.e., the relative expression level of two features).

    • A Novel Method for Constructing Classification Models by Combining Different Biomarker Patterns

      2022, IEEE/ACM Transactions on Computational Biology and Bioinformatics
    • Relative evolutionary hierarchical analysis for gene expression data classification

      2019, GECCO 2019 - Proceedings of the 2019 Genetic and Evolutionary Computation Conference
    View all citing articles on Scopus

    This paper is part of the special issue “Metabolomics II” by G. Theodoridis.

    View full text