A modified k-TSP algorithm and its application in LC–MS-based metabolomics study of hepatocellular carcinoma and chronic liver diseases☆
Introduction
Feature selection and classification techniques are powerful tools in many applications, such as text processing [1], intrusion detection [2], and bioinformatics data analysis [3]. Large amounts of data are generated for the development of systems biology. The ability to sort through and classify the information that reflects the nature of related problems from large amounts of data and the ability to make a correct prediction for an input sample are critical for the comprehensive understanding of complex biological processes.
Support vector machines (SVMs) [4], random forests (RFs) [5], and genetic algorithms (GAs) [6] are popular classification methods. SVM and RF could easily handle high-dimensional data. The feature selection techniques based on SVM, RF and GA are also effective in applications. Besides, t-test [7], principal component analysis (PCA) [8] and partial least squares discriminant analysis (PLS-DA) [9] are also common and efficient data analysis techniques in metabolomic study.
Since the biologic process is very complex, there may exist two or more features together to interpret the biological phenomena. The top scoring pair (TSP) [10] method combines two features to distinguish the samples in different groups. It measures the discriminative ability of a feature pair by the difference of the corresponding two features. If the two features have the different change in different groups, the feature pair could distinguish the groups quite well. TSP calculates the scores of each feature pair and selects the top-ranked feature pairs to perform the classification without any parameter tuning problems. The TSP method uses only a few features to classify the input samples, which results in a simple interpretation of bioinformatics problems, especially biomedical problems. In certain cases, one feature pair may contain insufficient information; thus, the k-TSP method [11] is employed, which uses the k > 0 top-ranked feature pairs to build the classification model. The k-TSP method outperforms many machine learning techniques, such as the Naïve Bayes, k nearest neighbors, and decision tree methods [11].
Because of their simplicity and interpretability, many TSP family algorithms have been proposed. The weighted-k-TSP method [12] modifies the ranking rule by combining the sample probabilities with the misclassification cost. Focusing on gene data analysis, the weight-k-TSP method [13] extended k-TSP by adopting the weight of pair-wise variable comparisons and the change percentages in gene expression in different sample groups. The TSP-decision tree method [14] combines the TSP and decision-tree techniques.
The k-TSP method has demonstrated its validity in gene data analysis [3], [11]. Metabolomics, as a new branch of systems biology, is playing an increasingly important role in disease studies [15], [16], drug development [17], sports medicine [18], etc. Because metabolomics data are typically complex and highly dimensional, the selection of the most meaningful information from the data to obtain a thorough comprehension of the biological problems is critical in the study of metabolomics.
The k-TSP method evaluates the feature pairs according to the relationship of the features in the samples. The relationship of two features is “<”, “=” or “>”. The k-TSP method, which includes TSP, simply considers the relationship as “<” or “≥”.
In this study, k-TSP is firstly applied in metabolomics and a modified k-TSP (M-k-TSP) method is proposed in which the ranking criteria are improved to treat the “=” relationship accurately. Eight public data sets were used to test the validity of this modified method. The M-k-TSP method was then applied to the analysis of the metabolomics data of liver diseases from ultra-high-performance liquid chromatography (UHPLC)–mass spectrometry (MS) to select the discriminative ion features.
Section snippets
TSP and k-TSP
TSP is a binary classification method. Let X = {x1, x2, …, xn} contain n samples, where xi ∈ Rm, F = {f1, f2, …, fm} denote the feature set, and C = {c1, c2} denote the class label set, c1 = +1, c2 = −1. Y = (y1, y2, …, yn) is the class label vector, and yi ∈ C is the class label of xi, where 1 ≤ i ≤ n.
The TSP method measures the feature pairs according to two criteria and selects the top pair to build the classifier. The principal criterion is Δij, and the second criterion is Γij (1 ≤ i ≠ j ≤ m) [10], [11]:
Experimental
The implementations of SVM and RF were from the WEKA machine learning package. The sequential minimal optimization with a linear kernel was applied for SVM. The size of the RF was set to 100. The algorithms of the k-TSP and M-k-TSP methods were written in C++. The executable file of M-k-TSP can be downloaded on the link (http://www.402.dicp.ac.cn/download_ok_2.htm). A 10-fold cross-validation was run 10 times to obtain the average performance.
To show the validity of the modified ranking
Comparison of the M-TSP method with the TSP, SVM, and RF methods using eight public data sets
To compare the classification performance of M-TSP with those of the TSP, SVM, and RF methods, eight public data sets were tested; the results are shown in Table 1. In five of the eight public data sets, the M-TSP method attained accuracy rates greater than those achieved with the TSP method. For the remaining three data sets, the accuracy rates of the M-TSP method were lower than those of the TSP method by 0.65% at most.
SVM and RF are very popular learning techniques. These methods have been
Conclusions
In this study, we improved the ranking criteria of the k-TSP method and proposed a modified k-TSP technique. The tests on eight public data sets showed the validity of the M-k-TSP method. Employing the newly developed M-k-TSP, we explored the liver disease metabolome data set of HCC and chronic liver diseases. Compared with the k-TSP method, the M-k-TSP method could measure the feature pairs more accurately, and the feature pairs contained more information reflecting the profound metabolic
Acknowledgments
The study has been supported by the State Key Science & Technology Project for Infectious Diseases (2012ZX10002011), the Sino-German Center for Research Promotion (GZ 753), National Natural Science Foundation of China (21375011).
References (44)
- et al.
Inf. Process. Manage.
(2012) - et al.
Mol. Cell. Proteomics
(2012) - et al.
Pharmacol. Ther.
(2012) - et al.
Cancer Cell
(2002) - et al.
J. Chromatogr. B
(2012) - et al.
Mol. Cell. Proteomics
(2013) - et al.
Electrophoresis
(2013) - et al.
Neural Comput. Appl.
(2012) - et al.
BMC Bioinf.
(2011) - et al.
Mach. Learn.
Mach. Learn.
Physiological. Genomics
Anal. Bioanal. Chem.
Anal. Chem.
Stat. Appl. Genet Mol.
Bioinformatics
Pattern Recognition in Bioinformatics
New Frontiers in Applied Artificial Intelligence
PLoS One
J. Am. Coll. Nutr.
Cited by (12)
A new data analysis method based on feature linear combination
2019, Journal of Biomedical InformaticsCitation Excerpt :Both k-TSP and LC-k-TSP are classification methods which try to define the simple and effective classification rules based on feature relationships. k-TSP examines each pair by the same linear combination, it has been shown that the performance of k-TSP is similar as SVM and RF, but it only uses several feature pairs and is easy to explore the biomedical explanation [1,27]. While LC-k-TSP explores the unique best linear combination for each feature pair by SVM, and each pair is evaluated by its own combination form.
Analyzing omics data by pair-wise feature evaluation with horizontal and vertical comparisons
2018, Journal of Pharmaceutical and Biomedical AnalysisCitation Excerpt :Tan et al. [1] proposed k-TSP which selected k > 0 top ranked feature pairs to build the classification model. We proposed the modified k-TSP (M-k-TSP) and applied it to analyzing the metabolomics data [12]. The TSP family algorithms identify important feature pairs by the horizontal comparison (i.e., the relative expression level of two features).
Optimized Systematic Review Tool: Application to Candidate Biomarkers for the Diagnosis of Hepatocellular Carcinoma
2022, Cancer Epidemiology Biomarkers and PreventionA Novel Method for Constructing Classification Models by Combining Different Biomarker Patterns
2022, IEEE/ACM Transactions on Computational Biology and BioinformaticsRecent advances of microbiome-associated metabolomics profiling in liver disease: Principles, mechanisms, and applications
2021, International Journal of Molecular SciencesRelative evolutionary hierarchical analysis for gene expression data classification
2019, GECCO 2019 - Proceedings of the 2019 Genetic and Evolutionary Computation Conference
- ☆
This paper is part of the special issue “Metabolomics II” by G. Theodoridis.